## HuggingFace  🤗  Transformers 
State-of-the-art NLP that provide thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. More details in the following link:

https://github.com/huggingface/transformers


To immediately use a model on a given input (text, image, audio, ...), Transformers/HuggingFace provide the pipeline API. Pipelines group together a pretrained model with the preprocessing that was used during that model's training. Here is how to quickly use a pipeline to classify positive versus negative texts:


For installation, you should install 🤗 Transformers in a virtual environment. First, create a virtual environment with the version of Python you're going to use and activate it. Then, you will need to install at least one of Flax, PyTorch or TensorFlow.

Tensorflow: https://www.tensorflow.org/install/

Pytorch: https://pytorch.org/get-started/locally/#start-locally

Flax: https://github.com/google/flax#quick-install

When one of those backends has been installed, 🤗 Transformers can be installed using pip as follows:

In [25]:
pip install transformers

Collecting transformers
  Using cached https://files.pythonhosted.org/packages/4a/7f/f1c28621af0d74794b18cbe5534ec7565ee782ba48257d08ec264bc4aacb/transformers-4.15.0-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
  Using cached https://files.pythonhosted.org/packages/98/58/b092e16beb8cc360025f8cd26e2f4deb1492e43a22de0cb499793d71ea30/tokenizers-0.10.3-cp37-cp37m-win_amd64.whl
Collecting sacremoses
  Using cached https://files.pythonhosted.org/packages/ec/e5/407e634cbd3b96a9ce6960874c5b66829592ead9ac762bd50662244ce20b/sacremoses-0.0.47-py2.py3-none-any.whl
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.47 tokenizers-0.10.3 transformers-4.15.0


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Note: you may need to restart the kernel to use updated packages.


Pipelines are made of:

1) A tokenizer in charge of mapping raw textual input to token.

2) A model to make predictions from the inputs.

3)Some (optional) post processing for enhancing model’s output.

Below is how to quickly use a pipeline to classify positive versus negative texts:


In [26]:
from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to introduce pipeline to the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9996980428695679}]

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
I0123 21:47:43.188745 17028 filelock.py:274] Lock 2073510857864 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=629, style=ProgressStyle(description_width=…




I0123 21:47:43.893019 17028 filelock.py:318] Lock 2073510857864 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b.lock
I0123 21:47:45.865792 17028 filelock.py:274] Lock 2073134764216 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\8d04c767d9d4c14d929ce7ad8e067b80c74dbdb212ef4c3fb743db4ee109fae0.9d268a35da669ead745c44d369dc9948b408da5010c6bac414414a7e33d5748c.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=267844284, style=ProgressStyle(description_…




I0123 21:50:05.544564 17028 filelock.py:318] Lock 2073134764216 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\8d04c767d9d4c14d929ce7ad8e067b80c74dbdb212ef4c3fb743db4ee109fae0.9d268a35da669ead745c44d369dc9948b408da5010c6bac414414a7e33d5748c.lock
I0123 21:50:06.920754 17028 filelock.py:274] Lock 2073134764216 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\d44ec0488a5f13d92b3934cb68cc5849bd74ce63ede2eea2bf3c675e1e57297c.627f9558061e7bc67ed0f516b2f7efc1351772cc8553101f08748d44aada8b11.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=48, style=ProgressStyle(description_width='…




I0123 21:50:07.590499 17028 filelock.py:318] Lock 2073134764216 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\d44ec0488a5f13d92b3934cb68cc5849bd74ce63ede2eea2bf3c675e1e57297c.627f9558061e7bc67ed0f516b2f7efc1351772cc8553101f08748d44aada8b11.lock
I0123 21:50:09.851805 17028 filelock.py:274] Lock 2073666223240 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\83261b0c74c462e53d6367de0646b1fca07d0f15f1be045156b9cf8c71279cc9.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




I0123 21:50:11.021684 17028 filelock.py:318] Lock 2073666223240 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\83261b0c74c462e53d6367de0646b1fca07d0f15f1be045156b9cf8c71279cc9.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


[{'label': 'POSITIVE', 'score': 0.9996980428695679}]

The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text. Here the answer is "positive" with a confidence of 99.97%.

Many NLP tasks have a pre-trained pipeline ready to go. For example, we can easily extract question answers given context:



In [30]:
# Allocate a pipeline for question-answering (do specifiy the checkpoint identifier and model)
question_answerer = pipeline('question-answering')
question_answerer({'question': 'What is the name of the repository ?','context': 'Pipeline has been included in the huggingface/transformers repository'})

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.3097018003463745,
 'start': 34,
 'end': 58,
 'answer': 'huggingface/transformers'}

Below is an example of using the tokenizer and model and leveraging the PreTrainedModel.top_k_top_p_filtering method to sample the next token following an input sequence of tokens.


In [32]:
#Import relevant packages
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

In [34]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

I0123 22:02:16.267212 17028 filelock.py:274] Lock 2073135912160 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=665, style=ProgressStyle(description_width=…




I0123 22:02:16.915227 17028 filelock.py:318] Lock 2073135912160 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51.lock
I0123 22:02:18.172657 17028 filelock.py:274] Lock 2074419133800 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=1042301, style=ProgressStyle(description_wi…




I0123 22:02:20.006437 17028 filelock.py:318] Lock 2074419133800 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock
I0123 22:02:20.589075 17028 filelock.py:274] Lock 2074419719752 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




I0123 22:02:21.864145 17028 filelock.py:318] Lock 2074419719752 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
I0123 22:02:22.443447 17028 filelock.py:274] Lock 2073562396936 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=1355256, style=ProgressStyle(description_wi…




I0123 22:02:24.752958 17028 filelock.py:318] Lock 2073562396936 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0.lock
I0123 22:02:29.634846 17028 filelock.py:274] Lock 2072448754688 acquired on C:\Users\HuyenNguyen/.cache\huggingface\transformers\752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925.lock


HBox(children=(IntProgress(value=0, description='Downloading', max=548118077, style=ProgressStyle(description_…




I0123 22:07:19.995980 17028 filelock.py:318] Lock 2072448754688 released on C:\Users\HuyenNguyen/.cache\huggingface\transformers\752929ace039baa8ef70fe21cdf9ab9445773d20e733cf693d667982e210837e.323c769945a351daa25546176f8208b3004b6f563438a7603e7932bae9025925.lock


Hugging Face is based in DUMBO, New York City, and is


You can learn more about the tasks supported by the pipeline API in this tutorial.

https://huggingface.co/docs/transformers/task_summary

and for more tasks that pipeline can do, refer to the following documentation:

https://huggingface.co/docs/transformers/main_classes/pipelines
    

## BERT in practice

### Set up BERT in Anaconda Prompt

#conda create -n bert python pytorch pandas tqdm
#conda install -c anaconda scikit-learn

In [None]:
#Install the Pytorch version of BERT from HuggingFace
#pip install pytorch-pretrained-bert (directly in this Notebook kernel)

In [63]:
#Check/locate your current working directory
import os

path = os.getcwd()

print(path)

C:\Users\HuyenNguyen\Dropbox (Erasmus Universiteit Rotterdam)\Hamburg\TEACHING_UHH\WiSo21-22\Text Analysis for Social Sciences in Python\Exercises\W14


To do text classification, we need a text classification dataset. For this guide, we use the Yelp Reviews- Polarity dataset
(from the Yelp Dataset Challenge 2015) which you can find in the data list of this link below:

https://course.fast.ai/datasets

After downloading, decompress the downloaded file and get the train.csv, and test.csv files. 

(Readme.txt file of the Yelp dataset has the following - please read it through!)

**ORIGIN**

The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset_challenge

The Yelp reviews polarity dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is first used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).


**DESCRIPTION**

The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples. Negative polarity is class 1, and positive class 2.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 2 columns in them, corresponding to class index (1 and 2) and review text. The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

In [2]:
pip install pytorch-pretrained-bert

Collecting pytorch-pretrained-bert
  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
Collecting regex
  Downloading https://files.pythonhosted.org/packages/a2/21/e9bf5c0eb6cb0f50bad4fd9f1635781996aef89df2cf0e582c522388618f/regex-2022.1.18-cp37-cp37m-win_amd64.whl (272kB)
Collecting torch>=0.4.1
  Downloading https://files.pythonhosted.org/packages/c0/f2/b12037765c40da46d7a48914dda220187a69d3a6a1ff102330c2e647f9a6/torch-1.10.1-cp37-cp37m-win_amd64.whl (226.6MB)
Installing collected packages: regex, torch, pytorch-pretrained-bert
Successfully installed pytorch-pretrained-bert-0.6.2 regex-2022.1.18 torch-1.10.1
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [36]:
#Load and check the data 
import pandas as pd

train_df = pd.read_csv('train.csv', header=None)
train_df.head()

Unnamed: 0,0,1
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [37]:
test_df = pd.read_csv('test.csv', header=None)
test_df.head()

Unnamed: 0,0,1
0,2,"Contrary to other reviews, I have zero complai..."
1,1,Last summer I had an appointment to get new ti...
2,2,"Friendly staff, same starbucks fair you get an..."
3,1,The food is good. Unfortunately the service is...
4,2,Even when we didn't have a car Filene's Baseme...


The data contain no headers, only two columns for the label and the text. The labels used here have 1 and 2 instead of the typical 0 and 1. Here, a label of 1 means the review is BAD, and a label of 2 means the review is GOOD. 

Let's change this to the more familiar 0 and 1 labelling, where a label 0 indicates a BAD review, and a label 1 indicates a GOOD review.

In [38]:
train_df[0] = (train_df[0] == 2).astype(int)
test_df[0] = (test_df[0] == 2).astype(int)

In [39]:
train_df.head()

Unnamed: 0,0,1
0,0,"Unfortunately, the frustration of being Dr. Go..."
1,1,Been going to Dr. Goldberg for over 10 years. ...
2,0,I don't know what Dr. Goldberg was like before...
3,0,I'm writing this review to give you a heads up...
4,1,All the food is great here. But the best thing...


In [40]:
test_df.head()

Unnamed: 0,0,1
0,1,"Contrary to other reviews, I have zero complai..."
1,0,Last summer I had an appointment to get new ti...
2,1,"Friendly staff, same starbucks fair you get an..."
3,0,The food is good. Unfortunately the service is...
4,1,Even when we didn't have a car Filene's Baseme...


BERT, however, wants data to be in a .tsv file with a specific format as given below:

Column 0: An ID for the row

Column 1: The label for the row (should be an integer)

Column 2: A column of the same letter for all rows. BERT wants this so we’ll give it, but we have no use for it.

Column 3: The text for the row

In [41]:
train_df_bert = pd.DataFrame({
    'id':range(len(train_df)),
    'label':train_df[0],
    'alpha':['a']*train_df.shape[0],
    'text': train_df[1].replace(r'\n', ' ', regex=True)
})

train_df_bert.head()

Unnamed: 0,id,label,alpha,text
0,0,0,a,"Unfortunately, the frustration of being Dr. Go..."
1,1,1,a,Been going to Dr. Goldberg for over 10 years. ...
2,2,0,a,I don't know what Dr. Goldberg was like before...
3,3,0,a,I'm writing this review to give you a heads up...
4,4,1,a,All the food is great here. But the best thing...


For convenience, we name the test data as dev data. This is because BERT comes with data loading classes that expects train and dev files in the above format. 

We can use the train data to train our model, and the dev data to evaluate its performance. BERT’s data loading classes can also use a test file but it expects the test file to be unlabelled. Therefore, let's use the train and dev files instead.

In [42]:
dev_df_bert = pd.DataFrame({
    'id':range(len(test_df)),
    'label':test_df[0],
    'alpha':['a']*test_df.shape[0],
    'text': test_df[1].replace(r'\n', ' ', regex=True)
})

dev_df_bert.head()

Unnamed: 0,id,label,alpha,text
0,0,1,a,"Contrary to other reviews, I have zero complai..."
1,1,0,a,Last summer I had an appointment to get new ti...
2,2,1,a,"Friendly staff, same starbucks fair you get an..."
3,3,0,a,The food is good. Unfortunately the service is...
4,4,1,a,Even when we didn't have a car Filene's Baseme...


After we have the data in the correct form, we need to save the train and dev data as .tsv files.


In [43]:
train_df_bert.to_csv('train.tsv', sep='\t', index=False, header=False)
dev_df_bert.to_csv('dev.tsv', sep='\t', index=False, header=False)

### Convert data to features

Before fine-tuning takes place, we need to convert the data into features that BERT uses. 

We will see the reason for us rearranging the data into the .tsv format in the previous section. It enables us to easily reuse the example classes that come with BERT for our own binary classification task.

In [44]:
from __future__ import absolute_import, division, print_function

import csv
import os
import sys
import logging

In [45]:
logger = logging.getLogger()
csv.field_size_limit(2147483647) # Increase CSV reader's field limit incase we have long text.

#The first class, InputExample, is the format that a single example of our dataset should be in.
# We won’t be using the text_b attribute since that is not necessary for our binary classification task.
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label

# DataProcessor and BinaryClassificationProcessor, are helper classes that can be used to read in .tsv files 
# and prepare them to be converted into features that will ultimately be fed into the actual BERT model.
class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8") as f:
            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
            return lines

#The BinaryClassificationProcessor class can read in the train.tsv 
#and dev.tsv files and convert them into lists of InputExample objects.

class BinaryClassificationProcessor(DataProcessor):
    """Processor for binary classification dataset."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            label = line[1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

So far, we have the capability to read in tsv datasets and convert them into InputExample objects. BERT, being a neural network, cannot directly deal with text as we have in InputExample objects. The next step is to convert them into InputFeatures.

BERT has a constraint on the maximum length of a sequence after tokenizing. For any BERT model, the maximum sequence length after tokenization is 512. But we can set any sequence length equal to or below this value. For faster training, I’ll be using 128 as the maximum sequence length. A bigger number may give better results if there are sequences longer than this value.


An **InputFeature** consists of purely numerical data (with the proper sequence lengths) that can then be fed into the BERT model. This is prepared by tokenizing the text of each example and truncating the longer sequence while padding the shorter sequences to the given maximum sequence length (128). The conversion of **InputExample** objects to **InputFeature** objects is quite slow by default, so we modify the conversion code to utilize the multiprocessing library of Python to significantly speed up the process.


In [46]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def convert_example_to_feature(example_row):
    # return example_row
    example, label_map, max_seq_length, tokenizer, output_mode = example_row

    tokens_a = tokenizer.tokenize(example.text_a)

    tokens_b = None
    if example.text_b:
        tokens_b = tokenizer.tokenize(example.text_b)
        # Modifies `tokens_a` and `tokens_b` in place so that the total
        # length is less than the specified length.
        # Account for [CLS], [SEP], [SEP] with "- 3"
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[:(max_seq_length - 2)]

    tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
    segment_ids = [0] * len(tokens)

    if tokens_b:
        tokens += tokens_b + ["[SEP]"]
        segment_ids += [1] * (len(tokens_b) + 1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    padding = [0] * (max_seq_length - len(input_ids))
    input_ids += padding
    input_mask += padding
    segment_ids += padding

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    if output_mode == "classification":
        label_id = label_map[example.label]
    elif output_mode == "regression":
        label_id = float(example.label)
    else:
        raise KeyError(output_mode)

    return InputFeatures(input_ids=input_ids,
                         input_mask=input_mask,
                         segment_ids=segment_ids,
                         label_id=label_id)

### Training stage
Let’s import all the packages that we’ll need, and then get our paths straightened out.

In [47]:
pip install tools

Collecting tools
  Downloading https://files.pythonhosted.org/packages/de/20/2a2dddb083fd0ce56b453cf016768b2c49f3c0194090500f78865b7d110c/tools-0.1.9.tar.gz
Collecting pytils
  Downloading https://files.pythonhosted.org/packages/c6/c1/12b556b5bb393ce5130d57af862d045f57fee764797c0fe837e49cb2a5da/pytils-0.3.tar.gz (89kB)
Building wheels for collected packages: tools, pytils
  Building wheel for tools (setup.py): started
  Building wheel for tools (setup.py): finished with status 'done'
  Created wheel for tools: filename=tools-0.1.9-cp37-none-any.whl size=46764 sha256=70d28a2688eb198603f1e8887081c494dab55573ed63c971136960d296f2c724
  Stored in directory: C:\Users\HuyenNguyen\AppData\Local\pip\Cache\wheels\87\67\9b\1ca7dcb0b9ebfdc23a00c85a0644abb6fb14f9159a0df8e067
  Building wheel for pytils (setup.py): started
  Building wheel for pytils (setup.py): finished with status 'done'
  Created wheel for pytils: filename=pytils-0.3-cp37-none-any.whl size=40360 sha256=3aaebdc65aedf2963dec3a26d7f

You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Note: you may need to restart the kernel to use updated packages.


In [50]:
pip install utils

Collecting utils
  Downloading https://files.pythonhosted.org/packages/55/e6/c2d2b2703e7debc8b501caae0e6f7ead148fd0faa3c8131292a599930029/utils-1.0.1-py2.py3-none-any.whl
Installing collected packages: utils
Successfully installed utils-1.0.1


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Note: you may need to restart the kernel to use updated packages.


In [56]:
import torch
import pickle
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
from torch.nn import CrossEntropyLoss, MSELoss

from tqdm import tqdm_notebook, trange
import os
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule

from multiprocessing import Pool, cpu_count
from tools import *

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Next, we set some paths for where files should be stored and where certain files can be found. We also set some configuration options for the BERT model.

In [81]:
from transformers import glue_convert_examples_to_features

In [64]:
# The input data dir. Should contain the .tsv files (or other data files) for the task.
DATA_DIR = path

# Bert pre-trained model selected in the list: bert-base-uncased, 
# bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased,
# bert-base-multilingual-cased, bert-base-chinese.
BERT_MODEL = 'bert-base-cased'

# The name of the task to train.Let's to name this 'yelp'.
TASK_NAME = 'yelp'

# The output directory where the fine-tuned model and checkpoints will be written.
OUTPUT_DIR = f'outputs/{TASK_NAME}/'

# The directory where the evaluation reports will be written to.
REPORTS_DIR = f'reports/{TASK_NAME}_evaluation_report/'

# This is where BERT will look for pre-trained models to load parameters from.
CACHE_DIR = 'cache/'

# The maximum total input sequence length after WordPiece tokenization.
# Sequences longer than this will be truncated, and sequences shorter than this will be padded.
MAX_SEQ_LENGTH = 128

TRAIN_BATCH_SIZE = 24
EVAL_BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 1
RANDOM_SEED = 42
GRADIENT_ACCUMULATION_STEPS = 1
WARMUP_PROPORTION = 0.1
OUTPUT_MODE = 'classification'

CONFIG_NAME = "config.json"
WEIGHTS_NAME = "pytorch_model.bin"

In [65]:
output_mode = OUTPUT_MODE

cache_dir = CACHE_DIR

 Finally, we will create the directories if they do not already exist.

In [66]:
if os.path.exists(REPORTS_DIR) and os.listdir(REPORTS_DIR):
        REPORTS_DIR += f'/report_{len(os.listdir(REPORTS_DIR))}'
        os.makedirs(REPORTS_DIR)
if not os.path.exists(REPORTS_DIR):
    os.makedirs(REPORTS_DIR)
    REPORTS_DIR += f'/report_{len(os.listdir(REPORTS_DIR))}'
    os.makedirs(REPORTS_DIR)

In [67]:
if os.path.exists(OUTPUT_DIR) and os.listdir(OUTPUT_DIR):
        raise ValueError("Output directory ({}) already exists and is not empty.".format(OUTPUT_DIR))
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

Next, we will use our BinaryClassificationProcessor to load in the data, and get everything ready for the tokenization step.


In [68]:
processor = BinaryClassificationProcessor()
train_examples = processor.get_train_examples(DATA_DIR)
train_examples_len = len(train_examples)

In [69]:
label_list = processor.get_labels() # [0, 1] for binary classification
num_labels = len(label_list)

In [70]:
num_train_optimization_steps = int(
    train_examples_len / TRAIN_BATCH_SIZE / GRADIENT_ACCUMULATION_STEPS) * NUM_TRAIN_EPOCHS

In [71]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

I0123 22:40:47.706812 17028 file_utils.py:224] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt not found in cache, downloading to C:\Users\HUYENN~1\AppData\Local\Temp\tmp5i_1k1h8
100%|██████████| 213450/213450 [00:00<00:00, 445275.35B/s]
I0123 22:40:48.803828 17028 file_utils.py:237] copying C:\Users\HUYENN~1\AppData\Local\Temp\tmp5i_1k1h8 to cache at C:\Users\HuyenNguyen\.pytorch_pretrained_bert\5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
I0123 22:40:48.812806 17028 file_utils.py:241] creating metadata file for C:\Users\HuyenNguyen\.pytorch_pretrained_bert\5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
I0123 22:40:48.831813 17028 file_utils.py:250] removing temp file C:\Users\HUYENN~1\AppData\Local\Temp\tmp5i_1k1h8
I0123 22:40:48.839792 17028 tokenization.py:190] loading vocabulary file ht

In [74]:
label_map = {label: i for i, label in enumerate(label_list)}
train_examples_for_processing = [(example, label_map, MAX_SEQ_LENGTH, tokenizer, OUTPUT_MODE) for example in train_examples]

Here, we are creating our BinaryClassificationProcessor and using it to load in the train examples. Then, we are setting some variables that we’ll use while training the model. Next, we are loading the pretrained tokenizer by BERT. In this case, we’ll be using the bert-base-cased model.

The convert_example_to_feature function expects a tuple containing an example, the label map, the maximum sequence length, a tokenizer, and the output mode. So lastly, we will create an examples list ready to be processed (tokenized, truncated/padded, and turned into InputFeatures) by the convert_example_to_feature function.

**Note**: This step takes a long time (1.5 hours minimum, depending on your processing power). Also, there is a bug for you to solve :)

In [84]:
process_count = cpu_count() - 1
if __name__ ==  '__main__':
    print(f'Preparing to convert {train_examples_len} examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        train_features = list(tqdm_notebook(p.imap(glue_convert_examples_to_features.convert_example_to_feature, train_examples_for_processing), total=train_examples_len))

Preparing to convert 560000 examples..
Spawning 7 processes..


AttributeError: 'function' object has no attribute 'convert_example_to_feature'

In [None]:
with open(DATA_DIR + "train_features.pkl", "wb") as f:
    pickle.dump(train_features, f)

### Fine-tune BERT

In [None]:
# Load pre-trained model (weights)
model = BertForSequenceClassification.from_pretrained(BERT_MODEL, cache_dir=CACHE_DIR, num_labels=num_labels)
# model = BertForSequenceClassification.from_pretrained(CACHE_DIR + 'cased_base_bert_pytorch.tar.gz', cache_dir=CACHE_DIR, num_labels=num_cache_dir=CACHE_DIR, num_labels=num_labels)

In [None]:
model.to(device)


HuggingFace’s pytorch implementation of BERT comes with a function that automatically downloads the BERT model for us.

Afterwards, we need to do some more configuration steps for the training. Here, we just use the default parameters.

In [None]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

In [None]:
optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=LEARNING_RATE,
                     warmup=WARMUP_PROPORTION,
                     t_total=num_train_optimization_steps)

In [None]:
global_step = 0
nb_tr_steps = 0
tr_loss = 0

In [None]:
logger.info("***** Running training *****")
logger.info("  Num examples = %d", train_examples_len)
logger.info("  Batch size = %d", TRAIN_BATCH_SIZE)
logger.info("  Num steps = %d", num_train_optimization_steps)
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)

if OUTPUT_MODE == "classification":
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
elif OUTPUT_MODE == "regression":
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.float)

### Model Training

Let's set up the data for training.

In [None]:
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=TRAIN_BATCH_SIZE)

In [None]:
model.train()
for _ in trange(int(NUM_TRAIN_EPOCHS), desc="Epoch"):
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(tqdm_notebook(train_dataloader, desc="Iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch

        logits = model(input_ids, segment_ids, input_mask, labels=None)

        if OUTPUT_MODE == "classification":
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, num_labels), label_ids.view(-1))
        elif OUTPUT_MODE == "regression":
            loss_fct = MSELoss()
            loss = loss_fct(logits.view(-1), label_ids.view(-1))

        if GRADIENT_ACCUMULATION_STEPS > 1:
            loss = loss / GRADIENT_ACCUMULATION_STEPS

        loss.backward()
        print("\r%f" % loss, end='')
        
        tr_loss += loss.item()
        nb_tr_examples += input_ids.size(0)
        nb_tr_steps += 1
        if (step + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1

In [None]:
model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self

# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(OUTPUT_DIR, WEIGHTS_NAME)
output_config_file = os.path.join(OUTPUT_DIR, CONFIG_NAME)

torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(OUTPUT_DIR)