# Text classification with BERT

BERT (Bidirectional Encoder Representations from **Transformers**) is a NLP model developed by Google in 2018. It is a model that is already pre-trained on a 2,5000M (+- 170 GB) words corpus from Wikipedia. 

![bert](https://www.advisa.fr/wp-content/uploads/2019/10/google-bert-algorithm.jpg)

To accomplish a particular NLP task, the pre-trained BERT model is used as a base and refined by adding an additional layer; the model can then be trained on a labeled data set dedicated to the NLP task to be performed. This is the very principle of **transfer learning**. It is important to note that BERT is a very large model with 12 layers, 12 attention heads and 110 million parameters (BERT base).

The BERT model is able to do :

* translation
* text generation
* classification
* question-answering
* syntax analysis (tagging, parsing) 

**Why BERT?**

Just look at the different benchmarks to quickly realize that the first models in the list are all forks of BERT.

https://gluebenchmark.com/leaderboard

## Let's go !

To use BERT you need to have either pytorch or tensorflow installed in your environment. It is also preferable to have access to a GPU on your computer. If you don't have a GPU use Google Colab. 

**Exercise :** Use tensorflow or pytorch to check if you have a GPU.





Next, let’s install the [transformers](https://github.com/huggingface/transformers) package from Hugging Face. This package is an interface between BERT and pytorch and/or tensorflow.


``!pip install transformers``



## Load the data

The dataset comes from Odile. She's a bot that tries to answer general questions on a few BeCode Discord servers. The sentences all come from conversations between learners and Odile on Discord.

**Exercise :** Import ``'./dataset/odile_data.csv'`` file into a dataframe.

In [2]:
import pandas as pd
odile = pd.read_csv('./dataset/odile_data.csv')

## Analyze the dataset ! 

It's time to take a quick look at our data. 

**Exercise :** You must answer the following questions: 
* How many observations does the dataset contain?
* How many different labels does the dataset contain?
* Which labels contain the most observations?
* Which labels contain the fewest observations?

In [3]:
print(len(odile))
print(len(odile.intent.unique()))

1555
95


In [4]:
observations = odile.groupby('intent').count().sort_values(['sentence'], ascending = False)
observations.head(5) #top 5 patients with most observations

Unnamed: 0_level_0,sentence
intent,Unnamed: 1_level_1
smalltalk_appraisal_good,89
smalltalk_confirmation_yes,82
smalltalk_user_likes_agent,77
smalltalk_confirmation_no,67
smalltalk_confirmation_cancel,62


In [5]:
observations.tail(5)

Unnamed: 0_level_0,sentence
intent,Unnamed: 1_level_1
smalltalk_agent_sure,5
smalltalk_user_will_be_back,5
smalltalk_dialog_what_do_you_mean,5
smalltalk_understand_binary,4
smalltalk_best_song,3


## It's time to clean up !

Not all NLP tasks require the same preprocessing. In this case, we have to ask ourselves some questions: 

- Are there unwanted characters in the dataset? For example, do you want to keep the smiley's or not?  
  - If, for example, you want to create labels to analyze feelings, it might be perishable to keep the smiley's.
- Is it relevant to keep capital letters in sentences?
  - In this case, capital letters don't really matter, because on one hand, not everyone starts their sentences with capital letters when chatting. On the other hand, the sentences are quite short, addressed directly to Odile. 
- Is it necessary to limit the number of characters in a sentence?
  - Again in this case it may be preferable to limit the number of words. The questions asked to Odile are supposed to be short, as too long sentences could interfere with the classification if they contain too much information.

There is no universal answer. Everything will depend on the expected result. 

**Exercise :** Clean the dataset.
- Remove all unnecessary characters. You can choose to keep the smiley's or not.
- Put all sentences in lower case.
- Limit text to 256 words.

In [6]:
import re
import contractions 
def clean_text(text):
    # remove contractions first
    text = contractions.fix(text)
    # remove unwanted chars
    text = re.sub(r'[^\w]', ' ', text)
    # remove starting and trailing spaces
    text = text.lstrip()
    text = text.rstrip()
    # make everything lowercase
    text = text.lower()
    # delete multiple spaces
    text = re.sub(' +', ' ', text)
    # limit to 256 words
    wordlist = text.split()
    words_text = len([item for item in wordlist])
    if words_text < 256:
        wordlist = [item[:] for item in wordlist]
    else:
        wordlist = [item[:256] for item in wordlist]
    text = ' '.join([str(elem) for elem in wordlist]) 
    
    return text

In [7]:
odile['sentence'] = odile['sentence'].apply(lambda x: clean_text(x))

In [8]:
odile.head(5)

Unnamed: 0,sentence,intent
0,who are you,smalltalk_agent_acquaintance
1,all about you,smalltalk_agent_acquaintance
2,what is your personality,smalltalk_agent_acquaintance
3,define yourself,smalltalk_agent_acquaintance
4,what are you,smalltalk_agent_acquaintance


## Label's encoding
As you know, the machine needs to convert words into numbers so that it can interpret them. It's the same with labels. So we are going to create a dictionary that will allow us to convert all labels into numbers. 

**Exercise :** Create a dictionary that contains all the labels and assign an id to it. (Of course, there should be no duplicates). 



In [9]:
list_uniques = list(odile.intent.unique())
label_dict = dict(enumerate(list_uniques))
label_map = {v: k for k, v in label_dict.items()}

**Exercise :** Create a column `id_label` in your dataframe and insert the id's of the labels. 

In [10]:
odile['id_label'] = odile['intent'].map(label_map)
odile.head(5)

Unnamed: 0,sentence,intent,id_label
0,who are you,smalltalk_agent_acquaintance,0
1,all about you,smalltalk_agent_acquaintance,0
2,what is your personality,smalltalk_agent_acquaintance,0
3,define yourself,smalltalk_agent_acquaintance,0
4,what are you,smalltalk_agent_acquaintance,0


When we make our predictions, the model will return the label id as a prediction. So it may be useful to save your label dictionary to be able to reinterpret the label for a human later on. 

**Exercise:** Save your label dictionary with pickle (or other). 

In [11]:
import pickle

#exporting the intent encoder
output = open('intent_encoder.pkl', 'wb')
pickle.dump(label_map, output)
output.close()

## Split your dataset !
After all this time, I dare to hope that it is not necessary to explain this step anymore!

**Exercise :** Create the variables X_train, X_test, y_train and y_test. 


In [12]:
X = odile.sentence
y = odile.id_label

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Tokenization 
If you don't know what tokenization is anymore look [here](../1.preprocessing/1.tokenization.ipynb)

We will use the tokenizer provided by BERT. This is a pre-trained model that will save us time. 

**Exercise :** Create a ``tokenizer`` variable and instantiate ``BertTokenizer.from_pretrained()`` from ``transformers``. You have to load ``bert-base-uncased`` model. (Uncased for case-insensitive.) 

[Documentation](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer)



In [14]:
import tensorflow as tf
from transformers import BertTokenizer, BertModel

In [15]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

Good! We have instantiated our tokenizer but we have not yet encoded our words in vector.
To do this we will have to use the method ``tokenizer.batch_encode_plus()``. This method will convert our sentences into a vector and create the attention mask.



**Exercise :** Create an ``encoded_data_train`` variable and instantiate `tokenizer.batch_encode_plus()`. First you have to specify the data. So pass the variable `X_train`.

You need to know 4 parameters. 

- **padding :** this is the parameter to make all vectors have the same length. You can set it to True. We need it to work with the attention masks.

- **return_attention_mask :** allows to have the vector of the attention mask in return. Set it to True. Without this mask, we cannot see the attention points of our model. 
- **max_length :** Maximum length of the sequence. You can set it to 256
 
- **return_tensors :** Here depending on the framework you are using (Pytorch VS Tensorflow) you have to specify the type of tensors you want to return. 

  - For pytorch you have to specify "pt".
  - For tensorflow you have to specify "tf".
  - For a numpy array, you must indicate "np".


In [16]:
encoded_data_train = tokenizer.batch_encode_plus(X_train, padding=True,
                                                 return_attention_mask=True,
                                                 max_length = 256, return_tensors='tf')

In [17]:
encoded_data_train

{'input_ids': <tf.Tensor: shape=(1244, 23), dtype=int32, numpy=
array([[ 101, 4931, 2054, ...,    0,    0,    0],
       [ 101, 5754, 5313, ...,    0,    0,    0],
       [ 101, 1045, 2064, ...,    0,    0,    0],
       ...,
       [ 101, 1997, 2607, ...,    0,    0,    0],
       [ 101, 2145, 3403, ...,    0,    0,    0],
       [ 101, 2129, 2079, ...,    0,    0,    0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1244, 23), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1244, 23), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}

You must do the same for the test data set. 

**Exercise :** Create a `encoded_data_test` variable and do the same thing as above. 

In [18]:
encoded_data_test = tokenizer.batch_encode_plus(X_test, padding=True,
                                                 return_attention_mask=True,
                                                 max_length = 256, return_tensors='tf')

In [19]:
encoded_data_test

{'input_ids': <tf.Tensor: shape=(311, 20), dtype=int32, numpy=
array([[  101,  1045,  2066, ...,     0,     0,     0],
       [  101,  2024,  2017, ...,     0,     0,     0],
       [  101,  2057,  2024, ...,     0,     0,     0],
       ...,
       [  101,  2017,  2024, ...,     0,     0,     0],
       [  101,  2515, 18750, ...,     0,     0,     0],
       [  101,  2085,  9061, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(311, 20), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(311, 20), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>}

If you do `print(encoded_data_train)`, you will see we have a dictionary with the following keys: `'input_ids'`, `'token_type_ids'` and `'attention_mask'`.

* **input_ids :** The sentence represented as a vector. The input_ids are the indices corresponding to each token in our sentence.

* **attention_mask :** It points out which tokens the model should pay attention to and which ones it should not.

* **token_type_ids :** Is used to bring together two sequences, we will not use it in this case.  
 But you can find more information by following this [link](https://huggingface.co/transformers/glossary.html#token-type-ids)
 

**Exercise :** print ``encoded_data_train['input_ids']`` and ``encoded_data_train['attention_mask']``

In [20]:
print(encoded_data_train.input_ids, encoded_data_train.attention_mask)

tf.Tensor(
[[ 101 4931 2054 ...    0    0    0]
 [ 101 5754 5313 ...    0    0    0]
 [ 101 1045 2064 ...    0    0    0]
 ...
 [ 101 1997 2607 ...    0    0    0]
 [ 101 2145 3403 ...    0    0    0]
 [ 101 2129 2079 ...    0    0    0]], shape=(1244, 23), dtype=int32) tf.Tensor(
[[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]], shape=(1244, 23), dtype=int32)


In [21]:
print(encoded_data_test.input_ids, encoded_data_test.attention_mask)

tf.Tensor(
[[  101  1045  2066 ...     0     0     0]
 [  101  2024  2017 ...     0     0     0]
 [  101  2057  2024 ...     0     0     0]
 ...
 [  101  2017  2024 ...     0     0     0]
 [  101  2515 18750 ...     0     0     0]
 [  101  2085  9061 ...     0     0     0]], shape=(311, 20), dtype=int32) tf.Tensor(
[[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]], shape=(311, 20), dtype=int32)


## Preapare the dataset
Whether it's for Pytorch or Tensorflow, we have to prepare the datasets (more simply said, convert the dataframes to tensors). 

We need to convert `y_train`, `y_test` into a tensor. For pytorch you have to use ``torch.tensor()`` and for tensorflow ``tf.tensor()``.

**Exercise :** Create a variable `labels_train` and create a tensor with `y_train`.


In [22]:
labels_train = tf.convert_to_tensor(y_train)

In [23]:
labels_train

<tf.Tensor: shape=(1244,), dtype=int64, numpy=array([56, 37, 61, ..., 38, 80, 52])>

**Exercise :** Create a variable `labels_test` and create a tensor with `y_test`.

In [24]:
labels_test = tf.convert_to_tensor(y_test)

In [25]:
labels_test

<tf.Tensor: shape=(311,), dtype=int64, numpy=
array([70, 10, 20, 27, 75, 88,  4, 24, 32, 20,  2, 67, 38, 89, 32, 69, 20,
       20, 87, 44, 42, 86, 45, 32,  9, 20, 13, 15, 15, 29, 37,  6, 45, 18,
       31, 27, 88, 38, 83, 37,  4,  6, 37, 39, 32, 38, 14, 56, 20, 40, 37,
       76, 10, 58, 70, 93, 37, 31, 64, 28, 25, 37,  1, 52, 54,  3, 75,  6,
       52, 54, 27, 31, 43, 58, 79, 53, 15, 27, 12, 53, 37, 32, 35, 20, 13,
       13,  3,  4, 74,  4, 45, 51, 38, 30, 40, 70,  6, 45, 35, 16, 27, 55,
       37, 47, 20,  6, 13, 27, 12, 60, 10, 33, 51, 49, 51, 31,  4, 89,  2,
       32, 27, 24, 32, 12, 52, 82, 30, 24, 27, 32, 48, 58, 38, 14, 47, 10,
       70, 38, 37, 31, 42, 12, 32, 52, 32, 81, 34, 20, 16, 27, 24, 27, 70,
       25, 69,  3, 27, 50, 42, 32, 51, 22,  6, 31, 10, 54, 16, 38, 19, 16,
        4, 19, 76, 70, 19, 51, 29,  0, 89, 26, 25, 70, 67, 64, 75, 40, 31,
       84, 31, 49, 31, 31, 20, 38, 42, 49, 27, 27, 70,  4,  4, 29, 33, 57,
       20,  3, 62, 52, 40, 70,  6, 25, 31,  6, 32, 50,

Define the batch size.  

**Exercise:** Create a `batch_size` variable. The number of samples will depend on several factors, such as the capacity of your graphics card. If your graphic card is not very powerful I advise you to put a small batch size of 8. 

In [26]:
batch_size = 8

Now we need to convert our encoded dataframe into a tensor.

**Exercise :** Create the ``dataset_train`` and ``dataset_test`` variables and convert ``encoded_data_train`` and ``encoded_data_test`` into tensor.

**PYTORCH  :** [Use torch.utils.data.Dataset class](https://classyvision.ai/tutorials/classy_dataset)  
**Tensorflow :** [Use tf.data.Dataset.from_tensor_slices](https://medium.com/when-i-work-data/converting-a-pandas-dataframe-into-a-tensorflow-dataset-752f3783c168)





In [37]:
dataset_train = tf.data.Dataset.from_tensor_slices((encoded_data_train.input_ids, labels_train))

In [38]:
dataset_train.element_spec

(TensorSpec(shape=(23,), dtype=tf.int32, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [39]:
dataset_test = tf.data.Dataset.from_tensor_slices((encoded_data_test.input_ids, labels_test))

In [40]:
dataset_test.element_spec

(TensorSpec(shape=(20,), dtype=tf.int32, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

## Load BERT model
Depending on what you use (pytorch or tensorflow) you will have to use the following class: 

pytorch = ``BertForSequenceClassification``  
tensorflow = ``TFBertForSequenceClassification.from_pretrained()``

⚠️ You must use the same model as the one used for tokenization. So in our case  ``bert-base-uncased``. 


[doc pytorch](https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification)   
[doc tensorflow](https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification)

**Exercise:** Create a model variable and instantiate the `BertForSequenceClassification().from_pretrained()` (or `TFBertForSequenceClassification.from_pretrained()`). As a parameter, you must indicate the number of labels (normally 95).



In [29]:
from transformers import TFBertForSequenceClassification

model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 95)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**🔦 Pytorch only :** Assign the model to "cuda" device   
``model.to("cuda")``

In [30]:
# 🔦 PYTORCH user only !! 
# Assign the model to gpu

## Train your model

It's time to start training the model!
For this, the HuggingFace package simplifies our life by bringing us a ``Trainer()`` class.

To use this class, we must first configure the model with the ``TrainingArguments()`` class. It is this class that will allow us to set the batch size, the number of epochs, ...

⚠️ For tensorflow you have to use `TFTrainer()` and `TFTrainingArguments()` !!

**Exercise :** import `Trainer` and `TrainingArgument` from transformers.

In [41]:
from transformers import TFTrainer
from transformers import TFTrainingArguments

**Exercise :** Create the ``training_args`` variable and instantiate the class `TrainingArguments`. You need to specify several parameters : 
* `output_dir` : Directory path for saving your template.
* `num_train_epochs` : Number of epochs. Will depend on your machine, batch size, etc...
* `per_device_train_batch_size` : batch size per GPU and for training. Here again the number will depend on your machine. If you have a weak GPU, I advise you to put 8 or 16.
* `per_device_eval_batch_size` : batch size per GPU and for **testing**. During the evaluation, the gradient and backpropagation are not executed, so you can set a larger batch size.
* `learnig_rate` : by default it is `5e-5`. But most likely you will have to change it.  Again, only your tests can define a good learning rate.
* `logging_dir` : directory path for storing logs





In [42]:
training_args = TFTrainingArguments(
    output_dir='./results',          
    num_train_epochs=3,              
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=16,
    learning_rate=1e-6,
    logging_dir='./logs',            
)

We are going to improve the metrics,notably the f1 score.   
[Copy and paste the compute_metrics found in this documentation.](https://huggingface.co/transformers/training.html#codecell14)

In [43]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

**Exercise :** Create the ``trainer`` variable and instantiate the ``Trainer()`` or ``TFTrainer()`` class. You need to specify several parameters :
* `model` : the `model` variable.
* `args` : the `trainings_args` variable
* `compute_metrics` : the `compute_metrics` function
* `train_dataset` : the `train_dataset` variable
* `test_dataset` : the `test_dataset` variable 

In [44]:
trainer = TFTrainer(model=model, args=training_args, compute_metrics=compute_metrics,
                    train_dataset= dataset_train, eval_dataset=dataset_test)

**Exercise :** Train your model with `trainer.train()` method.

In [46]:
trainer.train()

TypeError: 'NoneType' object is not callable

## Evaluate your model

**Exercise :** Evaluate your model with `trainer.evaluate()` method.

In [47]:
trainer.evaluate()

KeyboardInterrupt: 

If you do not have an f1 score of at least 0.8, your model could be improved. If your score is very low or stagnant, change the learning rate values and adjust the batch size. You can also increase the number of epochs. Unfortunately, there is no magic parameter, it all depends on your environment. You will have to do some tests to find the right hyper-parameters.

**Exercise :** Test your model by making a prediction on the phrase "Hello how are you?".
You should get the label "smalltalk_greetings_how_are_you".

## Building the BERT model with Ktrain

### Creating the training and test sets

In [61]:
import numpy as np
import pandas as pd
import tensorflow as tf
import ktrain
from ktrain import text as txt

In [68]:
# Split dataset for bert
(X_train_bert, y_train_bert), (X_test_bert, y_test_bert), preproc = txt.texts_from_df(odile, 'sentence',
                                                                                        label_columns = 'id_label',
                                                                                        lang = 'en',
                                                                                        maxlen = 256,
                                                                                        val_pct = 0.2,
                                                                                        preprocess_mode='bert')

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


In [69]:
# Load model
model = text.text_classifier('bert', (X_train_bert, y_train_bert), preproc=preproc)

Is Multi-Label? False
maxlen is 256
done.


In [70]:
# Wrap model in ktrain learner object
learner = ktrain.get_learner(model, train_data=(X_train_bert, y_train_bert),
                             val_data=(X_test_bert, y_test_bert), batch_size=8)

In [1]:
# Find good learning rate
# learner.lr_find() briefly simulate training to find good learning rate
# learner.lr_plot()

In [2]:
# Train model and stop at best epochs
# learner.autofit(1e-5)

In [3]:
# getting predictor variable
# predictor = ktrain.get_predictor(learner.model, preproc)

In [4]:
# make prediction
# y_pred = predictor.predict(test_df['text'].tolist()//X_new)
# y_pred

In [None]:
# Train model with default learning rate (takes too long time to find good learning rate)
learner.fit_onecycle(3e-5, 2)



begin training using onecycle policy with max lr of 3e-05...
Epoch 1/2
 11/156 [=>............................] - ETA: 1:55:27 - loss: 4.7916 - accuracy: 0.0010    