# Text Classification with BERT

![bert](https://res.cloudinary.com/practicaldev/image/fetch/s--ozy733MJ--/c_imagga_scale,f_auto,fl_progressive,h_420,q_auto,w_1000/https://dev-to-uploads.s3.amazonaws.com/i/q5e65ugnue96bir3usyk.png)

BERT (Bidirectional Encoder Representations from Transformers) is a NLP model developed by Google in 2018. It is a model that is already pre-trained on a 2,5000M (+- 170 GB) words corpus from Wikipedia.

To accomplish a particular NLP task, the pre-trained BERT model is used as a base and refined by adding an additional layer; the model can then be trained on a labeled dataset dedicated to the NLP task to be performed. This is the very principle of transfer learning. It is important to note that BERT is a very large model with 12 layers, 12 attention heads and 110 million parameters (BERT base).

The BERT model is able to do :

*   Translation
*   Text generation
*   Classification
*   Question-answering
*   ...

### Why BERT?

Using the General Language Understanding Evaluation ([GLUE](https://gluebenchmark.com/)) benchmark [leaderboard](https://gluebenchmark.com/leaderboard) , its easy to realize that many models on the list are all forks of BERT.

## Let's go !
To use BERT you need to have either pytorch or tensorflow installed in your environment. It is also preferable to have access to a GPU on your computer. If you don't have a GPU you can use [Google Colab](https://colab.research.google.com/).

Next, let’s install the transformers package from Hugging Face. This package is an interface between BERT and pytorch and/or tensorflow.




In [None]:
!pip install transformers

## Load the Data

For this project we will use the data from Odile. Odile is a bot that tries to answer general questions on a few BeCode Discord servers. The sentences all come from conversations between learners and Odile on Discord.

You'll find the data in `./dataset/odile_data.csv`. You can import them in a dataframe and display it.

**Tip:** if you are using Google colab you can import the CSV in your google drive and connect your notebook to your Google drive (check on Google how to do that !)





## Explore the data

It's time to take a quick look at our data.

As you see the questions from the learners are classified as intents (i.e. the goal the user has in mind when typing in a question or comment)

**Exercise:** Use your data exploration and visualization skills to answer the the following questions:

*   How many observations does the dataset contain?
*   How many different labels does the dataset contain?
*   Which labels contain the most observations?
*   Which labels contain the fewest observations?

## It's time to clean up !


Not all NLP tasks require the same preprocessing. In this case, we have to ask ourselves some questions: 

- Are there unwanted characters in the dataset? For example, do you want to keep the smiley's or not?  
  - If, for example, you want to create labels to analyze feelings, it might be perishable to keep the smiley's.
- Is it relevant to keep capital letters in sentences?
  - In this case, capital letters don't really matter, because on one hand, not everyone starts their sentences with capital letters when chatting. On the other hand, the sentences are quite short, addressed directly to Odile. 
- Is it necessary to limit the number of characters in a sentence?
  - Again in this case it may be preferable to limit the number of words. The questions asked to Odile are supposed to be short, as too long sentences could interfere with the classification if they contain too much information.

There is no universal answer. Everything will depend on the expected result. 

**Exercise :** Clean the dataset.
- Remove all unnecessary characters. You can choose to keep the smiley's or not.
- Put all sentences in lower case.
- Limit text to 256 words.

What other preprocessing steps can you think of?

## Defining observations (`X`) and labels (`y`)

As you know, training a model requires a set of observations (`X`) and their corresponding labels (`y`).

In that case, `X` is your clean text and `y` is the intent.

Do not forget that we are dealing with a multi-class classification problem. Then, you may have to **one-hot encode** the target value. Keep track of the mapping between the one-hot encoding and the labels in a dictionary.

## Split your dataset!

After all this time, I dare to hope that it is not necessary to explain this step anymore!

**Exercise :** Create the variables `X_train`, `X_val`, `X_test`, `y_train`, `y_val` and `y_test`. 

## Tokenization 
If you don't know what tokenization is anymore: look [here](../1.preprocessing/1.tokenization.ipynb).

We will use the tokenizer provided by BERT. This is a pre-trained model that will save us time. 

**Exercise :** Create a `tokenizer` variable and instantiate `DistilBertTokenizer.from_pretrained()` from `transformers`. You have to load `distilbert-base-uncased` model. (Uncased for case-insensitive.) 

Read more: [Tokenizer documentation](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertTokenizer).

In [None]:
import tensorflow as tf
from transformers import DistilBertTokenizer


### Tokenize the dataset

Good! We have instantiated our tokenizer but we have not yet encoded our words in vector.
To do this we will have to apply the tokenizer on our dataset. This will convert our texts into vectors.


**Exercise:** Create the `train_encodings`, `val_encodings` and `test_encodings` by calling the tokenizer on `X_train`,  `X_val` and `X_test`.

You need to know 3 parameters. 

- **max_length:** Maximum length of the sequence. You can set it to 200
- **truncation:** This will truncate to a maximum length specified by the max_length argument. This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached. You can set it to `True`
- **padding:** this is the parameter to make all vectors have the same length. You can set it to `True`.

[More info here](https://huggingface.co/docs/transformers/preprocessing)

## Prepare the datasets for training

You can now convert your training, evaluation and test sets in a dataset that will contain both observations and labels. Use the [from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) method from Tensorflow to create the datasets. This methods takes two arguments:

*   The encodings that you have just created (casted as a `dict`)
*   The labels



In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))

## Training

### Load BERT model

You will need to load the BERT pre-trained model by using the class `TFDistilBertForSequenceClassification`

⚠️ You must use the same model as the one used for tokenization. So in our case  `distilbert-base-uncased`. 

* [BERT for Sequence Classification Documentation](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertForSequenceClassification)

**Exercise:** Create a model variable and load it by using  `TFDistilBertForSequenceClassification.from_pretrained()` As a parameter, you must indicate the number of labels (get this number from your original dataframe).



In [None]:
from transformers import TFDistilBertForSequenceClassification

### Training arguments

Let's define the the training arguments and compile our model

*   Define the optimizer (Adam) and its learning rate
*   Define the loss function that will be used (remember that we have one-hot encoded output data)
*   Define the evaluation appropriate metrics
*   Compile the model with the right metrics
*   Display the model summary

In [None]:
OPTIMIZER =  
LOSS = 
METRICS = 

model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)
model.summary()


### Training

Define first the number of epochs and the batch size for the training.

The batch size will depend on your machine. If you have a weak GPU, I advise you to put 8 or 16.

The number of epochs will depend on your machine, the batch size, etc...You can start with 5 for example

In [None]:
BATCH_SIZE = 8
EPOCHS = 5

In [None]:
history = model.fit(
    train_dataset.batch(BATCH_SIZE),
    epochs=EPOCHS,
    validation_data=val_dataset.batch(BATCH_SIZE)
)

### Plot the learning curve of your model

In [None]:
import tensorflow
from matplotlib import pyplot as plt

def plot_history(history):
    """ This helper function takes the tensorflow.python.keras.callbacks.History
    that is output from your `fit` method to plot the loss and accuracy of
    the training and validation set.
    """
    fig, axs = plt.subplots(1,2, figsize=(12,6))
    axs[0].plot(history.history['accuracy'], label='training set')
    axs[0].plot(history.history['val_accuracy'], label = 'validation set')
    axs[0].set(xlabel = 'Epoch', ylabel='Accuracy', ylim=[0, 1])

    axs[1].plot(history.history['loss'], label='training set')
    axs[1].plot(history.history['val_loss'], label = 'validation set')
    axs[1].set(xlabel = 'Epoch', ylabel='Loss', ylim=[0, 10])
    
    axs[0].legend(loc='lower right')
    axs[1].legend(loc='lower right')
    
plot_history(history)

## Model Evaluation

We can now evaluate our model on the test set. Use the `model.evaluate()` function.

In [None]:
loss, accuracy = model.evaluate(test_dataset.batch(BATCH_SIZE))
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

**Exercise:** is the accuracy the best metrics for this dataset ? Explain your answer !

## Test your model

Well done, you did it :-)

Oh...I have an idea ! Try to classify the sentence *Well done !* with your model

Think to apply all the preprocessing steps and predict the intent of the user.

**Tip:** use the mapping you have created above to retrieve the original label of the prediction !

In [None]:
text = "Well done !"


