# BERT
In this section we will use BERT to classify our documents.
We will use the package pytorch-transformers by huggingface, that provides a pytorch implementation of the model.  

BERT can be used in different ways:
1. Use the model weights pre-trained on a large corpus to predict the labels for our custom task
1. Fine-tune the model weights, by unfreezing a certain number of the top layers, thereby better fiting the model to our custom task

We will use the [FastBert]() package, that simplifies the use of BERT and similar transformer models, and provides an API in spirit of [fast.ai](https://github.com/fastai/fastai), aiming to expose the most important settings and take care of the rest for you.

In [None]:
!pip install fast_bert

### 1. Create a DataBunch object
The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.

For this workshop, we will use the DistilBERT model, from the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). The model is a 6-layer, 768-hidden, 12-heads, 66M parameters model (compared to BERT Base which is a 12-layer, 768-hidden, 12-heads, 110M parameters model, and BERT large which is a 24-layer, 1024-hidden, 16-heads, 340M parameters model) and trains faster, with lighter memory requirements.
Check [this link](https://huggingface.co/transformers/pretrained_models.html) for more info on the available pretrained models.

In [None]:
model_type = 'distilbert'
tokenizer = 'distilbert-base-uncased'
multi_label = False

DATA_PATH = 'data/'
LABEL_PATH = 'data/'

In [1]:
from fast_bert.data_cls import BertDataBunch

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer=tokenizer,
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col='label',
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=False,
                          multi_label=multi_label,
                          model_type=model_type)

The CSV files should contain the columns `index`, `text` and `label`. In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.  
labels.csv will contain a list of all unique labels, or all possible tags in case of multi-label classification.

### 2. Create a Learner Object
BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

In [2]:
import torch
from fast_bert.learner_cls import BertLearner
from fast_bert.metrics import accuracy
import logging

OUTPUT_DIR = DATA_PATH
logger = logging.getLogger()
logger.setLevel(logging.INFO)

learner = BertLearner.from_pretrained_model(
					\databunch,
						pretrained_path='distilbert-base-uncased',
						metrics=[{'name': 'accuracy', 'function': accuracy}],
						device=torch.device("cuda") # for GPU,
						logger=logger,
						output_dir=OUTPUT_DIR,
						finetuned_wgts_path=None,
						warmup_steps=500,
						multi_gpu=True,
						is_fp16=False,
						multi_label=multi_label,
						logging_steps=100)

| parameter           | description                                                                                                                                                                                                                    |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| databunch           | Databunch object created earlier                                                                                                                                                                                               |
| pretrained_path     | Directory for the location of the pretrained model files or the name of one of the pretrained models i.e. bert-base-uncased, xlnet-large-cased, etc                                                                            |
| metrics             | List of metrics functions that you want the model to calculate on the validation set, e.g. accuracy, beta, etc                                                                                                                 |
| device              | torch.device of type _cuda_ or _cpu_                                                                                                                                                                                           |
| logger              | logger object                                                                                                                                                                                                                  |
| output_dir          | Directory for model to save trained artefacts, tokenizer vocabulary and tensorboard files                                                                                                                                      |
| finetuned_wgts_path | provide the location for fine-tuned language model (experimental feature)                                                                                                                                                      |
| warmup_steps        | number of training warms steps for the scheduler                                                                                                                                                                               |
| multi_gpu           | multiple GPUs available e.g. if running on AWS p3.8xlarge instance                                                                                                                                                             |
| is_fp16             | FP16 training                                                                                                                                                                                                                  |
| multi_label         | multilabel classification                                                                                                                                                                                                      |
| logging_steps       | number of steps between each tensorboard metrics calculation. Set it to 0 to disable tensor flow logging. Keeping this value too low will lower the training speed as model will be evaluated each time the metrics are logged |


### 3. Train the model


In [3]:
learner.fit(epochs=6,
            lr=6e-5,
            validate=True, 	# Evaluate the model after each epoch
            schedule_type="warmup_cosine",
            optimizer_type="lamb")

C:\develop\code\omriallouche\nlp_day_2019\tensorboard


KeyboardInterrupt: 

### 4. Save trained model artifacts
Model artefacts will be persisted in the output_dir/'model_out' path provided to the learner object. Following files will be persisted:

| File name               | description                                      |
| ----------------------- | ------------------------------------------------ |
| pytorch_model.bin       | trained model weights                            |
| spiece.model            | sentence tokenizer vocabulary (for xlnet models) |
| vocab.txt               | workpiece tokenizer vocabulary (for bert models) |
| special_tokens_map.json | special tokens mappings                          |
| config.json             | model config                                     |
| added_tokens.json       | list of new tokens                               |

As the model artefacts are all stored in the same folder, you will be able to instantiate the learner object to run inference by pointing pretrained_path to this location.

In [None]:
learner.save_model()

### 5. Model Inference


In order to perform inference, you need to save the model (see step #4 above).  
To make predictions, we init a `BertClassificationPredictor` object with the path of the model files:

In [None]:
from fast_bert.prediction import BertClassificationPredictor

MODEL_PATH = OUTPUT_DIR + 'model_out'

predictor = BertClassificationPredictor(
                model_path=MODEL_PATH,
                label_path=LABEL_PATH, # location for labels.csv file
                multi_label=False,
                model_type=model_type,
                do_lower_case=True)

In [None]:
# Single prediction
sentence_to_predict = "the problem with the middleeast is those damn data scientists. All they do is just look at the computer screen instead of laying outside in the warm sun."
single_prediction = predictor.predict(sentence_to_predict)
single_prediction

In [None]:
# Batch predictions
texts = [
    "this is the first text",
    "this is the second text"
    ]

multiple_predictions = predictor.predict_batch(texts)
multiple_predictions

Next, let's run the model on our validation dataset and report its performance.

In [None]:
# Batch predictions
import pandas as pd
df = pd.read_csv('val.txt')
texts = df['text'].values

multiple_predictions = predictor.predict_batch(texts)
# Each prediction includes the softmax value of all possible labels, sorting in descending order. 
# We thus use only the first element, which is the most-probable label, and keep only the first element in that tuple, which is the name of the label
y_predicted = [x[0][0] for x in multiple_predictions] 

**Try it yourself:** Can you improve the performance of our classifier? Let's see how high you can get.  
Good luck!

### Language Model Fine-Tuning
It is also possible to fine-tune the Language Model of BERT, to fit to the custom domain. The process requires a long training time, even on powerful GPUs. Check [this link](https://github.com/kaushaltrivedi/fast-bert#language-model-fine-tuning) for instructions how to do it.