## Experiment with using Transformer LM to do sentence classification

1. Finetune a classifier head on top of pretrained BERT
2. Take embeddings from pretrained BERT and train a classifier on top of it. This is not finetuning of BERT since BERT is used only for getting embeddings
3. Finetune GPT based LM to classify sentence.

- Use pretrained DistilBERT model from HuggingFace

In [None]:
from IPython.display import display, HTML
display(HTML("<style>:root { --jp-notebook-max-width: 100% !important; }</style>"))

In [None]:
import numpy as np
import pandas as pd

In [None]:
from functools import partial

In [None]:
from transformers import AutoTokenizer

In [None]:
from transformers import DistilBertModel, DistilBertConfig

In [None]:
from transformers import DataCollatorWithPadding

In [None]:
import evaluate

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [None]:
from datasets import load_dataset

In [None]:
import torch

In [None]:
torch.cuda.is_available()

## 1. Finetune a classifier head on top of pretrained BERT

In [None]:
# train_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None).rename(columns={0: "text", 1:"label"})
# dev_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/dev.tsv', delimiter='\t', header=None).rename(columns={0: "text", 1:"label"})
# test_df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/test.tsv', delimiter='\t', header=None).rename(columns={0: "text", 1:"label"})

# dev_df.shape

# base_url = "https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/"
# data = load_dataset("csv", data_files={"train": f"{base_url}train.tsv", "validate": f"{base_url}dev.tsv", "test": f"{base_url}test.tsv"}, delimiter="\t")

# train_df["data_type"] = "train"
# dev_df["data_type"] = "dev"
# test_df["data_type"] = "test"

# df = pd.concat([train_df, dev_df, test_df], axis=0).reset_index(drop=True)

# print(f"Shape of df: {df.shape}")

# df.head()

# ## Quite balanced
# df[df.data_type =='train'].label.value_counts(normalize=True)

## Load dataset at https://huggingface.co/datasets/stanfordnlp/sst2

In [None]:
df = load_dataset('stanfordnlp/sst2')

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. It was pretrainined with the following objectives:
it was pretrained with three objectives:

1. Distillation loss: the model was trained to return the same probabilities as the BERT base model.
2. Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
3. Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model.

https://huggingface.co/distilbert/distilbert-base-uncased

## Step1: Get tokenizer for specific model

In [None]:
## Based on the name of the model(distilbert), AutoTokenizer automatically instantiates one of the tokenizer classes of the library from a pretrained model vocabulary.
## https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
## WordPiece based tokizer
## Returns DistilBertTokenizer or DistilBertTokenizerFast based on use_fast=True
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", use_fast=True)

In [None]:
print(f"tokenizer model_max_length: {tokenizer.model_max_length}") ## A very large values => unreliable
print(f"tokenizer truncation_side: {tokenizer.truncation_side}")
print(f"tokenizer padding_side: {tokenizer.padding_side}") 
print(f"tokenizer model_input_names: {tokenizer.model_input_names}") 
print(f"tokenizer bos_token: {tokenizer.bos_token}") 
print(f"tokenizer eos_token: {tokenizer.eos_token}") 
print(f"tokenizer unk_token: {tokenizer.unk_token}") 
print(f"tokenizer sep_token: {tokenizer.sep_token}") 
print(f"tokenizer pad_token: {tokenizer.pad_token}") 
print(f"tokenizer cls_token: {tokenizer.cls_token}") 
print(f"tokenizer mask_token: {tokenizer.mask_token}") 

In [None]:
## Check configuration of pretrained DistilBERT model
configuration = DistilBertConfig()
print(f"DistilBERT config: {configuration}")

In [None]:
def preprocess_function(df, text_column="text"):
    ## truncation=True ensures that sequences to be no longer than DistilBERT’s maximum input length
    ## https://huggingface.co/docs/transformers/v4.40.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__
    return tokenizer(df[text_column], truncation=True)

## tokenizer returns input_ids (token id) and attention_mask to be input to model
 https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
tokenizer(['a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films', 'my name is hardik'], truncation=True)

## encode returns input_ids
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
sample_encoding = tokenizer.encode('A stirring , Funny and finally transporting re imagining of beauty and the beast and 1930s horror films amzertfys', truncation=True)

## decode converts token/ input_ids to tokens and returns sentences
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
tokenizer.decode(sample_encoding)

## See the tokenization (Wordpiece result) using convert_ids_to_tokens
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
tokenizer.convert_ids_to_tokens(sample_encoding)

## Step2: Tokenize the entries in text column to get input_ids(token_ids) and attention masks

In [None]:
#tokenized_dict_list = preprocess_function(df, text_column="text")
tokenized_df = df.map(partial(preprocess_function, text_column="sentence"), batched=True)

In [None]:
tokenized_df

## Step3: Padd shorted sequences to ensure all are of length 512 tokens (In step 2 we truncated long sequences)

https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

## Step 4: Get eveluation metric (scikit learn or evaluate library)
https://huggingface.co/docs/evaluate/package_reference/loading_methods

In [None]:
##evaluate.list_evaluation_modules(module_type="metric", include_community=True, with_details=True)

In [None]:
[metric for metric in evaluate.list_evaluation_modules(module_type="metric", include_community=True) if 'f1' in metric]

In [None]:
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy_value = accuracy(predictions=predictions, references=labels)
    precision_value = precision(predictions=predictions, references=labels)
    recall_value = recall(predictions=predictions, references=label)
    f1_value = f1(predictions=predictions, references=label)
    return accuracy_value, precision_value, recall_value, f1_value

## Step 5: Get id2label and label2id mapping

In [None]:
id2label = {0:"negative", 1:"positive"}
label2id = {"negative":0, "positive":1}

## Step 6: Train model
1. Use Trainer API by Hugging face which abstracts the training loop
2. Manually write training loop in native Pytorch/ Tensorflow

### Step 6.1 Use Trainer API
https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer

In [None]:
## https://huggingface.co/docs/transformers/v4.40.1/en/model_doc/auto#transformers.AutoModelForSequenceClassification
## model with be instantiated with a classification head (Linear+Softmax)
## https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/configuration#transformers.PretrainedConfig
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

In [None]:
print(f"Pretrained model with classification head architecture: {model}")

In [None]:
##https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments
## num_train_epochs,  learning_rate, optimizer to use
training_arguments = TrainingArguments(output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=5, weight_decay=0.01, evaluation_strategy="epoch")

In [None]:
## https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.Trainer
## INformaitoj like train / val dataset, data_collator, metrics to compute
trainer = Trainer(model=model, args=training_arguments, train_dataset=tokenized_df["train"], eval_dataset=tokenized_df["validation"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics)

In [None]:
tokenized_df

In [None]:
trainer.train()