## Experiment with using Transformer LM to do sentence classification
https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer

1. Finetune a classifier head on top of pretrained BERT (Using Native PyTorch)
<!-- 2. Take embeddings from pretrained BERT and train a classifier on top of it. This is not finetuning of BERT since BERT is used only for getting embeddings
3. Finetune GPT based LM to classify sentence. -->

- Use pretrained DistilBERT model from HuggingFace

In [None]:
from IPython.display import display, HTML
display(HTML("<style>:root { --jp-notebook-max-width: 100% !important; }</style>"))

In [None]:
import numpy as np
import pandas as pd

In [None]:
from functools import partial

In [None]:
from transformers import AutoTokenizer

In [None]:
from transformers import DistilBertModel, DistilBertConfig

In [None]:
from transformers import DataCollatorWithPadding

In [None]:
import evaluate

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

In [None]:
from datasets import load_dataset

In [None]:
import torch

In [None]:
from torch.utils.data import DataLoader

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

In [None]:
from tqdm.auto import tqdm

In [None]:
torch.cuda.is_available()

## 1. Finetune a classifier head on top of pretrained BERT

## Load dataset at https://huggingface.co/datasets/stanfordnlp/sst2

In [None]:
df = load_dataset('stanfordnlp/sst2')

In [None]:
df["test"][0]

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. It was pretrainined with the following objectives:
it was pretrained with three objectives:

1. Distillation loss: the model was trained to return the same probabilities as the BERT base model.
2. Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
3. Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model.

https://huggingface.co/distilbert/distilbert-base-uncased

## Step1: Get tokenizer for specific model

In [None]:
## Based on the name of the model(distilbert), AutoTokenizer automatically instantiates one of the tokenizer classes of the library from a pretrained model vocabulary.
## https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer
## WordPiece based tokizer
## Returns DistilBertTokenizer or DistilBertTokenizerFast based on use_fast=True
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", use_fast=True)

In [None]:
print(f"tokenizer model_max_length: {tokenizer.model_max_length}") ## A very large values => unreliable
print(f"tokenizer truncation_side: {tokenizer.truncation_side}")
print(f"tokenizer padding_side: {tokenizer.padding_side}") 
print(f"tokenizer model_input_names: {tokenizer.model_input_names}") 
print(f"tokenizer bos_token: {tokenizer.bos_token}") 
print(f"tokenizer eos_token: {tokenizer.eos_token}") 
print(f"tokenizer unk_token: {tokenizer.unk_token}") 
print(f"tokenizer sep_token: {tokenizer.sep_token}") 
print(f"tokenizer pad_token: {tokenizer.pad_token}") 
print(f"tokenizer cls_token: {tokenizer.cls_token}") 
print(f"tokenizer mask_token: {tokenizer.mask_token}") 

In [None]:
## Check configuration of pretrained DistilBERT model
configuration = DistilBertConfig()
print(f"DistilBERT config: {configuration}")

## Added padding="max_length" to ensure that all sentences Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
## https://huggingface.co/docs/transformers/v4.40.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__
## WE ARE NOT USING DataCollatorWithPadding

In [None]:
def preprocess_function(df, text_column="text"):
    ## truncation=True ensures that sequences to be no longer than DistilBERT’s maximum input length
    ## https://huggingface.co/docs/transformers/v4.40.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.__call__
    ## Added padding="max_length" SINCE WE ARE NOT USING DataCollatorWithPadding
    return tokenizer(df[text_column], truncation=True, padding="max_length", max_length=512)

## tokenizer returns input_ids (token id) and attention_mask to be input to model
 https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
tokenizer(['a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films', 'my name is hardik'], truncation=True, padding="max_length")

## encode returns input_ids
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
sample_encoding = tokenizer.encode('A stirring , Funny and finally transporting re imagining of beauty and the beast and 1930s horror films amzertfys', truncation=True, padding="max_length", max_length=512)

## decode converts token/ input_ids to tokens and returns sentences
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
tokenizer.decode(sample_encoding)

## See the tokenization (Wordpiece result) using convert_ids_to_tokens
https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast

In [None]:
tokenizer.convert_ids_to_tokens(sample_encoding)

## Step2: Tokenize the entries in text column to get input_ids(token_ids) and attention masks

In [None]:
#tokenized_dict_list = preprocess_function(df, text_column="text")
tokenized_df = df.map(partial(preprocess_function, text_column="sentence"), batched=True)

In [None]:
tokenized_df["train"]

### Step 2.1 Remove the text column because the model does not accept raw text as an input:

In [None]:
tokenized_df = tokenized_df.remove_columns(["sentence", "idx"])

### Step 2.2 Rename the label column to labels because the model expects the argument to be named labels:

In [None]:
tokenized_df = tokenized_df.rename_column("label", "labels")

### Step 2.3 Set the format of the dataset to return PyTorch tensors instead of lists:

In [None]:
tokenized_df.set_format("torch")

In [None]:
tokenized_df

## Step 3: Prepare data using DataLoader

In [None]:
type(tokenized_df["train"])

In [None]:
train_dataloader = DataLoader(tokenized_df["train"], shuffle=True, batch_size=8)
eval_dataloader = DataLoader(tokenized_df["validation"], shuffle=True,batch_size=8)

In [None]:
## Iterate over dataloader
dataiter = iter(eval_dataloader)

## Step 4: Load model with expected number of labels

In [None]:
id2label = {0:"negative", 1:"positive"}
label2id = {"negative":0, "positive":1}

In [None]:
## https://github.com/huggingface/transformers/blob/main/src/transformers/models/distilbert/modeling_distilbert.py#L928
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

In [None]:
print(f"Pretrained model with classification head architecture: {model}")

## This gives us pretrainined DistilBERT model with untrained classification head.

## Step5 : Set up Optimizer (AdamW) and scheduler

In [None]:
## https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/optimizer_schedules#transformers.AdamW
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

In [None]:
## model.parameters() gives the parameters of the model that need to be optimized by AdamW
for i in model.parameters():
    print(i)
    print(f"Shape of tensor: {i.size()}")
    print("==========")

In [None]:
num_epochs = 5
num_training_steps = num_epochs*len(train_dataloader)
print(f"num_training_steps: {num_training_steps}")

In [None]:
lr_scheduler = get_scheduler(name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

## Step6: Set device to cuda if available

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## Step7: Set training loop
## Forward pass: https://github.com/huggingface/transformers/blob/main/src/transformers/models/distilbert/modeling_distilbert.py#L928
## Very slow on local machine but faster on Colab with T4 GPU

In [None]:
progress_bar = tqdm(range(num_training_steps)) ## Set progress bar to track each batch with epoch
model.train(mode=True) ## Set model to train mode
for epoch in range(num_epochs):
    for batch in train_dataloader:
        ## Bring tensor to device
        batch = {k: v.to(device) for k, v in batch.items()}
        ## Pass batch through the model in train mode
        outputs = model(**batch)
        ## Get loss
        loss = outputs.loss ## returns CE loss function which is defined at https://github.com/huggingface/transformers/blob/main/src/transformers/models/distilbert/modeling_distilbert.py#L928
        
        ## loss.backward() computes dloss/dx for every parameter x which has requires_grad=True. These are accumulated into x.grad for every parameter x.
        loss.backward() ## https://discuss.pytorch.org/t/what-does-the-backward-function-do/9944: Calculate dF(loss function)/dx to update params

        ## optimizer.step updates the value of x using the gradient x.grad. For example, the SGD optimizer performs: x += -lr * x.grad
        optimizer.step()
        lr_scheduler.step()

        ## optimizer.zero_grad() clears x.grad for every parameter x in the optimizer.
        optimizer.zero_grad()
        progress_bar.update(1)

## Step3: Padd shorted sequences to ensure all are of length 512 tokens (In step 2 we truncated long sequences)

https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

## Step 4: Get eveluation metric (scikit learn or evaluate library)
https://huggingface.co/docs/evaluate/package_reference/loading_methods

In [None]:
##evaluate.list_evaluation_modules(module_type="metric", include_community=True, with_details=True)

In [None]:
[metric for metric in evaluate.list_evaluation_modules(module_type="metric", include_community=True) if 'f1' in metric]

In [None]:
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy_value = accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    precision_value = precision.compute(predictions=predictions, references=labels)["precision"]
    recall_value = recall.compute(predictions=predictions, references=labels)["recall"]
    f1_value = f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy_value, "precision":precision_value, "recall":recall_value, "f1":f1_value}

## Step 5: Get id2label and label2id mapping

## Step 6: Train model
1. Use Trainer API by Hugging face which abstracts the training loop
2. Manually write training loop in native Pytorch/ Tensorflow

### Step 6.1 Use Trainer API
https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer

In [None]:
## https://huggingface.co/docs/transformers/v4.40.1/en/model_doc/auto#transformers.AutoModelForSequenceClassification
## model with be instantiated with a classification head (Linear+Softmax)
## https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/configuration#transformers.PretrainedConfig
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

In [None]:
print(f"Pretrained model with classification head architecture: {model}")

In [None]:
##https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments
## num_train_epochs,  + arguments to control optimizer like learning rate
## Checkpoints are saved every 500 steps since save_strategy=steps and save_steps=500 by default
training_arguments = TrainingArguments(output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=5, weight_decay=0.01, evaluation_strategy="epoch")

In [None]:
## https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.Trainer
## INformaitoj like train / val dataset, data_collator, metrics to compute, optimzer and scheduler to use (default:  AdamW with get_linear_schedule_with_warmup())
trainer = Trainer(model=model, args=training_arguments, train_dataset=tokenized_df["train"], eval_dataset=tokenized_df["validation"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics)

In [None]:
tokenized_df

In [None]:
trainer.train()

In [None]:
trainer.save_model("./results/finetuned_model")

## Evaluating on validation dataset outside model training

In [None]:
trainer.evaluate(eval_dataset=tokenized_df["validation"])

In [None]:
tokenized_df["test"][0]

In [None]:
test_dataset = tokenized_df["test"].remove_columns("label")

In [None]:
predictions = trainer.predict(test_dataset)

In [None]:
print(predictions.predictions.shape)

In [None]:
preds = np.argmax(predictions.predictions, axis=-1)

In [None]:
predicted_test_dataset = test_dataset.add_column("prediction", [id2label[pred] for pred in preds])

In [None]:
predicted_test_dataset

## Load pretrained model

In [None]:
finetuned_model = AutoModelForSequenceClassification.from_pretrained("./results/finetuned_model")

In [None]:
from transformers import pipeline

In [None]:
clf = pipeline(task="text-classification", model="./results/finetuned_model", device=0)

In [None]:
clf([dd["sentence"] for dd in test_dataset])

In [None]:
[dd["sentence"] for dd in test_dataset]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
tokenizer.decode(tokenizer(['a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films', 'my name is hardik'], truncation=True, padding="max_length")["input_ids"][0])