<a href="https://colab.research.google.com/github/liampotts/CompInvesting/blob/main/lab8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 8: Transformer language models for classification

This lab was adapted from a few different Huggingface tutorials and notebook including [this](https://huggingface.co/blog/sentiment-analysis-python) and [this](https://huggingface.co/docs/transformers/training#finetune-with-trainer). 

This lab demonstrates how to use [DistilBERT](https://arxiv.org/abs/1910.01108), a light-weight version of BERT that uses many fewer resources while being nearly as accurate on important benchmarks. If you would like to use DistilBERT for your projects, [this Hugginface page](https://huggingface.co/docs/transformers/model_doc/distilbert) has links to many different notebooks for various kinds of tasks. *Pro tip: cribbing from other people's Colab notebooks is basically how this kind of work is done. Just make sure to acknowledge your sources.*

We will work with two datasets. The first is a freely available set of 50K movie reviews from IMDB, which is available through the Huggingface datasets library. The second is the good old clickbait vs. real news headline database. You will adapt the code I've given you for the IMDB dataset to the clickbait data to show me that you know how to convert your own datasets into the expected format for DistilBERT for classification.

For this lab, you will turn in **this notebook with all the output** so that we can see that you ran everything and with the (very few) questions answered in the appropriate places. This lab is due on **Thursday, November 10, 2022, at 11:59pm EST**.

## Part 1: Setting things up
### Activate the GPU. 
Using the GPU will make everything much faster, and the free verison of Colab allows you to use a GPU for limited periods of time.  To activate a GPU, go to the `Runtime` menu, then select `Change runtime type`, then pick `GPU` from the dropdown menu.

If you want to remain on good terms with Colab, just remember to disconnect from the GPU when you are done with it (`Runtime-> Disconnect and delete runtime`). Colab will disconnect you automatically after idle time anyway, but it's wise to try to use as few resources as possible.

### Mount your Google Drive
You're going to be be loading the clickbait dataset from your drive later on, so let's mount it now. 

**Don't forget to create a folder in your Drive called `lab8` where you can put the files I included in the repo for the clickbait dataset.**

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Install some libraries
Next, let's install some libraries. The first, [datasets](https://huggingface.co/docs/datasets/index), is Huggingface library that lets programmers easily download and process NLP datasets. The second is the [transformers](https://huggingface.co/docs/transformers/index) library, which is one of the most widely-used libraries for downloading and training pre-trained models for NLP.

In [7]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Part 2: Loading and processing the data
Now we will load the IMDB dataset from datasets. It contains 50K movie reviews that are very positive or negative, with half for training and half for testing. It will take a bit of time to download, but conveniently, you get a nice progress bar so you'll have some idea of how long you'll need to wait. **Don't go away! You'll want to sit by your computer until you start training, a few steps from now.**

In [8]:
from datasets import load_dataset
imdb = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

For fun later on, you can train with the whole dataset, but one of the things that is cool about BERT and friends is that they are pre-trained on billions of words, so you don't need that much data to fine-tune to your task. Let's use a **subset of 3000 from the training set and 300 from the test set**. The code below selects a random subset of 3000 and 300.

In [9]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])




Let's print out a few training examples to see what they look like. After the text, you can see the labels, which are 1=positive and 0=negative.

In [10]:
small_train_dataset[0]
small_train_dataset[2]

{'text': 'George P. Cosmatos\' "Rambo: First Blood Part II" is pure wish-fulfillment. The United States clearly didn\'t win the war in Vietnam. They caused damage to this country beyond the imaginable and this movie continues the fairy story of the oh-so innocent soldiers. The only bad guys were the leaders of the nation, who made this war happen. The character of Rambo is perfect to notice this. He is extremely patriotic, bemoans that US-Americans didn\'t appreciate and celebrate the achievements of the single soldier, but has nothing but distrust for leading officers and politicians. Like every film that defends the war (e.g. "We Were Soldiers") also this one avoids the need to give a comprehensible reason for the engagement in South Asia. And for that matter also the reason for every single US-American soldier that was there. Instead, Rambo gets to take revenge for the wounds of a whole nation. It would have been better to work on how to deal with the memories, rather than suppressi

You remember **tokenization** from the very beginning of the year, and in fact, it's still incredibly important! Even with all the fancy neural networks, it's still necessary to tokenize your text. The AutoTokenizer below will tokenize your data so that it plays nicely with the `distilbert-base-uncased` model.

In [11]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Here's a little function that will run the tokenizer for you. You pass in one of your datasets (in this case, either the training or the test set), and then it will tokenize whatever is in the "text" field of each row of the dataset. The `truncation=True` part just truncates texts that are longer than some max length.

In [12]:
def preprocess_function(examples):
  return tokenizer(examples["text"], truncation=True)

Now we'll run the tokenizer function on our train and test sets, and we'll save the results out to new variables.

In [13]:
# (removing batched=True)
tokenized_small_train = small_train_dataset.map(preprocess_function)
tokenized_small_test = small_test_dataset.map(preprocess_function)


  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/300 [00:00<?, ?ex/s]

What do they look like now? You'll see that the tokenization step is more than just separating punctuation and that sort of thing! Each object still has the original text and the label, but now there are two additional components: a list of input IDs (i.e., unique integer IDs for each token in the intput sequence) and a list of the same length full of 1s, called the "attention mask". This will be used later on during padding. Tokens that are non-padding tokens will have an attention mask of 1, while tokens that are padding tokens will have an attention mask of 0. (Padding is discussed a bit below.)

In [14]:
tokenized_small_train[0]

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...',
 'label': 1,
 'input_ids': [101,
  2045,
  2003,
  2053,
  7189,
  2012,
  2035,
  2090,
  3481,
  3771,
  1998,
  6337,
  2099,
  2021,
  1996,
  2755,
  2008,
  2119,
  2024,
  2610,
  2186,
  2055,
  6355,
  6997,
  1012,
  

Below we are getting ourselves a **data collator** that will later be used to convert the training samples to PyTorch tensors and to make sure they have the correct amount of padding. Training typically expects every input sample to be the same length. You can ensure this by "padding" the inputs shorter than the maximum length with (usually) a lot of trailing zeros at the beginning or the end or both.

In [15]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Part 3: Setting up the training (a.k.a. the fine-tuning)
Now we're going to do a few things to get everything ready for actually doing the fine-tuning of  DistilBERT to our task of classifying movie reviews according to their sentiment.

First, we need to download the model for classification, which is the `disilbert-based-uncased` model. Conveniently, Huggingface has set the whole thing up for us to do classification easily. We don't actually need to write the softmax layer or anything like this since they have done it all for us. We just need to get the pretrained model designed specifically for sequence classification.

You will get a warning saying that the weights were not used when intializing. That makes sense since we are fine-tuning to a new task, so it's okay.

In [16]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

When we are done with our training, we want to be able report the metrics that are actually interesting to us, not just loss, which is what is typically reported. So let's define what we want to see: accuracy and f1.

In [17]:
import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

Now we are setting all the hyperparameters and arguments we need for training. The `TrainingArguments` object contains many the hypterparameters we've talked about for training neural nets, like batch size, learning rate, number of epochs, and weight decay.

The `Trainer` object takes the training arguments as one of its arguments, along with the model (DistilBERT), the data collator that converts training instances into tensors of the right length, the set of evaluation metrics we want to use, and pointers to the training and testing data.

In [18]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/lab8/results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_small_train,
    eval_dataset=tokenized_small_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


## Part 4: Training

The code cell below will do the training, i.e., the fine-tuning of DistilBERT to the task of sentiment classification, on this small dataset of movie reviews.  When I ran this with a GPU, it took 12 minutes or so.

In [19]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 940
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.2526


Saving model checkpoint to /content/drive/MyDrive/lab8/results/checkpoint-500
Configuration saved in /content/drive/MyDrive/lab8/results/checkpoint-500/config.json
Model weights saved in /content/drive/MyDrive/lab8/results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/lab8/results/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/lab8/results/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=940, training_loss=0.16632050453348363, metrics={'train_runtime': 716.3871, 'train_samples_per_second': 20.938, 'train_steps_per_second': 1.312, 'total_flos': 1964686083560256.0, 'train_loss': 0.16632050453348363, 'epoch': 5.0})

## Part 5: Evaluation

The code cell below will evaluate the fine-tuned model with the test data. How does it know what data to use? We gave it a pointer to the test data when we created the Trainer objects a few cells ago. This step should be quick.

In [20]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 300
  Batch size = 16


  """


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

{'eval_loss': 0.4964480698108673,
 'eval_accuracy': 0.8766666666666667,
 'eval_f1': 0.877887788778878,
 'eval_runtime': 7.0045,
 'eval_samples_per_second': 42.829,
 'eval_steps_per_second': 2.713,
 'epoch': 5.0}

I got F1=88.2 and accuracy=88. You'll probably get something similar but it won't be exactly the same since the dataset was shuffled early on. Not bad accuracy at all, though I think this dataset is meant to be easy. Now we'll try with a different dataset, where you will be writing (copying and pasting) the code.

## Part 6: Doing it all over again with our own data

Now I would like you to write (copy and paste) the code from above to fine-tune and test on a different dataset, namely, the clickbait vs. new headlines dataset. In the repo, I distributed new versions of the data in a format that will be compatible with `load_dataset()`. Don't forget to put those two files in a folder on your Google Drive called `lab8`.

The remainder of this notebook will mostly involve (1) being extremely fastidious about replacing variable names, and (2) figuring out which steps you need to do and which ones you don't. 

I am getting you started below since the process of loading a dataset from a csv is different from downloading a dataset from Huggingface. 


In [21]:
# load the csv files I provided into a dataset
clickbait = load_dataset('csv',data_files={'train': '/content/drive/MyDrive/lab8/click_train.csv', 
                                           'test': '/content/drive/MyDrive/lab8/click_test.csv'})





Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-3b1b14e736ba3a1a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3b1b14e736ba3a1a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [22]:
## Insert the rest of your code in this code block and other additional code blocks.
## Don't waste time installing libraries or setting variables that have already been set.


small_train_dataset_clickbait = clickbait["train"].shuffle(seed=42).select([i for i in list(range(3000))])
small_test_dataset_clickbait = clickbait["test"].shuffle(seed=42).select([i for i in list(range(300))])


tokenized_small_train_clickbait = small_train_dataset_clickbait.map(preprocess_function)
tokenized_small_test_clickbait = small_test_dataset_clickbait.map(preprocess_function)

training_args_clickbait = TrainingArguments(
    output_dir="/content/drive/MyDrive/lab8/clickbait_results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer_clickbait = Trainer(
    model=model,
    args=training_args_clickbait,
    train_dataset=tokenized_small_train_clickbait,
    eval_dataset=tokenized_small_test_clickbait,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


trainer_clickbait.train()




  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/300 [00:00<?, ?ex/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 940
  Number of trainable parameters = 66955010


Step,Training Loss
500,0.0769


Saving model checkpoint to /content/drive/MyDrive/lab8/clickbait_results/checkpoint-500
Configuration saved in /content/drive/MyDrive/lab8/clickbait_results/checkpoint-500/config.json
Model weights saved in /content/drive/MyDrive/lab8/clickbait_results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/lab8/clickbait_results/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/lab8/clickbait_results/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=940, training_loss=0.04323281531638287, metrics={'train_runtime': 76.9192, 'train_samples_per_second': 195.01, 'train_steps_per_second': 12.221, 'total_flos': 74152766003904.0, 'train_loss': 0.04323281531638287, 'epoch': 5.0})

In [23]:
trainer_clickbait.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 300
  Batch size = 16


{'eval_loss': 0.1307637095451355,
 'eval_accuracy': 0.9666666666666667,
 'eval_f1': 0.9652777777777778,
 'eval_runtime': 2.2192,
 'eval_samples_per_second': 135.185,
 'eval_steps_per_second': 8.562,
 'epoch': 5.0}

###Q1 What accuracy and F1 did you get on the clickbait dataset? How does this compare with the various approaches you used previously for this dataset?

I got 0.96 for accuracy and f1. This is better than previous appraoches we used for this dataset (SVM, Naive Bayes, KNN) which gave us accuracy and f1 of 90-94%. 

### Q2 Do you think you will use a BERT or related approach in your project? Why or why not?

I think we will use BERT in our project to classify sarcasm vs not sarcasm. We will do this by using the outputs of BERT as inputs to another nerual classifier in order to classify sarcasm vs not sarcasm.As evidenced in this lab, BERT has the best performance so far, so it would make sense to use BERT in our project.