In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 7.2 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 48.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 45.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 4.3 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstallin

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix
import torch
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback


# Data
In Fine-tuning, a relatively small data size (here n=1000) should be sufficient to obtain good results.


In [None]:
# Read data
url="https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
data=pd.read_csv(url).sample(n = 1000)
sentiment_mapping = {"positive":1, "negative":0}
data = data.replace({'sentiment':sentiment_mapping})
train, test = train_test_split(data, test_size=0.2, shuffle=True)
train.head().style

Unnamed: 0,review,sentiment
39435,"I love buying those cheap, lousy DVD's from Alpha Video. One day, I happened to buy this one. It's the perfect silly science fiction film of the 50's, all sexed up. Replete with unscientific EVERYTHING, scantily clad girls and plenty of melodrama, it's an enjoyable film, to those who appreciate this kind of stuff. And if you can 'suspend your disbelief' enough, you can actually get creeped out-- not just by the psychotic head or by the beating of the thing in the closet, but toward the end, with the character of 'the perfect body'. It's so . . . . what's another word for mindf***ing?",1
29999,"Having watched 10 minutes of this movie I was bewildered, having watched 30 minutes my toes were curling - I simply couldn't believe it: The movie is really awful. In fact it is so awful, that I had to watch all of it just to be convinced(!). During this, I came to realize that it reminded me of a bunch of Danish so-called comedies from the 60's and 70's. The pattern is as follows: Take one extremely popular comedian, make a script putting this comedian in as many grotesque situations as possible, add a bunch of jokes (especially one-liners), and spice it up with a couple of beautiful young girls - film that, and you have a success! I wouldn't know if this movie was a success, but unlike the Danish tradition which died quietly (with a few great comedians) it seems that there is a market for this kind of movie in the US.",0
30081,"Director: Tay Garnett, Ford Beebe, Cast: Mike Mazurki, Vic Christy, Fritz Ford, Tay Garnett. Based on the number of comments I see on IMDb, this seems to be a forgotten movie. This seems rather ironic to me because it is actually one of the first movies that I remember. My mom took me and my little brother to see this film at The Garland theater in Spokane when it first came out in the mid 1970's and I still remember it. I am going by memory here but I believe this move is about a trapper who was accused of a crime which he did not commit and the law goes after him. I believe it to be set in 1800's Alaska. A narrator tells the story of the trapper played by Mike Mazurki. Really, this is a very good film with a great setting. It could be compared to the 1981 film Death Hunt with Charles Bronson. The two films have a very similar story line. The main difference between the two is Death Hunt is an adult orientated film whereas Challenge is a family friendly film. Mike Mazurki and Tay Garnett were both rather old when this movie was made which I find rather impressive when one considers that this movie was filmed on location in the wilds of Alaska. This was the last film made by Tay Garnett before he died which was just a few years later. They both had been around since the silent era.",1
34784,"According to the blurb on the back of the DVD case; Jonothan Ross 'laughed until a little bit of wee came out'. I suspect that that has more to do with his being full of it. I never watched the series for one reason or another, so maybe I'm missing some essential cues. As to this movie; I watched the first 45 minutes or so. I laughed once, smiled once, then reached for the newspaper whilst waiting for something else entertaining to happen. Nothing did. Evidently intended to be a surreal spoof upon life in the post-Python, gross-humour style, this one falls absolutely flat. There's been a host of comedy series on television in the last few years, not the least of which were 'Bottom', 'The Fast Show' 'The Vicar Of Dibley' and 'Father Ted', each one engaging a group of bizarre but hilarious characters and sketches. Any one of these could knock this crap into a cocked hat. If the series was anything like this movie; I'm surprised they got the funding. Happily it was one of those £2 Tesco bran-tub purchases and is now in the local charity shop. The moral of the story is; don't believe the pundits, never pay top dollar.",0
27505,"This movie is not about entertainment, or not even a movie you want to see to pass the time. This movie is a genuinely a display of true love that can only come from God. One cannot help but be touched deeply by looking at this movie. We have several dimensions of love that contributes to the value of this movie. There is the divine love of God that is beautifully portrayed. God's love transcends the heart and mind and endures and is eternal. There is the love in a marriage. While the main character grapples with his wife's disease, he realizes through God's love that he loves his wife more than he could ever imagine. He knows that he and his wife are one and can never be separated. Finally, you have the love of child and parent. The kids in the family come together and realize that nothing else matters except that love conquers fear. Dear friends, love is not love unless it comes from God, because God is love and love comes from God. Talk to someone and let them know you love them. Love does no good unless it is given to another. I pray this movie can inspire and change the lives of everyone who sees it. Amen!!",1


# Model Selection
- We select the pretrained model "bert-base-uncased", which is described on https://huggingface.co/bert-base-uncased. 
- The model was trained on language modelling tasks, but we can use it for several downstream tasks such as question answering, multiple choice, token classification. 
- To predict whether a movie comment is positive or negative we choose `BertForSequenceClassification` with `num_labels=2`. 
- We must use the same tokenizer that was used in Pre-Training 

In [None]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# Data Preprocessing
- We split our training data further into training and validation data set 
- We tokenize both the training, validation, and tes data
- Shorter texts are padded to 512 tokens. Texts longer than 512 tokens are truncated. (BERT is not able to handle sequences longer than 512 tokens). This will allow us to feed batches of sequences into the model at the same time.

In [None]:

X = list(train["review"])
y = list(train["sentiment"])
X_test = list(test["review"])
y_test = list(test["sentiment"])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)


In [None]:
tokenizer

PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Now, let’s turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a torch.utils.data.Dataset object and implementing __len__ and __getitem__. 

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(encodings=X_train_tokenized, labels=y_train)
val_dataset = Dataset(encodings=X_val_tokenized, labels=y_val)
test_dataset = Dataset(encodings=X_test_tokenized, labels=y_test)



# Fine-Tuning
- We define the setup for model fine tuning. 
- Specifically, we define metrics to be evaluated on the validation data set after every 10 steps. 
- If the model does not improve its validation loss 3 times in a row, early stopping is applied.
- Several model checkpoints are automatically stored and the best model in terms of validation loss is loaded at the end.

In [None]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

# Define Trainer
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=10,
    save_steps=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps = 2,
    num_train_epochs=3,
    seed=0,
    #warmup_steps = 50,
    load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)



Train the classification model.

In [None]:
trainer.train()



***** Running training *****
  Num examples = 640
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 2
  Total optimization steps = 120


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
10,No log,0.691503,0.53125,0.0,0.0,0.0
20,No log,0.662874,0.50625,0.487013,1.0,0.655022
30,No log,0.462581,0.88125,0.868421,0.88,0.874172
40,No log,0.411106,0.85625,0.788889,0.946667,0.860606
50,No log,0.330943,0.88125,0.868421,0.88,0.874172
60,No log,0.347679,0.88125,0.924242,0.813333,0.865248
70,No log,0.408029,0.8625,0.804598,0.933333,0.864198
80,No log,0.377452,0.89375,0.939394,0.826667,0.879433


***** Running Evaluation *****
  Num examples = 160
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to output/checkpoint-10
Configuration saved in output/checkpoint-10/config.json
Model weights saved in output/checkpoint-10/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 160
  Batch size = 8
Saving model checkpoint to output/checkpoint-20
Configuration saved in output/checkpoint-20/config.json
Model weights saved in output/checkpoint-20/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 160
  Batch size = 8
Saving model checkpoint to output/checkpoint-30
Configuration saved in output/checkpoint-30/config.json
Model weights saved in output/checkpoint-30/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 160
  Batch size = 8
Saving model checkpoint to output/checkpoint-40
Configuration saved in output/checkpoint-40/config.json
Model weights saved in output/checkpoint-40/pytorch_model.bin
*****

TrainOutput(global_step=80, training_loss=0.4195139408111572, metrics={'train_runtime': 198.3416, 'train_samples_per_second': 9.68, 'train_steps_per_second': 0.605, 'total_flos': 336782150860800.0, 'train_loss': 0.4195139408111572, 'epoch': 2.0})

In [None]:
# Load trained model
model_path = "output/checkpoint-80"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)



loading configuration file output/checkpoint-80/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file output/checkpoint-80/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClass

# Testing

In [None]:
# Define test trainer
test_trainer = Trainer(model)

# Make prediction
raw_pred, labels, metrics = test_trainer.predict(test_dataset)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Prediction *****
  Num examples = 200
  Batch size = 8


In [None]:
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

In [None]:
metrics

{'test_loss': 0.3502556085586548,
 'test_runtime': 6.9167,
 'test_samples_per_second': 28.915,
 'test_steps_per_second': 3.614}

In [None]:
test['prediction'] = labels

In [None]:
pd.crosstab(test.sentiment, test.prediction, margins=True, normalize='all')

prediction,0,1,All
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.49,0.0,0.49
1,0.0,0.51,0.51
All,0.49,0.51,1.0


In [None]:
model.save_pretrained("output/final")

Configuration saved in output/final/config.json
Model weights saved in output/final/pytorch_model.bin


In [None]:
!zip -r /content/file.zip /content/output/final

  adding: content/output/final/ (stored 0%)
  adding: content/output/final/config.json (deflated 49%)
  adding: content/output/final/pytorch_model.bin (deflated 7%)


In [None]:
# Verify access to GPU
tf.test.gpu_device_name()

'/device:GPU:0'

In [None]:
from google.colab import files
files.download("/content/file.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>