# intermediate_IMDB
This notebook takes our IMDB dataset and trains an intermediate model.

## Imports & Settings

First, update working directory to parent so that we may use our custom functions

In [1]:
import os
os.chdir('..')
# os.getcwd( )

In [2]:
import params
from utils import *
from trainer import *

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

# suppress model warning
from transformers import logging
logging.set_verbosity_error()

# set logging level
import logging
logging.basicConfig(format='%(message)s', level=logging.INFO)

<torch._C.Generator at 0x296bd37b0>

In [4]:
# set general seeds
set_seeds(1)

# set dataloader generator seed
g = torch.Generator()
g.manual_seed(1)

# set params for this model
params.num_labels = 2
params.output_dir = "model_saves/intermediate_IMDB_01"

# Ensure we're on an ARM environment if necessary.
platform_check()

We're Armed: macOS-13.1-arm64-i386-64bit


## Load Data

### IMDB

In [4]:
dataset_path = 'data/inter_IMDB_sentiment/IMDB_preped_train.csv'
df = pd.read_csv(dataset_path)

df.head()

Unnamed: 0,text,label,num_word_text
0,This film has been compared to the hilarious B...,0,186
1,Reasonably effective horror/science-fiction a ...,1,61
2,"The inspiration for the ""Naked Gun"" movies cas...",1,169
3,When this film was originally released it was ...,1,634
4,I happened upon this by chance. I was at my fr...,1,334


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   text           40000 non-null  object
 1   label          40000 non-null  int64 
 2   num_word_text  40000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 937.6+ KB


In [6]:
df['label'].value_counts()

1    20044
0    19956
Name: label, dtype: int64

### Target Text & Labels

In [7]:
text = df.text.values
labels = df.label.values

## Preprocess

In [8]:
token_id = []
attention_masks = []

for sample in text:
  encoding_dict = preprocessing(sample, params.tokenizer)
  token_id.append(encoding_dict['input_ids']) 
  attention_masks.append(encoding_dict['attention_mask'])


token_id = torch.cat(token_id, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)
labels = torch.tensor(labels)

## Data Split
We split the dataset into train (80%) and validation (20%) sets, and wrap them around a torch.utils.data.DataLoader object.

In [9]:
val_ratio = 0.2

# Indices of the train and validation splits stratified by labels
train_idx, val_idx = train_test_split(
    np.arange(len(labels)),
    test_size = val_ratio,
    shuffle = True,
    stratify = labels,
    random_state=1)

# Train and validation sets
train_set = TensorDataset(token_id[train_idx], 
                          attention_masks[train_idx], 
                          labels[train_idx])

val_set = TensorDataset(token_id[val_idx], 
                        attention_masks[val_idx], 
                        labels[val_idx])

# Prepare DataLoader
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = params.batch_size,
            worker_init_fn=seed_worker,
            generator=g,
        )

validation_dataloader = DataLoader(
            val_set,
            sampler = RandomSampler(val_set),
            batch_size = params.batch_size,
            worker_init_fn=seed_worker,
            generator=g,
        )

In [10]:
one = 0
zero = 0
for i in train_set:
    if i[2] == 1:
        one+=1
    elif i[2] == 0:
        zero += 1
         
print("one", one)
print("zero", zero)

one 16035
zero 15965


## Train

Download transformers.RobertaForSequenceClassificatio, which is a RoBERTa model with a linear layer for sentence classification (or regression) on top of the pooled output:

In [10]:
# Load the RobertaForSequenceClassification model
model = RobertaForSequenceClassification.from_pretrained('roberta-base',
                                                         num_labels = params.num_labels,
                                                         output_attentions = False,
                                                         output_hidden_states = False,
                                                         )

from torchinfo import summary
summary(model, input_size=(1, 512), dtypes=['torch.IntTensor'])

Layer (type:depth-idx)                                       Output Shape              Param #
RobertaForSequenceClassification                             [1, 2]                    --
├─RobertaModel: 1-1                                          [1, 512, 768]             --
│    └─RobertaEmbeddings: 2-1                                [1, 512, 768]             --
│    │    └─Embedding: 3-1                                   [1, 512, 768]             38,603,520
│    │    └─Embedding: 3-2                                   [1, 512, 768]             768
│    │    └─Embedding: 3-3                                   [1, 512, 768]             394,752
│    │    └─LayerNorm: 3-4                                   [1, 512, 768]             1,536
│    │    └─Dropout: 3-5                                     [1, 512, 768]             --
│    └─RobertaEncoder: 2-2                                   [1, 512, 768]             --
│    │    └─ModuleList: 3-6                                  --               

Set model to device, initialize trainer

In [11]:
model.to(params.device)
print(f"Trained Dataset: {dataset_path}")
print(f"Device: {params.device}")

optimizer = torch.optim.Adam(params=model.parameters(), lr=params.learning_rate) #roberta

trainer = Trainer(model=model,
                  device=params.device,
                  tokenizer=params.tokenizer,
                  train_dataloader=train_dataloader,
                  validation_dataloader=validation_dataloader,
                  epochs=params.epochs,
                  optimizer=optimizer,
                  val_loss_fn=params.val_loss_fn,
                  num_labels=params.num_labels,
                  output_dir=params.output_dir,
                  save_freq=params.save_freq,
                  checkpoint_freq=params.checkpoint_freq)

Trained Dataset: data/inter_IMDB_sentiment/IMDB_preped_train.csv
Device: mps


Fit the model to our training data.

In [12]:
trainer.fit()

  incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask
Epoch 1: 100%|██████████| 2000/2000 [45:50<00:00,  1.38s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:18<00:00,  2.53batch/s]


	 - Train loss: 0.213777
	 - Validation Loss: 0.163193
	 - Validation Accuracy: 0.938000
	 - Validation F1: 0.934431
	 - Validation Recall: 0.943260
	 - Validation Precision: 0.934511
	 * Model @ epoch 1 saved to model_saves/intermediate_IMDB_01/E01_A0.94_F0.93


Epoch 2: 100%|██████████| 2000/2000 [45:13<00:00,  1.36s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:18<00:00,  2.52batch/s]


	 - Train loss: 0.134534
	 - Validation Loss: 0.160835
	 - Validation Accuracy: 0.940500
	 - Validation F1: 0.937554
	 - Validation Recall: 0.957213
	 - Validation Precision: 0.927027
	 * Model @ epoch 2 saved to model_saves/intermediate_IMDB_01/E02_A0.94_F0.94


Epoch 3: 100%|██████████| 2000/2000 [45:14<00:00,  1.36s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:18<00:00,  2.52batch/s]


	 - Train loss: 0.088236
	 - Validation Loss: 0.192872
	 - Validation Accuracy: 0.938500
	 - Validation F1: 0.935895
	 - Validation Recall: 0.962625
	 - Validation Precision: 0.919498
	 * Model @ epoch 3 saved to model_saves/intermediate_IMDB_01/E03_A0.94_F0.94


Epoch 4: 100%|██████████| 2000/2000 [45:14<00:00,  1.36s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:18<00:00,  2.52batch/s]


	 - Train loss: 0.056261
	 - Validation Loss: 0.213532
	 - Validation Accuracy: 0.939250
	 - Validation F1: 0.936261
	 - Validation Recall: 0.954764
	 - Validation Precision: 0.926800
	 * Model @ epoch 4 saved to model_saves/intermediate_IMDB_01/E04_A0.94_F0.94


Epoch 5: 100%|██████████| 2000/2000 [45:28<00:00,  1.36s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:23<00:00,  2.46batch/s]


	 - Train loss: 0.041385
	 - Validation Loss: 0.258444
	 - Validation Accuracy: 0.938250
	 - Validation F1: 0.935756
	 - Validation Recall: 0.961460
	 - Validation Precision: 0.920246
	 * Model @ epoch 5 saved to model_saves/intermediate_IMDB_01/E05_A0.94_F0.94


Epoch 6: 100%|██████████| 2000/2000 [46:03<00:00,  1.38s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:22<00:00,  2.47batch/s]


	 - Train loss: 0.032413
	 - Validation Loss: 0.240136
	 - Validation Accuracy: 0.937500
	 - Validation F1: 0.935199
	 - Validation Recall: 0.951838
	 - Validation Precision: 0.927626
	 * Model @ epoch 6 saved to model_saves/intermediate_IMDB_01/E06_A0.94_F0.94


Epoch 7: 100%|██████████| 2000/2000 [45:49<00:00,  1.37s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:20<00:00,  2.50batch/s]


	 - Train loss: 0.023676
	 - Validation Loss: 0.263007
	 - Validation Accuracy: 0.940625
	 - Validation F1: 0.937400
	 - Validation Recall: 0.949269
	 - Validation Precision: 0.933921
	 * Model @ epoch 7 saved to model_saves/intermediate_IMDB_01/E07_A0.94_F0.94


Epoch 8: 100%|██████████| 2000/2000 [45:44<00:00,  1.37s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:19<00:00,  2.50batch/s]


	 - Train loss: 0.020455
	 - Validation Loss: 0.283110
	 - Validation Accuracy: 0.938500
	 - Validation F1: 0.935669
	 - Validation Recall: 0.951117
	 - Validation Precision: 0.928768
	 * Model @ epoch 8 saved to model_saves/intermediate_IMDB_01/E08_A0.94_F0.94


Epoch 9: 100%|██████████| 2000/2000 [45:22<00:00,  1.36s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:18<00:00,  2.52batch/s]


	 - Train loss: 0.018911
	 - Validation Loss: 0.366701
	 - Validation Accuracy: 0.934500
	 - Validation F1: 0.932137
	 - Validation Recall: 0.959832
	 - Validation Precision: 0.914201
	 * Model @ epoch 9 saved to model_saves/intermediate_IMDB_01/E09_A0.93_F0.93


Epoch 10: 100%|██████████| 2000/2000 [45:14<00:00,  1.36s/batch]
	 Validation 499: 100%|██████████| 500/500 [03:18<00:00,  2.52batch/s]


	 - Train loss: 0.015637
	 - Validation Loss: 0.365525
	 - Validation Accuracy: 0.934125
	 - Validation F1: 0.932048
	 - Validation Recall: 0.955656
	 - Validation Precision: 0.918202
	 * Model @ epoch 10 saved to model_saves/intermediate_IMDB_01/E10_A0.93_F0.93
