***Imports***

In [1]:
import pandas as pd
from datasets import load_dataset
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import evaluate 
import numpy as np

***Loading Data***

In [2]:
ds = load_dataset('imdb')

***Tokenizer***

In [3]:
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

# function to tokenize text
def tokenize(batch):
    token_data = tokenizer(batch['text'], truncation=True, padding=True)
    return token_data

# applying functions to data
tokenized_data = ds.map(tokenize, batched=True)

***Model***

For this model we will be using DistilBERT since it is faster and more efficient than the BERT model.

In [4]:
model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased',
                                                            num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


***Evalutation Metric Setup***

In [5]:
# choosing f1 as metric
metric = evaluate.load('f1')

# creating metric functions
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1) # highest prediction
    return metric.compute(predictions=predictions, references=labels)

***Training***

In [6]:
training_args = TrainingArguments(
    output_dir='my_model',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

  0%|          | 0/1563 [00:00<?, ?it/s]

{'loss': 0.322, 'grad_norm': 13.952038764953613, 'learning_rate': 1.3602047344849649e-05, 'epoch': 0.32}
{'loss': 0.2414, 'grad_norm': 4.882312297821045, 'learning_rate': 7.204094689699297e-06, 'epoch': 0.64}
{'loss': 0.2135, 'grad_norm': 10.112589836120605, 'learning_rate': 8.061420345489445e-07, 'epoch': 0.96}
{'train_runtime': 4109.802, 'train_samples_per_second': 6.083, 'train_steps_per_second': 0.38, 'train_loss': 0.25685090951559564, 'epoch': 1.0}


TrainOutput(global_step=1563, training_loss=0.25685090951559564, metrics={'train_runtime': 4109.802, 'train_samples_per_second': 6.083, 'train_steps_per_second': 0.38, 'total_flos': 3311684966400000.0, 'train_loss': 0.25685090951559564, 'epoch': 1.0})

***Evaluate***

In [7]:
trainer.evaluate()

  0%|          | 0/1563 [00:00<?, ?it/s]

{'eval_loss': 0.1963438242673874,
 'eval_f1': 0.9264570028517493,
 'eval_runtime': 1281.265,
 'eval_samples_per_second': 19.512,
 'eval_steps_per_second': 1.22,
 'epoch': 1.0}

**Insights:** We can see from the `f1` score of `0.9265` our model is doing pretty well. This is also a much better result than our original base model using Random Forest. It is still possible to optimize these results which we will be doing in the next notebook.

***Saving Model***

In [8]:
trainer.save_model('../models/fine_tune_model')
tokenizer.save_pretrained('../models/fine_tune_model')

Due to github issues none of the models were able to be saved to a folder during this project. But this can be rerun easily if desired.