### An example in finetuning
Using Hugging Face for data

Required reading: https://karpathy.github.io/2019/04/25/recipe/ - "A recipe for training neural networks"

#### Become one with the data

In [1]:
import torch
import random
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from pprint import pprint

def set_seed(seed=42):
    random.seed(seed)
    torch.manual_seed(seed)

set_seed()

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from datasets import load_dataset, Dataset

In [3]:
dataset = load_dataset("stanfordnlp/imdb")
train_df = dataset["train"].to_pandas()

In [4]:
train_df['dataset'] = 'train'

In [5]:
print(train_df.head())
print("Number of null values:")
print(train_df.isnull().sum())

                                                text  label dataset
0  I rented I AM CURIOUS-YELLOW from my video sto...      0   train
1  "I Am Curious: Yellow" is a risible and preten...      0   train
2  If only to avoid making this type of film in t...      0   train
3  This film was probably inspired by Godard's Ma...      0   train
4  Oh, brother...after hearing about this ridicul...      0   train
Number of null values:
text       0
label      0
dataset    0
dtype: int64


In [6]:
print("Dataframe Info:")
print(train_df.info())
print("\n")
print("Dataframe Description:")
print(train_df.describe())
print("\n")
print("Number of unique values in each column:")
print(train_df.nunique())

Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     25000 non-null  object
 1   label    25000 non-null  int64 
 2   dataset  25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB
None


Dataframe Description:
             label
count  25000.00000
mean       0.50000
std        0.50001
min        0.00000
25%        0.00000
50%        0.50000
75%        1.00000
max        1.00000


Number of unique values in each column:
text       24904
label          2
dataset        1
dtype: int64


In [7]:
random_index = random.randint(0, len(train_df) - 1)
pprint(train_df.loc[random_index, 'text'])

('Arguably this is a very good "sequel", better than the first live action '
 'film 101 Dalmatians. It has good dogs, good actors, good jokes and all right '
 'slapstick! <br /><br />Cruella DeVil, who has had some rather major therapy, '
 'is now a lover of dogs and very kind to them. Many, including Chloe Simon, '
 'owner of one of the dogs that Cruella once tried to kill, do not believe '
 'this. Others, like Kevin Shepherd (owner of 2nd Chance Dog Shelter) believe '
 'that she has changed. <br /><br />Meanwhile, Dipstick, with his mate, have '
 'given birth to three cute dalmatian puppies! Little Dipper, Domino and '
 'Oddball...<br /><br />Starring Eric Idle as Waddlesworth (the hilarious '
 'macaw), Glenn Close as Cruella herself and Gerard Depardieu as Le Pelt '
 '(another baddie, the name should give a clue), this is a good family film '
 'with excitement and lots more!! One downfall of this film is that is has a '
 'lot of painful slapstick, but not quite as excessive as the l

In [8]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_reviews = train_df['text'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
review_token_lengths = tokenized_reviews.apply(len)
print(f"Shortest review length (in tokens): {review_token_lengths.min()}")
print(f"Longest review length (in tokens): {review_token_lengths.max()}")
print(f"Average review length (in tokens): {review_token_lengths.mean()}")


Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Shortest review length (in tokens): 13
Longest review length (in tokens): 3127
Average review length (in tokens): 313.87132


We'll need to be careful with how we handle the longer reviews in the dataset given the warning message above

#### Get baselines

In [9]:
train_dataset = Dataset.from_pandas(train_df)

In [10]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="longest", truncation=True)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)


Map: 100%|██████████| 25000/25000 [00:10<00:00, 2375.22 examples/s]


In [11]:
model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
test_df = dataset["test"].to_pandas()
test_df['dataset'] = 'test'

In [13]:
test_dataset = Dataset.from_pandas(test_df)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

Map: 100%|██████████| 25000/25000 [00:10<00:00, 2385.44 examples/s]


In [15]:
training_args = TrainingArguments(
    output_dir='./temp_results',  
    do_train=False,
    do_eval=True,
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=tokenized_test_dataset
)

test_results = trainer.evaluate()
print(test_results)

100%|██████████| 3125/3125 [01:35<00:00, 32.86it/s]

{'eval_loss': 0.7193034291267395, 'eval_runtime': 101.787, 'eval_samples_per_second': 245.611, 'eval_steps_per_second': 30.701}





#### Padding

When feeding examples to a model during training, we typically feed them in as batches so they can be processed in parallel, saving time. If different examples in a batch are of different lengths, this parallel operation is not possible, forcing us to process each sequence individually. To force all examples in a batch to be the same length, we use padding and truncation. 

In [16]:
example_texts = ["I am a short sentence.", "I am a medium sentence. Not too long, not too short.", "I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on."]

fixed_length_tokens = []
for text in example_texts:
    tokenized_text = tokenizer(text, padding="max_length", max_length=50)
    fixed_length_tokens.append(tokenized_text)

for i in range(3):
    print(f"Original text: {example_texts[i]}")
    print(f"Tokenized text: {fixed_length_tokens[i]['input_ids']}")
    print("\n")

Original text: I am a short sentence.
Tokenized text: [101, 1045, 2572, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Original text: I am a medium sentence. Not too long, not too short.
Tokenized text: [101, 1045, 2572, 1037, 5396, 6251, 1012, 2025, 2205, 2146, 1010, 2025, 2205, 2460, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Original text: I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on.
Tokenized text: [101, 1045, 2572, 2019, 17003, 2135, 2146, 11101, 2098, 1010, 12034, 9232, 1010, 7566, 1011, 1037, 1011, 2843, 1010, 21707, 6251, 2008, 2064, 1005, 1056, 2644, 2770, 2006, 1998, 2006, 1998, 2006, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]




Super inefficient to use max padding, better to use dynamic padding where each example in the batch is padded out to the length of the longest sequence in the batch.

In [18]:
dynamically_padded = tokenizer(example_texts, padding='longest')

for i in range(3):
    print(f"Original text: {example_texts[i]}")
    print(f"Tokenized text: {dynamically_padded['input_ids'][i]}")
    print(f"Length of tokenized text: {len(dynamically_padded['input_ids'][i])}")
    print("\n")


Original text: I am a short sentence.
Tokenized text: [101, 1045, 2572, 1037, 2460, 6251, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Length of tokenized text: 34


Original text: I am a medium sentence. Not too long, not too short.
Tokenized text: [101, 1045, 2572, 1037, 5396, 6251, 1012, 2025, 2205, 2146, 1010, 2025, 2205, 2460, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Length of tokenized text: 34


Original text: I am an exceedingly longwinded, verbose, talks-a-lot, redundant sentence that can't stop running on and on and on.
Tokenized text: [101, 1045, 2572, 2019, 17003, 2135, 2146, 11101, 2098, 1010, 12034, 9232, 1010, 7566, 1011, 1037, 1011, 2843, 1010, 21707, 6251, 2008, 2064, 1005, 1056, 2644, 2770, 2006, 1998, 2006, 1998, 2006, 1012, 102]
Length of tokenized text: 34




#### Truncation

Truncation is used to cut off sentences that we decide are too long. The default truncation setting when True is to truncate to the longest length permitted by the model. However, we can also set a max length to truncate to.

In [19]:
tokenized_no_truncation = tokenizer(example_texts, truncation=False)
print("Length of non-truncated tokens:")
for idx, tok in enumerate(tokenized_no_truncation['input_ids']):
    print(f"Text {idx+1}: {len(tok)}")

tokenized_default_truncation = tokenizer(example_texts, truncation=True)
print("Length of default truncated tokens:")
for idx, tok in enumerate(tokenized_default_truncation['input_ids']):
    print(f"Text {idx+1}: {len(tok)}")

tokenized_max_length = tokenizer(example_texts, truncation=True, max_length=5)
print("Length of truncated tokens when max_length = 5:")
for idx, tok in enumerate(tokenized_max_length['input_ids']):
    print(f"Text {idx+1}: {len(tok)}")

Length of non-truncated tokens:
Text 1: 8
Text 2: 16
Text 3: 34
Length of default truncated tokens:
Text 1: 8
Text 2: 16
Text 3: 34
Length of truncated tokens when max_length = 5:
Text 1: 5
Text 2: 5
Text 3: 5


The rest of the finetuning process will be done in Google colab in order to access GPUs.