# TaD Q5 Context Vectors using BERT
Isaac Tabb

02/26/23

### Step 0: Load in the datasets

As always, we will begin by loading the training and validation sets.

In [None]:
import pandas as pd
from google.colab import files
uploaded = files.upload()

import io 
train_df = pd.read_csv(io.BytesIO(uploaded['training_set.csv']))
valid_df = pd.read_csv(io.BytesIO(uploaded['validation_set.csv']))

Saving training_set.csv to training_set.csv
Saving validation_set.csv to validation_set.csv


We will then convert the datasets to dictionaries and create two separate lists for each dataset, one for tweets and one for labels.

In [None]:
# turn the dataframes into dictionaries
train_dct = train_df.to_dict('records')
valid_dct = valid_df.to_dict('records')

# create two separate lists, the tweets and the labels for each set
train_tweets, train_labels = [], []
for tweet in train_dct:
  train_tweets.append(tweet['text'])
  train_labels.append(tweet['team'])

valid_tweets, valid_labels = [], []
for tweet in valid_dct:
  valid_tweets.append(tweet['text'])
  valid_labels.append(tweet['team'])

### (a) 
Encode the text of your documents using the ‘feature-extraction’ pipeline from the HuggingFace library with the
‘roberta_base’ model. Use only the first context vector for each document (which should represent the start token).
Pass the context vectors (without any other previous features) into a LogisticRegression classifier from scikit-learn
and train using the training set. Report the evaluation metrics on the validation set. 

First let's install the transformers and datasets libraries.

In [None]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m75.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-

Now let's create the feature-extraction pipeline using the roberta-base model.

In [None]:
import torch
from transformers import pipeline

# initialize the feature-extraction pipeline from hugging face using the roberta-base model
pipe = pipeline('feature-extraction', model="roberta-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We will now encode and save the first context vector for each document.

In [None]:
from tqdm import tqdm

# cvlist will hold all of the start token context vectors
cvlist = []
# iterate through the training tweets
for tweet in tqdm(train_tweets):
  # apply the pipeline to the tweet, saving only the first context vector
  cv = pipe(tweet, return_tensors='pt', device_map='cuda:0')[0,0,:]
  cvlist.append(cv)

# stack the context vectors into one pytorch tensor
cvall_train = torch.stack(cvlist)

# do the same on the validation set
cvlist = []
for tweet in tqdm(valid_tweets):
  cv = pipe(tweet, return_tensors='pt', device_map='cuda:0')[0,0,:]
  cvlist.append(cv)

cvall_valid = torch.stack(cvlist)

100%|██████████| 6000/6000 [15:01<00:00,  6.65it/s]
100%|██████████| 2000/2000 [04:55<00:00,  6.77it/s]


Let's look at the shape of the training tensor.

In [None]:
cvall_train.shape

torch.Size([6000, 768])

As we can see, the tensor is of size 6000 tweets x 768 features, as we expected!

Now let's define the classifier. We will use the base Logistic Regression classifier.

In [None]:
from sklearn.linear_model import LogisticRegression

# had to increase max iteration
clf = LogisticRegression().fit(cvall_train, train_labels)
labels_predicted = clf.predict(cvall_valid)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


And finally, let's calculate the accuracy, precision, recall, and F1 on the classifier.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(valid_labels, labels_predicted)
print(f"{accuracy=:.3f}")

precision = precision_score(valid_labels, labels_predicted, average='macro')
print(f"{precision=:.3f}")

recall = recall_score(valid_labels, labels_predicted, average='macro')
print(f"{recall=:.3f}")

f1 = f1_score(valid_labels, labels_predicted, average='macro')
print(f"{f1=:.3f}")

accuracy=0.726
precision=0.505
recall=0.273
f1=0.256


As we can see above, the Logistic Regression classifier does not perform well.

### (b)
Train an end-to-end classifier using the ‘trainer’ function from the HuggingFace library, again using the
‘roberta_base’ model. Use a learning rate = 1e-4, epochs = 1, batch_size = 16 and no weight decay. Report the
evaluation metrics on the validation set.


First, let's define our tokenizer and our model for sequence classification. The model we will use will again be the 'roberta-base' model. We will also specify the id2label parameter (along with the reverse, label2id). We will also give the numlabels parameter, which there are 4.

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer

# define our tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

# initialize dictionaries with id->label and label->id
id2label = {0: "MiamiHeat", 1: "LosAngelesLakers", 2: "BostonCeltics", 3: "DenverNuggets"}
label2id = {"MiamiHeat": 0, "LosAngelesLakers": 1, "BostonCeltics": 2, "DenverNuggets": 3}
# create our sequence classification model using roberta-base
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=4, id2label=id2label, label2id=label2id)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--roberta-base/snapshots/ff46155979338ff8063cdad90908b498ab91b181/config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading file vocab.json from cache at /root/.cache/huggingface/hub/mo

We need to define our training and validation datasets using the Dataset functionality of the datasets library.

In [None]:
# iterate through the training dictionary, tokenizing each tweet, and apply padding
# max_length is 157, which is the longest tweet in the dataset
final_train_dct = {'input_ids': [], 'labels': []}
for dct in train_dct:
  final_train_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_train_dct['labels'].append(label2id[dct['team']])

# do the same for the validation set
final_valid_dct = {'input_ids': [], 'labels': []}
for dct in valid_dct:
  final_valid_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_valid_dct['labels'].append(label2id[dct['team']])

# using the Dataset functionality of the datasets library, set up the training and validation sets
from datasets import Dataset
train_dataset = Dataset.from_dict(final_train_dct)
valid_dataset = Dataset.from_dict(final_valid_dct)

Let's set our training arguments.

In [None]:
# assign training arguments
training_args = TrainingArguments(
    output_dir="NBA Tweets Model",
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0,
    evaluation_strategy="epoch"
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


We will define our data collator.

In [None]:
from transformers import DataCollatorWithPadding

# define the data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We will set our trainer up.

In [None]:
# setup the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

And let's now train.

In [None]:
trainer.train()

***** Running training *****
  Num examples = 6000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 124648708
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,0.897562


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=0.9043717447916667, metrics={'train_runtime': 185.4231, 'train_samples_per_second': 32.358, 'train_steps_per_second': 2.022, 'total_flos': 484091923536000.0, 'train_loss': 0.9043717447916667, 'epoch': 1.0})

Let's run our predictions on the validation dataset for evaluation.

In [None]:
predictions, label_ids, metrics = trainer.predict(valid_dataset)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 16


Here are the metrics returned.

In [None]:
metrics

{'test_loss': 0.8975624442100525,
 'test_runtime': 16.1787,
 'test_samples_per_second': 123.62,
 'test_steps_per_second': 7.726}

Let's create a list of the labels predicted, much like the ones we create when we use classifiers. This will be used to calculate our evaluation metrics.

In [None]:
import numpy as np

labels_predicted = []
# iterate through the predictions and save the label that was predicted
# using the argmax function
for prediction in predictions:
  labels_predicted.append(id2label[np.argmax(prediction)])

Let's see the scores!

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(valid_labels, labels_predicted)
print(f"{accuracy=:.3f}")

precision = precision_score(valid_labels, labels_predicted, average='macro')
print(f"{precision=:.3f}")

recall = recall_score(valid_labels, labels_predicted, average='macro')
print(f"{recall=:.3f}")

f1 = f1_score(valid_labels, labels_predicted, average='macro')
print(f"{f1=:.3f}")

accuracy=0.723
precision=0.181
recall=0.250
f1=0.210


  _warn_prf(average, modifier, msg_start, len(result))


As you can see, the scores are abysmal. If we look closer, this is because the classifier always predicts the majority class. This is likely a result of class imbalance and a lack of diversity of tweets in the dataset.

### (c)
Try different values for the model, learning_rate, epochs and batch_size. Normally, you would do some form of
systematic search across these values, but due to computational costs, you should not do that. Pick three different
sets of these hyperparameters and describe your motivation for these choices. Retrain the models from scratch on
the training set and report the evaluation metrics on the validation set for those three settings in a table along with
the hyperparameter settings from (b).

Settings choices:

(1) model = ‘distilbert-base-uncased’, learning_rate=1e-5, batch_size=32, epochs=3

(2) model = ‘roberta-base’, learning_rate=1e-5, batch_size=32, epochs=3

(3) model = ‘distilbert-base-uncased’, learning_rate=1e-5, batch_size=32, epochs=10

#### Option 1

Let's define our distilbert-base-uncased tokenizer and model. The settings chosen are the fine-tuned parameters from the DistilBERT-Base-Uncased Model Fine-Tuned on SST2, which is supposed to work well on topic classification.

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# this time we are using specifically the distilbert tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# similarly, we are using distibert for sequence classification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4, id2label=id2label, label2id=label2id)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json


Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBer

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/pytorch_model.bin
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBert

Let's set up our training and validation datasets.

In [None]:
# these are set up in the same way as earlier
final_train_dct = {'input_ids': [], 'labels': []}
for dct in train_dct:
  final_train_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_train_dct['labels'].append(label2id[dct['team']])

final_valid_dct = {'input_ids': [], 'labels': []}
for dct in valid_dct:
  final_valid_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_valid_dct['labels'].append(label2id[dct['team']])

train_dataset = Dataset.from_dict(final_train_dct)
valid_dataset = Dataset.from_dict(final_valid_dct)

And our training arguments, which have notably changed.

In [None]:
training_args = TrainingArguments(
    output_dir="DistilBERT NBA Tweets Model",
    learning_rate=1e-5, # learning rate decreased
    per_device_train_batch_size=32, # batch size increased
    per_device_eval_batch_size=32,
    num_train_epochs=3, # epochs increased
    weight_decay=0,
    evaluation_strategy="epoch"
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


We will use the same Data Collator but with the new tokenizer.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We will define our trainer.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

And now let's train!

In [None]:
trainer.train()

***** Running training *****
  Num examples = 6000
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 564
  Number of trainable parameters = 66956548


Epoch,Training Loss,Validation Loss
1,No log,0.818447
2,No log,0.784614
3,0.826700,0.771171


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
Saving model checkpoint to DistilBERT NBA Tweets Model/checkpoint-500
Configuration saved in DistilBERT NBA Tweets Model/checkpoint-500/config.json
Model weights saved in DistilBERT NBA Tweets Model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in DistilBERT NBA Tweets Model/checkpoint-500/tokenizer_config.json
Special tokens file saved in DistilBERT NBA Tweets Model/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=564, training_loss=0.8202929260037469, metrics={'train_runtime': 267.4279, 'train_samples_per_second': 67.308, 'train_steps_per_second': 2.109, 'total_flos': 731184024816000.0, 'train_loss': 0.8202929260037469, 'epoch': 3.0})

We will now make predictions on our validation set.

In [None]:
predictions, label_ids, metrics = trainer.predict(valid_dataset)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 32


And save the labels predicted.

In [None]:
labels_predicted = []
for prediction in predictions:
  labels_predicted.append(id2label[np.argmax(prediction)])

And the scores!

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(valid_labels, labels_predicted)
print(f"{accuracy=:.3f}")

precision = precision_score(valid_labels, labels_predicted, average='macro')
print(f"{precision=:.3f}")

recall = recall_score(valid_labels, labels_predicted, average='macro')
print(f"{recall=:.3f}")

f1 = f1_score(valid_labels, labels_predicted, average='macro')
print(f"{f1=:.3f}")

accuracy=0.752
precision=0.517
recall=0.374
f1=0.398


  _warn_prf(average, modifier, msg_start, len(result))


As we can see here, the F1 score improved from the RoBERTa base model but it is still not very good.

#### Option 2

Now let's try our second option, changing the RoBERTa base model parameters to the same as the DistilBERT model.

In [None]:
# define tokenizer and model, note we are using roberta-base model
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=4, id2label=id2label, label2id=label2id)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--roberta-base/snapshots/ff46155979338ff8063cdad90908b498ab91b181/config.json
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading file vocab.json from cache at /root/.cache/huggingface/hub/mo

We will set up or datasets.

In [None]:
final_train_dct = {'input_ids': [], 'labels': []}
for dct in train_dct:
  final_train_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_train_dct['labels'].append(label2id[dct['team']])

final_valid_dct = {'input_ids': [], 'labels': []}
for dct in valid_dct:
  final_valid_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_valid_dct['labels'].append(label2id[dct['team']])

train_dataset = Dataset.from_dict(final_train_dct)
valid_dataset = Dataset.from_dict(final_valid_dct)

Set our training arguments (though they have not changed).

In [None]:
training_args = TrainingArguments(
    output_dir="NBA Tweets Model",
    learning_rate=1e-5,  # learning rate increased from first RoBERTa model
    per_device_train_batch_size=32,  # batch size increased from first RoBERTa model
    per_device_eval_batch_size=32,
    num_train_epochs=3,   # epochs increased from first RoBERTa model
    weight_decay=0,
    evaluation_strategy="epoch"
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Set our Data Collator.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Setup our trainer.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

And train!

In [None]:
trainer.train()

***** Running training *****
  Num examples = 6000
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 564
  Number of trainable parameters = 124648708
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,0.862428
2,No log,0.831801
3,0.865900,0.826666


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
Saving model checkpoint to NBA Tweets Model/checkpoint-500
Configuration saved in NBA Tweets Model/checkpoint-500/config.json
Model weights saved in NBA Tweets Model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in NBA Tweets Model/checkpoint-500/tokenizer_config.json
Special tokens file saved in NBA Tweets Model/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=564, training_loss=0.8618091759106792, metrics={'train_runtime': 533.4153, 'train_samples_per_second': 33.745, 'train_steps_per_second': 1.057, 'total_flos': 1452275770608000.0, 'train_loss': 0.8618091759106792, 'epoch': 3.0})

Now let's make our predictions on the validation set.

In [None]:
predictions, label_ids, metrics = trainer.predict(valid_dataset)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 32


And save the labels predicted.

In [None]:
labels_predicted = []
for prediction in predictions:
  labels_predicted.append(id2label[np.argmax(prediction)])

And the scores!

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(valid_labels, labels_predicted)
print(f"{accuracy=:.3f}")

precision = precision_score(valid_labels, labels_predicted, average='macro')
print(f"{precision=:.3f}")

recall = recall_score(valid_labels, labels_predicted, average='macro')
print(f"{recall=:.3f}")

f1 = f1_score(valid_labels, labels_predicted, average='macro')
print(f"{f1=:.3f}")

accuracy=0.723
precision=0.181
recall=0.250
f1=0.210


  _warn_prf(average, modifier, msg_start, len(result))


Still very poor, it seems that the model continues to always predict the majority class, MiamiHeat.

#### Option 3

Finally, we will try running the DistilBERT model with epochs increased to 10. This will let us see if the model validation loss begins to plateau, or possibly even overfit.

In [None]:
# define tokenizer and model, back t
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4, id2label=id2label, label2id=label2id)

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_tok

We will set up our datasets.

In [None]:
final_train_dct = {'input_ids': [], 'labels': []}
for dct in train_dct:
  final_train_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_train_dct['labels'].append(label2id[dct['team']])

final_valid_dct = {'input_ids': [], 'labels': []}
for dct in valid_dct:
  final_valid_dct['input_ids'].append(tokenizer.encode(dct['text'], padding='max_length', max_length=157, truncation=True))
  final_valid_dct['labels'].append(label2id[dct['team']])

train_dataset = Dataset.from_dict(final_train_dct)
valid_dataset = Dataset.from_dict(final_valid_dct)

And our training arguments with 10 epochs.

In [None]:
training_args = TrainingArguments(
    output_dir="DistilBERT NBA Tweets Model",
    learning_rate=1e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,  # epochs increase to 10
    weight_decay=0,
    evaluation_strategy="epoch"
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


We will use the same data collator.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We set-up our trainer.

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

And train!

In [None]:
trainer.train()

***** Running training *****
  Num examples = 6000
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1880
  Number of trainable parameters = 66956548


Epoch,Training Loss,Validation Loss
1,No log,0.832915
2,No log,0.757714
3,0.813600,0.698658
4,0.813600,0.683569
5,0.813600,0.676288
6,0.654600,0.658739
7,0.654600,0.693933
8,0.573100,0.665691
9,0.573100,0.666985
10,0.573100,0.667161


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
Saving model checkpoint to DistilBERT NBA Tweets Model/checkpoint-500
Configuration saved in DistilBERT NBA Tweets Model/checkpoint-500/config.json
Model weights saved in DistilBERT NBA Tweets Model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in DistilBERT NBA Tweets Model/checkpoint-500/tokenizer_config.json
Special tokens file saved in DistilBERT NBA Tweets Model/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 32
Saving model checkpoint to DistilBERT NBA Tweets Model/checkpoint-1000
Configuration saved in DistilBERT NBA Tweets Model/checkpoint-1000/config.json
Model weights saved in DistilBERT NBA Tweets Model/checkpoint-1000/pyto

TrainOutput(global_step=1880, training_loss=0.6494517711882896, metrics={'train_runtime': 889.3864, 'train_samples_per_second': 67.462, 'train_steps_per_second': 2.114, 'total_flos': 2437280082720000.0, 'train_loss': 0.6494517711882896, 'epoch': 10.0})

Let's run the predictions.

In [None]:
predictions, label_ids, metrics = trainer.predict(valid_dataset)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 32


And retrieve our predicted labels.

In [None]:
labels_predicted = []
for prediction in predictions:
  labels_predicted.append(id2label[np.argmax(prediction)])

And the scores!

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(valid_labels, labels_predicted)
print(f"{accuracy=:.3f}")

precision = precision_score(valid_labels, labels_predicted, average='macro')
print(f"{precision=:.3f}")

recall = recall_score(valid_labels, labels_predicted, average='macro')
print(f"{recall=:.3f}")

f1 = f1_score(valid_labels, labels_predicted, average='macro')
print(f"{f1=:.3f}")

accuracy=0.770
precision=0.633
recall=0.538
f1=0.576


As we can see, increasing the epochs to 10 does result in better performance, the F1 score of 0.576 is actually the best score we have seen out of any of the classifiers in the project!