<a href="https://colab.research.google.com/github/justina-tran/yelp-reviews/blob/master/notebooks/06_transformers_text_clf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I will leverage open-source Python libraries to load our dataset, fine-tune a pretrained [Transformer](https://huggingface.co/docs/transformers/tasks/sequence_classification) model used for text classification, and evaluate its performance.

For the fine-tuning process, I will be using a pretrained model from the `transformers` library available on Hugging Face.  I will be using a [RoBERTa-base](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) model trained on ~124M tweets and finetuned for sentiment analysis tasks.

This transformer is not effective for our use-case of review classification since it was trained on general tweets, not for the specific categories we aim to predict from reviews. I will fine-tune the model for this dataset and evaluate the fine-tuned model using key metrics such as f1-score, AUCPR, and accuracy.

In [1]:
!pip install datasets transformers evaluate gradio -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.1/20.1 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install transformers[torch] -q
!pip install accelerate -U

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/251.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m


In [3]:
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict, concatenate_datasets
from collections import Counter
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback, DataCollatorWithPadding
import evaluate
from sklearn.metrics import f1_score, average_precision_score, accuracy_score
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Import dataset from HuggingFace

In [4]:
entire_dataset = load_dataset("justina/yelp_boba_reviews")
print(entire_dataset)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/16.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['store', 'date', 'username', 'user_loc', 'rating', 'reviews', 'cleaned', 'cleaned_text', 'positive', 'negative', 'neutral', 'compound', 'sentiment'],
        num_rows: 13211
    })
})


In [5]:
entire_dataset = entire_dataset.class_encode_column("rating")
rating_distribution = Counter(entire_dataset['train']['rating'])
rating_distribution

Casting to class labels:   0%|          | 0/13211 [00:00<?, ? examples/s]

Counter({4: 6811, 0: 533, 2: 1431, 3: 3797, 1: 639})

In [6]:
# Select columns to keep
selected_features = ["reviews", "cleaned_text", "rating"]
selected_dataset = entire_dataset['train'].select_columns(selected_features)
# selected_dataset = selected_dataset.rename_column('rating','labels').rename_column('cleaned_text','text')

In [7]:
# Create train, validation, and test splits
train_set = selected_dataset.train_test_split(test_size=0.3, stratify_by_column='rating', seed=5)
valid_test = train_set['test'].train_test_split(test_size=0.5, stratify_by_column='rating', seed=5)

# Gather all datasets in a single DatasetDict
dataset = DatasetDict({
    'train': train_set['train'],
    'test': valid_test['test'],
    'valid': valid_test['train']})
print(dataset)
print('Train Set Distribution:\n', dataset['train'].to_pandas()['rating'].value_counts())
print('\nValidation Set Distribution:\n', dataset['valid'].to_pandas()['rating'].value_counts())
print('\nTest Set Distribution:\n',dataset['test'].to_pandas()['rating'].value_counts())

DatasetDict({
    train: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 9247
    })
    test: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1982
    })
    valid: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1982
    })
})
Train Set Distribution:
 4    4767
3    2658
2    1002
1     447
0     373
Name: rating, dtype: int64

Validation Set Distribution:
 4    1022
3     570
2     214
1      96
0      80
Name: rating, dtype: int64

Test Set Distribution:
 4    1022
3     569
2     215
1      96
0      80
Name: rating, dtype: int64


## Load Model for Fine-Tuning

The Twitter-roBERTa-base model outputs the labels:
-  0 -> Negative
-  1 -> Neutral
-  2 -> Positive

To test out this model on our dataset, I would expect a negative label assigned to reviews with 1 or 2 stars, a neutral label assigned to reviews with 3 stars, and a positive label assigned to reviews with 4 or 5 stars.

To prepare the model for fine-tuning on a new task, I will:
1. Load the pre-trained Twitter-roBERTa-base model.  
2. Modify the model to output star rating labels instead of sentiment labels.
  - Adjust the number of output units to 5 (for star ratings 1 to 5).  
  - Adjust the id2label and label2id mappings to correspond to the new star rating labels.  
3. Tokenize text to transform our data as inputs for this model.

After applying these adjustments, the model will tokenize the review text and predict star ratings instead of sentiment labels.



In [8]:
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [9]:
# Model output with original labels on 10 reviews
encoded_input = tokenizer(dataset['train']['reviews'][:10], return_tensors='pt', padding=True, truncation=True,
                          max_length=500, add_special_tokens = True).to(device)

with torch.no_grad():
  logits = model(**encoded_input).logits

for i in range(10):
  prediction = logits[i].argmax(-1).item()
  print("Predicted class:", model.config.id2label[prediction], ' | Actual:', dataset['train'][i]['rating'])

Predicted class: positive  | Actual: 3
Predicted class: negative  | Actual: 0
Predicted class: negative  | Actual: 1
Predicted class: negative  | Actual: 1
Predicted class: positive  | Actual: 4
Predicted class: positive  | Actual: 3
Predicted class: negative  | Actual: 0
Predicted class: positive  | Actual: 4
Predicted class: positive  | Actual: 3
Predicted class: positive  | Actual: 4


In [10]:
star_labels = ["1", "2", "3", "4", "5"]

model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           num_labels=5,
                                                           id2label={i: label for i, label in enumerate(star_labels)},
                                                           label2id={label: i for i, label in enumerate(star_labels)},
                                                           ignore_mismatched_sizes=True)
model.to(device);

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpo

In [11]:
print(model.config.id2label[4])
print(model.config.label2id['3'])

5
2


In [12]:
print('Review:', dataset['train']['reviews'][0])
print('Star rating:', dataset['train']['rating'][0])

Review: I've been searching for the perfect bubble tea in NYC.

While my quest for "perfection" is still ongoing, this is one of my top contenders from the long list of bubble tea places I've tried.

Friendly staff, a lot of drink options, and thankfully their stuff isn't over the top sweet and syrupy.

I got the lychee green tea with bubbles. The regular sweetness level was sweet, but not nearly as pow over the head with sugar as other places.

You can modify sweetness level etc here too.
Star rating: 3


In [13]:
# Model output with new labels on 10 reviews
sample_texts = [dataset['train'][i]['reviews'] for i in range(10)]
encoded_input = tokenizer(sample_texts, return_tensors='pt', padding=True, truncation=True,
                          max_length=500, add_special_tokens = True).to(device)

with torch.no_grad():
  logits = model(**encoded_input).logits

for i in range(10):
  prediction = logits[i].argmax(-1).item()
  print("Predicted class:", model.config.id2label[prediction], ' | Actual:', dataset['train'][i]['rating'])

Predicted class: 1  | Actual: 3
Predicted class: 5  | Actual: 0
Predicted class: 5  | Actual: 1
Predicted class: 5  | Actual: 1
Predicted class: 1  | Actual: 4
Predicted class: 4  | Actual: 3
Predicted class: 3  | Actual: 0
Predicted class: 4  | Actual: 4
Predicted class: 5  | Actual: 3
Predicted class: 4  | Actual: 4


# Data Sampling

I will select a small subset of data to experiment with comparing the performance of the fine-tuned model against the original review and preprocessed/cleaned reviews.


In [14]:
def tokenize_and_map(dataset, text_col):
  def tokenize_function(example_batch):
    encoded_inputs = tokenizer(example_batch[text_col], truncation=True, padding=True, max_length=500,
                               return_tensors='pt').to(device)
    encoded_inputs['labels'] = example_batch['rating']
    return encoded_inputs

  tokenized_datasets = {subset_name: dataset[subset_name].map(tokenize_function, batched=True)
                        for subset_name in dataset.keys()
                        }
  return tokenized_datasets

In [15]:
# DataCollator for sequence classification
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [16]:
train_sampled_data = Dataset.from_dict(dataset['train'][:10])

encoded_input = tokenizer(train_sampled_data['reviews'], return_tensors='pt', padding=True, truncation=True,
                          max_length=500, add_special_tokens = True).to(device)
encoded_input['labels'] = train_sampled_data['rating']
encoded_input.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [17]:
sampled_data = DatasetDict({
    'train': Dataset.from_dict(dataset['train'][:1000]),
    'valid': Dataset.from_dict(dataset['valid'][:1000]),
    'test': Dataset.from_dict(dataset['test'][:1000])

    })
sampled_data

DatasetDict({
    train: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1000
    })
    valid: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1000
    })
})

In [18]:
tokenized_org_datasets = tokenize_and_map(sampled_data, text_col='reviews')
tokenized_cleaned_datasets = tokenize_and_map(sampled_data, text_col='cleaned_text')

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [19]:
tokenized_org_datasets

{'train': Dataset({
     features: ['reviews', 'cleaned_text', 'rating', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 1000
 }),
 'valid': Dataset({
     features: ['reviews', 'cleaned_text', 'rating', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['reviews', 'cleaned_text', 'rating', 'input_ids', 'attention_mask', 'labels'],
     num_rows: 1000
 })}

In [20]:
tokenizer.decode(tokenized_org_datasets['train'][0]['input_ids'])

'<s>I\'ve been searching for the perfect bubble tea in NYC.\n\nWhile my quest for "perfection" is still ongoing, this is one of my top contenders from the long list of bubble tea places I\'ve tried.\n\nFriendly staff, a lot of drink options, and thankfully their stuff isn\'t over the top sweet and syrupy.\n\nI got the lychee green tea with bubbles. The regular sweetness level was sweet, but not nearly as pow over the head with sugar as other places.\n\nYou can modify sweetness level etc here too.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

## Fine-tune model on sampled dataset

In [21]:
def compute_metrics(p):
  predictions = np.argmax(p.predictions, axis=1)
  references = p.label_ids
  f1_macro = f1_score(references, predictions, average='macro')
  aucpr_macro = average_precision_score(np.eye(len(p.label_ids))[references], p.predictions, average='macro')
  accuracy = accuracy_score(references, predictions)
  return {"f1_macro": f1_macro,
          "aucpr_macro": aucpr_macro,
          "accuracy": accuracy}

early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=2,  # Number of evaluation steps without improvement before stopping
    early_stopping_threshold=0, # Minimum change in score to be considered as improvement
)

In [22]:
sample_training_args = TrainingArguments(
    output_dir='./sample-clf',
    evaluation_strategy="steps",
    eval_steps=100,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    logging_dir='./logs',
    logging_steps=25,
    weight_decay=0.01,
    metric_for_best_model="f1_macro",
    load_best_model_at_end=True, # required for early_stopping_callback
)

In [23]:
# Create the Trainer using dataset with original text
org_trainer = Trainer(
    model=model,
    args=sample_training_args,
    train_dataset=tokenized_org_datasets['train'],
    eval_dataset=tokenized_org_datasets['valid'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping_callback],
)

# Start training
org_trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,F1 Macro,Aucpr Macro,Accuracy
100,0.8058,0.845148,0.40696,0.535215,0.634
200,0.7012,0.810152,0.470447,0.546532,0.673
300,0.6971,0.894795,0.477519,0.553419,0.67
400,0.4298,0.859028,0.521949,0.575395,0.671
500,0.4825,0.841717,0.545494,0.584146,0.683
600,0.4745,0.867872,0.553464,0.588791,0.68


TrainOutput(global_step=625, training_loss=0.641634896850586, metrics={'train_runtime': 628.3419, 'train_samples_per_second': 7.957, 'train_steps_per_second': 0.995, 'total_flos': 1284756555000000.0, 'train_loss': 0.641634896850586, 'epoch': 5.0})

In [24]:
# Evaluate on test set
org_trainer.evaluate(tokenized_org_datasets['test'])

{'eval_loss': 0.8349262475967407,
 'eval_f1_macro': 0.5491654729162183,
 'eval_aucpr_macro': 0.5933266301441792,
 'eval_accuracy': 0.669,
 'eval_runtime': 29.532,
 'eval_samples_per_second': 33.862,
 'eval_steps_per_second': 4.233,
 'epoch': 5.0}

In [25]:
# Create the Trainer using dataset with cleaned text
cleaned_trainer = Trainer(
    model=model,
    args=sample_training_args,
    train_dataset=tokenized_cleaned_datasets['train'],
    eval_dataset=tokenized_cleaned_datasets['valid'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping_callback],
)

# Start training
cleaned_trainer.train()

Step,Training Loss,Validation Loss,F1 Macro,Aucpr Macro,Accuracy
100,0.8466,1.012614,0.404675,0.471747,0.609
200,0.7761,0.973569,0.468842,0.480057,0.622
300,0.657,1.076638,0.471818,0.479306,0.614
400,0.4993,1.071007,0.4787,0.482404,0.617
500,0.4861,1.169387,0.484871,0.469933,0.622
600,0.4318,1.16485,0.471792,0.478419,0.617


TrainOutput(global_step=625, training_loss=0.6378323684692383, metrics={'train_runtime': 417.1203, 'train_samples_per_second': 11.987, 'train_steps_per_second': 1.498, 'total_flos': 747728315010000.0, 'train_loss': 0.6378323684692383, 'epoch': 5.0})

In [26]:
# Evaluate on test set
cleaned_trainer.evaluate(tokenized_cleaned_datasets['test'])

{'eval_loss': 1.1293377876281738,
 'eval_f1_macro': 0.49172105568597296,
 'eval_aucpr_macro': 0.5260173356763254,
 'eval_accuracy': 0.621,
 'eval_runtime': 22.5275,
 'eval_samples_per_second': 44.39,
 'eval_steps_per_second': 5.549,
 'epoch': 5.0}

With a sampled dataset of 1000 rows to train, the raw review text outperformed the cleaned and preprocessed review text at fine-tuning a pretrained transformer model. As a result, I will proceed with fine-tuning the model using the raw text from the complete dataset. But before that, I will experiment with using undersampling due to the high imbalance in the rating labels.

# Undersampling

Since majority of the reviews are 5-star ratings and there may be computational constraints, I will perform undersampling. This will help me create a more balanced dataset that can be used effectively for training the model.

1. Perform train_test_split and save the test split for evaluation
2. Perform train_test_split on the train split to get the train and validation sets
3. Calculate the maximum minority class size in the training set
4. Perform undersampling on the training set to balance the classes
5. Combine the train, validation, and test sets together

In [27]:
rating_distribution.items()

dict_items([(4, 6811), (0, 533), (2, 1431), (3, 3797), (1, 639)])

In [28]:
undsampled_dataset1 = entire_dataset['train'].train_test_split(test_size=0.3, stratify_by_column='rating')
undsampled_dataset2 = undsampled_dataset1['train'].train_test_split(test_size=0.3, stratify_by_column='rating')
undsampled_dataset2

DatasetDict({
    train: Dataset({
        features: ['store', 'date', 'username', 'user_loc', 'rating', 'reviews', 'cleaned', 'cleaned_text', 'positive', 'negative', 'neutral', 'compound', 'sentiment'],
        num_rows: 6472
    })
    test: Dataset({
        features: ['store', 'date', 'username', 'user_loc', 'rating', 'reviews', 'cleaned', 'cleaned_text', 'positive', 'negative', 'neutral', 'compound', 'sentiment'],
        num_rows: 2775
    })
})

In [29]:
train_class_dist = Counter(undsampled_dataset2['train']['rating'])
print('Train Class Distribution:', train_class_dist)

# Determine the maximum number of samples for each class
minority_count = min(train_class_dist.values())
print('Minority sample size:', minority_count)

# Perform undersampling
undersampled_data = []
for rating, count in train_class_dist.items():
  samples = undsampled_dataset2['train'] \
    .filter(lambda example: example['rating'] == rating) \
    .shuffle() \
    [:minority_count]
  undersampled_data.append(Dataset.from_dict(samples))

Train Class Distribution: Counter({4: 3337, 3: 1860, 2: 701, 1: 313, 0: 261})
Minority sample size: 261


Filter:   0%|          | 0/6472 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6472 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6472 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6472 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6472 [00:00<?, ? examples/s]

In [30]:
balanced_dataset = concatenate_datasets(undersampled_data)
balanced_dataset

Dataset({
    features: ['store', 'date', 'username', 'user_loc', 'rating', 'reviews', 'cleaned', 'cleaned_text', 'positive', 'negative', 'neutral', 'compound', 'sentiment'],
    num_rows: 1305
})

In [31]:
# Encode label
balanced_dataset = balanced_dataset.class_encode_column("rating")

# Gather all datasets in a single DatasetDict
balanced_dataset_dict = DatasetDict({
    'train': balanced_dataset,
    'valid': undsampled_dataset2['test'],
    'test': undsampled_dataset1['test'],
})

print("Balanced Train Class Distribution:", Counter(balanced_dataset_dict['train']['rating']))
print("Validation Class Distribution:", Counter(balanced_dataset_dict['valid']['rating']))
print("Test Class Distribution:", Counter(balanced_dataset_dict['test']['rating']))

Stringifying the column:   0%|          | 0/1305 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/1305 [00:00<?, ? examples/s]

Balanced Train Class Distribution: Counter({4: 261, 0: 261, 1: 261, 3: 261, 2: 261})
Validation Class Distribution: Counter({4: 1430, 3: 798, 2: 301, 1: 134, 0: 112})
Test Class Distribution: Counter({4: 2044, 3: 1139, 2: 429, 1: 192, 0: 160})


In [32]:
# tokenized data
tokenized_undsampled_dataset = tokenize_and_map(balanced_dataset_dict, text_col='reviews')

Map:   0%|          | 0/1305 [00:00<?, ? examples/s]

Map:   0%|          | 0/2775 [00:00<?, ? examples/s]

Map:   0%|          | 0/3964 [00:00<?, ? examples/s]

## Fine-tune model on undersampled dataset

In [33]:
undsampled_training_args = TrainingArguments(
    output_dir='./undersampled-review-clf',
    evaluation_strategy="steps",
    eval_steps=100,
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    num_train_epochs=5,
    logging_dir='./logs',
    save_steps=1000,
    logging_steps=100,
    weight_decay=0.01,
    metric_for_best_model="f1_macro",
    load_best_model_at_end=True # required for early_stopping_callback
)

In [34]:
undsampled_trainer = Trainer(
    model=model,
    args=undsampled_training_args,
    train_dataset=tokenized_undsampled_dataset['train'],
    eval_dataset=tokenized_undsampled_dataset['valid'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping_callback],
)
undsampled_trainer.train()

Step,Training Loss,Validation Loss,F1 Macro,Aucpr Macro,Accuracy
100,0.9348,0.728649,0.613222,0.624362,0.696216
200,0.7438,0.785659,0.623172,0.621529,0.673514
300,0.6275,0.831714,0.597611,0.609245,0.677838
400,0.5561,0.817602,0.620016,0.623758,0.686847


TrainOutput(global_step=410, training_loss=0.7119309553285924, metrics={'train_runtime': 873.5083, 'train_samples_per_second': 7.47, 'train_steps_per_second': 0.469, 'total_flos': 1676607304275000.0, 'train_loss': 0.7119309553285924, 'epoch': 5.0})

In [35]:
undsampled_trainer.evaluate(tokenized_undsampled_dataset['test'])

{'eval_loss': 0.7951959371566772,
 'eval_f1_macro': 0.6014773630797381,
 'eval_aucpr_macro': 0.6287525136672937,
 'eval_accuracy': 0.6869323915237134,
 'eval_runtime': 118.955,
 'eval_samples_per_second': 33.324,
 'eval_steps_per_second': 4.17,
 'epoch': 5.0}

# Run model on entire dataset

In [36]:
dataset

DatasetDict({
    train: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 9247
    })
    test: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1982
    })
    valid: Dataset({
        features: ['reviews', 'cleaned_text', 'rating'],
        num_rows: 1982
    })
})

In [37]:
tokenized_datasets = tokenize_and_map(dataset, text_col='reviews')

Map:   0%|          | 0/9247 [00:00<?, ? examples/s]

Map:   0%|          | 0/1982 [00:00<?, ? examples/s]

Map:   0%|          | 0/1982 [00:00<?, ? examples/s]

In [39]:
fullds_training_args = TrainingArguments(
    output_dir='./full-review-clf',
    evaluation_strategy="steps",
    eval_steps=500,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=5,
    logging_dir='./logs',
    save_steps=1000,
    logging_steps=100,
    weight_decay=0.01,
    metric_for_best_model="f1_macro",
    load_best_model_at_end=True # required for early_stopping_callback
)

In [40]:
full_trainer = Trainer(
    model=model,
    args=fullds_training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['valid'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[early_stopping_callback],
)
full_trainer.train()

Step,Training Loss,Validation Loss,F1 Macro,Aucpr Macro,Accuracy
500,0.723,0.757558,0.597876,0.665158,0.683148
1000,0.7307,0.686171,0.636758,0.675224,0.718466
1500,0.5828,0.739832,0.643908,0.666064,0.72553
2000,0.6236,0.787836,0.621158,0.668974,0.706862
2500,0.3739,0.813828,0.64472,0.675228,0.716953
3000,0.4235,0.804803,0.649032,0.667344,0.72553
3500,0.3684,0.961478,0.6483,0.671477,0.720484
4000,0.3243,1.093125,0.643181,0.663212,0.723512


TrainOutput(global_step=4000, training_loss=0.5463592495918274, metrics={'train_runtime': 3354.4709, 'train_samples_per_second': 13.783, 'train_steps_per_second': 1.723, 'total_flos': 8221671098067000.0, 'train_loss': 0.5463592495918274, 'epoch': 3.46})

In [41]:
full_trainer.evaluate(tokenized_datasets['test'])

{'eval_loss': 0.8197635412216187,
 'eval_f1_macro': 0.6357510859552182,
 'eval_aucpr_macro': 0.665778394832688,
 'eval_accuracy': 0.7184661957618567,
 'eval_runtime': 59.2065,
 'eval_samples_per_second': 33.476,
 'eval_steps_per_second': 4.189,
 'epoch': 3.46}

# Summary Results

In [42]:
# Get all validation and test results
validation_dfs = []
test_dfs = []

# Define list of trainers, experiment name, and tokenized datasets
trainers = [cleaned_trainer, org_trainer, undsampled_trainer, full_trainer]
trainer_names = ['cleaned_sample', 'original_sample', 'original_undersampled', 'original_full']
tokenized_ds = [tokenized_cleaned_datasets, tokenized_org_datasets, tokenized_undsampled_dataset, tokenized_datasets]

for trainer, trainer_name, data in zip(trainers, trainer_names, tokenized_ds):
  validation_metrics = trainer.evaluate()
  test_metrics = trainer.evaluate(data['test'])

  validation_df = {
      "Experiment": trainer_name,
      "Validation F1": validation_metrics["eval_f1_macro"],
      "Validation AUCPR": validation_metrics["eval_aucpr_macro"],
      "Validation Accuracy": validation_metrics["eval_accuracy"]
  }
  validation_dfs.append(pd.DataFrame([validation_df]))

  test_df= {
      "Experiment": trainer_name,
      "Test F1": test_metrics["eval_f1_macro"],
      "Test AUCPR": test_metrics["eval_aucpr_macro"],
      "Test Accuracy": test_metrics["eval_accuracy"]
  }
  test_dfs.append(pd.DataFrame([test_df]))

In [43]:
validation_result_df = pd.concat(validation_dfs, ignore_index=True)
test_result_df = pd.concat(test_dfs, ignore_index=True)
result_df = validation_result_df.merge(test_result_df, on="Experiment")
result_df

Unnamed: 0,Experiment,Validation F1,Validation AUCPR,Validation Accuracy,Test F1,Test AUCPR,Test Accuracy
0,cleaned_sample,0.451017,0.478846,0.615,0.441426,0.501866,0.615
1,original_sample,0.64311,0.635833,0.72,0.613258,0.623404,0.703
2,original_undersampled,0.783916,0.831282,0.836396,0.779919,0.828606,0.846367
3,original_full,0.649032,0.667344,0.72553,0.635751,0.665778,0.718466


# Inference

In [44]:
text = "The drinks at this place are absolutely perfect! Each sip is bursting with flavor."
encoded_input = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
  logits = full_trainer.model(**encoded_input).logits
prediction = logits.argmax(-1).item()
print("Predicted star-rating:", full_trainer.model.config.id2label[prediction])

Predicted star-rating: 5


In [45]:
text = "The staff was rude and the drinks were overpriced but super tasty."
encoded_input = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
  logits = full_trainer.model(**encoded_input).logits
prediction = logits.argmax(-1).item()
print("Predicted star-rating:", full_trainer.model.config.id2label[prediction])

Predicted star-rating: 3


# Save Model

In [46]:
from huggingface_hub import notebook_login

In [55]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [57]:
full_trainer.push_to_hub()

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/justina/full-review-clf/tree/main/'

In [58]:
undsampled_trainer.push_to_hub()

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

'https://huggingface.co/justina/undersampled-review-clf/tree/main/'

In [59]:
tokenizer.push_to_hub(repo_id="justina/undersampled-review-clf")

CommitInfo(commit_url='https://huggingface.co/justina/undersampled-review-clf/commit/8cc7d4879893db499dc14fc184ab1fb1fccce156', commit_message='Upload tokenizer', commit_description='', oid='8cc7d4879893db499dc14fc184ab1fb1fccce156', pr_url=None, pr_revision=None, pr_num=None)