This shows you how to load the datasets and use them to train a sentiment classifier.
The sentiment classifier that is trained is just replicating the existing sentiment classification that has been applied.

The first thing to do is to review the dataset, and then we can train the model.

In [1]:
from pathlib import Path

PROJECT_ROOT = Path(".").resolve().parent
MODEL_FOLDER = PROJECT_ROOT / "models"
DATA_RAW_FOLDER = PROJECT_ROOT / "data" / "raw"

# these are unique by the combination of title, text AND sentiment
BURGER_KING_FILE = DATA_RAW_FOLDER / "burger-king.csv" # ~145k rows
WENDYS_FILE = DATA_RAW_FOLDER / "wendys.csv" # ~50k rows

MODEL_RUN_FOLDER = MODEL_FOLDER / "example-sentiment"
MODEL_RUN_FOLDER.mkdir(exist_ok=True, parents=True)

MODEL_NAME = "bert-base-uncased" # very small
BATCH_SIZE = 8 # adjust to your RAM

In [2]:
import pandas as pd

wendys_df = pd.read_csv(WENDYS_FILE)

# make it unique by text
wendys_df = wendys_df.drop_duplicates(subset="text")
wendys_df = wendys_df.reset_index(drop=True)

wendys_df

Unnamed: 0,title,text,sentiment
0,RT @libsoftiktok $500 fine for burning down a ...,RT @libsoftiktok $500 fine for burning down a ...,neutral
1,RT @RealSweet17 Do you think the punishment fi...,RT @RealSweet17 Do you think the punishment fi...,negative
2,RT @CollinRugg JUST IN: Two rioters who were r...,RT @CollinRugg JUST IN: Two rioters who were r...,negative
3,RT @charliekirk11 Whoa! More two-tiered justic...,RT @charliekirk11 Whoa! More two-tiered justic...,negative
4,RT @RepMTG J6’ers are being locked up for year...,RT @RepMTG J6’ers are being locked up for year...,negative
...,...,...,...
46232,@sara_ash88 when I was working hard on my weig...,@sara_ash88 when I was working hard on my weig...,positive
46233,nigga dat work at wendys said ohh ik you and s...,nigga dat work at wendys said ohh ik you and s...,neutral
46234,"🔆 The sunny days are calling, and so is the bo...","🔆 The sunny days are calling, and so is the bo...",neutral
46235,Cara the GM at edwardsville Wendy's,Cara the GM at edwardsville Wendy's,neutral


The dataset is the title of the thread or article and the text around the match.
For twitter the title is the same as the text except that the whitespace has been normalized.

To train with this we need to generate a single text column and convert the sentiment into an index.
We can just drop the title column as this is really a demonstration - you may wish to do something more sophisticated.
To convert the sentiment into an index we can convert it into a [category](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).
The category datatype is used for distinct values (like our positive/neutral/negative sentiment column) and creates a mapping between an integer value and the associated original value.

In [3]:
wendys_df["sentiment"] = wendys_df.sentiment.astype("category")
wendys_df["label"] = wendys_df.sentiment.cat.codes
label_to_sentiment = dict(enumerate(wendys_df.sentiment.cat.categories))

wendys_df

Unnamed: 0,title,text,sentiment,label
0,RT @libsoftiktok $500 fine for burning down a ...,RT @libsoftiktok $500 fine for burning down a ...,neutral,1
1,RT @RealSweet17 Do you think the punishment fi...,RT @RealSweet17 Do you think the punishment fi...,negative,0
2,RT @CollinRugg JUST IN: Two rioters who were r...,RT @CollinRugg JUST IN: Two rioters who were r...,negative,0
3,RT @charliekirk11 Whoa! More two-tiered justic...,RT @charliekirk11 Whoa! More two-tiered justic...,negative,0
4,RT @RepMTG J6’ers are being locked up for year...,RT @RepMTG J6’ers are being locked up for year...,negative,0
...,...,...,...,...
46232,@sara_ash88 when I was working hard on my weig...,@sara_ash88 when I was working hard on my weig...,positive,2
46233,nigga dat work at wendys said ohh ik you and s...,nigga dat work at wendys said ohh ik you and s...,neutral,1
46234,"🔆 The sunny days are calling, and so is the bo...","🔆 The sunny days are calling, and so is the bo...",neutral,1
46235,Cara the GM at edwardsville Wendy's,Cara the GM at edwardsville Wendy's,neutral,1


In [4]:
label_to_sentiment

{0: 'negative', 1: 'neutral', 2: 'positive'}

## Training

The NLP model that we will use expects to receive tokens instead of text, which means we have to encode the text as well.

Training on all 50k rows would take too long.
Instead I am going to reduce this dataset to 1,000 rows for training, 100 for validation and 100 as a test set.
Ideally the test set would come from a different dataset (the burger king posts might be good for this, they are still social media posts about restaurants though).

In [5]:
wendys_df = wendys_df[["text", "label"]]

# 1k for training, 100 for validation, 100 for test
wendys_df = wendys_df[:1_200]

# ♬ he's making a list,
# he's checking it twice,
# he's gonna find out,
# who's setting on a copy of a slice ♬
wendys_df = wendys_df.copy()

wendys_df

Unnamed: 0,text,label
0,RT @libsoftiktok $500 fine for burning down a ...,1
1,RT @RealSweet17 Do you think the punishment fi...,0
2,RT @CollinRugg JUST IN: Two rioters who were r...,0
3,RT @charliekirk11 Whoa! More two-tiered justic...,0
4,RT @RepMTG J6’ers are being locked up for year...,0
...,...,...
1195,@got_cake @Soulvintageone Damn that is scrimp!...,0
1196,@AMK_PhD @RepMTG He was married but fell aslee...,0
1197,@CollinRugg You people are more bent over not ...,0
1198,@radicalricci Ma’am this is a Wendy’s,1


In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
def encode(text: str) -> list[int]:
    tokenized = tokenizer(
        text,
        return_attention_mask=False,
        return_token_type_ids=False,
    )
    return tokenized.input_ids

wendys_df["input_ids"] = wendys_df.text.apply(encode)
wendys_df

Unnamed: 0,text,label,input_ids
0,RT @libsoftiktok $500 fine for burning down a ...,1,"[101, 19387, 1030, 5622, 5910, 15794, 5480, 18..."
1,RT @RealSweet17 Do you think the punishment fi...,0,"[101, 19387, 1030, 2613, 26760, 15558, 16576, ..."
2,RT @CollinRugg JUST IN: Two rioters who were r...,0,"[101, 19387, 1030, 22180, 26549, 2290, 2074, 1..."
3,RT @charliekirk11 Whoa! More two-tiered justic...,0,"[101, 19387, 1030, 4918, 23630, 2243, 14526, 2..."
4,RT @RepMTG J6’ers are being locked up for year...,0,"[101, 19387, 1030, 16360, 20492, 2290, 1046, 2..."
...,...,...,...
1195,@got_cake @Soulvintageone Damn that is scrimp!...,0,"[101, 1030, 2288, 1035, 9850, 1030, 3969, 6371..."
1196,@AMK_PhD @RepMTG He was married but fell aslee...,0,"[101, 1030, 2572, 2243, 1035, 8065, 1030, 1636..."
1197,@CollinRugg You people are more bent over not ...,0,"[101, 1030, 22180, 26549, 2290, 2017, 2111, 20..."
1198,@radicalricci Ma’am this is a Wendy’s,1,"[101, 1030, 7490, 7277, 6895, 5003, 1521, 2572..."


At this point we have the dataset prepared.
We need to split it into the train, valid and test sets.

I am converting the dataframes into datasets here, which is a custom huggingface format.
Converting the dataframes to a list of dictionaries would also work.
The trainer does not work with the dataframes directly unfortunately.

You can check the full documentation for the datasets library [here](https://huggingface.co/docs/datasets/index).

In [8]:
from datasets import Dataset

train_df = wendys_df[:1_000]
valid_df = wendys_df[1_000:1_100]
test_df = wendys_df[1_100:1_200]

assert not (set(train_df.text) & set(valid_df.text)), "rows shared between train and valid"
assert not (set(train_df.text) & set(test_df.text)), "rows shared between train and test"
assert not (set(valid_df.text) & set(test_df.text)), "rows shared between valid and test"

train_ds = Dataset.from_pandas(train_df)
valid_ds = Dataset.from_pandas(valid_df)
test_ds = Dataset.from_pandas(test_df)

When training we want to be able to see how well our model has trained.
This can also be used to select the best model at the end of training.

We can calculate the accuracy metric by comparing the predictions to the gold labels.
There are many other metrics that may provide more detailed performance information.

In [9]:
from transformers import EvalPrediction

def compute_metrics(results: EvalPrediction) -> dict[str, float]:
    predictions = results.predictions.argmax(axis=1)
    targets = results.label_ids
    correct = predictions == targets
    return {
        "accuracy": correct.mean(),
    }

Now we can train the model.
I don't know what sort of computer you have and I want this to run quickly so I have made the train _very short_.
You can alter the max_steps and logging_steps to change how long the train is done for and how often the evaluation is run.

Check the full documentation for the trainer [here](https://huggingface.co/docs/transformers/main_classes/trainer).

In [10]:
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import Dataset

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

training_args = TrainingArguments(
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,
    warmup_ratio=0.06,

    report_to=[],

    # very short as this is a demonstration
    evaluation_strategy="steps",
    max_steps=50,
    logging_steps=10,
    eval_steps=10,

    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    
    no_cuda=True, # let you run it on any machine

    # output_dir is compulsory
    logging_dir=MODEL_RUN_FOLDER / "output",
    output_dir=MODEL_RUN_FOLDER / "output",
    overwrite_output_dir=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
10,0.9771,0.948106,0.48
20,0.828,0.874752,0.56
30,0.8756,0.802535,0.69
40,0.6941,0.777884,0.7
50,0.7248,0.763196,0.72


TrainOutput(global_step=50, training_loss=0.8199179649353028, metrics={'train_runtime': 48.6785, 'train_samples_per_second': 8.217, 'train_steps_per_second': 1.027, 'total_flos': 17595709810560.0, 'train_loss': 0.8199179649353028, 'epoch': 0.4})

This has been trained for 8 (batch size) x 50 (steps) = 400 rows of data.
We didn't even make it through one epoch.
The model will not perform well, remember this is a demonstration!

## Evaluation

With our "trained" model we can now evaluate it.
Here we are testing against our test dataset and using the sklearn classification report to describe the performance.

In [11]:
import torch
from transformers import AutoModelForSequenceClassification
import pandas as pd

@torch.inference_mode()
def predict(model: AutoModelForSequenceClassification, tokens: list[int]) -> int:
    tokens_tensor = torch.tensor(tokens)
    tokens_tensor = tokens_tensor.to(model.device)
    tokens_tensor = tokens_tensor[None] # add a batch dimension
    
    output = model(input_ids=tokens_tensor)
    predictions = output.logits
    return predictions.argmax().item()

In [12]:
predictions = test_df.input_ids.apply(lambda tokens: predict(model, tokens))

In [13]:
from sklearn.metrics import classification_report

print(
    classification_report(
        y_true=test_df.label.map(label_to_sentiment),
        y_pred=predictions.map(label_to_sentiment),
        zero_division=0,
    )
)

              precision    recall  f1-score   support

    negative       0.80      0.76      0.78        49
     neutral       0.72      0.85      0.78        46
    positive       0.00      0.00      0.00         5

    accuracy                           0.76       100
   macro avg       0.51      0.53      0.52       100
weighted avg       0.73      0.76      0.74       100



These results are **not** good and they show that the test dataset is wildly imbalanced.

In [14]:
train_df.label.map(label_to_sentiment).value_counts()

label
negative    489
neutral     462
positive     49
Name: count, dtype: int64

We can see that the training dataset is very imbalanced as well.
This has resulted in a model which doesn't properly predict positive.

Fixing this would involve balancing the datasets correctly, and likely training for longer.

You could also think about ways to make sure that the dataset is diverse.
The rows in these datasets are ordered by time.
This means that the distribution reflects the current conversation around Wendys at the time.
If there is a crisis then the conversation will skew negative.
Secondly the current theme of conversation may result in very similar posts by different people, ensuring that you have a diversity of topics would also improve the quality of your model.