# Downloading and read data

First, download the data from [Kaggle](https://www.kaggle.com/datasets/tirendazacademy/fifa-world-cup-2022-tweets/) to the path "../fifa-tweet-sentiment". Then, extract the .csv in that same directory.

In [1]:
import pandas as pd

data_path = "../data/fifa-tweet-sentiment/fifa_world_cup_2022_tweets.csv"
df = pd.read_csv(data_path)
df.head()

Unnamed: 0.1,Unnamed: 0,Date Created,Number of Likes,Source of Tweet,Tweet,Sentiment
0,0,2022-11-20 23:59:21+00:00,4,Twitter Web App,What are we drinking today @TucanTribe \n@MadB...,neutral
1,1,2022-11-20 23:59:01+00:00,3,Twitter for iPhone,Amazing @CanadaSoccerEN #WorldCup2022 launch ...,positive
2,2,2022-11-20 23:58:41+00:00,1,Twitter for iPhone,Worth reading while watching #WorldCup2022 htt...,positive
3,3,2022-11-20 23:58:33+00:00,1,Twitter Web App,Golden Maknae shinning bright\n\nhttps://t.co/...,positive
4,4,2022-11-20 23:58:28+00:00,0,Twitter for Android,"If the BBC cares so much about human rights, h...",negative


## Finetuning a pretrained huggingface sentiment classifier

Most of the following is adapted from section 3.a of "[Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python)." This workflow loads a pre-trained DistilBERT model and finetunes a classifier head on top of it. We take this approach to classify tweets from the FIFA dataset above. 

Get pretrained tokenizer and collator to transform input and pad input text.

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [5]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Define helper function for calculating accuracy.

In [30]:
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    return{"accuracy": accuracy}

Authenticate the notebook so we can write to our repo. Make sure to use a token with `write` privileges.

In [26]:
from huggingface_hub import notebook_login
notebook_login() # login with a write token here

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Get a pretrained model. Note that we use 3 labels since our data has three possible sentiments (`negative`, `neutral`, `positive`).

In [34]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map the string values for sentiment values to ints, and make a Dataset-compatible dataframe for the huggingface trainer.

In [11]:
sentiment_map = {'negative': 0, 'neutral': 1, 'positive': 2}
pre_df = pd.DataFrame({'text': df['Tweet'], 'label': df['Sentiment'].apply(lambda x: sentiment_map[x])})

In [12]:
from sklearn.model_selection import train_test_split

train_split, test_split = train_test_split(pre_df, train_size=0.7)
train_split.shape

(15766, 2)

Make the DatasetDict object for training/eval.

In [13]:
import datasets

train_dataset = datasets.Dataset.from_dict(train_split)
test_dataset = datasets.Dataset.from_dict(test_split)

In [14]:
def preprocess(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = train_dataset.map(preprocess, batched=True)
tokenized_test = test_dataset.map(preprocess, batched=True)

Map:   0%|          | 0/15766 [00:00<?, ? examples/s]

Map:   0%|          | 0/6758 [00:00<?, ? examples/s]

Train the model!

In [36]:
from transformers import TrainingArguments, Trainer

repo_name = f"finetuning-sentiment-model-fifa-{train_split.shape[0]}-samples"

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=1e-2,
    save_strategy="epoch",
    push_to_hub=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [37]:
trainer.train()

Step,Training Loss
500,0.6152
1000,0.4882
1500,0.3229
2000,0.3193
2500,0.2058
3000,0.2072
3500,0.1427
4000,0.1334
4500,0.092


Checkpoint destination directory ./finetuning-sentiment-model-fifa-15766-samples/checkpoint-986 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finetuning-sentiment-model-fifa-15766-samples/checkpoint-1972 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=4930, training_loss=0.26403545356425506, metrics={'train_runtime': 235.0164, 'train_samples_per_second': 335.423, 'train_steps_per_second': 20.977, 'total_flos': 1833253679951172.0, 'train_loss': 0.26403545356425506, 'epoch': 5.0})

Evaluate the model on the eval set.

In [38]:
trainer.evaluate()

{'eval_loss': 0.711454451084137,
 'eval_accuracy': 0.8393015685113939,
 'eval_runtime': 7.1526,
 'eval_samples_per_second': 944.835,
 'eval_steps_per_second': 59.14,
 'epoch': 5.0}

Push the model to the remote repo, and evaluate the trained model on some unseen tweets.

In [None]:
trainer.push_to_hub()

In [43]:
from transformers import pipeline

back_map = {val: key for key, val in sentiment_map.items()}
huggingface_username = "<YOUR_HUGGINGFACE_USERNAME>"

sentiment_model = pipeline(model=f"{huggingface_username}/finetuning-sentiment-model-fifa-{train_split.shape[0]}-samples")
examples = ["This team is the greatest!", "I can't with this team...", "This game is a snoozefest", "What is soccer?"]
preds = sentiment_model(examples)
for i, pred in enumerate(preds):
    print(f'{back_map[int(pred["label"][-1])]} with score={pred["score"]:6.5f} ("{examples[i]}")')

positive with score=0.99932 ("This team is the greatest!")
negative with score=0.99918 ("I can't with this team...")
negative with score=0.99802 ("This game is a snoozefest")
neutral with score=0.99411 ("What is soccer?")
