# Exercise 1 

In this exercise, we will practice how to fine-tune a pre-trained model using the `transformers` library from Hugging Face.

### Exercise 1(a) (6 points)

Read the `financial_data_sentiment.csv`. Do the following:

- Change the `Sentiment` to the following: `negative -> 0`, `neutral -> 1`, and `positive -> 2`. 
- Update the columms mames to `text` and `label`.
- Split the data into `train` (80%) and `test` (20%).

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from transformers import pipeline

df = pd.read_csv('financial_data_sentiment.csv')
df['Sentiment'] = df['Sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 2})
df.columns = ['text', 'label']

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

### Exercise 1(b) (6 points)

Using the `pipeline`, load the `ProsusAI/finbert`, and attach sentiments for each of the `text`. Report the accuracy of the model.

In [7]:
# load model
finbert_md = pipeline('sentiment-analysis', model='ProsusAI/finbert')

# predict on the test
y_pred = X_test['text'].apply(lambda x: finbert_md(x)[0]['label'])

# calculate accuracy
accuracy = accuracy_score(X_test['label'], y_pred.map({'negative': 0, 'neutral': 1, 'positive': 2}))
print(f'Accuracy: {accuracy:.2f}')

Device set to use cpu


Accuracy: 0.75


### Exercise 1(c) (20 points)

Fine-tune the `ProsusAI/finbert` on the `train` dataset and report the accuracy of the tuned model on the `test` dataset.

In [4]:
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from datasets import Dataset
import torch

# load model
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')
tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')

# prepare data
train_dataset = Dataset.from_pandas(X_train)
test_dataset = Dataset.from_pandas(X_test)

# tokenize data
train_dataset = train_dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True, max_length = 128), batched=True)
test_dataset = test_dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True, max_length = 128), batched=True)

# convert to torch
train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

Map: 100%|██████████| 4673/4673 [00:02<00:00, 1601.57 examples/s]
Map: 100%|██████████| 1169/1169 [00:00<00:00, 1665.84 examples/s]


In [9]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': accuracy_score(predictions, labels)}

# define training arguments
args = TrainingArguments(
    output_dir='./Inclass_results',
    evaluation_strategy='steps',
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=1,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy')

# define trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics)

# train model
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=37, training_loss=0.6015726553427206, metrics={'train_runtime': 2488.6317, 'train_samples_per_second': 1.878, 'train_steps_per_second': 0.015, 'total_flos': 307382250260736.0, 'train_loss': 0.6015726553427206, 'epoch': 1.0})

In [10]:
model.eval()
trainer.predict(test_dataset).metrics

{'test_loss': 0.542752742767334,
 'test_accuracy': 0.7801539777587682,
 'test_runtime': 118.0679,
 'test_samples_per_second': 9.901,
 'test_steps_per_second': 0.085}

### Exercise 1(d) (3 points)

What model would you use the predict the sentiment? Be specific.

Based on my results i would use the fine-tuned finebert model because it has a higher accuracy score on the test data