***Imports***

In [61]:
from transformers import pipeline
import pandas as pd
from sklearn.metrics import f1_score

***Load Data***

In [13]:
train_set = pd.read_csv('../data/train_set.csv')

***Create a Pipeline for the Model***

The pre-trained model used for this is SiEBERT - English-Language Sentiment Classification. It can be found here: https://huggingface.co/siebert/sentiment-roberta-large-english

In [10]:
pipe = pipeline('sentiment-analysis', model='siebert/sentiment-roberta-large-english')

***Creating Predictions***

In [26]:
# turning data into list
text = train_set['text'].to_list()

# trunicating data to fit in the max length for model
max_length = pipe.tokenizer.model_max_length
data = []
for text in text:
    trunicated_text = text[:max_length]
    data.append(trunicated_text)

# creasting predictions on data. only using limited rows due to computation time
preds = pipe(data[:1000])

***Inspecting and evaluating Predictions***

In [40]:
preds = pd.DataFrame(preds)

preds.head()

Unnamed: 0,label,score
0,POSITIVE,0.998582
1,NEGATIVE,0.999419
2,NEGATIVE,0.999514
3,NEGATIVE,0.999505
4,NEGATIVE,0.999491


In [41]:
preds['label'].value_counts()

label
NEGATIVE    900
POSITIVE    100
Name: count, dtype: int64

**Insights:** From this data we can see that the model is predicting more reviews to be negative rather than positive.

In [44]:
# changing predictions into 0 and 1
predictions = []
for label in preds['label']:
    if label == 'POSITIVE':
        predictions.append(1)
    else:
        predictions.append(0)

In [60]:
# turning into dataframe
predictions = pd.DataFrame(predictions)

# testing using f1 score
results = f1_score(train_set.iloc[:1000, 1], predictions)
print(results)

0.0


**Insights:** These `0` result indicates that although the predictions were correctly predicting some of the negative values correctly, they were not predicting any of the positive values correctly. This could indicate that the pre-trained model chosen is not ideal for the data we have