### Philosophy
I will start with a beginner friendly model, then I can upgrade in the future.

Beginner friendly model: SciKit-Learn, TF-IDF Vectorizer, Logistic Regression for classification, joblib to save & load the model.

Performance model: Use pre-trained BERT or Distil/BERT from Transformers library

##### Beginner friendly model:
I downloaded the dataset from https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis?select=twitter_training.csv and renamed twitter_validation to twitter_test.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import accuracy_score
import numpy as np
import joblib

In [3]:
# I noticed that the text column contains arrays with a single string in them instead of just pure strings. I want to fix this.


In [4]:
labels = ['id', 'org', 'sentiment', 'text']
data = pd.read_csv('archive/twitter_training.csv', header=None, names=labels)
data

Unnamed: 0,id,org,sentiment,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [5]:
unique_values = data['org'].unique()
sentiments = data['sentiment'].unique()
print(unique_values)
print(sentiments)

['Borderlands' 'CallOfDutyBlackopsColdWar' 'Amazon' 'Overwatch'
 'Xbox(Xseries)' 'NBA2K' 'Dota2' 'PlayStation5(PS5)' 'WorldOfCraft'
 'CS-GO' 'Google' 'AssassinsCreed' 'ApexLegends' 'LeagueOfLegends'
 'Fortnite' 'Microsoft' 'Hearthstone' 'Battlefield'
 'PlayerUnknownsBattlegrounds(PUBG)' 'Verizon' 'HomeDepot' 'FIFA'
 'RedDeadRedemption(RDR)' 'CallOfDuty' 'TomClancysRainbowSix' 'Facebook'
 'GrandTheftAuto(GTA)' 'MaddenNFL' 'johnson&johnson' 'Cyberpunk2077'
 'TomClancysGhostRecon' 'Nvidia']
['Positive' 'Neutral' 'Negative' 'Irrelevant']


In [4]:
# So our data is broken up into many different organizations. I think we can ignore these

# Split the data:
data_test = pd.read_csv('archive/twitter_test.csv', header=None, names=labels)
x_test = data_test['text'].values.tolist()
y_test = data_test['sentiment'].values.tolist()

x_train = data['text'].values.tolist()
y_train = data['sentiment'].values.tolist()

display(x_train[:3], y_train[:3])
display(x_test[:3], y_test[:3])


['im getting on borderlands and i will murder you all ,',
 'I am coming to the borders and I will kill you all,',
 'im getting on borderlands and i will kill you all,']

['Positive', 'Positive', 'Positive']

['I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣',
 "BBC News - Amazon boss Jeff Bezos rejects claims company acted like a 'drug dealer' bbc.co.uk/news/av/busine…",
 '@Microsoft Why do I pay for WORD when it functions so poorly on my @SamsungUS Chromebook? 🙄']

['Irrelevant', 'Neutral', 'Negative']

In [5]:
# Vectorize the tweets:
vectorizer = TfidfVectorizer()
for i in range(len(x_train)):
    if isinstance(x_train[i], float) and np.isnan(x_train[i]):
        x_train[i] = ''
for i in range(len(x_test)):
    if isinstance(x_train[i], float) and np.isnan(x_train[i]):
        x_train[i] = ''
x_train_processed = vectorizer.fit_transform(x_train)
x_test_processed = vectorizer.transform(x_test)
x_train_processed

<74682x31062 sparse matrix of type '<class 'numpy.float64'>'
	with 1213145 stored elements in Compressed Sparse Row format>

In [6]:
model = LogisticRegression(max_iter=500)
model.fit(x_train_processed, y_train)

y_pred = model.predict(x_test_processed)
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')

Accuracy: 0.918


In [7]:
# Save the model:
joblib.dump(model, 'tweet_emotion_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']

In [8]:
# Playing around with predict:
tweets = [
    "I HATE AWS ITS THE WORST SERVICE IVE EVER USED",
    "AWS IS THE BOMB IT'S EXPLODING ALL MY WORRIES",
    "Anyone that plays a bad luck albatross deck in hearthstone is a literal cop. \n \n Fucking fun police. pic.twitter.com/jY6TRq351e"
]

tweets_processed = vectorizer.transform(tweets)

tweets_pred = model.predict(tweets_processed)

for prediction in tweets_pred:
    print(prediction)

Negative
Positive
Neutral
