<h1>Sentiment Analysis</h1>

This notebook demonstrates how a text tokenizer can be used to turn a corpus of text into a matrix of numeric values that then can be used in regulat machine learning applications. This way a model can be trained to predict labels with a piece of text as input. As can be seen at the end of the notebook, a test string is used as input and the machine gives a probability for each of the possible labels.

The dataset used in this notebook originates from [Kaggle](https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets). A 100000 random sample from the English negative, uncertainty and positive observations was taken, so no litigious and no other languages.

Bas Michielsen MSc 2023

In [1]:
import json

import pandas, sklearn
pandas.set_option("max_colwidth", 200)
pandas.set_option("display.float_format", '{:.2f}'.format)
random_state = 42

# 📃 Sample the data
A random sample of 25 observations is taken from the dataset.

In [2]:
df = pandas.read_csv("./data/data.zip")
df.sample(25)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Language,Label
90310,24221,32859,@Mayamayyyy_ I'm afraid that's not possible.,en,uncertainty
85301,610858,828784,@brettbutlerisok @BREW_MATHs Twins never get far post season. Maybe their curse for using juiced balls?,en,uncertainty
6253,372349,505082,@AriMelber @TheBeatWithAri Trump lost the election because he couldn’t mustard enough votes.,en,negative
69730,105056,142623,no bc i’m excited to see what happens in season 2 of yellowjackets,en,positive
20105,368157,499380,@29namimori The people live a moment at a risk of their life. Small mistakes would lost their lives even amount of mayo.,en,negative
38437,531447,721083,MLB has the best blackout restrictions https://t.co/R9VVIWMbIu,en,positive
53363,254051,344640,"Siloed data, different toolsets, manual handoffs and redundant processes slow down and frustrate design teams. \n\nSound like your team? See how you can unleash the art of multi-disciplinary desig...",en,positive
23775,101788,138181,Maybe I should go find a job this summer nalang instead of studying abroad? I need an income.,en,uncertainty
93940,603455,818635,@1StrayWoo DBDKSKS yeah thatd probably be best,en,uncertainty
65201,232654,315582,@KarlvanBeek @sthnjeff Ive had a small business for over 30 years. If your business model is so poor it cant handle a public holiday and you earn less than your staff then your business model is h...,en,negative


# Preprocessing
## 🆔 Encoding

Here the labels are mapped to integers. Because one value is neutral, the value `0` is used for that, the other values then become positive `+1` and negative `-1`. A new column for target is created.

In [3]:
label_map = {"negative": -1, "uncertainty": 0, "positive": 1}
df["Target"] = df["Label"].map(label_map)
df.sample(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Language,Label,Target
25004,572338,776580,"""Leadership is about going somewhere ... \nFull Steam Ahead! - https://t.co/1XtQMuv14N https://t.co/4KgrXVcvkU",en,uncertainty,0
10765,124922,169535,"@fakehockeyteam Calling people out, to their faces, and having high standards is now bad in a world where people endlessly talk shit about each other online.\n\nSetting expectations, leading by ex...",en,negative,-1
97532,406985,552019,@deliot8 @MrMekzy_ This picture is everything... A food photographers dream 🖤🖤,en,positive,1
71928,423999,575112,"Wawa’s #databreach highlights the special risks posed by gas station pumps. Exposed, unattended &amp; expensive to upgrade, pumps are prime targets for payment card thieves. #cybersecurity https:/...",en,uncertainty,0
38776,341005,462661,I almost forgot what music sounds like... y’all sure it’s been 4 years??? 😔 https://t.co/VyY0a5xwEl,en,uncertainty,0
57751,497157,674520,"A no brainer. Sam has one of the best young basketball minds around &amp; is going to do incredible things at Kirkwood. Let’s get to it! @SamBriscoe3 Once a Pioneer always a Pioneer, eh? https://...",en,positive,1
76327,689256,935177,@Fojim13 Poor jimin \nNobody want him 😹 https://t.co/auStgb3zh9,en,negative,-1
60043,636738,863756,"@GOGUYGO_ Thanks again buddy and I hope you feel fantastic with @rokwhiskey . For me is like an inspiration, like an american dream. \nAnd many times, the dreams are happen in our life.",en,positive,1
69831,679627,922230,@StevenTDennis https://t.co/qzIfMuhHFJ is probably breaking the news that Alexandria Ocasio-Cortez is actually Trump's biological daughter.,en,uncertainty,0
46400,306267,415677,"@troyhill91 @trfcste @Yami5trfc Slightly bigger maybe, much bigger behave",en,uncertainty,0


In order to decrease training time, here a sample size of 5000 observation is specified. Also, the vectorizer is limited to a maximum of 100 words. Technically it is possible to increase these values at the cost of training time and possibly increasing the outcome quality, however, in this case even dramatic increases seemed to create mere insignificant improvements. The vectorizer then turns the corpus of texts (5000 observations) in numeric representations for the 100 most prominent words excluding the stop words of the English language. For every observation it will give a `0` for any word that is not present in the observation, or a higher value for a word that is. The expected shape of the output is therefore 5000 times 100 values.

In [4]:
sample_size = 5000
max_words = 100
min_df = .01

df = df.sample(sample_size, random_state=random_state)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=max_words, min_df=min_df, stop_words="english")
X_vectorized = vectorizer.fit_transform(df["Text"]).toarray()
X_vectorized.shape

(5000, 100)

Because the vectorizer removed the original text from the observation it is added again here. This is done so that the same data can be used also for test evaluation purposes by humans.

In [5]:
X = pandas.DataFrame(X_vectorized)
X[max_words] = df["Text"].values
y = df["Target"]
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,0.00,0.00,0.00,0.00,0.00,0.28,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,4/ Product Risk is the risk that your founding insight will not be powerful enough for you to achieve product-market fit.\n\nThe best market-risk companies have evidence that there will be demand ...
1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,People are not having Good nutritional food in school and old age centers \n\nAgencies take Bottled water and throw in Airport \n\nThanks to Rules\n\nCooked food is thrown away as Trash each night...
2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,@fletch_biggsss Fletcher u probably still suck dick for Xanax
3,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@WlTCHOFSCARLET ""Your right. That wouldn't be a good idea."""
4,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@elgoonishshive @shadowraptor51 99.999% of all the drama and problems in the Herald's books are the direct result evil acts by evil people for the sake of power, almost nothing (with a very few ve..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,@PACinTX @DNC @SpeakerPelosi @SenSchumer @RepAdamSchiff @KamalaHarris @DickDurbin @TeamPelosi @JoeBiden @tedlieu @CNN Which is why the president is probably promoting Johnson and Johnson… Because ...
4996,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,@Zakiyyah6 I almost made the tomb raider uncharted comparison but went with assassins creed and witcher because i am playing witcher 3. Also take out the swinging and spiderman PS4 is basically a ...
4997,0.00,0.00,0.00,0.45,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@atinyseongstar I got ateez audience 🥹 I’m so excited to see them, but still a little sad at how fast m&amp;g went 🥲🖤"
4998,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@PennieRoyalTea @StandingforXX @RepanseDe Nope. But keep trying to pigeon-hole me if it helps you. I don't mind replying.\n\nHere's a question for you, since I'm answering all yours. What is a tra..."


## 🪓 Splitting into train/test

Here the dataset is split into a train set and a test set. From the train set the original text will be removed again, as this is not a numeric feature and cannot be used in training.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=random_state)
X_train = X_train.drop([max_words], axis=1)
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
4227,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4676,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
800,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3671,0.00,0.40,0.00,0.00,0.00,0.00,0.00,0.48,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4193,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4426,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
466,0.00,0.00,0.00,0.00,0.66,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3092,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3772,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.70,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00


# Modelling
Given that the target variable is a scale going from negative to positive but uses classes, the sigmoid kernel is likely best suited for this problem. The class_weight hyper parameter ensures that the weight is recalculated for each class.

In [7]:
from sklearn.svm import SVC
model = SVC(kernel="sigmoid", probability=True, class_weight="balanced")
model.fit(X_train, y_train)
X_test_messages = X_test[max_words]
X_test = X_test.drop([max_words], axis=1)
score = model.score(X_test, y_test)
print("Accuracy:", score)

Accuracy: 0.934


# Evaluation
Now, for every observation in the test set a prediction is given. Also the truth value, and the original text are included. For brevity reasons only a random sample of 50 is displayed.

In [8]:
pred = model.predict_proba(X_test)
predictions = pandas.DataFrame(pred, columns=label_map.keys())
predictions["truth"] = y_test.map(dict((v,k) for k, v in label_map.items())).values
predictions["text"] = X_test_messages.values
predictions.sample(50)

Unnamed: 0,negative,uncertainty,positive,truth,text
456,0.82,0.14,0.04,negative,"@ampleswap thank you for having an airdrop event or giveaway, will always support you and stay calm without any problems.\n@Syaiful09286777 @syafiqmughni9 @Alkucluki1"
921,0.11,0.76,0.12,uncertainty,"this dude went from “ we not having anymore kids “ to “ you gotta give me a son “ 😂 ... if it’s in God’s Plans we might can make that happen later... if not, oh well. 🤷🏾‍♀️"
711,0.02,0.26,0.73,positive,See this the reason why we be staring at our man all day after a dream of him cheating 😂 https://t.co/337Q4clM1Y
272,0.0,0.06,0.93,positive,@Queen_NoCrown Love ya best life honey 🥂
251,0.97,0.03,0.0,negative,@Sen_JoeManchin Senator Manchin it just demonstrates how flawed the review process is and the lack of accountability for the Supreme Court. It seems to me on its face that they both lied and there...
881,0.0,0.0,1.0,positive,Senior varsity Bulldogs improve to 2 to 1 record https://t.co/faAhrgFl9W
940,0.0,0.0,1.0,positive,no i’m so excited guys this looks so fun
969,0.0,1.0,0.0,uncertainty,@NetflixFR Izombie saison 4 sur Netflix possible ?
598,0.01,0.23,0.76,positive,happy cancer season it’s a cancer new moon today listen to Lana del ray okay? okay okay okay okay okay you live in my dream state relocate my fantasy whoops sorry https://t.co/l1POouHZID
489,1.0,0.0,0.0,negative,@TechCentral There is a broken `Add A Comment` button on the bottom of your articles


A classification report gives information about the precision and recall. Also the classification report can be ran on the test set as well as on the train set. If very different outcomes are presented, the model may be overfitted. Here the outcomes are rather similar, so the model is likely fit to generalize in the real world.

In [9]:
from sklearn.metrics import classification_report

pred = model.predict(X_train)
report = classification_report(y_train, pred, target_names=label_map.keys())
print("Train set")
print(report)

pred = model.predict(X_test)
report = classification_report(y_test, pred, target_names=label_map.keys())
print("Test set")
print(report)

Train set
              precision    recall  f1-score   support

    negative       0.94      0.94      0.94      1451
 uncertainty       0.87      0.86      0.87      1116
    positive       0.93      0.94      0.93      1433

    accuracy                           0.92      4000
   macro avg       0.91      0.91      0.91      4000
weighted avg       0.92      0.92      0.92      4000

Test set
              precision    recall  f1-score   support

    negative       0.97      0.95      0.96       335
 uncertainty       0.89      0.91      0.90       273
    positive       0.94      0.94      0.94       392

    accuracy                           0.93      1000
   macro avg       0.93      0.93      0.93      1000
weighted avg       0.93      0.93      0.93      1000



# Inference

In [10]:
message = "the broken car is useless"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,1.0,0.0,0.0


In [11]:
message = "the sun shines and everything is good"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,0.0,0.0,1.0


In [12]:
message = "anything may happen at any given moment"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,0.08,0.88,0.04


---
Nu wil ik proberen om nederlandse teksten te gaan voorspellen. Dit moeten we ook gaan doen voor de proftaak.

In [13]:
message = "Hallo, ik ben Pepijn"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,0.08,0.88,0.04


In [14]:
message = "Vandaag heb ik heel veel mooie dingen gezien. ik heb het erg naar mijn zin gehad. super leuk."
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,0.08,0.88,0.04


Zoals je heirboven ziet kan er met het bestaande model niet goed voorspelt worden. Dit komt omdat de model is getraind op engelse teksten. Nu gaan we een nieuw model trainen op nederlandse teksten. Om hier mee te beginnen moet we de stopwoorden van de  vectorizer aanpassen. Dit doen we door de stopwoorden van het engels te verwijderen en de stopwoorden van het nederlands toe te voegen.

In [28]:
import json # import json package
stop_words = json.load(open("./data/stop_words_dutch.json", "r")) # load the stop words from the json file
vectorizer_nl = TfidfVectorizer(max_features=max_words, min_df=min_df, stop_words=stop_words) # create a new vectorizer with the stop words

In [30]:
message = "Vandaag heb ik heel veel mooie dingen gezien. ik heb het erg naar mijn zin gehad. super leuk."
message_vectorized = vectorizer_nl.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

NotFittedError: The TF-IDF vectorizer is not fitted

Als ik dit werkent zou willen krijgen zou ik een gelabelde training set nodig hebben, na wat zoeken op internet heb ik geen dataset gevonden die ik goed genoeg vond om te gebruiken zonder al te veel data cleaning.
