Trying out simple sentiment analysis on tweets using the sentiment140 dataset. 

This is heavily based off of this [article](https://investigate.ai/investigating-sentiment-analysis/designing-your-own-sentiment-analysis-tool/) by investigate.ai

In [83]:
# imports
import pandas as pd

In [84]:
# extract the zip
import zipfile
import os

# Path to the zip file
zip_file_path = "sentiment140.zip"
# Directory where you want to extract the files
extract_dir = "data"

# Create the directory if it doesn't exist
os.makedirs(extract_dir, exist_ok=True)

# Open the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified directory
    zip_ref.extractall(extract_dir)

print(f"Files extracted to {extract_dir}")

Files extracted to data


Dataset was downloaded from kaggle [here](https://www.kaggle.com/datasets/kazanova/sentiment140?resource=download)

In [85]:
# load dataset

df = pd.read_csv("data/sentiment140.csv", encoding="ISO-8859-1", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [86]:
# preprocessing
# polarity = positive or negative (0 = neg, 4 = pos)
# id = id of the tweet
# date = date of the tweet
# flag = query associated
# user = user who tweeted the text
# text = actual text of the tweet
df.columns = ['polarity', 'id', 'date', 'flag', 'user', 'text']
df.head()

Unnamed: 0,polarity,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [87]:
# cut it down, take 50000 random samples from the dataset (skipped alr done)
df = df.sample(n=50000, random_state=42)
df.head()

Unnamed: 0,polarity,id,date,flag,user,text
541200,0,2200003196,Tue Jun 16 18:18:12 PDT 2009,NO_QUERY,LaLaLindsey0609,@chrishasboobs AHHH I HOPE YOUR OK!!!
750,0,1467998485,Mon Apr 06 23:11:14 PDT 2009,NO_QUERY,sexygrneyes,"@misstoriblack cool , i have no tweet apps for my razr 2"
766711,0,2300048954,Tue Jun 23 13:40:11 PDT 2009,NO_QUERY,sammydearr,"@TiannaChaos i know just family drama. its lame.hey next time u hang out with kim n u guys like have a sleepover or whatever, ill call u"
285055,0,1993474027,Mon Jun 01 10:26:07 PDT 2009,NO_QUERY,Lamb_Leanne,School email won't open and I have geography stuff on there to revise! *Stupid School* :'(
705995,0,2256550904,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,yogicerdito,upper airways problem


In [88]:
df.shape

(50000, 6)

In [89]:
df.info

<bound method DataFrame.info of         polarity          id                          date      flag  \
541200         0  2200003196  Tue Jun 16 18:18:12 PDT 2009  NO_QUERY   
750            0  1467998485  Mon Apr 06 23:11:14 PDT 2009  NO_QUERY   
766711         0  2300048954  Tue Jun 23 13:40:11 PDT 2009  NO_QUERY   
285055         0  1993474027  Mon Jun 01 10:26:07 PDT 2009  NO_QUERY   
705995         0  2256550904  Sat Jun 20 12:56:51 PDT 2009  NO_QUERY   
...          ...         ...                           ...       ...   
199266         0  1971396270  Sat May 30 07:00:39 PDT 2009  NO_QUERY   
210814         0  1974331559  Sat May 30 12:54:50 PDT 2009  NO_QUERY   
180674         0  1966611316  Fri May 29 18:04:55 PDT 2009  NO_QUERY   
364859         0  2048348064  Fri Jun 05 15:03:22 PDT 2009  NO_QUERY   
172400         0  1963565183  Fri May 29 12:55:00 PDT 2009  NO_QUERY   

                   user  \
541200  LaLaLindsey0609   
750         sexygrneyes   
766711       sammydear

In [90]:
# counting the totals of each possible value in polarity (how many postive compared to negative tweets)
df.polarity.value_counts()

polarity
4    25014
0    24986
Name: count, dtype: int64

In [91]:
# vectorizing the tweets
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names_out())
words_df.head()


Unnamed: 0,10,100,11,12,15,1st,20,2day,30,able,...,yesterday,yet,yo,you,your,yours,yourself,youtube,yummy,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.373036,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [92]:
# storing the features (X) and labels (y)'
X = words_df
y = df.polarity

In [93]:
# testing all algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

In [94]:
%%time
# logistic reg
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000, random_state=42)
logreg.fit(X, y)

CPU times: total: 16.6 s
Wall time: 5.86 s


In [95]:
%%time
# random forest
rf = RandomForestClassifier(n_estimators=50, random_state=42)
rf.fit(X, y)

CPU times: total: 20.8 s
Wall time: 1min 1s


In [96]:
%%time
# svc
svc = LinearSVC(random_state=42)
svc.fit(X, y)

CPU times: total: 250 ms
Wall time: 519 ms


In [97]:
%%time
# naive bayes classifier
bayes = MultinomialNB(random_state=42)
bayes.fit(X, y)

TypeError: MultinomialNB.__init__() got an unexpected keyword argument 'random_state'

In [98]:
pd.set_option("display.max_colwidth", 200)

test = pd.DataFrame({'content': [
    "The new marvel movie is kinda mid.",
    "hate the new keyboard, not worth it at all",
    "Not sure how i feel about soggy toast in the morning",
    "Feeling indifferent about the weather today, it's just... meh.",
    "did you see the new joker movie? it was okay.",
    "my figure arrived late and the head broke offff what the heck",
    "i cant stop watching dunmeshi god it's so addicting",
    "the concert was ok but coulda been better",
    "people have to stop doomscrolling its so tiring seeing all the bad stuff on here",
    "idk why people can't just fucking be normal is it that hard to be nice????",
    "won't lie sote is good but some areas are just... empty. like what...?",
]})
test

Unnamed: 0,content
0,The new marvel movie is kinda mid.
1,"hate the new keyboard, not worth it at all"
2,Not sure how i feel about soggy toast in the morning
3,"Feeling indifferent about the weather today, it's just... meh."
4,did you see the new joker movie? it was okay.
5,my figure arrived late and the head broke offff what the heck
6,i cant stop watching dunmeshi god it's so addicting
7,the concert was ok but coulda been better
8,people have to stop doomscrolling its so tiring seeing all the bad stuff on here
9,idk why people can't just fucking be normal is it that hard to be nice????


In [99]:
print(vectorizer.get_feature_names_out())

['10' '100' '11' '12' '15' '1st' '20' '2day' '30' 'able' 'about'
 'absolutely' 'account' 'actually' 'add' 'afraid' 'after' 'afternoon'
 'again' 'ago' 'agree' 'ah' 'ahead' 'ahh' 'ahhh' 'aint' 'air' 'airport'
 'album' 'all' 'almost' 'alone' 'along' 'already' 'alright' 'also'
 'although' 'always' 'am' 'amazing' 'amp' 'an' 'and' 'annoying' 'another'
 'answer' 'any' 'anymore' 'anyone' 'anything' 'anyway' 'apparently'
 'apple' 'are' 'aren' 'around' 'art' 'as' 'ask' 'asleep' 'ass' 'at' 'ate'
 'aw' 'awake' 'awards' 'away' 'awesome' 'aww' 'awww' 'baby' 'back' 'bad'
 'band' 'bbq' 'bday' 'be' 'beach' 'beat' 'beautiful' 'because' 'bed'
 'been' 'beer' 'before' 'behind' 'being' 'believe' 'best' 'bet' 'better'
 'big' 'bike' 'birthday' 'bit' 'black' 'blip' 'blog' 'blood' 'blue' 'body'
 'boo' 'book' 'books' 'bored' 'boring' 'both' 'bought' 'bout' 'box' 'boy'
 'boyfriend' 'boys' 'break' 'breakfast' 'bring' 'bro' 'broke' 'broken'
 'brother' 'brothers' 'btw' 'bus' 'busy' 'but' 'buy' 'by' 'bye' 'cake'
 'ca

In [100]:
test_vectors = vectorizer.transform(test.content)
test_words_df = pd.DataFrame(test_vectors.toarray(), columns=vectorizer.get_feature_names_out())
test_words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,30,able,...,yesterday,yet,yo,you,your,yours,yourself,youtube,yummy,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.214724,0.0,0.0,0.0,0.0,0.0,0.0


In [101]:
test_words_df.shape

(11, 1000)

In [102]:
# model predictions

# log reg predictions + probabilities
test['pred_logreg'] = logreg.predict(test_words_df)
test['pred_logreg_proba'] = logreg.predict_proba(test_words_df)[:,1]

# random forest predictions + probabilities
test['pred_forest'] = rf.predict(test_words_df)
test['pred_forest_proba'] = rf.predict_proba(test_words_df)[:,1]

# svc predictions
test['pred_svc'] = svc.predict(test_words_df)

# naive bayes predictions + probabilities
test['pred_bayes'] = bayes.predict(test_words_df)
test['pred_bayes_proba'] = bayes.predict_proba(test_words_df)[:,1]

In [103]:
test

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,The new marvel movie is kinda mid.,4,0.879679,4,0.98,4,4,0.655362
1,"hate the new keyboard, not worth it at all",0,0.146346,0,0.16,0,0,0.349759
2,Not sure how i feel about soggy toast in the morning,0,0.306354,0,0.34,0,4,0.510399
3,"Feeling indifferent about the weather today, it's just... meh.",0,0.235788,0,0.28,0,0,0.365722
4,did you see the new joker movie? it was okay.,4,0.938208,4,0.93,4,4,0.767261
5,my figure arrived late and the head broke offff what the heck,0,0.159757,0,0.22,0,0,0.227229
6,i cant stop watching dunmeshi god it's so addicting,4,0.504499,0,0.4,4,0,0.492799
7,the concert was ok but coulda been better,0,0.432442,0,0.32,0,0,0.445777
8,people have to stop doomscrolling its so tiring seeing all the bad stuff on here,0,0.21273,0,0.2,0,0,0.397773
9,idk why people can't just fucking be normal is it that hard to be nice????,0,0.136796,0,0.44,0,0,0.338457


In [104]:
# using train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [105]:
%%time

print("Training logistic regression...")
logreg.fit(X_train, y_train)

print("Training random forest...")
rf.fit(X_train, y_train)

print("Training svc...")
svc.fit(X_train, y_train)

print("Training naive bayes...")
bayes.fit(X_train, y_train)

Training logistic regression...
Training random forest...
Training svc...
Training naive bayes...
CPU times: total: 25.1 s
Wall time: 41.3 s


In [106]:
# checking accuracy
from sklearn.metrics import confusion_matrix, accuracy_score

# logreg
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
logreg_acc = accuracy_score(y_true, y_pred)

# value count
print("Actual values")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)


Actual values


Unnamed: 0,Predicted negative,Predicted positive
Is negative,4628,1605
Is positive,1446,4821


In [107]:
# percentage
print("Percentage")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Percentage


Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.7425,0.2575
Is positive,0.230732,0.769268


In [108]:
# random forest
y_true = y_test
y_pred = rf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
rf_acc = accuracy_score(y_true, y_pred)

# value count
print("Actual values")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Actual values


Unnamed: 0,Predicted negative,Predicted positive
Is negative,4730,1503
Is positive,1726,4541


In [109]:
# percentage
print("Percentage")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Percentage


Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.758864,0.241136
Is positive,0.275411,0.724589


In [110]:
# svc
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
svc_acc = accuracy_score(y_true, y_pred)

# value count
print("Actual values")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Actual values


Unnamed: 0,Predicted negative,Predicted positive
Is negative,4635,1598
Is positive,1434,4833


In [111]:
# percentage
print("Percentage")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Percentage


Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.743623,0.256377
Is positive,0.228818,0.771182


In [112]:
# naive bayes
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
bayes_acc = accuracy_score(y_true, y_pred)

# value count
print("Actual values")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Actual values


Unnamed: 0,Predicted negative,Predicted positive
Is negative,4735,1498
Is positive,1730,4537


In [113]:
# percentage
print("Percentage")
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Percentage


Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.759666,0.240334
Is positive,0.276049,0.723951


In [114]:
# comparing acc scores
print("Logistic regression accuracy score: ", logreg_acc)
print("Random forest accuracy score: ", rf_acc)
print("SVC accuracy score: ", svc_acc)
print("Naive Bayes accuracy score: ", bayes_acc)

Logistic regression accuracy score:  0.75592
Random forest accuracy score:  0.74168
SVC accuracy score:  0.75744
Naive Bayes accuracy score:  0.74176


Overall, logistic regression had the highest accuracy score, with SVC in second, Naive Bayes third, and random forest in last.