## Sentiment140 ✨ Logistic Regression

### Dataset

Load `sentiment140` dataset, assign column names and extract a sample (100.000)

In [6]:
import numpy as np
import pandas as pd
import random

# Load dataset
df = pd.read_csv('data/sentiment140.csv', header=None, usecols=[0, 5]) # no header in .csv
df.columns = ['Sentiment', 'Text']

# Sample
sample_size = 100_000
sample_indices = random.sample(range(len(df)), sample_size)
df = df.iloc[sample_indices]

display(df.head())

Unnamed: 0,Sentiment,Text
1139615,4,I clearly need 2 get 2 gether so I can get mov...
383930,0,@thatahanitya iya taaaat susah banget td mate...
1001364,4,@JawshKrewz Bye josh. See you at work tomorrow.
966494,4,@ElleBows Congrats on your Etsy It Up! win
372380,0,@allisongrayce what do you mean? you must have...


### Preprosessing

Convert to lowercase, `remove punctuation`, numbers and whitespaces.

In [7]:
df['Text'] = df['Text'].str.lower()                             # Convert text to lowercase
df['Text'] = df['Text'].str.replace('[^\w\s]', '', regex=True)  # Remove punctuation
df['Text'] = df['Text'].str.replace('\d+', '', regex=True)      # Remove numbers
df['Text'] = df['Text'].str.strip()                             # Strip whitespaces
display(df.head())

Unnamed: 0,Sentiment,Text
1139615,4,i clearly need get gether so i can get movin...
383930,0,thatahanitya iya taaaat susah banget td matem...
1001364,4,jawshkrewz bye josh see you at work tomorrow
966494,4,ellebows congrats on your etsy it up win
372380,0,allisongrayce what do you mean you must have n...


### Training
Split data into train and test, convert to `numerical features`, fit the model (Logistic Regression).



In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the dataset into training and testing sets
X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

# Convert text into numerical features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X1 = tfidf_vectorizer.fit_transform(X1)
X2 = tfidf_vectorizer.transform(X2)

# Train the model
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X1, y1)

# Make predictions on the test set
y_pred = classifier.predict(X2)

print("Sentiment140 samples =", sample_size)
print("Score on Train:", classifier.score(X1, y1).round(2))
print("Score on Test:", classifier.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 samples = 100000
Score on Train: 0.84
Score on Test: 0.75
Report:               precision    recall  f1-score   support

           0       0.76      0.74      0.75      9910
           4       0.75      0.77      0.76     10090

    accuracy                           0.75     20000
   macro avg       0.75      0.75      0.75     20000
weighted avg       0.75      0.75      0.75     20000



### Predictions

Predict unknow `reviews`, from reviews.csv file.

In [9]:
# Load reviews
df_reviews = pd.read_csv('data/reviews.csv')

# Convert the sentiment labels
df_reviews['Sentiment'] = df_reviews['Sentiment'].map({'negative': 0, 'positive': 4})

X_unknown = df_reviews['Review']
y_unknown = df_reviews['Sentiment']

# Convert text into numerical features
X_unknown = tfidf_vectorizer.transform(X_unknown)

# Make predictions on unknown set
y_unknown_pred = classifier.predict(X_unknown)
score = classifier.score(X_unknown, y_unknown)

print("Reviews (unknown):")
print("Unknown:\t", y_unknown.values)
print("Prediction:\t", y_unknown_pred)
print("Score on Unknown:", score.round(2))
print("Report:", classification_report(y_unknown, y_unknown_pred), "\n")

Reviews (unknown):
Unknown:	 [0 0 4 4 4 4 4 4 4 4 4 4 4 0 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 4 4 0]
Prediction:	 [0 4 4 4 4 0 0 4 0 4 4 4 4 4 0 4 4 4 4 4 4 4 4 4 0 4 0 4 4 0 0 4 4 4 4]
Score on Unknown: 0.63
Report:               precision    recall  f1-score   support

           0       0.67      0.38      0.48        16
           4       0.62      0.84      0.71        19

    accuracy                           0.63        35
   macro avg       0.64      0.61      0.60        35
weighted avg       0.64      0.63      0.61        35
 



### Random One

Prediction for `one random` review.

In [12]:
random_index = df_reviews.sample().index.item()

X_unknown_one = [df_reviews['Review'][random_index]]
y_unknown_one = df_reviews['Sentiment'][random_index]

labels = {0: 'negative', 4: 'positive'}

print("Review:\n", X_unknown_one[0])
print("Expected:", labels[y_unknown_one])

# Convert using vectorizer
X_unknown_one = tfidf_vectorizer.transform(X_unknown_one)
y_pred_one = classifier.predict(X_unknown_one)

print("Prediction:", labels[y_unknown_pred[0]])

Review:
 This book is worthless; I sent it back. You may derive some benefit if you are new to programming, but this book is full of warm milk for programming babies; it contains no strong meat, which pros require to produce commercial quality product at scale.
Expected: negative
Prediction: negative
