## Sentiment140 ✨ Logistic Regression

### Dataset

Load `sentiment140` dataset, assign column names and extract a sample (100.000)

In [53]:
import numpy as np
import pandas as pd
import random

# Load dataset
df = pd.read_csv('data/sentiment140.csv', header=None, usecols=[0, 5]) # no header in .csv
df.columns = ['Sentiment', 'Text']

# Sample
sample_size = 100_000
sample_indices = random.sample(range(len(df)), sample_size)
df = df.iloc[sample_indices]

display(df.head())

Unnamed: 0,Sentiment,Text
1073703,4,@yourscenekid Cheer up
195697,0,It's too cloudy for sunbathing.
1275633,4,@MonieParedes Oh is that the excuse? I could h...
959937,4,@shotbykim Nope just a cat (Elmo)
374604,0,i have not gotten a new moon poster where do t...


### Preprosessing

Convert to lowercase, `remove punctuation`, numbers and whitespaces.

In [54]:
df['Text'] = df['Text'].str.lower()                             # Convert text to lowercase
df['Text'] = df['Text'].str.replace('[^\w\s]', '', regex=True)  # Remove punctuation
df['Text'] = df['Text'].str.replace('\d+', '', regex=True)      # Remove numbers
df['Text'] = df['Text'].str.strip()                             # Strip whitespaces
display(df.head())

Unnamed: 0,Sentiment,Text
1073703,4,yourscenekid cheer up
195697,0,its too cloudy for sunbathing
1275633,4,monieparedes oh is that the excuse i could hav...
959937,4,shotbykim nope just a cat elmo
374604,0,i have not gotten a new moon poster where do t...


### Training
Split data into train and test, convert to `numerical features`, fit the model (Logistic Regression).



In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the dataset into training and testing sets
X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

# Convert text into numerical features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X1 = tfidf_vectorizer.fit_transform(X1)
X2 = tfidf_vectorizer.transform(X2)

# Train the model
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X1, y1)

# Make predictions on the test set
y_pred = classifier.predict(X2)

print("Sentiment140 samples =", sample_size)
print("Score on Train:", classifier.score(X1, y1).round(2))
print("Score on Test:", classifier.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 samples = 100000
Score on Train: 0.84
Score on Test: 0.76
Report:               precision    recall  f1-score   support

           0       0.77      0.74      0.75     10023
           4       0.75      0.78      0.76      9977

    accuracy                           0.76     20000
   macro avg       0.76      0.76      0.76     20000
weighted avg       0.76      0.76      0.76     20000



### Predictions

Predict unknow `reviews`, from reviews.csv file.

In [56]:
# Load reviews
df_reviews = pd.read_csv('data/reviews.csv')

# Convert the sentiment labels
df_reviews['Sentiment'] = df_reviews['Sentiment'].map({'negative': 0, 'positive': 4})

X_unknown = df_reviews['Review']
y_unknown = df_reviews['Sentiment']

# Convert text into numerical features
X_unknown = tfidf_vectorizer.transform(X_unknown)

# Make predictions on unknown set
y_unknown_pred = classifier.predict(X_unknown)
score = classifier.score(X_unknown, y_unknown)

print("Reviews (unknown):")
print("Unknown:\t", y_unknown.values)
print("Prediction:\t", y_unknown_pred)
print("Score on Unknown:", score.round(2))
print("Report:", classification_report(y_unknown, y_unknown_pred), "\n")

Reviews (unknown):
Unknown:	 [0 0 4 4 4 4 4 4 4 4 4 4 4 0 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 4 4 0]
Prediction:	 [4 4 4 4 4 0 4 4 4 4 4 4 4 4 0 4 4 4 4 4 0 4 0 4 0 4 0 4 4 0 0 4 4 4 0]
Score on Unknown: 0.69
Report:               precision    recall  f1-score   support

           0       0.78      0.44      0.56        16
           4       0.65      0.89      0.76        19

    accuracy                           0.69        35
   macro avg       0.72      0.67      0.66        35
weighted avg       0.71      0.69      0.67        35
 



### Random One

Prediction for `one random` review.

In [57]:
random_index = df_reviews.sample().index.item()

X_unknown_one = [df_reviews['Review'][random_index]]
y_unknown_one = df_reviews['Sentiment'][random_index]

labels = {0: 'negative', 4: 'positive'}

print("Review:\n", X_unknown_one[0])
print("Expected:", labels[y_unknown_one])

# Convert using vectorizer
X_unknown_one = tfidf_vectorizer.transform(X_unknown_one)
y_pred_one = classifier.predict(X_unknown_one)

print("Prediction:", labels[y_unknown_pred[0]])

Review:
 The book starts by explaining each function with an example that makes sense. But it builds on these pieces as the book continues. The approach is well disciplined and pretty quickly you are building solutions. A mix of theory and code with emphasis on the latter. You won't learn calculus but you will learn how to do it from the code. Good for those who dropped out of Andrew Ng's courses and mortals wanting results. Me: 40 years development, MBA, and 5 years data science. Imagining and prediction mostly.
Expected: positive
Prediction: positive
