### Sentiment140 Dataset / Logistic Regression

Load `sentiment140` dataset, assign column names and extract a sample (100.000)

In [2]:
import numpy as np
import pandas as pd
import random

# Load dataset
df = pd.read_csv('sentiment140.csv', header=None, usecols=[0, 5]) # no header in .csv
df.columns = ['Sentiment', 'Text']

# Sample
sample_size = 100_000
sample_indices = random.sample(range(len(df)), sample_size)
df = df.iloc[sample_indices]

display(df.head())

Unnamed: 0,Sentiment,Text
956867,4,oh would ya look at that!
108587,0,NOOOOOOOOOOOO PEYTON IS GOING TO DIE D: AND LU...
996058,4,@indiaknight Oh I didn't know that - I was des...
1156621,4,"Tesbihaaaaaaatt hihi, I'm so cheated the other..."
1356050,4,Just drove to target I feel so accomplished haha


### Preprosessing

Convert to lowercase, `remove punctuation`, numbers and whitespaces.

In [3]:
df['Text'] = df['Text'].str.lower()                             # Convert text to lowercase
df['Text'] = df['Text'].str.replace('[^\w\s]', '', regex=True)  # Remove punctuation
df['Text'] = df['Text'].str.replace('\d+', '', regex=True)      # Remove numbers
df['Text'] = df['Text'].str.strip()                             # Strip whitespaces
display(df.head())

Unnamed: 0,Sentiment,Text
956867,4,oh would ya look at that
108587,0,noooooooooooo peyton is going to die d and luc...
996058,4,indiaknight oh i didnt know that i was desper...
1156621,4,tesbihaaaaaaatt hihi im so cheated the others ...
1356050,4,just drove to target i feel so accomplished haha


### Training
Split data into train and test, convert to `numerical features`, fit the model (Logistic Regression).



In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the dataset into training and testing sets
X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

# Convert text into numerical features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X1 = tfidf_vectorizer.fit_transform(X1)
X2 = tfidf_vectorizer.transform(X2)

# Train the model
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X1, y1)

# Make predictions on the test set
y_pred = classifier.predict(X2)

print("Sentiment140 samples =", sample_size)
print("Score on Train:", classifier.score(X1, y1).round(2))
print("Score on Test:", classifier.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 samples = 100000
Score on Train: 0.84
Score on Test: 0.76
Report:               precision    recall  f1-score   support

           0       0.77      0.75      0.76      9945
           4       0.76      0.77      0.77     10055

    accuracy                           0.76     20000
   macro avg       0.76      0.76      0.76     20000
weighted avg       0.76      0.76      0.76     20000



### Predictions

Predict unknow `reviews`, from reviews.csv file.

In [5]:
# Load reviews
df_reviews = pd.read_csv('reviews_unknown.csv')

# Convert the sentiment labels
df_reviews['Sentiment'] = df_reviews['Sentiment'].map({'negative': 0, 'positive': 4})

X_unknown = df_reviews['Review']
y_unknown = df_reviews['Sentiment']

# Convert text into numerical features
X_unknown = tfidf_vectorizer.transform(X_unknown)

# Make predictions on unknown set
y_unknown_pred = classifier.predict(X_unknown)
score = classifier.score(X_unknown, y_unknown)

print("Reviews (unknown):")
print("Unknown:\t", y_unknown.values)
print("Prediction:\t", y_unknown_pred)
print("Score on Unknown:", score.round(2))
print("Report:", classification_report(y_unknown, y_unknown_pred), "\n")

Reviews (unknown):
Unknown:	 [0 0 4 4 4 4 4 4 4 4 4 4 4 0 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 4 4 0]
Prediction:	 [0 0 4 4 4 0 4 4 4 0 4 4 4 4 0 4 4 4 4 4 4 4 4 4 0 4 0 4 4 0 0 4 4 4 4]
Score on Unknown: 0.69
Report:               precision    recall  f1-score   support

           0       0.78      0.44      0.56        16
           4       0.65      0.89      0.76        19

    accuracy                           0.69        35
   macro avg       0.72      0.67      0.66        35
weighted avg       0.71      0.69      0.67        35
 



### Random One

Prediction for `one random` review.

In [7]:
random_index = df_reviews.sample().index.item()

X_unknown_one = [df_reviews['Review'][random_index]]
y_unknown_one = df_reviews['Sentiment'][random_index]

labels = {0: 'negative', 4: 'positive'}

print("Review:\n", X_unknown_one[0])
print("Expected:", labels[y_unknown_one])

# Convert using vectorizer
X_unknown_one = tfidf_vectorizer.transform(X_unknown_one)
y_pred_one = classifier.predict(X_unknown_one)

print("Prediction:", labels[y_unknown_pred[0]])

Review:
 Had a lot of issues trying to get the coding running. The plot_interactive_tree.py used through out the book used a imread module which no longer exists in its old place, you will get a bitter taste of the 'dependency hell' of python out from these code. For me I had to give up after a couple days of trying.
Expected: negative
Prediction: negative
