## Sentiment140 ✨ KNN Classifier

### Dataset

Sentiment140 contains `1,600,000 tweets` extracted using the twitter api.  
The tweets have been annotated (`0 = negative, 4 = positive`) and they can be used to detect sentiment.

In [86]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/sentiment140.csv", header=None, usecols=[0, 5]) # no header in .csv

# Assign column names
df.columns = ['Sentiment', 'Text']

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Shape: (1600000, 2)


### Sample

Get a `sample` from this huge dataset (otherwise won't work).

In [87]:
import random

# Sample
sample_size = 10_000
indices = random.sample(range(len(df)), sample_size)
df = df.iloc[indices]

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
177203,0,We're going to panahra for dinner. miss you g...
901481,4,@AndrewHansen1 Thanks Andrew! It hasn't yet be...
265976,0,"@morganmarie i exaggerate, but it will want to..."
1170494,4,@zenworm I see dirty diapers in my future... M...
745487,0,Bah. My head hurts and I wish I were napping. ...


Shape: (10000, 2)


### Split data

Split the dataset into training and testing `data sets`.

In [88]:
from sklearn.model_selection import train_test_split

X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

display(X1.head())
display(y1.head())

1140200      I GOT A HAIRCUTTT  it's short, but i love it .)
1172245    @harryjandu aww a gorgeous little funny wabbit...
493718     My mighty mouse is not so mighty, this morning...
1314859    @gorillaglo this i know, and yes we do i say s...
1476784                     You should text me 210-343-9168 
Name: Text, dtype: object

1140200    4
1172245    4
493718     0
1314859    4
1476784    4
Name: Sentiment, dtype: int64

### Vectorize

Convert text into `numerical` features.

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X1 = tfidf_vectorizer.fit_transform(X1)
X2 = tfidf_vectorizer.transform(X2)

print(X1.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### KNN Classifier

Train the KNN `classifier` using train (X1) dataset.

In [90]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

knn = KNeighborsClassifier()
knn.fit(X1, y1)

# Make predictions on the test set
y_pred = knn.predict(X2)

# Reports
print("Sentiment140 / samples =", sample_size)
print("Score on Train:", knn.score(X1, y1).round(2))
print("Score on Test:", knn.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 / samples = 10000
Score on Train: 0.79
Score on Test: 0.68
Report:               precision    recall  f1-score   support

           0       0.66      0.73      0.70      1018
           4       0.69      0.62      0.65       982

    accuracy                           0.68      2000
   macro avg       0.68      0.67      0.67      2000
weighted avg       0.68      0.68      0.67      2000



### Predictions

Make prediction for `reviews` (from .csv file).

In [91]:
df_reviews = pd.read_csv('data/reviews.csv')
df_reviews['Sentiment'] = df_reviews['Sentiment'].map({'negative': 0, 'positive': 4})

X_unknown = df_reviews['Review']
y_unknown = df_reviews['Sentiment']

labels = {0: 'negative', 4: 'positive'}

# Convert using vectorizer
X_unknown = tfidf_vectorizer.transform(X_unknown)
y_pred_unknown = knn.predict(X_unknown)

print("Reviews (unknown):")
print("Unknown:\t", y_unknown.values)
print("Prediction:\t", y_pred_unknown)
print("Score on Reviews:", knn.score(X_unknown, y_unknown).round(2))

Reviews (unknown):
Unknown:	 [0 0 4 4 4 4 4 4 4 4 4 4 4 0 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 4 4 0]
Prediction:	 [4 0 4 4 0 0 0 0 0 4 4 0 0 0 0 0 4 0 0 0 0 4 4 0 4 0 0 0 4 0 0 4 0 0 0]
Score on Reviews: 0.49


### Random One

Prediction for `one random` review.

In [95]:
random_index = df_reviews.sample().index.item()

X_unknown_one = [df_reviews['Review'][random_index]]
y_unknown_one = df_reviews['Sentiment'][random_index]

labels = {0: 'negative', 4: 'positive'}

print("Review:\n", X_unknown_one[0])
print("Expected:", labels[y_unknown_one])

# Convert using vectorizer
X_unknown_one = tfidf_vectorizer.transform(X_unknown_one)
y_pred_one = knn.predict(X_unknown_one)

print("Prediction:", labels[y_unknown_pred[0]])

Review:
 The book is meant to be introductory but dives straight into Python programming with NumPy and sklearn without showing the ropes of the libraries. The introduction to the ML concepts is gentle and well explained but the code is shoved down your throat and you better run to the docs to see what is actually does.
Saving point is: if you are teaching ML (like me) and need good well designed examples go for this book; also if you need very visual explanations. Would not recommend the book for a student though."
Expected: negative
Prediction: negative
