## Sentiment140 ✨ KNN Classifier

### Dataset

Sentiment140 contains `1,600,000 tweets` extracted using the twitter api.  
The tweets have been annotated (`0 = negative, 4 = positive`) and they can be used to detect sentiment.

In [96]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/sentiment140.csv", header=None, usecols=[0, 5]) # no header in .csv

# Assign column names
df.columns = ['Sentiment', 'Text']

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Shape: (1600000, 2)


### Sample

Get a `sample` from this huge dataset (otherwise won't work).

In [97]:
import random

# Sample
sample_size = 10_000
indices = random.sample(range(len(df)), sample_size)
df = df.iloc[indices]

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
31871,0,oh lordddy i missed the live show
1314017,4,Starcraft 2 confirmed by the end of 2009
1426281,4,@ladysybilla \t I am reading in Foforks Russet...
180967,0,@jdenisse i miss you
394029,0,@PhantomPai omg it's really happening!!! and u...


Shape: (10000, 2)


### Split data

Split the dataset into training and testing `data sets`.

In [98]:
from sklearn.model_selection import train_test_split

X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

display(X1.head())
display(y1.head())

964772                        @lalaliiindsey flippin' yeaah 
158919     @KateriRose   sometimes i hate papas. did you ...
1413751    @miss_magpie : what a gorgie day for a bbq! Ha...
1239651    @DJRzSpinz ill update you &amp; @BabyBree96 ab...
592149     Sao hôm nay không thể follow back m�?i ngư�?i ...
Name: Text, dtype: object

964772     4
158919     0
1413751    4
1239651    4
592149     0
Name: Sentiment, dtype: int64

### Vectorize

Convert text into `numerical` features.

In [99]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X1 = tfidf_vectorizer.fit_transform(X1)
X2 = tfidf_vectorizer.transform(X2)

print(X1.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### KNN Classifier

Train the KNN `classifier` using train (X1) dataset.

In [100]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

knn = KNeighborsClassifier()
knn.fit(X1, y1)

# Make predictions on the test set
y_pred = knn.predict(X2)

# Reports
print("Sentiment140 / samples =", sample_size)
print("Score on Train:", knn.score(X1, y1).round(2))
print("Score on Test:", knn.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 / samples = 10000
Score on Train: 0.8
Score on Test: 0.67
Report:               precision    recall  f1-score   support

           0       0.65      0.71      0.68       991
           4       0.69      0.62      0.65      1009

    accuracy                           0.67      2000
   macro avg       0.67      0.67      0.67      2000
weighted avg       0.67      0.67      0.67      2000



### Predictions

Make prediction for `reviews` (from .csv file).

In [101]:
df_reviews = pd.read_csv('data/reviews.csv')
df_reviews['Sentiment'] = df_reviews['Sentiment'].map({'negative': 0, 'positive': 4})

X_unknown = df_reviews['Review']
y_unknown = df_reviews['Sentiment']

labels = {0: 'negative', 4: 'positive'}

# Convert using vectorizer
X_unknown = tfidf_vectorizer.transform(X_unknown)
y_pred_unknown = knn.predict(X_unknown)

print("Reviews (unknown):")
print("Unknown:\t", y_unknown.values)
print("Prediction:\t", y_pred_unknown)
print("Score on Reviews:", knn.score(X_unknown, y_unknown).round(2))

Reviews (unknown):
Unknown:	 [0 0 4 4 4 4 4 4 4 4 4 4 4 0 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 4 4 0]
Prediction:	 [4 0 0 0 4 0 4 4 0 4 4 4 4 4 4 4 4 0 4 4 4 0 4 4 4 0 4 0 0 0 0 4 4 4 0]
Score on Reviews: 0.57


### Random One

Prediction for `one random` review.

In [105]:
random_index = df_reviews.sample().index.item()

X_unknown_one = [df_reviews['Review'][random_index]]
y_unknown_one = df_reviews['Sentiment'][random_index]

labels = {0: 'negative', 4: 'positive'}

print("Review:\n", X_unknown_one[0])
print("Expected:", labels[y_unknown_one])

# Convert using vectorizer
X_unknown_one = tfidf_vectorizer.transform(X_unknown_one)
y_pred_one = knn.predict(X_unknown_one)

print("Prediction:", labels[y_unknown_pred[0]])

Review:
 The topics and concepts in the book are the ones that the Pro programmer will use. However, the author did a pure job of implementing it. Unfortunately, there are too many unneeded talks and comparisons instead of getting straight to the point. In addition, the author's example is so long that you must go back and forth between chapters to see pieces of the code.
Expected: negative
Prediction: negative
