## Sentiment140 ✨ KNN Classifier

### Dataset

Sentiment140 contains `1,600,000 tweets` extracted using the twitter api.  
The tweets have been annotated (`0 = negative, 4 = positive`) and they can be used to detect sentiment.

In [44]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/sentiment140.csv", header=None, usecols=[0, 5]) # no header in .csv

# Assign column names
df.columns = ['Sentiment', 'Text']

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Shape: (1600000, 2)


### Sample

Get a `sample` from this huge dataset (otherwise won't work).

In [45]:
import random

# Sample
sample_size = 10_000
indices = random.sample(range(len(df)), sample_size)
df = df.iloc[indices]

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
145039,0,@she_writes what's wrong
1138503,4,@ferntreacy oh hon! i'm sorry you have to go t...
1290030,4,I should start charging people for seeking me ...
448798,0,I spilled hot soup on my arm.
1451413,4,"Valerie danced amazing, she had one 10 and the..."


Shape: (10000, 2)


### Split data

Split the dataset into training and testing `data sets`.

In [46]:
from sklearn.model_selection import train_test_split

X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

display(X1.head())
display(y1.head())

638997     looks like we'll be under self-quarantine for ...
933045     *stepping on an oreo on the ground*  me:oops a...
1317049    Heeeey guys!! J&amp;C day for me 2day yeeeeey ...
376238     I'm exhausted but my damn ocd won't let me sle...
250243                                 @Seany_ aww  *cuddle*
Name: Text, dtype: object

638997     0
933045     4
1317049    4
376238     0
250243     0
Name: Sentiment, dtype: int64

### Vectorize

Convert text into `numerical` features.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X1 = tfidf_vectorizer.fit_transform(X1)
X2 = tfidf_vectorizer.transform(X2)

print(X1.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### KNN Classifier

Train the KNN `classifier` using train (X1) dataset.

In [48]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

knn = KNeighborsClassifier()
knn.fit(X1, y1)

# Make predictions on the test set
y_pred = knn.predict(X2)

# Reports
print("Sentiment140 / samples =", sample_size)
print("Score on Train:", knn.score(X1, y1).round(2))
print("Score on Test:", knn.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 / samples = 10000
Score on Train: 0.79
Score on Test: 0.65
Report:               precision    recall  f1-score   support

           0       0.64      0.71      0.67      1002
           4       0.67      0.60      0.63       998

    accuracy                           0.65      2000
   macro avg       0.66      0.65      0.65      2000
weighted avg       0.66      0.65      0.65      2000



### Predictions

Make prediction for `one review` (from .csv file)

In [49]:
df_reviews = pd.read_csv('data/reviews.csv')
random_index = df_reviews.sample().index.item()

df_reviews['Sentiment'] = df_reviews['Sentiment'].map({'negative': 0, 'positive': 4})

X_unknown = [df_reviews['Review'][random_index]]
y_unknown = df_reviews['Sentiment'][random_index]

labels = {0: 'negative', 4: 'positive'}

print("Review:\n", X_unknown[0], "\n")
print("Expected:", labels[y_unknown])

# Convert using vectorizer
X_unknown = tfidf_vectorizer.transform(X_unknown)
y_pred_unknown = knn.predict(X_unknown)

print("Prediction:", labels[y_pred_unknown[0]])

Review:
 I found this book to be quite informative for someone who just get into machine learning. Lots of good examples provided ... however, I want to see some more real world application examples rather than sample data sets. 

Expected: negative
Prediction: negative
