## Sentiment140 ✨ KNN Classifier

### Dataset

Sentiment140 contains `1,600,000 tweets` extracted using the twitter api.  
The tweets have been annotated (`0 = negative, 4 = positive`) and they can be used to detect sentiment.

In [66]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/sentiment140.csv", header=None, usecols=[0, 5]) # no header in .csv

# Assign column names
df.columns = ['Sentiment', 'Text']

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Shape: (1600000, 2)


### Sample

Get a `sample` from this huge dataset (otherwise won't work).

In [67]:
import random

# Sample
sample_size = 10_000
indices = random.sample(range(len(df)), sample_size)
df = df.iloc[indices]

display(df.head())
print("Shape:", df.shape)

Unnamed: 0,Sentiment,Text
980293,4,Rushing to get to the pool what is this thing...
1395650,4,@Annette836 oh wow your going twice. lucky you...
876833,4,I love how Pacquiao doesn't talk trash about o...
574372,0,"Been here for almost an hour, still waiting"
390409,0,day is a bit of a blah dayfor me... it's the 2...


Shape: (10000, 2)


### Split data

Split the dataset into training and testing `data sets`.

In [68]:
from sklearn.model_selection import train_test_split

X = df['Text']
y = df['Sentiment']
X1, X2, y1, y2 = train_test_split(X, y, test_size=0.2, random_state=0)

display(X1.head())
display(y1.head())

1150473    @markhoppus I'm thinking vegas ... Back to bac...
764544                  @mskatrina25 YOU!!! YOU M.I.A AGAIN 
567307      hair is too short now. still sick. have to le...
870840     @KeirPoole Everyones saying that but I loved t...
244227     @regengirl Glad to hear you're home safe. Big ...
Name: Text, dtype: object

1150473    4
764544     0
567307     0
870840     4
244227     0
Name: Sentiment, dtype: int64

### Vectorize

Convert text into `numerical` features.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
X1 = v.fit_transform(X1)
X2 = v.transform(X2)

print(X1.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### KNN Classifier

Train the KNN `classifier` using train (X1) dataset.

In [70]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

knn = KNeighborsClassifier()
knn.fit(X1, y1)

# Make predictions on the test set
y_pred = knn.predict(X2)

# Reports
print("Sentiment140 / samples =", sample_size)
print("Score on Train:", knn.score(X1, y1).round(2))
print("Score on Test:", knn.score(X2, y2).round(2))
print("Report:", classification_report(y2, y_pred))

Sentiment140 / samples = 10000
Score on Train: 0.79
Score on Test: 0.67
Report:               precision    recall  f1-score   support

           0       0.65      0.72      0.68       982
           4       0.70      0.63      0.66      1018

    accuracy                           0.67      2000
   macro avg       0.68      0.67      0.67      2000
weighted avg       0.68      0.67      0.67      2000

