# Importing Libraries and Data

In [1]:
from math import sqrt
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

In [None]:
## IMPORT DATA HERE (TF-IDF TRANSFORMED SPARSE MATRIX)

# Buidling the Model
K-Nearest Neighbors (KNN) is an unsupervised learning algorithm. It relies on calculating the distance between datapoints, as specified by a certain method (Mnhattan, Euclidean...). Next, it gets the neighbors with minimum distance, ie. those closest to the data point we want to classify.
## Calculating Distance
For this case we will calculate the Euclidean distance between the rows. Since we have a sparse matrix of normalized count values that we got from out TF-IDF Transformer, we could find the euclidean distance by taking the square-root of the difference squared of each word-count value across all the tweets.


In [2]:
def euclidean_distance(tweet0, tweet1):
  ## init distance to 0
    distance = 0.0
    ## loop through the word counts both tweets and take the
    ##difference at each position
    for i in range(len(tweet0)-1):
        distance += (tweet0[i] - tweet1[i])**2
    ## return the square-root of the squared distance
    return sqrt(distance)

## Getting the Neighbors
In this step, we use the distances from the previous step to see the closest k-neighbors to our datapoint.

In [8]:
def get_neighbors(training_set, test_tweet, k):
  ## init list of distances to store (tweet, distance) tuples
    distances = list()
    ## loop through tweets in the dataset and calculate distance to test_tweet
    for tweet in training_set:
        dist = euclidean_distance(test_tweet, tweet)
        distances.append((tweet, dist))
    ## sort the (tweet, distance) entries based on distance
    distances.sort(key=lambda entry: entry[1])
    ## Now we can get the neighbors based on the specified k
    neighbors = list()
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors

## Class Prediction
Now that we have a way to calculate the distance and a method to get the neighbors using that distance measure, we can start making predictions. 
A prediction of class label `y` is basically the most frequent class of the k-neighbors closest to our test data-point. 

In [3]:
def predict_class(training_set, test_tweet, k):
    ## getting the neighbors of our test data point
    neighbors = get_neighbors(training_set, test_tweet, k)
    ## getting the class labels of the k-neighbors
    labels = [row[-1] for row in neighbors]
    ## Now we make a prediction based on the most frequent class
    prediction = max(set(labels), key=labels.count)
    return prediction

## Testing on Fake data (Remove later)
small test on fake dataset to make sure all is working fine

In [6]:
dataset = [[2.7810836,2.550537003,0],
    [1.465489372,2.362125076,0],
    [3.396561688,4.400293529,0],
    [1.38807019,1.850220317,0],
    [3.06407232,3.005305973,0],
    [7.627531214,2.759262235,1],
    [5.332441248,2.088626775,1],
    [6.922596716,1.77106367,1],
    [8.675418651,-0.242068655,1],
    [7.673756466,3.508563011,1]]
tweet0 = dataset[0]
i = 0
for tweet in dataset:
  distance = euclidean_distance(tweet0, tweet)
  print("Distance between tweet0 and tweet%d is %.3f" % (i,distance))
  i+=1    

Distance between tweet0 and tweet0 is 0.000
Distance between tweet0 and tweet1 is 1.329
Distance between tweet0 and tweet2 is 1.949
Distance between tweet0 and tweet3 is 1.559
Distance between tweet0 and tweet4 is 0.536
Distance between tweet0 and tweet5 is 4.851
Distance between tweet0 and tweet6 is 2.593
Distance between tweet0 and tweet7 is 4.214
Distance between tweet0 and tweet8 is 6.522
Distance between tweet0 and tweet9 is 4.986


In [9]:
neighbors = get_neighbors(dataset, tweet0, 3)
for neighbor in neighbors:
  print(neighbor)

[2.7810836, 2.550537003, 0]
[3.06407232, 3.005305973, 0]
[1.465489372, 2.362125076, 0]


In [13]:
prediction = predict_class(dataset, tweet0, 3)
print('Expected Class: %d\nGot Class: %d'% (dataset[0][-1], prediction))

Expected Class: 0
Got Class: 0


Seems like it's working. 
#### TBD
- [ ] Import data set as TF-IDF transformed matrix. (Or redo transformation then split to train, test, validate sets.  
- [ ] Transform labels to numerical values (string to 1/0) (useful for max count extraction in neighbors generation) (faster to process than string comparison)    
- [ ] Run model on dataset   
- [ ] Create function to get prediction metrics