#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# KNN or K-Nearest-Neighbors

KNN is a simple concept: define some distance metric between the items in your dataset, and find the K closest items. You can then use those items to predict some property of a test item, by having them somehow "vote" on it.

## Overview

### Learning Objectives

* Understand the basic concept of KNN 
* Use KNN to solve a classification problem

### Prerequisites

* Introduction to Colab
* Intermediate Python
* Introduction to Pandas
* Visualizations

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |


## K-Nearest-Neighbors for Classification


In this example, we will use KNN to predict whether or not a person will be diagnosed with diabetes. You can download the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/pima-indians-diabetes-database.zip/1). Save it as diabetes.csv and upload the file into the runtime. 

In [0]:
import pandas as pd 

diabetes = pd.read_csv('./diabetes.csv')
diabetes.head()

In [0]:
print("dimension of diabetes data: {}".format(diabetes.shape))

In [0]:
list(diabetes)

In this example, our features are 'Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction', and 
 'Age.' Our target is "Outcome" which is currently encoded with a 1 for a positive diabetes diagnosis and 0 for a negative diabetes diagnosis. 

In [0]:
print(diabetes.groupby('Outcome').size())

We notice that there are several 0's in the dataset. These are likely cases where the data simply wasn't collected or stored properly. We need to clean these up or they will have an incorrect affect on the outcome of our KNN.

In [0]:
import numpy as np
no_zero = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for column in no_zero:
  diabetes[column] = diabetes[column].replace(0,np.NaN)
  mean = int(diabetes[column].mean(skipna=True))
  diabetes[column] = diabetes[column].replace(np.NaN, mean)

We create a training and testing sets, remembering to separate 'Outcome' as our target value y. 

In [0]:
from sklearn.model_selection import train_test_split

X = diabetes.iloc[:,0:8]
y = diabetes.iloc[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Now we scale our features using StandardScaler. 

In [0]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test= sc_X.transform(X_test)

Finally, we use the SciKit Learn KNN model. 

In [0]:
from sklearn.neighbors import KNeighborsClassifier

n_neighbors = 14

KNN = KNeighborsClassifier(n_neighbors = n_neighbors, p=2, metric = 'euclidean')
KNN.fit(X_train, y_train)

y_pred = KNN.predict(X_test)

We now evaluate our model. First let's look at the confusion matrix. 

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
accuracy = accuracy_score(y_test,y_pred)


print('The confusion matrix is', cm)
print('The f1 score is', f1)
print('The accuracy score is', accuracy)



## K-Nearest-Neighbors for Regression

We can also use KNN for regression. In this example, we also define our own distance metric (as Euclidean distance is not well defined for these data) and write our own KNN function. 

Let's look at the MovieLens data which you can download [here](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) **. Once you download and unzip the file, please upload the ratings.csv and movies.csv files. 


---
** MovieLens data is available in relation to the following paper:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872



We'll use KNN to guess the rating of a movie by looking at the 10 movies that are closest to it in terms of genres and popularity.

To start, we'll load up every rating in the data set into a Pandas DataFrame:

In [0]:
import pandas as pd

ratings = pd.read_csv('./ratings.csv')
ratings.head()

Now, we'll group everything by movie ID, and compute the total number of ratings (each movie's popularity) and the average rating for every movie.

In [0]:
import numpy as np

movieProperties = ratings.groupby('movieId').agg({'rating': [np.size, np.mean]})
movieProperties.head()

The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings. So, a value of 0 means nobody rated it, and a value of 1 will mean it's the most popular movie there is.

In [0]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Now, let's get the genre information from the movies.csv file. In the genres column, we see the list of genres for each movie separated by a '|'. Note that a movie may have more than one genre. 

First we read the file into a DataFrame. 

In [0]:
movies = pd.read_csv('./movies.csv')
movies.head()

Now we split the genres column on the '|' and create a new DataFrame called movies_split. 

In [0]:
movies_split = movies.genres.str.split('|', expand=True)
movies_split.head()

We now create a list of all the unique genres that appear in this DataFrame. It is called genres_list. 

In [0]:
genres = pd.unique(movies_split[[0,1,2,3,4,5,6,7,8,9]].values.ravel('K'))
genres_list = list(genres)
genres_list.remove(None)
print(genres_list)
print(len(genres_list))

In the movies DataFrame, we want to recode the values of the genres column to be a list of twenty 0's and 1's that correspond to the values in genres_list (in the order they appear in genres_list). For example, if a movie has genres Adventure and Children, then we would like the element in the genres column to be: \
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The function defined below iterates through a list of genres and compares the values to the elements of genres_list. It then returns an appropriate array of 0's and 1's as described above. 

In [0]:
#Definition of the function f to create the array of 0's and 1's based on genres
def f(mylist):
  a = np.zeros(20).astype(int)
  for i in mylist:
    for j in range(20):
      if i == genres_list[j]:
        a[j] = 1
  return a

#Test that f works as expected on an example list 
print(f(['Adventure', 'Children']))

We split the genres column of the movies DataFrame to be a list (in preparation for applying the function, f).

In [0]:
movies['genres'] = movies.genres.str.split('|')
movies.head()

We apply the function f to the genres column to change the elements to arrays of 0's and 1's representing the genres. We also set the index to be the movieID. 

In [0]:
movies['genres'] = movies.genres.apply(f)
movies = movies.set_index('movieId')
movies.head()

In [0]:
df = pd.concat([movies, movieNormalizedNumRatings, movieProperties], axis=1)
df.head()

Convert the DataFrame to a dictionary, and check the first entry. 

In [0]:
movieDict = df.T.to_dict('list')
movieDict[1]

Now let's define a function that computes the "distance" between two movies based on how similar their genres are, and how similar their popularity is. Just to make sure it works, we'll compute the distance between movie ID's 2 and 4:

In [0]:
from scipy import spatial

def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB) #this will be 1 if there are no overlapping genres
    #print('The genre distance is', genreDistance)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    #print('The popularity distance is', popularityDistance)
    return genreDistance + popularityDistance
    
ComputeDistance(movieDict[2], movieDict[4])



Remember the higher the distance, the less similar the movies are. Let's check what movies 2 and 4 actually are - and confirm they're not really all that similar:

In [0]:
print(movieDict[2])
print(movieDict[4])


Now, we just need a little code to compute the distance between some given test movie (Toy Story, in this example) and all of the movies in our data set. When the sort those by distance, and print out the K nearest neighbors:

In [0]:
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors(1, K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][4]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][4]))
    
avgRating /= K

While we were at it, we computed the average rating of the 10 nearest neighbors to Toy Story:

In [0]:
avgRating

How does this compare to Toy Story's actual average rating?

In [0]:
movieDict[1]

Not too bad!


# Exercises

## Exercise 1: K-Nearest-Neighbors for Classification


Our choice of 14 for the number of neighbors was arbitrary - what effect do different values have on the results? Run some tests and explain what you find. 

### Student Solution

ANSWER = 

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 2: K-Nearest-Neighbors for Classification


Create a plot of accuracy vs. n_neighbors (i.e. accuracy on the y-axis and n_neighbors on the x-axis). Let the number of neighbors (x-axis) range from 1 to 20.  Your plot should contain two lines. The first line should plot the model's training accuracy and the second line should show the model's testing accuracy. 

### Student Solution

In [0]:
# Your code goes here.

### Answer Key

**Solution**

In [0]:
from sklearn.model_selection import train_test_split

X = diabetes.iloc[:,0:8]
y = diabetes.iloc[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)


from sklearn.neighbors import KNeighborsClassifier
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 20
neighbors_settings = range(1, 20)
for n_neighbors in neighbors_settings:
    # build the model
    KNN = KNeighborsClassifier(n_neighbors=n_neighbors)
    KNN.fit(X_train, y_train)
    # record training set accuracy
    training_accuracy.append(KNN.score(X_train, y_train))
    # record test set accuracy
    test_accuracy.append(KNN.score(X_test, y_test))
    
import matplotlib.pyplot as plt
   
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.savefig('knn_compare_model')

**Validation**

In [0]:
# TODO

## Exercise 3: K-Nearest-Neighbors for Regression


Our choice of 10 for K was arbitrary - what effect do different K values have on the results?


### Student Solution

ANSWER = 

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO

## Exercise 4: Challenge (Ungraded)

Our distance metric was also somewhat arbitrary - we just took the cosine distance between the genres and added it to the difference between the normalized popularity scores. Can you improve on that?


### Student Solution

ANSWER = 

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO