# Classification with _k_-nearest neigbor
In this example, we are trying to predict the genre of a movie. We are again using a modified [IMDB (Internet movie database) data set on movies](https://www.kaggle.com/nielspace/imdb-data)

In [2]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split #We need this to split the data

In [3]:
df = pd.read_csv('movies.csv')
df = df.dropna() #first get rid of rows with empty cells
df.head()

Unnamed: 0,title,runtime,metascore,rating,genre
0,The Dark Knight,152,82.0,9.0,action
2,Inception,148,74.0,8.8,action
3,Interstellar,169,74.0,8.6,drama
4,Kimi no na wa,106,79.0,8.6,drama
5,The Intouchables,112,57.0,8.6,comedy


In [4]:
df['genre'].value_counts() #Let's have a look at the 'genre' variable

drama     314
action    289
comedy    223
Name: genre, dtype: int64

## Training the k-NN model

Let's try and predict genre based on runtime, metascore and rating. Maybe dramas are longer and better rated than action movies, for instance. Let's try the k-NN algorithm. Remember that to use the k-NN algorithm, we need to normalize the data (make the mean 0 and the standard deviation 1). Fortunately, there is a function for that.

Let's start by normalizing the data and then

In [5]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = df[['runtime', 'metascore', 'rating']] #create the X matrix
X = normalize(X) #normalize the matrix to put everything on the same scale
y = df['genre'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

In [None]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=5) #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

42% of movies is predicted accurately. So, is that good or bad?

Well, given that 38% of the movies are dramas, we could actaully get this performance by predicting _everything_ is 'drama'. So, not so great, but kind of expected given the variables. Let's look at the _confusion matrix_ to see how well the model tells apart the different genres. A confusion matrix gives a the different classes and the number of predictions for each combination.

In [62]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[48, 23, 16],
       [25, 25, 12],
       [41, 26, 32]], dtype=int64)

In [63]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['drama', 'action', 'comedy'], columns = ['drama_p', 'action_p', 'comedy_p']) 
conf_matrix

Unnamed: 0,drama_p,action_p,comedy_p
drama,48,23,16
action,25,25,12
comedy,41,26,32


The way to read this is that of the drama movies, 48 are correctly predicted as 'drama', 23 are instead predicted as 'action' and '16' as comedy. The _recall_ and _precision_ for the category drama is:

$recall = \frac{48}{48 + 23 + 16} = .55$

$precision = \frac{48}{48 + 25 + 41} = .42$

We might improve our scores by trying out different values of _k_.