# Neighborhood Based Recommender

The key idea is that the rating of u for a new item i is likely to be similar to that of another user v,if u and v have rated other items in a similar way. Likewise,u is likely to rate two items i and j in a similar fashion, if other users have given similar ratings to these two items.

#### Similarity metrics:

- Cosine Similariy/Distance (works good for sparse high dimensional data)
- Jaccard Similarity/Distance (only works on binarized vectors)
- Pearson Correlation/Distance (cosine similarity on centered vectors)
- Euclidian Distance/Similarity (not good for sparse high dimensional data)

You find many more metrics here:https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

In [16]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from movie_dataset import preprocess_movies
from movie_dataset import DISNEY_MOVIE_IDS, DISNEY_RELEVANT_HITS
from sklearn.metrics import pairwise
import sklearn.neighbors as nb
import sklearn
import pickle

In [17]:
ratings, movies, R = preprocess_movies()

In [18]:
ratings.shape, movies.shape

((66658, 4), (9742, 2))

In [19]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [20]:
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [21]:
movies.loc[DISNEY_MOVIE_IDS]

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
4470,Ariel (1988),Drama
48,Pocahontas (1995),Animation|Children|Drama|Musical|Romance
594,Snow White and the Seven Dwarfs (1937),Animation|Children|Drama|Fantasy|Musical
27619,"Lion King 1½, The (2004)",Adventure|Animation|Children|Comedy
152081,Zootopia (2016),Action|Adventure|Animation|Children|Comedy
595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX
616,"Aristocats, The (1970)",Animation|Children
1029,Dumbo (1941),Animation|Children|Drama|Musical


### Initialize the Model

- pick a distance metric
- at this point the model only stores the user-item-matrix. All calculations take place later!

In [22]:
import sklearn.neighbors as nb

# which metrics can we use for sparse matrics?
sorted(nb.VALID_METRICS_SPARSE['brute'])

['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan', 'precomputed']

In [23]:
# initialize the unsupervised model
model = nb.NearestNeighbors(metric='cosine')
model.fit(R)

NearestNeighbors(metric='cosine')

### Prepare a User Vector

In [24]:
# new user vector: needs to have the same format as the training data
# pre fill it with zeros
user_vec = np.zeros(168253)

# fill in the ratings that arrived from the query
user_vec[DISNEY_MOVIE_IDS] = 5

### Find 10 most similar Users

In [25]:
# calculate the distance to all other users in the data
distances, userIds = model.kneighbors([user_vec], n_neighbors=10, return_distance=True)

# sklearn returns nested lists with a single row (=1 user)
distances = distances[0]
userIds = userIds[0]

In [26]:
distances, userIds

(array([0.8059715 , 0.83263452, 0.8346348 , 0.85242967, 0.85242967,
        0.8526059 , 0.85274623, 0.85989566, 0.86081367, 0.86770655]),
 array([476,  43, 563,   5, 170, 484,  58, 235,  20, 216]))

In [27]:
# ratings of 10 most similar users
neighborhood = ratings.set_index('userId').loc[userIds]
neighborhood

Unnamed: 0_level_0,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
476,1,4.0,835021447
476,2,4.0,835021693
476,10,3.0,835021420
476,11,3.0,835021635
476,32,4.0,835021513
...,...,...,...
216,3996,4.0,982169907
216,4002,3.0,975212110
216,4023,3.0,982169946
216,4025,2.0,982169965


In [28]:
# calculate the summed up rating for each movie
scores = neighborhood.groupby('movieId')['rating'].sum()
scores

movieId
1         20.5
2         12.0
3          8.0
5         12.0
7         10.0
          ... 
106920     3.0
112552     4.5
117529     3.5
119145     4.0
134853     4.0
Name: rating, Length: 543, dtype: float64

### Give recommendations

In [29]:
# give a zero score to movies the user has allready seen
allready_seen = scores.index.isin(DISNEY_MOVIE_IDS)
scores.loc[allready_seen] = 0

In [30]:
scores = scores.sort_values(ascending=False)
recommendations = scores.head(10).index
recommendations

Int64Index([588, 364, 34, 356, 318, 596, 597, 457, 590, 150], dtype='int64', name='movieId')

In [31]:
# let's see the recommendations!
movies.loc[recommendations]

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical
364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX
34,Babe (1995),Children|Drama
356,Forrest Gump (1994),Comedy|Drama|Romance|War
318,"Shawshank Redemption, The (1994)",Crime|Drama
596,Pinocchio (1940),Animation|Children|Fantasy|Musical
597,Pretty Woman (1990),Comedy|Romance
457,"Fugitive, The (1993)",Thriller
590,Dances with Wolves (1990),Adventure|Drama|Western
150,Apollo 13 (1995),Adventure|Drama|IMAX


#### Save the trained model on your hard drive

In [None]:
with open('./distance_recommender.pkl', 'wb') as file:
    pickle.dump(model, file)