# Sistema de recomendação com KNN
### INF1608 - Análise Numérica
### Aluno: Leonardo E. Wajnsztok
### Matrícula: 1312737



## Introdução

Sistemas de recomendação é uma das áreas de Data Mining que consiste em decidir quais informações tem a maior probabilidade de ser do interesse do usuário. Tais sistemas são de grande importância nos dias de hoje e são usados por empresas como Netflix, Google e Amazon.

## Tipos de filtro
- Content-based Filtering 

Esse filtro usa apenas as informações de similaridade entre os items. Usando filmes do Netflix como exemplo, ao assistir um filme, o sistema iria tentar recomendar um outro filme que tivesse caracteristicas parecidas, como gênero, atores, tema, ano de lançamento.

Ex:
> Se você assistiu e gostou de 'Star Wars Uma Nova Esperança', assista também 'Star Wars O Império Contra-Ataca'

- Collaborative Filtering

Já no filtro colaborativo, são usadas informações de outros usuários para recomendar um novo item. Como visto em e-commerces como Amazon, ao comprar um item, o sistema recomenda items que outros usuários que compraram o mesmo item também compraram.

Ex:
>Usuários que compraram um 'Fogão de Cozinha', também compraram 'Panelas' e 'Conjunto de pratos'

## Formulação


Usando a recomendação de filmes do Netflix como exemplo, podemos formular o problema e definir uma $n_f$ x $n_u$, sendo $n_u$ o número de usuários e $n_f$ o número de filmes.

$$
A_{fu}=\left\{
\begin{array}{c 2}	
     R_{uf}\ \text{ , nota que o usuário } u \text{ deu ao filme } f \\
      ?\    \text{ ,nota não existe para tupla } u, f
\end{array}\right.
$$


Exemplo de matriz $A_{fu}$:
![matriz](ex_matriz.png)

## Referência

- http://cs229.stanford.edu/proj2008/Wen-RecommendationSystemBasedOnCollaborativeFiltering.pdf
- https://github.com/farzades/kNN_recommendations/
- https://github.com/bv123/Knn-for-Movie-Recommendation/
- https://pt.slideshare.net/seydahatipoglu111/collaborative-filtering-using-knn

## Exemplo Content-based Filtering

In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()


Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [2]:
r_cols = ['id', 'name', 'date', '-', 'link','unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies_info = pd.read_csv('ml-100k/u.item', sep='|', names=r_cols, usecols=range(24))
movies_info.head()



Unnamed: 0,id,name,date,-,link,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [3]:
import numpy as np

movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


In [4]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,0.774914
2,0.223368
3,0.152921
4,0.357388
5,0.146048


In [5]:
movieDict = {}
name_to_id = {}
with open('ml-100k/u.item') as f:
    temp = ''
    for line in f:
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        name_to_id[name] = movieID
        movieDict[movieID] = (name, genres, movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))

print movieDict[1]

('Toy Story (1995)', [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 0.77491408934707906, 3.8783185840707963)


In [6]:
def find_movie_contains(substring):
    for m_id, m_info in movieDict.items():
        if substring in m_info[0]:
            print m_id, m_info[0]

In [7]:
find_movie_contains("Star Wars")
find_movie_contains("Godfather")
find_movie_contains("Aladdin")

50 Star Wars (1977)
127 Godfather, The (1972)
187 Godfather: Part II, The (1974)
95 Aladdin (1992)
422 Aladdin and the King of Thieves (1996)


In [47]:
from scipy import spatial

def compute_distance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
print compute_distance(movieDict[2], movieDict[4])
print movieDict[2]
print movieDict[4]


0.800687285223
('GoldenEye (1995)', [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 0.22336769759450173, 3.2061068702290076)
('Get Shorty (1995)', [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 0.35738831615120276, 3.5502392344497609)


In [9]:
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = compute_distance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

In [10]:
def show_k_neighbors(movie_id, K):
    neighbors = getNeighbors(movie_id, K)
    for neighbor in neighbors:
        print "- ", movieDict[neighbor][0], "- id:", neighbor

In [11]:
print "Recomendações para 'Star wars A New Hope':\n"
show_k_neighbors(50, 3)
print
print "Recomendações para 'The Godfather':\n"
show_k_neighbors(127, 3)
print
print "Recomendações para 'Alladin':\n"
show_k_neighbors(95, 3)

Recomendações para 'Star wars A New Hope':

-  Return of the Jedi (1983) - id: 181
-  Empire Strikes Back, The (1980) - id: 172
-  Independence Day (ID4) (1996) - id: 121

Recomendações para 'The Godfather':

-  Pulp Fiction (1994) - id: 56
-  Godfather: Part II, The (1974) - id: 187
-  Titanic (1997) - id: 313

Recomendações para 'Alladin':

-  Lion King, The (1994) - id: 71
-  Beauty and the Beast (1991) - id: 588
-  Mary Poppins (1964) - id: 419


## Exemplo Collaborative Filtering


In [12]:
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [95]:
import numpy as njp

num_user = max(ratings.user_id)
print "Numero de usuarios:", num_user

num_movies = max(ratings.movie_id)
print "Numero de filmes:", num_movies

users = list(ratings.user_id.unique())
movies = list(ratings.movie_id.unique())

userId_to_idx = {}
for i in range(num_user):
    userId_to_idx[users[i]] = i
    
movieId_to_idx = {}
for i in range(num_movies):
    movieId_to_idx[movies[i]] = i

similarity_matrix = np.zeros((num_user + 1, num_movies + 1), dtype=float)

Numero de usuarios: 943
Numero de filmes: 1682


In [96]:
for i in range(len(ratings)):
    userIndex = userId_to_idx[row[0]]
    movieIndex = movieId_to_idx[row[1]]
    row = ratings.iloc[i]
    similarity_matrix[userIndex][movieIndex] = row[2]

In [97]:
print similarity_matrix

[[ 3.  0.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  0.  2. ...,  0.  0.  0.]
 ..., 
 [ 0.  5.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [120]:
from sklearn.neighbors import NearestNeighbors

k = 10

nbrs = NearestNeighbors(n_neighbors=k, algorithm='brute', metric='cosine', n_jobs=4)
nbrs.fit(similarity_matrix)

distancias, indices = nbrs.kneighbors(similarity_matrix)


[[  0.00000000e+00   6.80121643e-01   7.01828027e-01 ...,   7.44456038e-01
    7.59371288e-01   7.64030279e-01]
 [ -2.22044605e-16   6.14302216e-01   6.76651627e-01 ...,   6.93096190e-01
    6.95675697e-01   7.02848512e-01]
 [ -2.22044605e-16   5.13815760e-01   5.46671666e-01 ...,   5.85715224e-01
    5.86540107e-01   5.92641382e-01]
 ..., 
 [  0.00000000e+00   4.21793675e-01   4.33956743e-01 ...,   5.15826231e-01
    5.18544446e-01   5.21404968e-01]
 [  1.11022302e-16   5.24611785e-01   5.97733212e-01 ...,   6.56201288e-01
    6.60061127e-01   6.60847118e-01]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   1.00000000e+00]]
[[  0 582 570 ..., 540 369 114]
 [  1 318 530 ..., 543 395 782]
 [  2 639 151 ...,  54 819  26]
 ..., 
 [941 806 916 ..., 526 170  66]
 [942 809 484 ..., 289 620 113]
 [628 627 632 ..., 623 625 636]]


In [163]:
def get_recommendation_for_user(user_id):
    idx =  movieId_to_idx[user_id]
    print "\nRecomendado para user_id", user_id
    print
    for i in indices[user_id]:
        print movieDict[i][0]
    

In [167]:
def all_movies_user_watched(user_id):
    print "user_id", user_id, "assistiu:\n"
    for i in ratings[ratings.user_id == user_id].movie_id:
        print movieDict[i][0]


In [168]:
all_movies_user_watched(1)

user_id 1 assistiu:

Three Colors: White (1994)
Grand Day Out, A (1992)
Desperado (1995)
Glengarry Glen Ross (1992)
Angels and Insects (1995)
Groundhog Day (1993)
Delicatessen (1991)
Hunt for Red October, The (1990)
Dirty Dancing (1987)
Rock, The (1996)
Ed Wood (1994)
Star Trek: First Contact (1996)
Pillow Book, The (1995)
Horseman on the Roof, The (Hussard sur le toit, Le) (1995)
Star Trek VI: The Undiscovered Country (1991)
From Dusk Till Dawn (1996)
So I Married an Axe Murderer (1993)
Shawshank Redemption, The (1994)
True Romance (1993)
Star Trek: The Wrath of Khan (1982)
Kull the Conqueror (1997)
Independence Day (ID4) (1996)
Wallace & Gromit: The Best of Aardman Animation (1996)
Wizard of Oz, The (1939)
Faster Pussycat! Kill! Kill! (1965)
Citizen Kane (1941)
Silence of the Lambs, The (1991)
Blues Brothers, The (1980)
Breaking the Waves (1996)
Robert A. Heinlein's The Puppet Masters (1994)
Crimson Tide (1995)
Four Weddings and a Funeral (1994)
Three Colors: Blue (1993)
Good, The Ba

In [166]:
get_recommendation_for_user(1)


Recomendado para user_id 1

Toy Story (1995)
Schindler's List (1993)
Man Who Would Be King, The (1975)
Cold Comfort Farm (1995)
Dangerous Minds (1995)
Kansas City (1996)
Titanic (1997)
Mis�rables, Les (1995)
Robin Hood: Men in Tights (1993)
Little Odessa (1994)
