# Sistema de recomendação com KNN
### INF1608 - Análise Numérica
### Aluno: Leonardo E. Wajnsztok
### Matrícula: 1312737



## Introdução

Sistemas de recomendação é uma das áreas de Data Mining que consiste em decidir quais informações tem a maior probabilidade de ser do interesse do usuário. Tais sistemas são de grande importância nos dias de hoje e são usados por empresas como Netflix, Google e Amazon.

## Tipos de filtro
- Content-based Filtering 

Esse filtro usa apenas as informações de similaridade entre os items. Usando filmes do Netflix como exemplo, ao assistir um filme, o sistema iria tentar recomendar um outro filme que tivesse caracteristicas parecidas, como gênero, atores, tema, ano de lançamento.

Ex:
> Se você assistiu e gostou de 'Star Wars Uma Nova Esperança', assista também 'Star Wars O Império Contra-Ataca'

- Collaborative Filtering

Já no filtro colaborativo, são usadas informações de outros usuários para recomendar um novo item. Como visto em e-commerces como Amazon, ao comprar um item, o sistema recomenda items que outros usuários que compraram o mesmo item também compraram.

Ex:
>Usuários que compraram um 'Fogão de Cozinha', também compraram 'Panelas' e 'Conjunto de pratos'

## Formulação


Usando a recomendação de filmes do Netflix como exemplo, podemos formular o problema e definir uma $n_f$ x $n_u$, sendo $n_u$ o número de usuários e $n_f$ o número de filmes.

$$
A_{fu}=\left\{
\begin{array}{c 2}	
     R_{uf}\ \text{ , nota que o usuário } u \text{ deu ao filme } f \\
      ?\    \text{ ,nota não existe para tupla } u, f
\end{array}\right.
$$


Exemplo de matriz $A_{fu}$:
![matriz](ex_matriz.png)

## Exemplo Content-based Filtering

In [4]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()


Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [5]:
r_cols = ['id', 'name', 'date', '-', 'link','unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies_info = pd.read_csv('ml-100k/u.item', sep='|', names=r_cols, usecols=range(24))
movies_info.head()



Unnamed: 0,id,name,date,-,link,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [6]:
import numpy as np

movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie_id,Unnamed: 1_level_2,Unnamed: 2_level_2
1,452,3.878319
2,131,3.206107
3,90,3.033333
4,209,3.550239
5,86,3.302326


In [7]:
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Unnamed: 0_level_0,size
movie_id,Unnamed: 1_level_1
1,0.774914
2,0.223368
3,0.152921
4,0.357388
5,0.146048


In [8]:
movieDict = {}
name_to_id = {}
with open('ml-100k/u.item') as f:
    temp = ''
    for line in f:
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        name_to_id[name] = movieID
        movieDict[movieID] = (name, genres, movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))

print movieDict[1]

('Toy Story (1995)', [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 0.77491408934707906, 3.8783185840707963)


In [9]:
def find_movie_contains(substring):
    for m_id, m_info in movieDict.items():
        if substring in m_info[0]:
            print m_id, m_info[0]

In [10]:
find_movie_contains("Star Wars")
find_movie_contains("Godfather")
find_movie_contains("Aladdin")

50 Star Wars (1977)
127 Godfather, The (1972)
187 Godfather: Part II, The (1974)
95 Aladdin (1992)
422 Aladdin and the King of Thieves (1996)


In [11]:
from scipy import spatial

def compute_distance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
print movieDict[1][0], movieDict[2][0]
print movieDict[1][1]
print movieDict[2][1]
print "Distancia:", compute_distance(movieDict[1], movieDict[2])

print

print movieDict[50][0]
print movieDict[181][0]
print compute_distance(movieDict[50], movieDict[181])



Toy Story (1995) GoldenEye (1995)
[0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Distancia: 1.55154639175

Star Wars (1977)
Return of the Jedi (1983)
0.13058419244


In [12]:
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = compute_distance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

In [13]:
def show_k_neighbors(movie_id, K):
    neighbors = getNeighbors(movie_id, K)
    for neighbor in neighbors:
        print "- ", movieDict[neighbor][0], "- id:", neighbor

In [14]:
print "Recomendações para 'Star wars A New Hope':\n"
show_k_neighbors(50, 3)
print
print "Recomendações para 'The Godfather':\n"
show_k_neighbors(127, 3)
print
print "Recomendações para 'Alladin':\n"
show_k_neighbors(95, 3)

Recomendações para 'Star wars A New Hope':

-  Return of the Jedi (1983) - id: 181
-  Empire Strikes Back, The (1980) - id: 172
-  Independence Day (ID4) (1996) - id: 121

Recomendações para 'The Godfather':

-  Pulp Fiction (1994) - id: 56
-  Godfather: Part II, The (1974) - id: 187
-  Titanic (1997) - id: 313

Recomendações para 'Alladin':

-  Lion King, The (1994) - id: 71
-  Beauty and the Beast (1991) - id: 588
-  Mary Poppins (1964) - id: 419


## Exemplo Collaborative Filtering


In [15]:
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [16]:
import numpy as njp

num_user = max(ratings.user_id)
print "Numero de usuarios:", num_user

num_movies = max(ratings.movie_id)
print "Numero de filmes:", num_movies

users = list(ratings.user_id.unique())
movies = list(ratings.movie_id.unique())

userId_to_idx = {}
for i in range(num_user):
    userId_to_idx[users[i]] = i
    
movieId_to_idx = {}
for i in range(num_movies):
    movieId_to_idx[movies[i]] = i

similarity_matrix = np.zeros((num_user + 1, num_movies + 1), dtype=float)

Numero de usuarios: 943
Numero de filmes: 1682


In [19]:
for i in range(len(ratings)):
    row = ratings.iloc[i]
    userIndex = userId_to_idx[row[0]]
    movieIndex = movieId_to_idx[row[1]]

    similarity_matrix[userIndex][movieIndex] = row[2]

In [20]:
print similarity_matrix

[[ 3.  0.  0. ...,  0.  0.  0.]
 [ 0.  3.  0. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 ..., 
 [ 0.  4.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [29]:
from sklearn.neighbors import NearestNeighbors

k = 10

nbrs = NearestNeighbors(n_neighbors=k, algorithm='brute', metric='cosine', n_jobs=4)
nbrs.fit(similarity_matrix)

distances, indices = nbrs.kneighbors(similarity_matrix)
user_similars = {}
for user_idx in range(len(distances)):
    try:
        user_id = users[user_idx-1]
    except:
        print user_idx
    sim_list = []
    for nei in range(1, k):
        nei_id = users[indices[user_idx][nei]]
        sim = distances[user_idx][nei]
        sim_list.append([nei_id, sim])
    user_similars[user_id] = sim_list[:]

In [41]:
user_similars

{1: [[222, 0.40458003366858908],
  [109, 0.41557908745764871],
  [864, 0.41877963946806451],
  [484, 0.43876749219550937],
  [497, 0.44007811969889099],
  [301, 0.44335638744189831],
  [648, 0.44580495222781114],
  [545, 0.45821626874025723],
  [727, 0.4605649749566092]],
 2: [[356, 0.41590288424018329],
  [740, 0.42737317751216697],
  [856, 0.43797664213701271],
  [510, 0.45243867588912889],
  [134, 0.45546964892950093],
  [111, 0.47114911758098887],
  [827, 0.47124227601988167],
  [258, 0.48129503878521951],
  [810, 0.48648918987282241]],
 3: [[590, 0.46882908732720496],
  [789, 0.48583199338291128],
  [634, 0.50495980930550055],
  [936, 0.51876814956589845],
  [473, 0.53270628219599236],
  [501, 0.53825023108141323],
  [470, 0.54209099291800267],
  [733, 0.54431358533287677],
  [150, 0.55434238492161525]],
 4: [[502, 0.52546623048920338],
  [166, 0.53571917986141293],
  [33, 0.54443201463226054],
  [589, 0.54627886297053552],
  [515, 0.54797952888585866],
  [816, 0.55177644897232569

In [43]:
def all_movies_user_watched(user_id):
    print "user_id", user_id, "assistiu:\n"
    all_movies = list()
    for i in ratings[ratings.user_id == user_id].movie_id:
        all_movies.append(movieDict[i][0]) 
    return all_movies

def get_all_movies_for_neighbors(user_id):
    idx = userId_to_idx[user_id]
    neigh = indices[idx]
    all_movies = set()
    for n in neigh:
        movies = all_movies_user_watched(n)
        for m in movies:
            all_movies.add(m)
    return all_movies

get_all_movies_for_neighbors(196)
        

{0, 109, 219, 329, 348, 511, 570, 582, 647, 860}

In [None]:
def all_movies_user_watched(user_id):
    print "user_id", user_id, "assistiu:\n"
    for i in ratings[ratings.user_id == user_id].movie_id:
        print movieDict[i][0]


In [None]:
all_movies_user_watched(1)

In [None]:
get_recommendation_for_user(1)

## Referência

- http://cs229.stanford.edu/proj2008/Wen-RecommendationSystemBasedOnCollaborativeFiltering.pdf
- https://github.com/farzades/kNN_recommendations/
- https://github.com/bv123/Knn-for-Movie-Recommendation/
- https://pt.slideshare.net/seydahatipoglu111/collaborative-filtering-using-knn