# ***Sistema de recomendações - Filmes***

Descrição do Projeto: Projeto utilizando metódos de Machine Learning para criação de um sistema de recomendações de filmes.

Dataset encontrado da internet (GroupLens).

Dicionário de Dados:

* *MovieId*: ID do filme.
* *Title*: Título do filme.
* *UserID*: ID do usuário.
* *Rating*: Nota do usuário atribuída ao filme.

## Importação de bibliotecas e tratamento de dados

In [1]:
# Importando bibliotecas
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

In [2]:
# Lendo base de dados dos filmes
filmes = pd.read_csv('./data/movies.csv',
                     usecols = ['movieId', 'title'],
                     dtype = {'movieId': 'int32', 'title': 'str'})

filmes.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [3]:
# Renomeando colunas para português
filmes.columns = ['filmeId', 'titulo']
filmes.head()

Unnamed: 0,filmeId,titulo
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [4]:
# Total de 9742 filmes no dataset
filmes.shape[0]

9742

In [5]:
# Lendo base de dados das classificações dos filmes
notas = pd.read_csv('./data/ratings.csv',
                             usecols = ['userId', 'movieId', 'rating'],
                             dtype = {'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

notas.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [6]:
# Renomeando colunas para português
notas.columns = ['usuarioId', 'filmeId', 'nota']
notas.head()

Unnamed: 0,usuarioId,filmeId,nota
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [7]:
# Total de 100836 notas no dataset
notas.shape[0]

100836

In [8]:
# Utilizando merge para juntar os dois DataFrames em um só
filmes_notas = pd.merge(notas, filmes, on = 'filmeId')
filmes_notas.head()

Unnamed: 0,usuarioId,filmeId,nota,titulo
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)


In [9]:
# Retirando filmes com notas faltantes
filmes_notas.dropna(subset = ['nota'], inplace = True)

In [10]:
# Total de notas por filme 
filmes_notas_contador = filmes_notas.groupby(by = ['titulo'])['nota'].count().reset_index()
filmes_notas_contador

Unnamed: 0,titulo,nota
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2
...,...,...
9714,eXistenZ (1999),22
9715,xXx (2002),24
9716,xXx: State of the Union (2005),5
9717,¡Three Amigos! (1986),26


In [11]:
# Renomeando a coluna 'NOTA'
filmes_notas_contador = filmes_notas_contador.rename(columns = {'nota': 'total_de_notas'})
filmes_notas_contador.head()

Unnamed: 0,titulo,total_de_notas
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2


In [12]:
# Unindo cada nota dos filmes com o número total de notas
notas_com_total_de_notas = filmes_notas.merge(filmes_notas_contador, left_on = 'titulo', right_on = 'titulo', how = 'left')
notas_com_total_de_notas.head()

Unnamed: 0,usuarioId,filmeId,nota,titulo,total_de_notas
0,1,1,4.0,Toy Story (1995),215
1,5,1,4.0,Toy Story (1995),215
2,7,1,4.5,Toy Story (1995),215
3,15,1,2.5,Toy Story (1995),215
4,17,1,4.5,Toy Story (1995),215


In [13]:
# Verificando a distribuição de notas por filme
pd.set_option('display.float_format', lambda x: '%.3f' % x)
filmes_notas_contador['total_de_notas'].describe()

count   9719.000
mean      10.375
std       22.406
min        1.000
25%        1.000
50%        3.000
75%        9.000
max      329.000
Name: total_de_notas, dtype: float64

In [14]:
# Filtrando os filmes e retirando aqueles com menos de 50 notas atribuídas
limitador = 50
notas_filmes_populares = notas_com_total_de_notas.query('total_de_notas >= @limitador') # Utilizado o @ pois é uma variável fora do DataFrame
notas_filmes_populares.head()

Unnamed: 0,usuarioId,filmeId,nota,titulo,total_de_notas
0,1,1,4.0,Toy Story (1995),215
1,5,1,4.0,Toy Story (1995),215
2,7,1,4.5,Toy Story (1995),215
3,15,1,2.5,Toy Story (1995),215
4,17,1,4.5,Toy Story (1995),215


In [15]:
# Verificando a quantidade de filmes no novo DataFrame
len(notas_filmes_populares['titulo'].unique())

450

In [16]:
# Colocando os filmes como linhas, usuários como colunas e notas atribuídas como os valores
filmes_notas_pivot = notas_filmes_populares.pivot_table(index = 'titulo', columns = 'usuarioId', values = 'nota').fillna(0)
filmes_notas_pivot.head()

usuarioId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
titulo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2001: A Space Odyssey (1968),0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,5.0,0.0,3.0,0.0,4.5
28 Days Later (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,5.0
300 (2007),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0,4.0


## Utilizando NearestNeighbors

In [17]:
# Criando uma csr_matrix para armazenar os valores de maneira mais otimizada a fim de utilizá-lo no fit de NearestNeighbors
filmes_notas_pivot_matrix = csr_matrix(filmes_notas_pivot)

modelo_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
modelo_knn.fit(filmes_notas_pivot_matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

In [18]:
# Selecionando um filme aleatoriamente
query_index = np.random.choice(filmes_notas_pivot.shape[0])
distancias, indices = modelo_knn.kneighbors(filmes_notas_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

print(f'Filme: {filmes_notas_pivot.index[query_index]} - índice {query_index}')

Filme: Moulin Rouge (2001) - índice 278


In [19]:
# Visualizando as recomendações
for i in range(0, len(distancias.flatten())):
    if i == 0:
        print('Recomendações para {0}:\n'.format(filmes_notas_pivot.index[query_index]))
    else:
        print('{0}: {1}, com a distância de {2}:'.format(i, filmes_notas_pivot.index[indices.flatten()[i]], distancias.flatten()[i]))

Recomendações para Moulin Rouge (2001):

1: Bridget Jones's Diary (2001), com a distância de 0.44030672311782837:
2: Legally Blonde (2001), com a distância de 0.5300582051277161:
3: Harry Potter and the Chamber of Secrets (2002), com a distância de 0.5409270524978638:
4: Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001), com a distância de 0.543553352355957:
5: Signs (2002), com a distância de 0.5448830127716064:


# ***Fim***