# Sistema de Recomendação de Filmes Brasileiros: Letterboxd

### 1. Descrição:
Este trabalho consiste na utilização de algoritmos de machine learning com a finalidade de criar modelos (preditivos, descritivos ou híbridos) que possibilitem extrair padrões ou conhecimento do dataset utilizado nos trabalhos anteriores.

### 2. Escolha uma das categorias de tarefas/problemas a seguir:
a. Regressão </br>
b. Classificação</br>
c. Agrupamento (Clustering)</br>
d. Regras de Associação</br>
e. Detecção de Outlier</br>
f. Redução de Dimensionalidade e Seleção de Features</br>

### 3. Escolha uma métrica para avaliação desempenho. Justifique a escolha.

### Sistema de recomendação baseado em filtro de conteúdo, ou seja, depende apenas dos próprios atributos/features dos itens para sugerir as recomendações.

In content based filtering we recommend items to a user which are similar to items the user likes based on the properties/attributes of that item.

In [1]:
# Importar bibliotecas:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# ler o arquivo:
base_filmes = pd.read_csv('eda_filmes.csv')

In [3]:
base_filmes.head()

Unnamed: 0,Ano,País 1,Título US,Título Original,Diretor,Watchedby,Listado (qtd),Liked,Ranking,Fans,Rating,Duração,Sinopse,Gênero,Gênero 1,Gênero 2,Gênero 3,Taxa_Liked_Watched
0,2002,Brazil,City of God,‘Cidade de Deus’,Fernando Meirelles,379000,77000,133000,24.0,10000,4.4,130,Buscapé was raised in a very violent environme...,"['crime', 'drama']",crime,drama,--,0.35
1,2000,Brazil,A Dog’s Will,‘O Auto da Compadecida’,Guel Arraes,68000,11000,29000,5.0,1400,4.5,104,The lively João Grilo and the sly Chicó are po...,"['drama', 'fantasy', 'comedy']",drama,fantasy,comedy,0.43
2,1998,Brazil,Central Station,‘Central do Brasil’,Walter Salles,52000,15000,19000,22.0,1800,4.4,110,"An emotive journey of a former school teacher,...",['drama'],drama,--,--,0.37
3,2014,Brazil,The Way He Looks,‘Hoje Eu Quero Voltar Sozinho’,Daniel Ribeiro,71000,16000,22000,,1100,4.0,96,Leonardo is a blind teenager dealing with an o...,"['romance', 'drama']",romance,drama,--,0.31
4,2015,Brazil,The Second Mother,‘Que Horas Ela Volta?’,Anna Muylaert,45000,12000,14000,112.0,429,4.2,112,After leaving her daughter Jessica in a small ...,['drama'],drama,--,--,0.31


In [4]:
base_filmes.shape

(2445, 18)

In [5]:
# Acrescentar um Id único ao dataframe:
base_filmes['Id'] = [x for x in np.arange(0,2445)]

In [6]:
base_filmes.tail()

Unnamed: 0,Ano,País 1,Título US,Título Original,Diretor,Watchedby,Listado (qtd),Liked,Ranking,Fans,Rating,Duração,Sinopse,Gênero,Gênero 1,Gênero 2,Gênero 3,Taxa_Liked_Watched,Id
2440,2019,Brazil,Em Reforma,Em Reforma,Diana Coelho,80,17,10,,0,3.4,19,"Bianca, a public school teacher, decides to re...",['drama'],drama,--,--,0.12,2440
2441,2020,Brazil,Luz Acesa,Luz Acesa,Guilherme Coelho,43,27,3,,0,3.3,70,The documentary portrays five people trying to...,['documentary'],documentary,--,--,0.07,2441
2442,2020,Brazil,De Costas Pro Rio,De Costas Pro Rio,Felipe Aufiero,78,20,1,,0,2.9,16,"After the contact of a spirit from the forest,...",['drama'],drama,--,--,0.01,2442
2443,2014,Brazil,Ressurgentes — Um Filme de Ação Direta,Ressurgentes — Um Filme de Ação Direta,Dácia Ibiapina,55,23,6,,0,3.5,74,"This film touches on political thought, worldv...",['documentary'],documentary,--,--,0.11,2443
2444,2005,Brazil,From Bereavement to Fight,‘Do luto à luta’,Evaldo Mocarzel,59,31,3,,0,3.5,75,This documentary gives a look without prejudic...,['documentary'],documentary,--,--,0.05,2444


In [7]:
# Sinopse
base_filmes.loc[base_filmes['Sinopse']==' ']

Unnamed: 0,Ano,País 1,Título US,Título Original,Diretor,Watchedby,Listado (qtd),Liked,Ranking,Fans,Rating,Duração,Sinopse,Gênero,Gênero 1,Gênero 2,Gênero 3,Taxa_Liked_Watched,Id
1319,2021,Brazil,Abjetas 288,Abjetas 288,Julia da Costa,250,72,54,,0,3.5,20,,['science-fiction'],science-fiction,--,--,0.22,1319
1351,2015,Brazil,Rapsódia Para o Homem Negro,Rapsódia Para o Homem Negro,Gabriel Martins,243,62,49,,0,3.8,24,,"['thriller', 'drama']",thriller,drama,--,0.20,1351
1355,2020,Brazil,Yaõkwá: Image and Memory,‘Yaõkwa - Imagem e Memória’,Vincent Carelli,190,65,59,,1,3.8,21,,['documentary'],documentary,--,--,0.31,1355
1368,1972,Brazil,Independência ou Morte,Independência ou Morte,Carlos Coimbra,179,88,23,,1,3.1,118,,"['drama', 'history']",drama,history,--,0.13,1368
1376,2016,Brazil,Deixa Na Régua,Deixa Na Régua,Emílio Domingos,135,66,38,,4,3.8,73,,['documentary'],documentary,--,--,0.28,1376
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2412,2020,Brazil,Matriz.doc,Matriz.doc,Otávio Sousa,60,13,15,,0,3.7,60,,"['documentary', 'music']",documentary,music,--,0.25,2412
2419,1942,Brazil,O Despertar da Redentora,O Despertar da Redentora,Humberto Mauro,55,26,0,,0,3.0,18,,['history'],history,--,--,0.00,2419
2429,2020,Brazil,A Nave de Mané Socó,A Nave de Mané Socó,Severino Dadá,81,13,5,,0,3.4,18,,['science-fiction'],science-fiction,--,--,0.06,2429
2433,2015,Brazil,Damas do Samba,Damas do Samba,Susanna Lira,45,40,6,,0,3.4,75,,['documentary'],documentary,--,--,0.13,2433


In [8]:
indexNames = base_filmes.loc[base_filmes['Sinopse']==' '].index
base_filmes.drop(indexNames,inplace=True)

In [9]:
df = base_filmes.copy()

In [10]:
df.head()

Unnamed: 0,Ano,País 1,Título US,Título Original,Diretor,Watchedby,Listado (qtd),Liked,Ranking,Fans,Rating,Duração,Sinopse,Gênero,Gênero 1,Gênero 2,Gênero 3,Taxa_Liked_Watched,Id
0,2002,Brazil,City of God,‘Cidade de Deus’,Fernando Meirelles,379000,77000,133000,24.0,10000,4.4,130,Buscapé was raised in a very violent environme...,"['crime', 'drama']",crime,drama,--,0.35,0
1,2000,Brazil,A Dog’s Will,‘O Auto da Compadecida’,Guel Arraes,68000,11000,29000,5.0,1400,4.5,104,The lively João Grilo and the sly Chicó are po...,"['drama', 'fantasy', 'comedy']",drama,fantasy,comedy,0.43,1
2,1998,Brazil,Central Station,‘Central do Brasil’,Walter Salles,52000,15000,19000,22.0,1800,4.4,110,"An emotive journey of a former school teacher,...",['drama'],drama,--,--,0.37,2
3,2014,Brazil,The Way He Looks,‘Hoje Eu Quero Voltar Sozinho’,Daniel Ribeiro,71000,16000,22000,,1100,4.0,96,Leonardo is a blind teenager dealing with an o...,"['romance', 'drama']",romance,drama,--,0.31,3
4,2015,Brazil,The Second Mother,‘Que Horas Ela Volta?’,Anna Muylaert,45000,12000,14000,112.0,429,4.2,112,After leaving her daughter Jessica in a small ...,['drama'],drama,--,--,0.31,4


### Redução de Dimensionalidade

Método não-supervisionado de geração de novos atributos a partir da combinação linear dos atributos originais;

In [11]:
# Gerar novo atributo a partir da combinação dos originais (apenas str)
df['atributos'] = df['Título US'] +  ' ' + df['Diretor'] + ' ' + df['Sinopse'] + ' ' + df['Gênero 1'] + ' ' + df['Gênero 2'] +' ' + df['Gênero 3']

In [12]:
df['atributos']

0       City of God Fernando Meirelles Buscapé was rai...
1       A Dog’s Will Guel Arraes The lively João Grilo...
2       Central Station Walter Salles An emotive journ...
3       The Way He Looks Daniel Ribeiro Leonardo is a ...
4       The Second Mother Anna Muylaert After leaving ...
                              ...                        
2440    Em Reforma Diana Coelho Bianca, a public schoo...
2441    Luz Acesa Guilherme Coelho The documentary por...
2442    De Costas Pro Rio Felipe Aufiero After the con...
2443    Ressurgentes — Um Filme de Ação Direta Dácia I...
2444    From Bereavement to Fight Evaldo Mocarzel This...
Name: atributos, Length: 2369, dtype: object

In [13]:
# Dropar os atributos que foram combinados e alguns que não são considerados relevantes: 
df.drop(columns=['País 1','Título US','Diretor','Sinopse','Gênero','Gênero 1','Gênero 2','Gênero 3', 'Watchedby', 'Listado (qtd)', 'Liked', 'Ranking', 'Fans'], inplace=True)

In [14]:
df.head()

Unnamed: 0,Ano,Título Original,Rating,Duração,Taxa_Liked_Watched,Id,atributos
0,2002,‘Cidade de Deus’,4.4,130,0.35,0,City of God Fernando Meirelles Buscapé was rai...
1,2000,‘O Auto da Compadecida’,4.5,104,0.43,1,A Dog’s Will Guel Arraes The lively João Grilo...
2,1998,‘Central do Brasil’,4.4,110,0.37,2,Central Station Walter Salles An emotive journ...
3,2014,‘Hoje Eu Quero Voltar Sozinho’,4.0,96,0.31,3,The Way He Looks Daniel Ribeiro Leonardo is a ...
4,2015,‘Que Horas Ela Volta?’,4.2,112,0.31,4,The Second Mother Anna Muylaert After leaving ...


In [15]:
df.rename(columns={'Título Original': 'title'}, inplace = True)

In [16]:
df.isnull().sum()

Ano                   0
title                 0
Rating                0
Duração               0
Taxa_Liked_Watched    0
Id                    0
atributos             0
dtype: int64

In [17]:
df.shape

(2369, 7)

In [18]:
df.drop_duplicates(inplace=True)

In [19]:
df.shape

(2369, 7)

Vetorização dos dados e transformação

In [20]:
# TfidfVectorizer: Convert a collection of raw documents to a matrix of TF-IDF features. (documentação do sklearn)
tfidf = TfidfVectorizer(max_features=5000)

In [21]:
# Transformar os dados
dado_vetorizado = tfidf.fit_transform(df['atributos'].values)
dado_vetorizado

<2369x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 88375 stored elements in Compressed Sparse Row format>

In [22]:
df_vetorizado = pd.DataFrame(dado_vetorizado.toarray(), index=df['atributos'].index.tolist())

In [23]:
df_vetorizado.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.102357,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


SVD: Dimensionality reduction using truncated SVD.

In [24]:
# sklearn.decomposition.TruncatedSVD
svd = TruncatedSVD(n_components=3000)

In [25]:
# Fit 
reduced_data = svd.fit_transform(df_vetorizado)

In [26]:
# Shape dos dados reduzidos
reduced_data.shape

(2369, 2369)

In [27]:
svd.explained_variance_ratio_.cumsum()

array([0.0065457 , 0.01706554, 0.02416757, ..., 0.99999064, 0.99999668,
       1.        ])

Similaridade do cosseno: mede a similaridade entre dois vetores num espaço vetorial

In [28]:
# Métrica: Compute cosine similarity between samples in X and Y.
similarity = cosine_similarity(reduced_data)

In [29]:
def recommendation(filme):
    id_filme = df[df['title']==filme].index[0]
    distances = similarity[id_filme]
    lista_filmes = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:3]
    
    for i in lista_filmes:
        print(df.iloc[i[0]].title)

In [30]:
recommendation('‘Tropa de Elite’')

‘Tropa de Elite 2’
‘Eu Matei Lúcio Flávio’


In [60]:
recommendation('‘Sonhos Roubados’')

‘Que Horas Ela Volta?’
Por Que Você Não Chora?
‘A Vizinhança do Tigre’


In [34]:
# Gerando csv:
df.to_csv('recomenda_filmebr.csv', index=False)

In [48]:
list(df['title'])

['‘Cidade de Deus’',
 '‘O Auto da Compadecida’',
 '‘Central do Brasil’',
 '‘Hoje Eu Quero Voltar Sozinho’',
 '‘Que Horas Ela Volta?’',
 '‘Tropa de Elite’',
 '‘Democracia em Vertigem’',
 '‘Pixote: A Lei do Mais Fraco’',
 '‘Cabra Marcado Para Morrer’',
 '‘Deus e o Diabo na Terra do Sol’',
 '‘Tropa de Elite 2’',
 '‘O Som ao Redor’',
 '‘A Menina Que Matou os Pais’',
 '‘O Menino Que Matou Meus Pais’',
 '‘O Auto da Compadecida’',
 '‘O Menino e o Mundo’',
 '‘Terra em Transe’',
 'Limite',
 '‘O Pagador de Promessas’',
 '‘Ilha das Flores’',
 'Marighella',
 '‘Jogo de Cena’',
 '‘À Meia Noite Levarei Sua Alma’',
 '‘Edifício Master’',
 '‘Tudo Bem no Natal Que Vem’',
 '‘Minha Mãe é uma Peça: O Filme’',
 '‘Bicho de Sete Cabeças’',
 '‘O Lobo Atrás da Porta’',
 '‘Eles não usam black-tie’',
 '‘Bingo: O Rei das Manhãs’',
 '‘Vidas Secas’',
 '‘Emicida: AmarElo - É Tudo Pra Ontem’',
 '‘Minha Mãe é uma Peça 3: O Filme’',
 'Ratatoing',
 '‘O Homem Que Copiava’',
 '‘Lisbela e o Prisioneiro’',
 '‘Turma da Mônica: