# Sistema de recomendaciones _Content-Based_

Construimos un sistema de recomendación utilizando los metadatos de las películas rankeadas en el dataset de MovieLens. Sobre los datos a utilizar realizamos un feature engineering y limpiamos outliers en (1). 

El sistema resultante será utilizado en un esquema de recomendación híbrido, otmando como entrada los datos de salida del recomendador colaborativo Kmeans + knn.

(1) : https://colab.research.google.com/drive/1m-flwEvpFchQrqaMLviU_nS2MRlD1d7M#scrollTo=WiHCRLWtL6IX



Montamos Google Drive y cargamos los datos

In [1]:
# Montamos drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [153]:
import os
from ast import literal_eval

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#KNN:
#Importing tqdm function of tqdm module 
from tqdm import tqdm
from time import sleep
from sklearn.neighbors import NearestNeighbors

In [10]:
# Cargamos los datos mergeados de Imdb y MovieLens
files_path = '/content/drive/MyDrive/LaboDatos_1C2022/tp_final/data'

movies_data = os.path.join(files_path, 'df_movies.csv')
users_data =  os.path.join(files_path, 'df_ratings.csv')
clusters = os.path.join(files_path, '9-features-maxlocal-5.csv')


df_movies = pd.read_csv(movies_data)
df_ratings = pd.read_csv(users_data)
df_clusters = pd.read_csv(clusters, usecols=['userId', 'cluster'])

In [11]:
df_ratings = df_ratings.merge(df_clusters, on='userId', how='inner')
df_ratings

Unnamed: 0,index,userId,movieId,rating,timestamp,movies_count,cluster
0,0,1,296,5.0,2006-05-17,70,4
1,1,1,306,3.5,2006-05-17,70,4
2,2,1,307,5.0,2006-05-17,70,4
3,3,1,665,5.0,2006-05-17,70,4
4,4,1,899,3.5,2006-05-17,70,4
...,...,...,...,...,...,...,...
11909121,25000090,162541,50872,4.5,2009-04-28,180,2
11909122,25000091,162541,55768,2.5,2009-04-28,180,2
11909123,25000092,162541,56176,2.0,2009-04-28,180,2
11909124,25000093,162541,58559,4.0,2009-04-28,180,2


## 1. Limpieza de datos

Primero llevemos df_movies al formato adecuado para hacer recomendaciones, utilizando todas las películas, y guardemos el archivo para usar más adelante.

In [203]:
metadata = df_movies.drop(['startYear', 'runtimeMinutes', 'writers'], axis=1)
metadata.set_index('movieId', inplace = True)
metadata

Unnamed: 0_level_0,title,genres,averageRating,numVotes,directors,cast,keywords
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Toy Story (1995),"['Fantasy', 'Comedy', 'Animation', 'Children',...",8.3,965708.0,['John Lasseter'],"['Tom Hanks', 'Tim Allen', 'Don Rickles']","['jealousy', 'toy', 'boy', 'friendship', 'frie..."
2,Jumanji (1995),"['Fantasy', 'Children', 'Adventure']",7.0,337453.0,['Joe Johnston'],"['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['board game', 'disappearance', ""based on chil..."
3,Grumpier Old Men (1995),"['Comedy', 'Romance']",6.6,27243.0,['Howard Deutch'],"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret']","['fishing', 'best friend', 'duringcreditssting..."
4,Waiting to Exhale (1995),"['Comedy', 'Drama', 'Romance']",5.9,10877.0,['Forest Whitaker'],"['Whitney Houston', 'Angela Bassett', 'Loretta...","['based on novel', 'interracial relationship',..."
5,Father of the Bride Part II (1995),['Comedy'],6.0,37702.0,['Charles Shyer'],"['Steve Martin', 'Diane Keaton', 'Martin Short']","['baby', 'midlife crisis', 'confidence', 'agin..."
...,...,...,...,...,...,...,...
209155,Santosh Subramaniam (2008),"['Comedy', 'Action', 'Romance']",7.4,1638.0,['Mohan Raja'],,
209157,We (2018),['Drama'],5.6,2677.0,['Rene Eller'],,
209159,Window of the Soul (2001),['Documentary'],7.8,668.0,['João Jardim' 'Walter Carvalho'],,
209163,Bad Poems (2018),"['Comedy', 'Drama']",7.5,2491.0,['Gábor Reisz'],,


In [204]:
metadata.isnull().sum()

title                0
genres               0
averageRating        0
numVotes             0
directors            0
cast             13059
keywords         13059
dtype: int64

Pasamos todas las palabras a minusculas, quitamos espacios y armamos una sopa entre director y género. De esta manera estamos buscando similitud usando ambos features.

In [205]:
# Para los directores sacamos espacios y mayusculas
def literal_eval_(x):
  res = None
  try:
    res = literal_eval(x)  
  except: ValueError

  return res
  

def clean_staff(lis):
  if isinstance(lis, list):
    return [str.lower(string.replace(" ", "")) for string in lis]

def clean_genres(lis):
    if isinstance(lis, list):
      return [string.lower() for string in lis]
   
def words_soup(df):
  """Retorna un string listo para vectorizar
  que contiene la información de directores y género"""
  return " ".join(df["genres"]) + " " + " ".join(df["directors"]) + " " + " ".join(df["cast"]) + " " + " ".join(df["keywords"])     

In [206]:
#Paso todos los strings a objetos de python
metadata['directors'] = metadata['directors'].apply(literal_eval_)
metadata['genres'] = metadata['genres'].apply(literal_eval)
metadata['cast'] = metadata['cast'].apply(
    lambda x: literal_eval(x) if isinstance(x, str) else x
    )

#Limpio los features utilizados para recomendar
metadata['directors'] = metadata['directors'].apply(clean_staff)
metadata['genres'] = metadata['genres'].apply(clean_genres)
metadata['cast'] = metadata['cast'].apply(clean_staff)

In [207]:
metadata['directors'] = metadata['directors'].apply(lambda x: [] if x is None else x)

Ponemos 3 veces al director así lo ponderamos respecto del resto del cast

In [208]:
metadata['directors'] = metadata['directors'].apply(lambda x: x + x + x)

Para limpiar los keywords tenemos que hacer un poquito más de trabajo: Sacar las que no son muy frecuentes y lematizar

In [209]:
metadata['keywords'] = metadata['keywords'].apply(
    lambda x: literal_eval(x) if isinstance(x, str) else x
    )

In [210]:
# Creamos un diccionario de palabras
s = metadata.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

# Tiramos las que aparecen una sola vez
s = s.value_counts()
s = s[s > 1]

# Definimos una funcion para limpiar las keywords usando s
def filter_keywords(x, s):
    words = []
    if isinstance(x, list):
      for i in x:
          if i in s:
              words.append(i)
    return words

  


In [211]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

metadata['keywords'] = metadata['keywords'].apply(lambda x: filter_keywords(x, s))
metadata['keywords'] = metadata['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
metadata['keywords'] = metadata['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [212]:
metadata['cast'] = metadata['cast'].apply(lambda x: [] if x is None else x) 
metadata

Unnamed: 0_level_0,title,genres,averageRating,numVotes,directors,cast,keywords
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Toy Story (1995),"[fantasy, comedy, animation, children, adventure]",8.3,965708.0,"[johnlasseter, johnlasseter, johnlasseter]","[tomhanks, timallen, donrickles]","[jealousi, toy, boy, friendship, friend, rival..."
2,Jumanji (1995),"[fantasy, children, adventure]",7.0,337453.0,"[joejohnston, joejohnston, joejohnston]","[robinwilliams, jonathanhyde, kirstendunst]","[boardgam, disappear, basedonchildren'sbook, n..."
3,Grumpier Old Men (1995),"[comedy, romance]",6.6,27243.0,"[howarddeutch, howarddeutch, howarddeutch]","[waltermatthau, jacklemmon, ann-margret]","[fish, bestfriend, duringcreditssting, oldmen]"
4,Waiting to Exhale (1995),"[comedy, drama, romance]",5.9,10877.0,"[forestwhitaker, forestwhitaker, forestwhitaker]","[whitneyhouston, angelabassett, lorettadevine]","[basedonnovel, interracialrelationship, single..."
5,Father of the Bride Part II (1995),[comedy],6.0,37702.0,"[charlesshyer, charlesshyer, charlesshyer]","[stevemartin, dianekeaton, martinshort]","[babi, midlifecrisi, confid, age, daughter, mo..."
...,...,...,...,...,...,...,...
209155,Santosh Subramaniam (2008),"[comedy, action, romance]",7.4,1638.0,"[mohanraja, mohanraja, mohanraja]",[],[]
209157,We (2018),[drama],5.6,2677.0,"[reneeller, reneeller, reneeller]",[],[]
209159,Window of the Soul (2001),[documentary],7.8,668.0,"[joãojardimwaltercarvalho, joãojardimwaltercar...",[],[]
209163,Bad Poems (2018),"[comedy, drama]",7.5,2491.0,"[gáborreisz, gáborreisz, gáborreisz]",[],[]


Juntamos todo en una feature para vectorizar

In [213]:
#Agrego la columna con los documentos a vectorizar
metadata['documents'] = metadata.apply(words_soup, axis=1)
metadata.drop(columns=['genres', 'directors', 'cast', 'keywords'], inplace=True) #le saco las columnas que ya no nos interesan
# Miro como quedó mi dataset
metadata.head()

Unnamed: 0_level_0,title,averageRating,numVotes,documents
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),8.3,965708.0,fantasy comedy animation children adventure jo...
2,Jumanji (1995),7.0,337453.0,fantasy children adventure joejohnston joejohn...
3,Grumpier Old Men (1995),6.6,27243.0,comedy romance howarddeutch howarddeutch howar...
4,Waiting to Exhale (1995),5.9,10877.0,comedy drama romance forestwhitaker forestwhit...
5,Father of the Bride Part II (1995),6.0,37702.0,comedy charlesshyer charlesshyer charlesshyer ...


In [214]:
metadata.to_csv(os.path.join(files_path, 'metadata.csv'), index=False)

## 3. CountVectorizer + Distancia Coseno

Antes de arrancar, seleccionamos un subconjunto de metadata para hacer la recomendación -así no saturamos la ram-. 

Cuando tengamos la salida del recomendador colaborativo, esta parte se puede obviar, ya que el subconjunto de metadata será el de las películas vistas por los vecinos más cercanos a nuestro usuario.

In [105]:
#seleccionemos un usuario random, llamemoslo Juan:
juan_id = np.random.randint(0, 142246)                           #nos da un id valido de Juan
juan_cluster = df_ratings[df_ratings['userId'] == juan_id].iloc[0].cluster   #nos da el cluster de Juan

# Como subconjunto de películas, tomamos el cluster de Juan
cluster_movies = df_ratings[df_ratings['cluster'] == juan_cluster].movieId

#creamos el df con la metadata de usuarios que nos sirven:
metadata = metadata[metadata.index.isin(cluster_movies.unique())]#.reset_index(drop=True)

Armamos un Bag of Words (BoW) con CountVectorizer y usamos get_recomendations y distancia coseno para obtener similitudes.

In [26]:
# Bag of Words
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['documents'])

# Distancia coseno
cosine_sim = cosine_similarity(count_matrix, count_matrix)

Necesitamos tambien la manera de mapear cada vector con la pelicula asociada:

In [27]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

Ahora ya tenemos todo para definir la funcion de recomendacion:

In [28]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim, indices):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

Hagamos algunas pruebas:

In [29]:
peli = 'Reservoir Dogs (1992)'

print('Con similaridad por director y Género: ')
print(get_recommendations(peli, cosine_sim, indices))

Con similaridad por director y Género: 
movieId
124          Star Maker, The (Uomo delle stelle, L') (1995)
2691      Legend of 1900, The (a.k.a. The Legend of the ...
3992                                          Malèna (2000)
162414                                            Moonlight
5112                                     Last Orders (2001)
59995                                          Boy A (2007)
96821               Perks of Being a Wallflower, The (2012)
1768                      Mother and Son (Mat i syn) (1997)
133739                                 Lenin's Guard (1965)
3078                                 Liberty Heights (1999)
Name: title, dtype: object


Vemos que en este caos las películas recomendadas no están pesadas por ningún tipo de puntuación. Se puede, usar datos como el average rating y el número de votos de cada película para sopesar las recomendaciones. Armemos un recomendador pesado por estos features



In [30]:
def weighted_rating(x, C, m):
  """Devuelve la calificación pesada por el numero de votos
  
  recibe:
  -------
  x: pandas.DataFrame
  Dataset películas

  C: float
  Promedio de todos los ratings del dataset

  m: float
  Cantidad mínima de votos requeridos
  """
  v = x['numVotes']
  R = x['averageRating']
  
  return (v/(v+m) * R) + (m/(m+v) * C)


def improved_recommendations(title, cosine_sim, indices):
  """Recomienda títulos similares pero pesandolos
  por su rating y numero de votos"""
  # Get the index of the movie that matches the title
  idx = indices[title]

  # Get the pairwsie similarity scores of all movies with that movie
  sim_scores = list(enumerate(cosine_sim[idx]))

  # Sort the movies based on the similarity scores
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  # Get the scores of the 10 most similar movies
  sim_scores = sim_scores[1:11]

  # Get the movie indices
  movie_indices = [i[0] for i in sim_scores]
  
  # Get movies but with rating
  movies = metadata.iloc[movie_indices][['title', 'numVotes', 'averageRating']]
  
  # Vote counts and Average rating 
  vote_counts = movies[movies['numVotes'].notnull()]['numVotes'].astype('int')
  vote_averages = movies[movies['averageRating'].notnull()]['averageRating'].astype('int')

  # Mean ratings and most voted movies
  mean_rating = vote_averages.mean()
  most_voted = vote_counts.quantile(0.01)

  # Sort by weighted rating
  qualified = movies[
                     ((movies['numVotes'] >= most_voted) & 
                     (movies['numVotes'].notnull()) & 
                     (movies['averageRating'].notnull()))
                     ]
  qualified['numVotes'] = qualified['numVotes'].astype('int')
  qualified['averageRating'] = qualified['averageRating'].astype('int')

  qualified['wr'] = qualified.apply(lambda x: weighted_rating(x, mean_rating, most_voted), axis=1)
  qualified = qualified.sort_values('wr', ascending=False).head(10)

  return qualified

In [31]:
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
peli = 'Showgirls (1995)'

recommendations = improved_recommendations(peli, cosine_sim, indices)

print(f'Similares a {peli} -por director, genero, actores y keywords- pesado por ranking: ')
print(*recommendations['title'], sep='\n')
print('-'*60)
print('Con similaridad por director y Género: ')
print(get_recommendations(peli, cosine_sim, indices))


Similares a Showgirls (1995) -por director, genero, actores y keywords- pesado por ranking: 
Kill Bill: Vol. 2 (2004)
State of Play (2009)
Trash (2014)
Solace (2015)
Sacrament, The (2013)
Smoke & Mirrors (2016)
First Kill (2017)
Echelon Conspiracy (2009)
Morgan (2016)
------------------------------------------------------------
Con similaridad por director y Género: 
movieId
67361     Echelon Conspiracy (2009)
146688                Solace (2015)
162594                Morgan (2016)
116893                 Trash (2014)
7438       Kill Bill: Vol. 2 (2004)
165885       Smoke & Mirrors (2016)
175783            First Kill (2017)
113565        Sacrament, The (2013)
68159          State of Play (2009)
1709            Legal Deceit (1997)
Name: title, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


# 4. Otras métodos de recomendación

Probamos algunos otros métodos para hacer recomendaciones basadas en contenido.
La idea es medir una vez que tengamos métrica y elegir con cuál quedarnos.

## Tf-idf + Cosine Distance

Probamos con la matriz de términos tf-idf. A priori debería dar un poco peor que usando BoW, ya que nos va a pesar poco los términos muy usuales, que en este caso podría ser un director.

In [None]:
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['documents'])

# Para poder calcular la similaridad del coseno, alcanza con hacer el producto interno de los vectores.
cosine_sim_tfidf= linear_kernel(tfidf_matrix, tfidf_matrix)     

In [None]:
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
peli = 'Showgirls (1995)'

recommendations = improved_recommendations(peli, cosine_sim_tfidf, indices)

print(f'Similares a {peli} -por director, genero, actores y keywords- pesado por ranking: ')
print(*recommendations['title'], sep='\n')
print('-'*60)
print('Con similaridad por director y Género: ')
print(get_recommendations(peli, cosine_sim_tfidf, indices))

NameError: ignored

## Td-idf + SVD

# 5. Métricas

O sea deberiamos hacer un par de cosas más a mi entender:

1) alguna manera de agarrar la peli mejor punteada (o algun otro criterio) por cada usuario, y en base a eso pasarle la funcion get_recomendations().

2) codear la medida de correctitud con la idea que planteó Enzo. nos fijamos en qué pelis le estamos recomendando al usuario. y podemos contar cuántas de esas pelis ya vio. algo tipo: 
        
        correctitud = (pelis ya vistas) / (total pelis recomendadas)

Busquemos las 10 películas favoritas de un usuario ficticio

In [78]:
#intentemos hacer eso de la medida de correctitud de la recomendacion:

# Encontramos las 10 pelis mejor puntuadas por Juan:
df_ratings_juan = df_ratings[df_ratings['userId'] == juan_id]
print(f'Juan vio {len(df_ratings_juan)} películas')
juan_max_rating = df_ratings_juan.rating.sort_values(ascending=False)[:10]

juan_peli_fav_id = df_ratings.iloc[juan_max_rating.index.to_list(), :].movieId
juan_peli_fav = df_movies[df_movies.movieId.isin(juan_peli_fav_id)].title
print(juan_peli_fav)

Juan vio 87 películas
251              Star Wars: Episode IV - A New Hope (1977)
1154     Star Wars: Episode V - The Empire Strikes Back...
1155                            Princess Bride, The (1987)
1167     Star Wars: Episode VI - Return of the Jedi (1983)
1193                                     Local Hero (1983)
1219                                  Groundhog Day (1993)
1502                      Men in Black (a.k.a. MIB) (1997)
1531                      Hunt for Red October, The (1990)
2664                                      Airplane! (1980)
11490                         Bourne Ultimatum, The (2007)
Name: title, dtype: object


Recomendemos 10 Pelis para cada una de esas 10 películas. Con las 100 películas recomendadas, tomamos las 10 mejor puntuadas y vemos cuántas vio Juan. 

In [33]:
total_recommended_movies = []
for idx, title in juan_peli_fav.iteritems():
  total_recommended_movies.append(get_recommendations(title, cosine_sim, indices).index.values)

total_recommended_movies = np.asarray(total_recommended_movies).flatten()
ratings_id = df_ratings.iloc[total_recommended_movies, :].movieId

recommended_movies = metadata[metadata.index.isin(ratings_id)].sort_values(by='averageRating', ascending=False)[:10]
recommended_movies

Unnamed: 0_level_0,title,averageRating,numVotes,documents
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
318,"Shawshank Redemption, The (1994)",9.3,2592905.0,crime drama frankdarabont frankdarabont frankd...
7153,"Lord of the Rings: The Return of the King, The...",9.0,1781358.0,fantasy adventure drama action peterjackson pe...
58559,"Dark Knight, The (2008)",9.0,2563798.0,crime drama action christophernolan christophe...
4993,"Lord of the Rings: The Fellowship of the Ring,...",8.8,1802552.0,fantasy adventure peterjackson peterjackson pe...
593,"Silence of the Lambs, The (1991)",8.6,1388234.0,horror crime thriller jonathandemme jonathande...
589,Terminator 2: Judgment Day (1991),8.6,1070317.0,sci-fi action jamescameron jamescameron jamesc...
2324,Life Is Beautiful (La Vita è bella) (1997),8.6,677871.0,comedy drama war romance robertobenigni robert...
3578,Gladiator (2000),8.5,1457131.0,adventure drama action ridleyscott ridleyscott...
923,Citizen Kane (1941),8.3,435008.0,drama mystery orsonwelles orsonwelles orsonwel...
899,Singin' in the Rain (1952),8.3,237681.0,musical comedy romance stanleydonengenekelly s...


In [79]:
#Recien ahora creo que estamos en condiciones de crear la medida de correctitud:
#como tenemos el df_ratings_juan, sabemos exactamente cuales son las pelis que ya vio el usuario:
#sabemos que la cantidad de recomendaciones siempre es 10, asi lo definimos en get_recommendations:

#dada una serie de recomendations, se fija si el usuario ya las vio
def correctitud(recommendations):
  pelis_recomendadas = 10
  # Creo que e problema es que el index no es el movieId
  pelis_vistas_count = sum(1 for movie_id in recommendations.index if movie_id in list(df_ratings_juan.movieId))
  
  return pelis_vistas_count/pelis_recomendadas

In [80]:
recommended_movies = get_recommendations('Toy Story (1995)', cosine_sim, indices)
print('Correctitud con similaridad por generos:', correctitud(recommended_movies))
#no me convence, me voy a dormir a ver si me viene la respuesta en un sueño...
# Tenemos que buscar un usuario con más pelis vistas para probar!!

Correctitud con similaridad por generos: 0.0


Intra-list Similarity: It is the average similarity of all recommended items to a user. This similarity can be taken on any item feature like the genre for a movie. If very similar items in terms of the chosen feature are recommended, the intra-list similarity is high. This can be high or low depending on how the programmer wants the recommendation to be.

**Personalization**: How personalized the recommendations are is also very important. Your recommender should not recommend the same movies to different kinds of users. This can be checked by checking the similarity in a recommendation for users.

In [46]:
#defino acá el nuevo proceso de recomendaciones quw toma en cuenta los 10 knn de juan:

#tenemos que hacer varias cosas:
#1) encuentro los 10 vecinos mas cercanos dado un juan en df_users
#2) achico el universo de df_ratings a aquellos 10 vecinos --- metadata_10nn
#3) repito el proceso con la nueva metadata


"""este es el codigo de sofi para encontrar los 10nn"""

#levanto df_normalizado con las features usadas para clusterizar y la columna de clusters
df_normalizado = pd.read_csv('/content/drive/MyDrive/LaboDatos_1C2022/tp_final/data/df-normalizado-9-features-maxlocal-5.csv', index_col=[0])

#levanto df_users con los usuarios, las features de usuarios y los clusters obtenidos
df_users_knn = pd.read_csv('/content/drive/MyDrive/LaboDatos_1C2022/tp_final/data/9-features-maxlocal-5.csv',index_col=[0])

In [196]:
#seleccionemos un usuario random, llamemoslo Juan:
juan_id = random.choice(list(df_ratings.userId.values))                           #nos da un id valido de Juan
juan_cluster = df_ratings[df_ratings['userId'] == juan_id].iloc[0].cluster   #nos da el cluster de Juan

In [197]:
#esto funca para el cluster de juan
for label in tqdm([juan_cluster]):#range(0,data.cluster.max()):
    # Features
    features = list(set(df_normalizado.columns).difference(set(['cluster'])))
    # me quedo con las features del cluster actual
    X = df_normalizado[df_normalizado.cluster == label][features]
    #Unsupervised learner for implementing neighbor searches.
    nnbrs = NearestNeighbors(n_neighbors=10)
    #Fit the nearest neighbors estimator from the training dataset.
    nearest_neighbors = nnbrs.fit(X)
    # Finds the K-neighbors of a point. 
    # Returns indices of and distances to the neighbors of each point.
    distance_mat, neighbours_mat = nnbrs.kneighbors(X)

100%|██████████| 1/1 [00:01<00:00,  1.29s/it]


In [198]:
#creo una columna con los vecinos de los usuarios del cluster 0 
aux = pd.DataFrame(neighbours_mat)
cols = list(aux.columns)
aux['vecinos'] = aux[cols].apply(lambda row: ('_'.join(row.values.astype(str)).split('_')), axis=1)
aux = aux[['vecinos']]

#visualizo
aux.reset_index(inplace=True)
aux.rename(columns={'index' : 'userId'}, inplace=True)
aux

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,userId,vecinos
0,0,"[0, 28920, 23476, 7408, 22929, 27520, 26847, 2..."
1,1,"[1, 13279, 10230, 26598, 2222, 3666, 7194, 268..."
2,2,"[2, 2823, 2663, 418, 6697, 134, 23798, 26300, ..."
3,3,"[3, 26850, 23033, 10439, 24079, 16275, 10461, ..."
4,4,"[4, 13802, 29643, 28251, 25490, 22283, 3662, 2..."
...,...,...
31168,31168,"[31168, 17498, 14258, 29281, 17271, 2585, 1487..."
31169,31169,"[31169, 23680, 12854, 9202, 5803, 8922, 3796, ..."
31170,31170,"[31170, 21101, 12081, 13836, 697, 6984, 3739, ..."
31171,31171,"[31171, 25467, 8183, 12393, 4692, 20004, 21804..."


In [199]:
df_ratings_cluster_juan = df_ratings[df_ratings.cluster == juan_cluster]
df_ratings_cluster_juan = df_ratings_cluster_juan.merge(aux, how='right', on='userId')

df_ratings_cluster_juan
#aca debería tener entonces el df_cluster_juan, con los vecinos por cada userId
#df_ratings_cluster_juan
#ahora filtro aún más el df con los vecinos:

vecinos_10nn_juan = list(df_ratings_cluster_juan[df_ratings_cluster_juan['userId'] == juan_id].vecinos.values)[0]
vecinos_10nn_juan

['22820',
 '29590',
 '28718',
 '11915',
 '16176',
 '262',
 '3011',
 '615',
 '10576',
 '11740']

In [200]:
#hacemos el df con los vecinos de juan
df_ratings_vecinos_juan = df_ratings[df_ratings.userId.isin([int(i) for i in vecinos_10nn_juan])]
df_ratings_vecinos_juan

Unnamed: 0,index,userId,movieId,rating,timestamp,movies_count,cluster
20046,34348,262,6,4.0,1996-07-23,61,0
20047,34349,262,10,3.0,1996-07-23,61,0
20048,34350,262,32,4.0,1996-07-23,61,0
20049,34351,262,47,3.0,1996-07-23,61,0
20050,34352,262,50,5.0,1996-07-23,61,0
...,...,...,...,...,...,...,...
2164129,4502612,29590,1387,4.0,1997-01-15,125,2
2164130,4502613,29590,1394,4.0,1997-01-15,125,2
2164131,4502614,29590,1405,4.0,1997-03-29,125,2
2164132,4502615,29590,1461,5.0,1997-03-29,125,2


In [215]:
#y ahora usamos este df_ratings_vecinos_juan para crear el metadata:
#repetimos todo el proceso:

# Como subconjunto de películas, tomamos los vecinos de Juan
vecinos_movies = df_ratings_vecinos_juan.movieId

#creamos el df con la metadata de usuarios que nos sirven:
metadata = metadata[metadata.index.isin(vecinos_movies.unique())]#.reset_index(drop=True)

# Bag of Words
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['documents'])

# Distancia coseno
cosine_sim = cosine_similarity(count_matrix, count_matrix)

#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [216]:
# Encontramos las 10 pelis mejor puntuadas por Juan:
df_ratings_juan = df_ratings[df_ratings['userId'] == juan_id]
print(f'Juan vio {len(df_ratings_juan)} películas')
juan_max_rating = df_ratings_juan.rating.sort_values(ascending=False)[:10]

juan_peli_fav_id = df_ratings.iloc[juan_max_rating.index.to_list(), :].movieId
juan_peli_fav = df_movies[df_movies.movieId.isin(juan_peli_fav_id)].title
print(juan_peli_fav)

Juan vio 132 películas
514                    Schindler's List (1993)
572          Terminator 2: Judgment Day (1991)
1152    One Flew Over the Cuckoo's Nest (1975)
1161                       12 Angry Men (1957)
1163                Clockwork Orange, A (1971)
1913                Saving Private Ryan (1998)
2031            Gods Must Be Crazy, The (1980)
2146                    Few Good Men, A (1992)
2405                 Planet of the Apes (1968)
2963               Grapes of Wrath, The (1940)
Name: title, dtype: object


In [219]:
#le pido que me recomiende segun la primera peli fav:
get_recommendations(juan_peli_fav.iloc[0], cosine_sim, indices)

movieId
55269                       Darjeeling Limited, The (2007)
92259                                  Intouchables (2011)
2502                                   Office Space (1999)
86882                             Midnight in Paris (2011)
6373                                 Bruce Almighty (2003)
48385    Borat: Cultural Learnings of America for Make ...
94959                              Moonrise Kingdom (2012)
91658              Girl with the Dragon Tattoo, The (2011)
517                                      Rising Sun (1993)
476                                    Inkwell, The (1994)
Name: title, dtype: object