## Proyecto Netflix
Gergana Rosenova Tsakova, Kristina Petrova Tsvetkova

### Recomendador basado en contenido

Importar las librerías necesarias

In [66]:
import pandas as pd
import numpy as np

Leemos el fichero que contiene las películas. Cada película tiene id, título y géneros.

In [67]:
movies = pd.read_csv('ml-10M100K/movies.dat', delimiter = '::', header = None, engine='python' )
movies.rename(columns={0: 'MovieID', 1: 'Title', 2: 'Genres'}, inplace=True)
movies['Genres'] = movies['Genres'].str.split("|")
movies

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy]
...,...,...,...
10676,65088,Bedtime Stories (2008),"[Adventure, Children, Comedy]"
10677,65091,Manhattan Melodrama (1934),"[Crime, Drama, Romance]"
10678,65126,Choke (2008),"[Comedy, Drama]"
10679,65130,Revolutionary Road (2008),"[Drama, Romance]"


En la base de datos hay 10 681 películas distintas.

In [68]:
len(movies['MovieID'].unique())

10681

No hay datos ausentes.

In [69]:
movies.isnull().sum()

MovieID    0
Title      0
Genres     0
dtype: int64

Hacemos one-hot-encoding para las películas y cada género se transforma en una columna. 

In [70]:
movies1 = pd.get_dummies(pd.DataFrame(movies.pop('Genres').values.tolist()), prefix='', prefix_sep='').groupby(axis=1, level=0).max()
movies = pd.concat([movies, movies1], axis=1)
#movies = movies.set_index('MovieID')
movies

Unnamed: 0,MovieID,Title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10676,65088,Bedtime Stories (2008),0,0,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
10677,65091,Manhattan Melodrama (1934),0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
10678,65126,Choke (2008),0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
10679,65130,Revolutionary Road (2008),0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Leemos los datos de las etiquetas dadas a las películas por los usuarios. Una fila representa una etiqueta dada por un usuario para una película en tiempo concreto.

In [71]:
tags = pd.read_csv('ml-10M100K/tags.dat', delimiter = '::', header = None, engine='python' )
tags.rename(columns={0: 'UserID', 1: 'MovieID', 2: 'Tag', 3: 'Timestamp'}, inplace=True)
tags

Unnamed: 0,UserID,MovieID,Tag,Timestamp
0,15,4973,excellent!,1215184630
1,20,1747,politics,1188263867
2,20,1747,satire,1188263867
3,20,2424,chick flick 212,1188263835
4,20,2424,hanks,1188263835
...,...,...,...,...
95575,71556,1377,Gothic,1188263571
95576,71556,2424,chick flick,1188263606
95577,71556,3033,comedy,1188263626
95578,71556,3081,Gothic,1188263565


En esta tabla hay 16 529 etiquetas únicas para 7 601 películas distintas.

In [72]:
print('Número de películas:',len(tags['MovieID'].unique()))
print('Número de etiquetas distintas:',len(tags['Tag'].unique()))

Número de películas: 7601
Número de etiquetas distintas: 16529


Ya que este número de etiquetas es demasiado grande para que sea cada uno una variable predictora, decidimos reducir el número de etiquetas teniendo en cuenta solo las etiquetas que han sido usadas más de 5 veces. Nos quedamos con 2 528 etiquetas.

In [73]:
values = tags['Tag'].value_counts()
values = values[values > 5]
print('Número de etiquetas con más de 5 ocurencias:', len(values))

Número de etiquetas con más de 5 ocurencias: 2528


Filtramos la tabla con las etiquetas.

In [74]:
filter = values.index.tolist()

tags = tags[tags['Tag'].isin(filter)] 
tags = tags[['MovieID', 'Tag']]
tags

Unnamed: 0,MovieID,Tag
0,4973,excellent!
1,1747,politics
2,1747,satire
6,2947,action
7,2947,bond
...,...,...
95570,30701,Orson Welles
95572,1219,Alfred Hitchcock
95574,1302,Baseball
95576,2424,chick flick


Agrupamos las etiquetas por película y después hacemos one-hot-encoding para representar las etiquetas de la misma manera como los géneros.

In [75]:
#agrupar
grouped = tags.groupby(['MovieID'])['Tag'].apply(','.join).reset_index()
grouped['Tag'] = grouped['Tag'].str.split(",")

#obtener dummy variables 
tags1 = pd.get_dummies(pd.DataFrame(grouped.pop('Tag').values.tolist()), prefix='', prefix_sep='').groupby(axis=1, level=0).max()
#concatenar para obtener el movieID
tags = pd.concat([grouped, tags1], axis=1)
#quitar la columna con la etiqueta vacía
tags = tags.drop(['\xa0'], axis=1)
#tags = tags.set_index('MovieID')
tags

Unnamed: 0,MovieID,"""damn dirty apes""",(s)vcd,*Good* Musicals,007,007 (series),06 Oscar Nominated Best Movie - Animation,1,1.5,16th century,...,writers,writers at work,writing,wry,wuxia,yakuza,youth,zibri studio,zombie,zombies
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,6,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7256,64993,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7257,65006,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7258,65037,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7259,65126,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Juntamos la tabla de los géneros con la de las etiquetas para obtener una tabla donde cada película está representada por sus géneros y sus etiquetas. Las columnas son los 20 géneros, seguidos por las 2 527 etiquetas. Cada película está representada como un vector de zeros y unos. Guardamos el resultado en un DataFrame.

In [76]:
result = pd.merge(movies, tags, how = 'left', on='MovieID', suffixes=['_genre','_tag'])
result = result.drop(['Title'], axis=1)
result = result.fillna(0) #cambiamos los valores nulos por 0
result

Unnamed: 0,MovieID,(no genres listed),Action_genre,Adventure_genre,Animation_genre,Children_genre,Comedy_genre,Crime_genre,Documentary_genre,Drama_genre,...,writers,writers at work,writing,wry,wuxia,yakuza,youth,zibri studio,zombie,zombies
0,1,0,0,1,1,1,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,0,0,1,0,1,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0,0,0,0,0,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10676,65088,0,0,1,0,1,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10677,65091,0,0,0,0,0,0,1,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10678,65126,0,0,0,0,0,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10679,65130,0,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Leemos el fichero con los datos de los ratings de los usuarios. El fichero contiene 10 000 054 ratings. Cada fila es una combinación de usuario, película, rating (entre 1 y 5) y el timestamp.

In [63]:
ratings = pd.read_csv('ml-10M100K/ratings.dat', delimiter = '::', header = None, engine='python' )
ratings.rename(columns={0: 'UserID', 1: 'MovieID', 2: 'Rating', 3: 'Timestamp'}, inplace=True)
ratings

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,122,5.0,838985046
1,1,185,5.0,838983525
2,1,231,5.0,838983392
3,1,292,5.0,838983421
4,1,316,5.0,838983392
...,...,...,...,...
10000049,71567,2107,1.0,912580553
10000050,71567,2126,2.0,912649143
10000051,71567,2294,5.0,912577968
10000052,71567,2338,2.0,912578016


#### Modelo lineal

Importamos las librerías necesarias. Obtemos una lista con los identificadores de los usurios y despúes recorremos esta lista. Para cada usuario:
- obtenemos las películas que ha calificado y la calificación.
- creamos un DataFrame con las variables predictoras X, que son los géneros y las etiquetas de cada película.
- creamos una lista para la variable dependiente Y, que es el rating de cada película.
- separamos los datos en conjunto de train (80%) y conjunto de test (20%). Éste último usaremos para hacer la validación del modelo.
- entrenamos un modelo Ridge. Usamos un modelo lineal con regularización debido a la multicolinealidad de los datos que se produce cuando convertimos los datos en dummy variables. Elegimos Ridge ante Lasso porque no queremos que el modelo "quite" variables, haciendo sus coeficientes igual a 0.
- guardamos los coeficientes del modelo en un DataFrame.
- hacemos predicción en test para validar y guardamos el error cuádratico medio (tenemos que trabajar con MSE porque el número de observaciones de cada usuario es distinto).

Imprimimos el error medio para todos usuarios y el DataFrame con los coeficientes del modelo lineal.

In [78]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#lista de todos los usuarios
users = ratings['UserID'].unique()
#DataFrame vacío para los coeficientes del modelo
coeff = pd.DataFrame()
#array para el error
error = []

for user in users:
    #información sobre el usuario
    user_ratings = ratings[ratings['UserID'] == user]
    user_ratings = user_ratings.sort_values(by='MovieID') #df
    user_movies = user_ratings['MovieID'].tolist() #list
    
    #películas del usuario (one-hot-encoding)
    #variables predictoras
    X = result[result['MovieID'].isin(user_movies)] 
    X = X.reset_index(drop=True).sort_values(by='MovieID') #df
    X = X.set_index('MovieID')
    #ratings
    #variable dependiente
    Y = user_ratings['Rating']
    
    #separar en test y entrenamiento (20%, 80%)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
    
    #modelo lineal Ridge
    #sin coeficiente intercept
    model = Ridge(fit_intercept=False).fit(X_train, Y_train)
    #coeficientes del usuario
    coeff_user = pd.DataFrame(model.coef_).transpose()
    #cambiar los nombres de las columnas
    coeff_user.columns = X.columns
    #añadir columna con el id del usuario (UserID)
    coeff_user['UserID'] = user
    coeff = pd.concat([coeff, coeff_user])
    
    #validación en test
    Y_pred = model.predict(X_test)
    error.append(mean_squared_error(Y_test, Y_pred))
    
coeff = coeff.set_index('UserID')

#media del error medio
print('Error medio:', np.mean(error))
#DataFrame con 2 547 columnas - las variables predictoras y 69 878 filas - una para cada usuario
coeff

Error medio: 2.499404822558787


Unnamed: 0_level_0,(no genres listed),Action_genre,Adventure_genre,Animation_genre,Children_genre,Comedy_genre,Crime_genre,Documentary_genre,Drama_genre,Fantasy_genre,...,writers,writers at work,writing,wry,wuxia,yakuza,youth,zibri studio,zombie,zombies
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.919216,0.738271,0.381652,1.290167,1.610046,0.534488,0.000000,0.189030,0.425190,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0
2,0.0,0.702326,0.449313,0.000000,0.075185,0.710329,-0.020579,0.000000,0.498736,0.075185,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0
3,0.0,0.271687,0.176024,0.055394,0.055394,1.161958,0.516271,0.000000,1.907107,0.047335,...,0.0,0.0,0.000000,0.019087,0.0,0.000000,0.000000,0.0,0.035255,0.0
4,0.0,0.494784,0.677436,0.066142,0.314168,1.250852,0.305595,0.000000,1.093009,0.547024,...,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0
5,0.0,0.080057,0.199923,-0.018254,-0.018254,0.810756,1.231700,0.000000,2.584088,0.037319,...,0.0,0.0,-0.182834,-0.029188,0.0,0.000000,0.012973,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71563,0.0,1.070128,0.101079,0.053322,0.053322,0.210979,0.606582,0.000000,0.675048,0.018517,...,0.0,0.0,0.000000,0.000000,0.0,0.034118,0.000000,0.0,0.000000,0.0
71564,0.0,0.712589,0.036890,0.016617,0.016617,1.608676,0.064517,0.048512,1.158193,0.120380,...,0.0,0.0,0.000000,0.026568,0.0,0.000000,0.000000,0.0,0.011808,0.0
71565,0.0,1.144791,0.544190,0.680022,0.317095,1.738618,0.216168,0.203038,2.126420,0.211515,...,0.0,0.0,0.000000,0.151728,0.0,0.000000,0.000000,0.0,0.000000,0.0
71566,0.0,0.580557,0.435949,0.009117,0.013534,1.817091,0.352195,0.391059,1.055668,0.013534,...,0.0,0.0,0.091531,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0


In [79]:
coeff.to_csv('coeficientes.csv')

Después de obtener los coeficientes de cada usuario podemos usarlos para calcular la similaridad entre el vector de preferencias del usuario y los vectores de las películas que no ha visto. Las películas con mayor similaridad (utilidad para el usuario) van a ser las recomendaciones.

Creamos una lista con los usuarios que nos interesan y para quienes queremos hacer recomendaciones. Para cada usuario:
- obtenemos los vectores de las películas que no ha visto.
- obtenemos sus coeficientes del modelo lineal (su vector de preferencias).
- restamos los vectores y calculamos la distancia euclídea.
- calculamos la similaridad (utilidad) de cada película.
- ordenamos por la similaridad y nos quedamos con las primeras 10 películas, que van a ser las recomendaciones para este usuario.

In [None]:
coeff = pd.read_csv('coeficientes.csv')

In [80]:
desired_users = [1,3,59269] 

for user in desired_users:
    #películas no vistas por el usuario
    user_ratings = ratings[ratings['UserID'] == user]
    movies_not_watched = result[~result['MovieID'].isin(user_ratings['MovieID'].tolist())] 
    movies_not_watched = movies_not_watched.set_index('MovieID')
    
    #coeficientes del usuario
    user_coeff = coeff.loc[[user]]
    #función para obtener la diferencia al cuadrado 
    func = lambda x: (np.asarray(x) - np.asarray(user_coeff)[0])**2
    distance = movies_not_watched.apply(func, axis=1)
    distance = pd.DataFrame(distance)
    distance = pd.DataFrame(distance[0].to_list(), columns=movies_not_watched.columns, index=distance.index)
    #obtener la distancia euclídea 
    distance['distance'] = np.sqrt(distance.sum(axis=1))
    #calcular la similaridad entre los vectores
    distance['similarity'] = 1 / (1 + distance['distance'])

    #ordenar
    distance = distance.sort_values(by='similarity', ascending=False)
    recommendations = distance.head(10)
    
    print('')
    print('User', user)
    print('Recomendaciones')
    print(recommendations[['similarity']])


User 1
Recomendaciones
         similarity
MovieID            
258        0.248814
7262       0.247732
2041       0.247039
26792      0.247039
2720       0.245164
26828      0.242416
55603      0.240199
6714       0.240199
8883       0.239090
1011       0.239090

User 3
Recomendaciones
         similarity
MovieID            
6792       0.289499
27797      0.287576
8671       0.287034
26521      0.287034
52838      0.287034
7270       0.287034
6171       0.287034
2556       0.287034
27644      0.287034
4269       0.287034

User 59269
Recomendaciones
         similarity
MovieID            
7350       0.044190
3582       0.044190
27884      0.044190
49265      0.044096
32314      0.044073
682        0.044070
26176      0.044070
53833      0.044070
1149       0.044070
5402       0.044070


De forma similar, con los datos que hemos obtenido para hacer recomendaciones basadas en contenido, podríamos también predecir el rating que un usuario daría a una película que no ha visto. Las películas con el rating más alto, serían las recomendaciones. 

La única diferencia es que multiplicamos los coeficientes por los valores de los atributos de las películas no vistas y para cada película obtenemos el rating, sumando los valores multiplicados.

In [81]:
#predicción de ratings de películas no vistas

#lista de usuarios 
desired_users = [1,3] #59269


for user in desired_users:
    #películas no vistas por el usuario
    user_ratings = ratings[ratings['UserID'] == user]
    movies_not_watched = result[~result['MovieID'].isin(user_ratings['MovieID'].tolist())] 
    movies_not_watched = movies_not_watched.set_index('MovieID')
    
    #coeficientes del usuario
    user_coeff = coeff.loc[[user]]
    #función para multiplicar arrays
    func = lambda x: np.asarray(x) * np.asarray(user_coeff)[0]
    predicted_ratings = movies_not_watched.apply(func, axis=1)
    predicted_ratings = pd.DataFrame(predicted_ratings)
    predicted_ratings = pd.DataFrame(predicted_ratings[0].to_list(), columns=movies_not_watched.columns, index=predicted_ratings.index)
    
    #obtener rating para cada película
    predicted_ratings['Rating'] = predicted_ratings.sum(axis=1)
    predicted_ratings = predicted_ratings.sort_values(by='Rating', ascending=False)
    recommendations = predicted_ratings.head(10)
    
    print('')
    print('User', user)
    print('Recomendaciones')
    print(recommendations[['Rating']])


User 3
Recomendaciones
           Rating
MovieID          
356      7.992616
608      6.980432
2959     6.765860
3418     6.711573
296      6.638151
7361     6.529750
778      6.401008
587      6.134478
8949     6.124589
541      6.018033

User 1
Recomendaciones
           Rating
MovieID          
4306     8.621632
1197     7.757841
8961     7.266438
1        7.218827
380      7.138364
8360     6.981210
7361     6.952783
2080     6.864713
546      6.816515
296      6.609084
