# Movies Recommender

<img src="figures/recommendation-system-project.png" width="50%">


In [1]:
# load libraries
import pandas as pd 
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity as cos
from sklearn.metrics import jaccard_score as jacc

## Paso 1: Se calcula una matriz de puntuaciones o interacción usuario-ítem (L100x5)

### Lectura de datos de ratings de películas:


- **userId**: id del usuario.

- **MovieId**: id de la película.

- **Rating**: puntuación que da el usuario a cada película.

- **Timestamp**: Marca temporal del registro en BBDD.

In [6]:
# Load the dataset 
ratings = pd.read_csv("datasets/ratings.csv", encoding='"ISO-8859-1"')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Tratamiento de datos

Veamos cuantas películas hay en el dataset.

In [7]:
# Nº de películas del dataset
np.unique(ratings["movieId"]).shape

(9724,)

Por motivos de reducir la complejidad computacional vamos a quedarnos solo con las primeras 100 películas.

In [8]:
# escogemos un subconjunto de datos para poder trabajar más fácilmente, las 100 primeras películas
num_movies = 100
ratings = ratings[ratings["movieId"] < np.unique(ratings["movieId"])[num_movies]]
np.unique(ratings["movieId"]).shape

(100,)

Eliminamos las columnas que no nos hacen falta.

In [9]:
# eliminamos la columna timestamp
del ratings['timestamp']

Como habréis notado, el formato del dataset es distinto al que necesitamos, es decir, una matriz L donde cada fila es un usuario y cada columna un item.

Para transformar el dataset original a esta matriz L utilizaremos el método de Pandas llamado pivot_table

In [10]:
help(pd.pivot_table)

Help on function pivot_table in module pandas.core.reshape.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) -> 'DataFrame'
    Create a spreadsheet-style pivot table as a DataFrame.
    
    The levels in the pivot table will be stored in MultiIndex objects
    (hierarchical indexes) on the index and columns of the result DataFrame.
    
    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same lengt

In [11]:
# pivotamos la tabla de datos para obtener una matriz que contenga en las filas las películas 
# y en las columnas los usuarios con sus correspondientes puntuaciones
new_ratings = pd.pivot_table(ratings, values='rating', 
                     index=['userId'], 
                     columns='movieId')
new_ratings

movieId,1,2,3,4,5,6,7,8,9,10,...,102,103,104,105,106,107,108,110,111,112
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,4.0,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,4.0,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,4.0,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,1.0,,4.0,3.0,,,,5.0,,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,3.5,,,,3.5,3.5,
607,4.0,,,,,,,,,,...,,,,,,,,5.0,,2.0
608,2.5,2.0,2.0,,,,,,,4.0,...,,,3.0,,,3.0,,4.0,3.0,
609,3.0,,,,,,,,,4.0,...,,,,,,,,3.0,,


Como veis, tenemos muchos missing values, que se corresponden con películas que un usuario determinado no ha puntuado. Reemplazaremos todos estos valores con el valor 0. 

Este es el approach más básico, sería mejor rellenar teniendo en cuenta las medias de las puntuaciones por usuario y película, pero a modo ilustrativo lo haremos así.

In [12]:
# cambiamos los NA por lel valor 0
new_ratings = new_ratings.fillna(0)
new_ratings

movieId,1,2,3,4,5,6,7,8,9,10,...,102,103,104,105,106,107,108,110,111,112
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0
6,0.0,4.0,5.0,3.0,5.0,4.0,4.0,3.0,0.0,3.0,...,1.0,0.0,4.0,3.0,0.0,0.0,0.0,5.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,3.5,0.0,0.0,0.0,3.5,3.5,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,2.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,3.0,0.0,0.0,3.0,0.0,4.0,3.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0


La tabla final es una matriz L 529x100, es decir 529 usuarios y 100 películas

In [13]:
# dimensiones de la nueva tabla
new_ratings.shape

(529, 100)

## Paso 2: Se calcula una matriz de similitudes entre las películas (S100x100)

Utilizaremos de nuevo la distancia del coseno para calcular las similitudes entre items (películas en este caso).

In [14]:
# calcular la similitud del coseno entre películas
sim = cos(X = np.transpose(new_ratings))
sim

array([[1.        , 0.41056206, 0.2969169 , ..., 0.48105392, 0.34135086,
        0.32822857],
       [0.41056206, 1.        , 0.28243799, ..., 0.41531401, 0.22871507,
        0.19350527],
       [0.2969169 , 0.28243799, 1.        , ..., 0.25074299, 0.15018003,
        0.26106922],
       ...,
       [0.48105392, 0.41531401, 0.25074299, ..., 1.        , 0.36078562,
        0.26866901],
       [0.34135086, 0.22871507, 0.15018003, ..., 0.36078562, 1.        ,
        0.15259629],
       [0.32822857, 0.19350527, 0.26106922, ..., 0.26866901, 0.15259629,
        1.        ]])

Comprobemos las dimensiones de la matriz de similitudes.

In [15]:
# ¿Cuáles son las dimensiones del matriz de similitudes?
print('Dimensiones = ' + str(sim.shape))

Dimensiones = (100, 100)


## Paso 3: Multiplicar matriz de puntuaciones por la matriz de similitudes para obtener el ítem que tiene más probabilidad de interesarle a cada usuario

Multiplicamos la matriz sim 100x100 por las interacciones de cada usuario. Empezemos con un ejemplo con el usuario 4.

In [27]:
# Crear el vector de interés del usuario 4
interes_user = np.matmul(np.transpose(new_ratings.loc[1]),sim)
interes_user

array([0.08919115, 0.12242016, 0.10384738, 0.14072192, 0.12612045,
       0.12493966, 0.13040986, 0.0618676 , 0.09325648, 0.10953536,
       0.1167058 , 0.03711561, 0.02435405, 0.10023025, 0.0566988 ,
       0.15955689, 0.12844715, 0.06226388, 0.07832284, 0.04468134,
       0.11601145, 0.09527901, 0.07133849, 0.13259877, 0.111825  ,
       0.16367374, 0.11474318, 0.02309003, 0.03497231, 0.        ,
       0.5       , 0.10053874, 0.10417679, 0.11755513, 0.01623603,
       0.11503792, 0.        , 0.11966456, 0.03570385, 0.10745863,
       0.15024817, 0.10126354, 0.13840959, 0.13069755, 0.09038294,
       0.        , 0.1151603 , 0.11866681, 0.        , 0.07971735,
       0.07306215, 0.08699588, 0.01018399, 0.08134009, 0.07518165,
       0.1729233 , 0.0205223 , 0.        , 0.07830499, 0.08677133,
       0.        , 0.00702748, 0.05936947, 0.08575096, 0.04028047,
       0.0344951 , 0.07207143, 0.02261217, 0.08928961, 0.        ,
       0.08771039, 0.10435214, 0.        , 0.0693046 , 0.08891

Calculamos el item con el valor más alto para el usuario 4. 

In [25]:
top1 = np.argmax(interes_user)
top1

97

Por tanto, el usuario 4 tendría como principal película a recomendar la correspondiente al ID 43.

Ahora hacemos lo mismo para todos los usuarios en un único paso. Multiplicamos la matriz de las interacciones de todos los usuarios, L 529x100, por las similitudes sim 100x100.

De esta forma, obtenemos la matriz de interés I 529x100, donde: 

$$L_{529x100} \cdot S_{100x100} = I_{529x100}$$

In [18]:
# Crear la matriz de intereses de los usuarios
matriz_interes_users = np.matmul(new_ratings, sim)
matriz_interes_users

movieId,1,2,3,4,5,6,7,8,9,10,...,102,103,104,105,106,107,108,110,111,112
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,14.811554,10.773412,11.092618,2.515074,7.757282,15.092448,7.757281,4.907405,4.609195,11.675893,...,2.354306,1.707382,11.274231,4.148923,0.172005,6.450271,2.235830,15.862341,11.416311,8.513057
3,0.089191,0.122420,0.103847,0.140722,0.126120,0.124940,0.130410,0.061868,0.093256,0.109535,...,0.022029,0.000000,0.094563,0.082478,0.000000,0.092788,0.000000,0.146341,0.086427,0.060813
4,4.727342,3.375838,3.447148,1.859406,3.104564,5.450892,4.161224,1.634746,1.705860,3.998947,...,0.548394,1.532693,3.280830,2.222521,5.709307,1.920727,1.691891,5.017887,4.162392,4.087645
5,14.379056,9.838357,7.709797,3.019508,7.083995,11.238392,8.403417,3.537087,2.872262,10.159275,...,1.553202,1.719576,8.533247,5.901166,0.950777,5.714782,2.815002,14.516611,9.443482,7.533456
6,45.184721,46.097227,50.784418,32.875592,50.214188,51.395529,51.684598,38.914071,26.561304,45.152446,...,18.955089,19.726501,45.075799,31.607242,0.928717,27.654433,4.720440,49.963055,30.151986,46.542671
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,22.617154,18.697069,15.289605,9.067902,14.580405,20.123632,20.713832,9.195805,5.341041,17.571609,...,4.251834,4.628820,16.041625,17.319686,0.669465,10.450286,4.815802,24.015817,21.197835,14.791248
607,12.235452,8.351102,6.886511,3.674156,7.588112,10.122438,8.194105,5.057798,4.079190,9.189962,...,2.066616,3.020351,7.517587,5.435817,0.000000,4.404844,1.810283,12.774316,6.930989,8.987510
608,26.326027,23.728658,20.117196,6.772310,16.289792,24.336092,16.108098,11.131629,8.553971,24.989048,...,5.040733,3.937798,23.615417,9.568758,0.593615,15.695859,2.983199,28.553900,20.708393,17.791362
609,6.025455,4.148400,2.614796,0.782161,2.567627,4.051591,2.547221,1.515299,1.215293,6.661503,...,0.927061,0.341249,3.566502,1.890755,0.000000,2.076331,0.251174,6.409540,2.934511,2.931102


Comprobamos las dimensiones de esta matriz de intereses I.

In [19]:
# dimensiones de la matriz de intereses 
matriz_interes_users.shape

(529, 100)

Por último, sacamos la principal recomendación para cada usuario.

In [20]:
# sacar el interés de cada usuario
tops1 = [np.argmax(matriz_interes_users.iloc[row]) for row in range(matriz_interes_users.shape[0])]
tops1

[43,
 30,
 20,
 97,
 84,
 0,
 43,
 37,
 5,
 35,
 43,
 31,
 43,
 46,
 43,
 43,
 0,
 44,
 9,
 95,
 31,
 5,
 43,
 32,
 31,
 46,
 97,
 0,
 31,
 20,
 9,
 20,
 24,
 10,
 46,
 0,
 43,
 97,
 0,
 99,
 31,
 97,
 30,
 44,
 97,
 98,
 2,
 31,
 97,
 31,
 97,
 98,
 44,
 46,
 97,
 43,
 31,
 31,
 97,
 46,
 33,
 55,
 43,
 0,
 98,
 43,
 43,
 0,
 5,
 46,
 31,
 97,
 97,
 33,
 48,
 0,
 53,
 46,
 92,
 24,
 31,
 95,
 97,
 97,
 31,
 0,
 0,
 21,
 84,
 43,
 0,
 1,
 43,
 97,
 74,
 31,
 43,
 92,
 31,
 10,
 31,
 43,
 97,
 24,
 9,
 4,
 5,
 43,
 43,
 46,
 97,
 43,
 97,
 97,
 97,
 43,
 31,
 43,
 97,
 31,
 97,
 98,
 18,
 97,
 0,
 46,
 97,
 46,
 31,
 4,
 31,
 24,
 92,
 43,
 0,
 0,
 20,
 0,
 31,
 0,
 33,
 97,
 97,
 31,
 92,
 98,
 0,
 32,
 31,
 97,
 10,
 97,
 0,
 43,
 31,
 10,
 5,
 97,
 0,
 1,
 43,
 6,
 31,
 9,
 0,
 44,
 5,
 55,
 46,
 31,
 24,
 97,
 31,
 31,
 97,
 43,
 24,
 88,
 97,
 46,
 0,
 0,
 46,
 32,
 0,
 31,
 31,
 31,
 43,
 97,
 35,
 61,
 92,
 43,
 46,
 97,
 43,
 97,
 43,
 0,
 20,
 31,
 31,
 1,
 43,
 31,
 9,
 97,
 1

## Resultados:

**Usuario 1:**
Se le recomendará la película con ID 43.

**Usuario 2:**
Se le recomendará la película con ID 30.

**Usuario 3:**
Se le recomendará la película con ID 20.

**Usuario 4:**
Se le recomendará la película con ID 97.

**Usuario 5:**
Se le recomendará la película con ID 84.

...