# Práctico 2: Recomendación de videojuegos

En este práctico trabajaremos con un subconjunto de datos sobre [videojuegos de Steam](http://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data). Para facilitar un poco el práctico, se les dará el conjunto de datos previamente procesado. En este mismo notebook mostraremos el proceso de limpieza, para que quede registro del proceso (de todas maneras, por el tamaño de los datos no recomendamos que pierdan tiempo en el proceso salvo que lo consideren útil a fines personales). 

El conjunto de datos se basa en dos partes: lista de juegos (items), y lista de reviews de usuarios sobre distintos juegos. Este último, en su versión original es muy grande, (pesa 1.3GB), por lo que será solo una muestra del mismo sobre la que trabajarán.

A diferencia del conjunto de datos de LastFM utilizados en el [Práctico 1](./practico1.ipynb), en este caso los datos no están particularmente pensados para un sistema de recomendación, por lo que requerirá de un poco más de trabajo general sobre el dataset.

La idea es que, de manera similar al práctico anterior, realicen un sistema de recomendación. A diferencia del práctico anterior, este será un poco más completo y deberán hacer dos sistemas, uno que, dado un nombre de usuario le recomiende una lista de juegos, y otro que dado el título de un juego, recomiende una lista de juegos similares. Además, en este caso se requiere que el segundo sistema (el que recomienda juegos basado en el nombre de un juego en particular) haga uso de la información de contenido (i.e. o bien harán un filtrado basado en contenido o algo híbrido).

## Obtención y limpieza del conjunto de datos

El conjunto de datos originalmente se encuentra en archivos que deberían ser de formato "JSON". Sin embargo, en realidad es un archivo donde cada línea es un objeto de JSON. Hay un problema no obstante y es que las líneas están mal formateadas, dado que no respetan el estándar JSON de utilizar comillas dobles (**"**) y en su lugar utilizan comillas simples (**'**). Afortunadamente, se pueden evaluar como diccionarios de Python, lo cuál permite trabajarlos directamente.

### Descarga

La siguiente celda descarga los conjuntos de datos crudos. Nuevamente, no es necesario ejecutarla y pueden ir [más abajo](#Conjunto-de-datos-limpio) para ejecutar la celda que descargará el conjunto ya procesado.

In [None]:
%%bash

mkdir -p data/steam/
curl -L -o data/steam/steam_games.json.gz http://cseweb.ucsd.edu/\~wckang/steam_games.json.gz
curl -L -o data/steam/steam_reviews.json.gz http://cseweb.ucsd.edu/\~wckang/steam_reviews.json.gz

### Carga de datos

Como se dijo, por la naturaleza de los datos, necesitamos utilizar Python para trabajarlos (no podemos leerlos con JSON).

In [None]:
import gzip
from tqdm import tqdm_notebook  # To print a progress bar (comes with Anaconda or can be installed)

with gzip.open("./data/steam/steam_games.json.gz") as fh:
    games = []
    for game in tqdm_notebook(fh, total=32135):
        try:
            games.append(eval(game))
        except SyntaxError:
            continue

print("Loaded {} games".format(len(games)))

with gzip.open("./data/steam/steam_reviews.json.gz") as fh:
    reviews = []
    for review in tqdm_notebook(fh, total=7793069):
        try:
            reviews.append(eval(review))
        except SyntaxError:
            continue

print("Loaded {} user reviews".format(len(reviews)))

### Exploración de los datos

En esta parte necesitamos revisar la estructura general, para poder pasarlos a un formato más amigable (e.g. CSV).

In [None]:
games[0]

In [None]:
reviews[0]

### Transformación de los datos

Viendo los datos que tenemos de cada tipo, podemos utilizar pandas para leer los registros y trabajar con algo más sencillo.

In [None]:
import pandas as pd

In [None]:
games = pd.DataFrame.from_records(games)
games.head(3)

In [None]:
reviews = pd.DataFrame.from_records(reviews)
reviews.head(3)

### Selección de características

Teniendo los datos, podemos hacer una selección muy superficial (no basada en EDA) de algunas características que consideremos irrelevantes. En particular, para el caso del dataset de juegos, vemos que las columnas `url` y `reviews_url` no son útiles a los propósitos de este práctico, por lo que las removeremos.

Por el lado del dataset de opiniones todas parecen útiles. Aunque, si vemos muy por arriba `recommended` vemos que para todos los valores son `True`, por lo que la podemos sacar también.

In [None]:
games.drop(columns=["url", "reviews_url"], inplace=True)
games.head(3)

In [None]:
reviews.drop(columns=["recommended"], inplace=True)
reviews.head(3)

### Muestreo y guarda de datos

Como dijimos, tenemos muchas reviews. Sería excelente trabajarlas a todas, pero el dataset es medio pesado (en RAM llega a ocupar más de 8 GB). Por lo que optaremos por hacer un muestreo de reviews. Esto quiere decir que, probablemente, algunos usuarios/juegos queden afuera. Podríamos hacer algún muestreo estratificado, pero iremos por algo más sencillo. Dejaremos aproximadamente el 10% del dataset (700 mil reviews).

El conjunto de datos de juegos lo dejaremos como está. Lo guardaremos con formato JSON para conservar la información de aquellas columnas que sean de tipo lista.

In [None]:
games.to_json("./data/steam/games.json.gz", orient="records")
reviews.sample(n=int(7e5), random_state=42).to_json("./data/steam/reviews.json.gz", orient="records")

## Conjunto de datos limpio

Para descargar el conjunto de datos que se utilizará en el práctico, basta con ejecutar la siguiente celda.

In [1]:
%%bash

mkdir -p data/steam/
curl -L -o data/steam/games.json.gz https://cs.famaf.unc.edu.ar/\~ccardellino/diplomatura/games.json.gz
curl -L -o data/steam/reviews.json.gz https://cs.famaf.unc.edu.ar/\~ccardellino/diplomatura/reviews.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 98 1927k   98 1904k    0     0  2151k      0 --:--:-- --:--:-- --:--:-- 2148k100 1927k  100 1927k    0     0  1842k      0  0:00:01  0:00:01 --:--:-- 1842k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  2  121M    2 2672k    0     0  4423k      0  0:00:28 --:--:--  0:00:28 4416k 13  121M   13 16.2M    0     0  10.4M      0  0:00:11  0:00:01  0:00:10 10.4M 24  121M   24 29.5M    0     0  11.4M      0  0:00:10  0:00:02  0:00:08 11.4M 35  121M   35 42.6M    0     0  11.9M      0  0:00:10  0:00:03  0:00:07 11.9M 46  121M   46 55.8M    0     0  12.2M      0  0:0

## Ejercicio 1: Análisis Exploratorio de Datos

Ya teniendo los datos, podemos cargarlos y empezar con el práctico. Antes que nada vamos a hacer una exploración de los datos. Lo principal a tener en cuenta para este caso es que debemos identificar las variables con las que vamos a trabajar. A diferencia del práctico anterior, este conjunto de datos no está documentado, por lo que la exploración es necesaria para poder entender que cosas van a definir nuestro sistema de recomendación.

In [2]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances
from surprise import Dataset, Reader, KNNWithMeans
from surprise.model_selection import train_test_split

### Características del conjunto de datos sobre videojuegos

Las características del conjunto de datos de videojuegos tienen la información necesaria para hacer el "vector de contenido" utilizado en el segundo sistema de recomendación. Su tarea es hacer un análisis sobre dicho conjunto de datos y descartar aquella información redundante.

In [3]:
games = pd.read_json("./data/steam/games.json.gz")
games.head()

Unnamed: 0,publisher,genres,app_name,title,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,[Single-player],4.99,False,761140.0,Kotoshiro,,
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980.0,Secret Level SRL,Mostly Positive,
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290.0,Poolians.com,Mostly Positive,
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,2017-12-07,"[Action, Adventure, Casual]",0.83,[Single-player],0.99,False,767400.0,彼岸领域,,
4,,,Log Challenge,,,"[Action, Indie, Casual, Sports]",1.79,"[Single-player, Full controller support, HTC V...",2.99,False,773570.0,,,


In [4]:
df_games_title = games[['title','app_name']]
df_games_title.head()

Unnamed: 0,title,app_name
0,Lost Summoner Kitty,Lost Summoner Kitty
1,Ironbound,Ironbound
2,Real Pool 3D - Poolians,Real Pool 3D - Poolians
3,弹炸人2222,弹炸人2222
4,,Log Challenge


In [5]:
df_games_tags = games[['genres','tags']]
df_games_tags.head()

Unnamed: 0,genres,tags
0,"[Action, Casual, Indie, Simulation, Strategy]","[Strategy, Action, Indie, Casual, Simulation]"
1,"[Free to Play, Indie, RPG, Strategy]","[Free to Play, Strategy, Indie, RPG, Card Game..."
2,"[Casual, Free to Play, Indie, Simulation, Sports]","[Free to Play, Simulation, Sports, Casual, Ind..."
3,"[Action, Adventure, Casual]","[Action, Adventure, Casual]"
4,,"[Action, Indie, Casual, Sports]"


In [6]:
games.drop(columns=["genres","title"], inplace=True)

In [7]:
games.head()

Unnamed: 0,publisher,app_name,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,[Single-player],4.99,False,761140.0,Kotoshiro,,
1,"Making Fun, Inc.",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980.0,Secret Level SRL,Mostly Positive,
2,Poolians.com,Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290.0,Poolians.com,Mostly Positive,
3,彼岸领域,弹炸人2222,2017-12-07,"[Action, Adventure, Casual]",0.83,[Single-player],0.99,False,767400.0,彼岸领域,,
4,,Log Challenge,,"[Action, Indie, Casual, Sports]",1.79,"[Single-player, Full controller support, HTC V...",2.99,False,773570.0,,,


### Características del conjunto de datos de reviews

Este será el conjunto de datos a utilizar para obtener información sobre los usuarios y su interacción con videojuegos. Como se puede observar no hay un rating explícito, sino uno implícito a calcular, que será parte de su trabajo (deberán descubrir que característica les puede dar información que puede ser equivalente a un rating).

In [8]:
reviews = pd.read_json("./data/steam/reviews.json.gz")
reviews.head()

Unnamed: 0,username,product_id,page_order,text,hours,products,date,early_access,page,compensation,found_funny,user_id
0,SPejsMan,227940,0,Just one word... Balance!,23.0,92.0,2015-02-25,True,3159,,,
1,Spodermen,270170,4,Graphics: none\nMusic: Makes me want to sleep\...,4.9,217.0,2014-08-26,False,231,,,7.65612e+16
2,josh,41700,1,"cheeki breeki iv danke, stalker",53.2,78.0,2015-12-25,False,191,,,
3,Sammyrism,332310,9,I am really underwhelmed by the small about of...,16.2,178.0,2015-06-04,True,570,,,
4,moonmirroir,303210,9,"I came into the game expecting nothing, of cou...",1.8,13.0,2015-10-02,False,967,,,


Según un análisis inicial se puede tomar como las horas de juego como un rating, por lo que buscamos el máximo y el mínimo y luego dividimos por el maximo para obtener valor entre 0 y 1 

In [9]:
max_hours = reviews.hours.max()
min_hours = reviews.hours.min()

max_hours , min_hours

(18570.9, 0.0)

In [10]:
reviews['rating'] = reviews['hours'] / max_hours
reviews.head()

Unnamed: 0,username,product_id,page_order,text,hours,products,date,early_access,page,compensation,found_funny,user_id,rating
0,SPejsMan,227940,0,Just one word... Balance!,23.0,92.0,2015-02-25,True,3159,,,,0.001238
1,Spodermen,270170,4,Graphics: none\nMusic: Makes me want to sleep\...,4.9,217.0,2014-08-26,False,231,,,7.65612e+16,0.000264
2,josh,41700,1,"cheeki breeki iv danke, stalker",53.2,78.0,2015-12-25,False,191,,,,0.002865
3,Sammyrism,332310,9,I am really underwhelmed by the small about of...,16.2,178.0,2015-06-04,True,570,,,,0.000872
4,moonmirroir,303210,9,"I came into the game expecting nothing, of cou...",1.8,13.0,2015-10-02,False,967,,,,9.7e-05


## Ejercicio 2 - Sistema de Recomendación Basado en Usuarios

Este sistema de recomendación deberá entrenar un algoritmo y desarrollar una interfaz que, dado un usuario, le devuelva una lista con los juegos más recomendados.

*** Un poco de limpieza de datos ***

In [11]:
games.head()

Unnamed: 0,publisher,app_name,release_date,tags,discount_price,specs,price,early_access,id,developer,sentiment,metascore
0,Kotoshiro,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.49,[Single-player],4.99,False,761140.0,Kotoshiro,,
1,"Making Fun, Inc.",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980.0,Secret Level SRL,Mostly Positive,
2,Poolians.com,Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290.0,Poolians.com,Mostly Positive,
3,彼岸领域,弹炸人2222,2017-12-07,"[Action, Adventure, Casual]",0.83,[Single-player],0.99,False,767400.0,彼岸领域,,
4,,Log Challenge,,"[Action, Indie, Casual, Sports]",1.79,"[Single-player, Full controller support, HTC V...",2.99,False,773570.0,,,


In [12]:
reviews.head(10)

Unnamed: 0,username,product_id,page_order,text,hours,products,date,early_access,page,compensation,found_funny,user_id,rating
0,SPejsMan,227940,0,Just one word... Balance!,23.0,92.0,2015-02-25,True,3159,,,,0.001238
1,Spodermen,270170,4,Graphics: none\nMusic: Makes me want to sleep\...,4.9,217.0,2014-08-26,False,231,,,7.65612e+16,0.000264
2,josh,41700,1,"cheeki breeki iv danke, stalker",53.2,78.0,2015-12-25,False,191,,,,0.002865
3,Sammyrism,332310,9,I am really underwhelmed by the small about of...,16.2,178.0,2015-06-04,True,570,,,,0.000872
4,moonmirroir,303210,9,"I came into the game expecting nothing, of cou...",1.8,13.0,2015-10-02,False,967,,,,9.7e-05
5,brotherdave84,311340,4,"i havent got to play this game yet,IT WILL NOT...",45.8,8.0,2014-12-25,False,406,,,7.65612e+16,0.002466
6,Amaraen,422970,1,If you enjoy skill-based FPS games and don't m...,35.6,803.0,2016-06-22,False,205,,3.0,,0.001917
7,CaptainPlanet,214950,1,"If you like slaughtering in the name of Rome, ...",203.3,274.0,2016-10-27,False,285,,,7.65612e+16,0.010947
8,BlacKobra246,440,6,Good game\nI recommend this game\nbecause is g...,2.5,7.0,2015-05-10,False,14090,,1.0,,0.000135
9,TpaXep,526790,0,Hi! My name's Sasha Zenko. I'm from Belarus.\n...,0.7,1213.0,2016-10-23,False,21,Product received for free,2.0,,3.8e-05


In [13]:
reviews_short = reviews.drop(columns=['page_order','text','hours','products','date','early_access','page',
                                  'compensation','found_funny','username'])

reviews_short.head()


Unnamed: 0,product_id,user_id,rating
0,227940,,0.001238
1,270170,7.65612e+16,0.000264
2,41700,,0.002865
3,332310,,0.000872
4,303210,,9.7e-05


In [14]:
reviews_short.size

2100000

In [15]:
reviews_short = reviews_short.dropna()
reviews_short.size

854565

In [16]:
reviews_short = reviews_short.reset_index()
reviews_short.head()

Unnamed: 0,index,product_id,user_id,rating
0,1,270170,7.65612e+16,0.000264
1,5,311340,7.65612e+16,0.002466
2,7,214950,7.65612e+16,0.010947
3,10,10500,7.65612e+16,0.00063
4,12,360940,7.65612e+16,0.000237


In [17]:
reviews_short = reviews_short.drop(columns=['index'])
reviews_short.head()

Unnamed: 0,product_id,user_id,rating
0,270170,7.65612e+16,0.000264
1,311340,7.65612e+16,0.002466
2,214950,7.65612e+16,0.010947
3,10500,7.65612e+16,0.00063
4,360940,7.65612e+16,0.000237


In [18]:
games['product_id'] = games['id']
games_short = games.drop(columns=['publisher','release_date','discount_price','specs','price','early_access','developer',
                                  'sentiment','metascore','id'])
games_short.head()

Unnamed: 0,app_name,tags,product_id
0,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",761140.0
1,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980.0
2,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",670290.0
3,弹炸人2222,"[Action, Adventure, Casual]",767400.0
4,Log Challenge,"[Action, Indie, Casual, Sports]",773570.0


In [19]:
games_short.size

96405

In [20]:
games_short = games_short.dropna()
games_short.size

95910

In [21]:
games_short.head()

Unnamed: 0,app_name,tags,product_id
0,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",761140.0
1,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980.0
2,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",670290.0
3,弹炸人2222,"[Action, Adventure, Casual]",767400.0
4,Log Challenge,"[Action, Indie, Casual, Sports]",773570.0


In [22]:
games_short_inner = pd.merge(left=games_short,right=reviews_short, left_on='product_id', right_on='product_id')

games_short_inner.shape
games_short_inner


Unnamed: 0,app_name,tags,product_id,user_id,rating
0,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980.0,7.656120e+16,0.000022
1,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",670290.0,7.656120e+16,0.000059
2,Carmageddon Max Pack,"[Racing, Action, Classic, Indie, Gore, 1990's,...",282010.0,7.656120e+16,0.001061
3,Carmageddon Max Pack,"[Racing, Action, Classic, Indie, Gore, 1990's,...",282010.0,7.656120e+16,0.001152
4,Carmageddon Max Pack,"[Racing, Action, Classic, Indie, Gore, 1990's,...",282010.0,7.656120e+16,0.000027
...,...,...,...,...,...
285122,Counter-Strike: Condition Zero,"[Action, FPS, Shooter, Multiplayer, Singleplay...",80.0,7.656120e+16,0.000011
285123,Counter-Strike: Condition Zero,"[Action, FPS, Shooter, Multiplayer, Singleplay...",80.0,7.656120e+16,0.000032
285124,Snail Trek - Chapter 3: Lettuce Be,"[Adventure, Indie, Retro, Point & Click, Pixel...",761480.0,7.656120e+16,0.000070
285125,Kebab it Up!,"[Action, Indie, Casual, Violent, Adventure]",745400.0,7.656120e+16,0.000382


In [23]:
games_short_inner = games_short_inner.drop(columns=['user_id','rating'])
games_short_inner = games_short_inner.drop_duplicates(subset=('app_name','product_id'))
games_short_inner.head()

Unnamed: 0,app_name,tags,product_id
0,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980.0
1,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",670290.0
2,Carmageddon Max Pack,"[Racing, Action, Classic, Indie, Gore, 1990's,...",282010.0
12,Half-Life,"[FPS, Classic, Action, Sci-fi, Singleplayer, S...",70.0
333,Vaporwave Simulator,"[Casual, Indie, Simulation]",766850.0


In [24]:
games_short_inner.size

28416

In [25]:
games_short_inner=games_short_inner.reset_index()
games_short_inner.head()

Unnamed: 0,index,app_name,tags,product_id
0,0,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980.0
1,1,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",670290.0
2,2,Carmageddon Max Pack,"[Racing, Action, Classic, Indie, Gore, 1990's,...",282010.0
3,12,Half-Life,"[FPS, Classic, Action, Sci-fi, Singleplayer, S...",70.0
4,333,Vaporwave Simulator,"[Casual, Indie, Simulation]",766850.0


In [26]:
games_short_inner=games_short_inner.drop(columns=['index'])
games_short_inner.head()

Unnamed: 0,app_name,tags,product_id
0,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980.0
1,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...",670290.0
2,Carmageddon Max Pack,"[Racing, Action, Classic, Indie, Gore, 1990's,...",282010.0
3,Half-Life,"[FPS, Classic, Action, Sci-fi, Singleplayer, S...",70.0
4,Vaporwave Simulator,"[Casual, Indie, Simulation]",766850.0


In [27]:
game_data_map = games_short_inner[['product_id','app_name']]
game_data_map.head()

Unnamed: 0,product_id,app_name
0,643980.0,Ironbound
1,670290.0,Real Pool 3D - Poolians
2,282010.0,Carmageddon Max Pack
3,70.0,Half-Life
4,766850.0,Vaporwave Simulator


In [28]:
game_titles_name = dict(zip(game_data_map['app_name'], game_data_map['product_id']))
game_titles = dict(zip(game_data_map['product_id'], game_data_map['app_name']))

In [29]:
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
 
# instantiate a reader and read in our rating data
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(reviews_short[['user_id','product_id','rating']], reader)
 
# train SVD on 80% of known rates
trainset, testset = train_test_split(data, test_size=.2)
algorithm = SVD()
algorithm.fit(trainset)
predictions = algorithm.test(testset)
 
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictions)


RMSE: 0.0306


0.03061315412752615

In [30]:
users = reviews[['user_id','username']]
users = users.dropna()
users.shape

(285587, 2)

In [31]:
users = users.drop_duplicates(subset=('user_id','username'))
users.shape

(238974, 2)

In [32]:
pd.set_option('display.float_format', lambda x: '%f' % x)

In [33]:
Mapping_file = dict(zip(game_data_map.app_name.tolist(),game_data_map.product_id.tolist()))

In [34]:
def pred_user_rating(ui):
    if ui in users.user_id.unique():
        ui_list = reviews_short[reviews_short.user_id == ui].product_id.tolist()
        d = {k: v for k,v in Mapping_file.items() if not v in ui_list}        
        predictedL = []
        for i, j in d.items():     
            predicted = algorithm.predict(ui, j)
            predictedL.append((i, predicted[3])) 
        pdf = pd.DataFrame(predictedL, columns = ['app_name', 'rating'])
        pdf.sort_values('rating', ascending=False, inplace=True)  
        pdf.set_index('app_name', inplace=True)    
        return pdf.head(10)        
    else:
        print("User Id does not exist in the list!")
        return None

*** Una vez entrenado, buscamos el nombre de un usuario y con ese obtenemos los primeros 10 mas recomendados. ***

In [35]:
# check the preferences of a particular user
user = 'brotherdave84'
user_id = users[users['username']==user]
user_id_float = int(user_id['user_id'])
user_id_float

76561198161372176

In [36]:
pred_user_rating(user_id_float)

Unnamed: 0_level_0,rating
app_name,Unnamed: 1_level_1
Kung Fu Strike - The Warrior's Rise,0.333237
RollerCoaster Tycoon® 3: Platinum,0.329885
Clockwork Tales: Of Glass and Ink,0.328822
Virtual Pool 4,0.324053
Grave Prosperity: Redux- part 1,0.323806
GAUGE,0.320693
Resilience: Wave Survival,0.320554
Grappledrome,0.319291
meleng,0.304545
Oik 2,0.303672


## Ejercicio 3 - Sistema de Recomendación Basado en Juegos

Similar al caso anterior, con la diferencia de que este sistema espera como entrada el nombre de un juego y devuelve una lista de juegos similares. El sistema deberá estar programado en base a información de contenido de los juegos (i.e. filtrado basado en contenido o sistema híbrido).

In [37]:
from scipy.sparse import csr_matrix

def create_X(df):
    """
    Generates a sparse matrix from ratings dataframe.
    
    Args:
        df: pandas dataframe containing 3 columns (userId, movieId, rating)
    
    Returns:
        X: sparse matrix
        user_mapper: dict that maps user id's to user indices
        user_inv_mapper: dict that maps user indices to user id's
        movie_mapper: dict that maps movie id's to movie indices
        movie_inv_mapper: dict that maps movie indices to movie id's
    """
    M = df['user_id'].nunique()
    N = df['product_id'].nunique()

    user_mapper = dict(zip(np.unique(df["user_id"]), list(range(M))))
    game_mapper = dict(zip(np.unique(df["product_id"]), list(range(N))))
    
    user_inv_mapper = dict(zip(list(range(M)), np.unique(df["user_id"])))
    game_inv_mapper = dict(zip(list(range(N)), np.unique(df["product_id"])))
    
    user_index = [user_mapper[i] for i in df['user_id']]
    item_index = [game_mapper[i] for i in df['product_id']]

    X = csr_matrix((df["rating"], (user_index,item_index)), shape=(M,N))
    
    return X, user_mapper, game_mapper, user_inv_mapper, game_inv_mapper

X, user_mapper, game_mapper, user_inv_mapper, game_inv_mapper = create_X(reviews_short)

In [38]:
from sklearn.neighbors import NearestNeighbors
def find_similar_game_name(game_name, X, game_mapper, game_inv_mapper, k, metric='cosine'):
    """
    Finds k-nearest neighbours for a given movie id.
    
    Args:
        artist_id: id of the artist of interest
        X: user-item utility matrix
        k: number of similar artist to retrieve
        metric: distance metric for kNN calculations
    
    Output: returns list of k similar movie ID's
    """
    X = X.T
    neighbour_ids = []
    game_output = []
    
    game_id = game_titles_name[game_name]
    game_ind = game_mapper[game_id] 
    game_vec = X[game_ind]
    if isinstance(game_vec, (np.ndarray)):
        game_vec = game_vec.reshape(1,-1)
    # use k+1 since kNN output includes the artistID of interest
    kNN = NearestNeighbors(n_neighbors=k+1, algorithm="brute", metric=metric)
    kNN.fit(X)
    neighbour = kNN.kneighbors(game_vec, return_distance=False)
    for i in range(0,k):
        n = neighbour.item(i)
        neighbour_ids.append(game_inv_mapper[n])
        
    neighbour_ids.pop(0)
    
    for i in neighbour_ids:
        game_output.append(game_titles[i])
    game_output.pop(0)
    
    return game_output

Una vez entrenado, devuelve los 10 juegos mas similares para un nombre ingresado.

In [39]:
#Busco los juegos similares por nombre
juegos_a_recomendar = find_similar_game_name('Ironbound', X, game_mapper, game_inv_mapper, k=10)
juegos_a_recomendar

['Mathoria: It All Adds Up',
 'Formicide',
 'Simutrans',
 'Blood and Bacon',
 'Forbidden Planet',
 'Pythagoria',
 'The Chosen RPG',
 'Lost Castle']