# 1. ETL, Extracción Transformación y Carga de Datos

En este proceso extraemos los datos necesarios y los limpiamos para tenerlos en formato correcto para nuestros fines.

#### 1.1 Importamos librerías, definimos funciones y constantes

In [1]:
import pandas as pd
import json
import ast
import warnings
from io import StringIO
import hashlib
import matplotlib.pyplot as plt

from typing import List, Dict
import base64, csv

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import plotly.express as px
import seaborn as sns
import importlib

warnings.filterwarnings('ignore')

In [2]:

def showPie(columna):
  count_values = pd.Series(columna).value_counts()
  if len(count_values) > 15:
    count_values = count_values.iloc[0:15]
  datos = pd.DataFrame({"valor":count_values.index, "ocurrencia": count_values.values})

  plt.title(columna.name)
  plt.pie(datos["ocurrencia"], labels=datos['valor'], autopct='%1.1f%%')
  plt.show()

def concatenar(data_1, data_2,  axis=1):
  return pd.concat([data_1, data_2], axis=axis)

def contar_nulos(data):
  return data.isna().sum()

def mapear(columna: pd.Series, mapa={'NO': 0, 'SI':1}):
  return columna.map(mapa)

def showPiePx(columna, max=15):
  count_values = pd.Series(columna).value_counts()
  if len(count_values) > max:
    count_values = count_values.iloc[0:max]
  datos = pd.DataFrame({"valor":count_values.index, "ocurrencia": count_values.values})
  fig = px.pie(datos, values='ocurrencia', names='valor', title=columna.name)
  fig.update_traces(textposition='outside', textinfo='percent+label')
  fig.show()
  
def nulos_filas(data):
    return pd.DataFrame(data.isna().sum(axis=1).value_counts().sort_values(ascending=False).reset_index().values, columns=['cant_col_nulas', 'cantidad'])

In [3]:
URL_STEAM_GAMES = 'datasets/origin/output_steam_games.json'
URL_USERS_ITEMS = 'datasets/australian_users_items.json'
URL_USERS_REVIEWS = 'datasets/australian_user_reviews.json'

## 1.2 ETL del Dataset Steam Games

#### 1.2.1 Leer archivo JSON

Ya que el archivo se encuentra en formato JSON, con llamar a la función built-in ```read_json()``` podemos crear un ```DataFrame``` para leer los valores.

In [4]:
df_games_all = pd.read_json(URL_STEAM_GAMES, lines=True)

#### 1.2.2 Análisis de Valores Nulos

In [5]:
# Devuelve la cantidad filas que tienen la por cantidad de columnas nulas, es decir, 
# hay 88.310 filas que tienen 13 valores nulos, hay 22.530 filas que tienen 0 valores nulos

nulos_filas(df_games_all)

Unnamed: 0,cant_col_nulas,cantidad
0,13,88310
1,0,22530
2,1,6070
3,5,1940
4,3,733
5,4,391
6,2,349
7,6,121
8,10,1


Revisando los valores nulos a lo largo del eje 1, vemos que hay un número muy grande (88310) de filas completamente vacías, ya que tiene 13 columnas el dataset y la cuenta de nulos es igual a ese valor. Por lo tanto, se recorta 
hasta n-1 valores nulos, es decir, como mínimo tiene que tener una columna no nula.

In [6]:
n = len(df_games_all.columns)

df_games = df_games_all.drop( df_games_all[df_games_all.isna().sum(axis=1) > (n - 1) ].index)

#### 1.2.3 Verificar duplicados

In [7]:
# Utilizamos una funcion de hash para verificar los duplicados por fila

# Convert DataFrame to string representation
df_str = df_games.astype(str).duplicated()

print(f'Cantidad de filas duplicadas: {df_str.sum()}')


Cantidad de filas duplicadas: 0


### 1.2.4 Analizar todas las columnas para quitar redundancias e información no útil para el análisis

Las columnas ```title``` y ```app_name``` parecen contener lo mismo, comparamos

In [8]:
df_games[['title','app_name']].isna().sum()

title       2050
app_name       2
dtype: int64

Las columnas ```url```, ```reviewes_url```, ```specs``` contienen información considerada superflua para el análisis, ```publisher``` duplica información, al igual que ```early_access```.

In [9]:
columnas_a_quitar = ['title','url','reviews_url', 'early_access', 'publisher', 'specs']

In [10]:
df_games.drop(columnas_a_quitar, axis=1, inplace=True)

Reorganizamos las columnas a comodidad

In [11]:
df_games = df_games[['id','app_name', 'genres', 'release_date', 'tags', 'price', 'developer']]

Seguimos trabajando con las columnas

#### 1.2.4.1. ```App_Name``` y ```Title```

In [12]:
df_games[df_games.app_name.isna()]

Unnamed: 0,id,app_name,genres,release_date,tags,price,developer
88384,,,,,,19.99,
90890,317160.0,,"[Action, Indie]",2014-08-26,"[Action, Indie]",,


In [13]:
try:
    df_games.drop(88384, inplace=True)
except:
    print('No encontrado')

Buscando en la información duplicada que no usaremos para el análisis, pero la tenemos disponibles, conseguimos el valor en la columna ```title```

In [14]:
df_games.loc[90890, 'app_name'] = 'Duet'
df_games.loc[90890]

id                     317160.0
app_name                   Duet
genres          [Action, Indie]
release_date         2014-08-26
tags            [Action, Indie]
price                      None
developer                  None
Name: 90890, dtype: object

#### 1.2.4.2 ```Id```

In [15]:
df_games[df_games.id.isna()]

Unnamed: 0,id,app_name,genres,release_date,tags,price,developer
119271,,Batman: Arkham City - Game of the Year Edition,"[Action, Adventure]",2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",19.99,"Rocksteady Studios,Feral Interactive (Mac)"


Al igual que hicimos anteriormente encontramos el valor en columnas que duplican información

In [16]:
df_games.loc[119271,'id'] = 200260

Ahora consideramos que es apto resetear el índice

In [17]:
df_games.reset_index(inplace=True)

In [18]:
df_games.drop('index', axis=1, inplace=True)

In [19]:
df_games

Unnamed: 0,id,app_name,genres,release_date,tags,price,developer
0,761140.0,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,Kotoshiro
1,643980.0,Ironbound,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",Free To Play,Secret Level SRL
2,670290.0,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",Free to Play,Poolians.com
3,767400.0,弹炸人2222,"[Action, Adventure, Casual]",2017-12-07,"[Action, Adventure, Casual]",0.99,彼岸领域
4,773570.0,Log Challenge,,,"[Action, Indie, Casual, Sports]",2.99,
...,...,...,...,...,...,...,...
32129,773640.0,Colony On Mars,"[Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.99,"Nikita ""Ghost_RUS"""
32130,733530.0,LOGistICAL: South Africa,"[Casual, Indie, Strategy]",2018-01-04,"[Strategy, Indie, Casual]",4.99,Sacada
32131,610660.0,Russian Roads,"[Indie, Racing, Simulation]",2018-01-04,"[Indie, Simulation, Racing]",1.99,Laush Dmitriy Sergeevich
32132,658870.0,EXIT 2 - Directions,"[Casual, Indie]",2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",4.99,"xropi,stev3ns"


Buscamos items duplicados

In [20]:
cuenta_duplicados = df_games.id.value_counts()
id_dups = cuenta_duplicados[cuenta_duplicados.values > 1].keys()



In [21]:
to_b = []

for i in id_dups:
  to_b.append(df_games[df_games['id'] == i].id.idxmax())

df_games.drop(to_b, axis=0, inplace=True)

In [22]:
df_games = df_games.set_index(df_games['id'].astype(int))
df_games.id = df_games.id.astype(int)
df_games.rename_axis('index')

Unnamed: 0_level_0,id,app_name,genres,release_date,tags,price,developer
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
761140,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,Kotoshiro
643980,643980,Ironbound,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",Free To Play,Secret Level SRL
670290,670290,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",Free to Play,Poolians.com
767400,767400,弹炸人2222,"[Action, Adventure, Casual]",2017-12-07,"[Action, Adventure, Casual]",0.99,彼岸领域
773570,773570,Log Challenge,,,"[Action, Indie, Casual, Sports]",2.99,
...,...,...,...,...,...,...,...
773640,773640,Colony On Mars,"[Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.99,"Nikita ""Ghost_RUS"""
733530,733530,LOGistICAL: South Africa,"[Casual, Indie, Strategy]",2018-01-04,"[Strategy, Indie, Casual]",4.99,Sacada
610660,610660,Russian Roads,"[Indie, Racing, Simulation]",2018-01-04,"[Indie, Simulation, Racing]",1.99,Laush Dmitriy Sergeevich
658870,658870,EXIT 2 - Directions,"[Casual, Indie]",2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",4.99,"xropi,stev3ns"


#### 1.2.4.3 ``` Price ```

In [23]:
def isnumber(x):
    try:
        x = float(x)
        return x
    except:
        return 0

In [24]:
### Transformamos los valores a número, ```Free to Play = 0 ```

df_games.price = df_games.price.apply(isnumber)

#### 1.2.4.4 ```release_date``` y ```release_year```

In [25]:
# to_date = lambda x: pd.to_datetime(x, errors='coerce') if pd.notna(x) else pd.to_datetime('1900-01-01')
# to_date = lambda x: pd.to_datetime(x, errors='coerce').fillna(pd.to_datetime('1900-01-01'))
to_date = lambda x: pd.to_datetime(x, errors='coerce')



In [26]:
fechas = df_games['release_date'].apply(to_date)


In [27]:
df_games['release_year'] = df_games['release_date'].apply(to_date)

In [28]:
df_games['release_year'] = df_games['release_year'].dt.year.fillna(1900).astype(int)

In [29]:
df_games

Unnamed: 0_level_0,id,app_name,genres,release_date,tags,price,developer,release_year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
761140,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,Kotoshiro,2018
643980,643980,Ironbound,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",0.00,Secret Level SRL,2018
670290,670290,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",0.00,Poolians.com,2017
767400,767400,弹炸人2222,"[Action, Adventure, Casual]",2017-12-07,"[Action, Adventure, Casual]",0.99,彼岸领域,2017
773570,773570,Log Challenge,,,"[Action, Indie, Casual, Sports]",2.99,,1900
...,...,...,...,...,...,...,...,...
773640,773640,Colony On Mars,"[Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.99,"Nikita ""Ghost_RUS""",2018
733530,733530,LOGistICAL: South Africa,"[Casual, Indie, Strategy]",2018-01-04,"[Strategy, Indie, Casual]",4.99,Sacada,2018
610660,610660,Russian Roads,"[Indie, Racing, Simulation]",2018-01-04,"[Indie, Simulation, Racing]",1.99,Laush Dmitriy Sergeevich,2018
658870,658870,EXIT 2 - Directions,"[Casual, Indie]",2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",4.99,"xropi,stev3ns",2017


#### 1.2.4.5 ``` genres ``` y ``` tags ```

In [30]:
juego_genero = df_games['genres'].explode()

In [31]:
juego_genero

id
761140        Action
761140        Casual
761140         Indie
761140    Simulation
761140      Strategy
             ...    
610660        Racing
610660    Simulation
658870        Casual
658870         Indie
681550          None
Name: genres, Length: 74833, dtype: object

In [32]:
df_games['genres_tags'] = df_games['genres'] + df_games['tags']

In [33]:
df_games['genres_tags']	= df_games['genres_tags'].apply(lambda x: pd.Series(x).drop_duplicates().tolist())

In [34]:
# generos_tags = []
# df_games['genres_tags'].fillna("", inplace=True)
# def to_set(x):
#     try:
#         x = list(set(x))
#     except:
#         x = []
#         pass
#     return x

# df_games['genres_tags'] = df_games['genres_tags'].apply(to_set)

In [35]:
df_games['genres_tags'].explode().fillna('Generic', inplace=True)

In [36]:
print(df_games.query("id == 761140")['genres_tags'].explode())

id
761140        Action
761140        Casual
761140         Indie
761140    Simulation
761140      Strategy
Name: genres_tags, dtype: object


In [37]:
# df_games.loc[1]

In [38]:
# df_games.loc[1, 'genres_tags']

In [91]:
generos_tags = [[row['id'], *row['genres_tags']] for _, row in df_games.iterrows()]

tuplas = [[i[0], j] for i in generos_tags for j in i[1:]]

# generos_tags

In [92]:
df_juego_genero_tag = pd.DataFrame(tuplas, columns=['id_juego', 'genero_tag'])

# Elegimos el nro de generos para el dataset de análisis
nro_generos_tag = 40

recorte = df_juego_genero_tag['genero_tag'].value_counts().reset_index().head(nro_generos_tag)

df_juego_genero_tag = df_juego_genero_tag[df_juego_genero_tag.genero_tag.isin(recorte['genero_tag'])]

In [93]:
matriz_dummies = pd.get_dummies(df_juego_genero_tag, dtype=int, prefix='gen').groupby('id_juego').sum()

In [94]:
matriz_dummies.sum().sort_values(ascending=False)

gen_Indie                    16352
gen_Action                   11898
gen_Adventure                 9221
gen_Casual                    8879
gen_Strategy                  7365
gen_Simulation                7062
gen_RPG                       5797
gen_Singleplayer              4195
gen_Multiplayer               2283
gen_Free to Play              2230
gen_Great Soundtrack          2166
gen_Puzzle                    2009
gen_2D                        1929
gen_Atmospheric               1839
gen_Early Access              1462
gen_Platformer                1418
gen_Story Rich                1408
gen_Sports                    1308
gen_Fantasy                   1280
gen_Open World                1273
gen_Difficult                 1270
gen_Massively Multiplayer     1250
gen_Pixel Graphics            1236
gen_Sci-fi                    1225
gen_Co-op                     1197
gen_Horror                    1155
gen_Female Protagonist        1153
gen_Shooter                   1141
gen_Racing          

In [95]:
filter_games_df = None

items_reviews_filter = False

if items_reviews_filter:
    items_unicos = pd.read_csv('id_items_unicos.csv')
    items_unicos.columns = ['index', 'id_juego']
    # matriz_dummies = pd.merge(matriz_dummies, items_unicos,left_on='id_juego', right_on='id_juego')
    filter_games_df = items_unicos

In [96]:
stats_filter = True

if stats_filter:
    stats = pd.read_csv('df_games_stats.csv')
    filter_games_df = stats
    # stats.info()

In [97]:
if filter_games_df is not None and not filter_games_df.empty:
    matriz_dummies = pd.merge(matriz_dummies, filter_games_df, left_on='id_juego', right_on='id_juego')

# matriz_dummies.info()

In [98]:
df_juego_genero = pd.DataFrame(juego_genero)

df_juego_genero.rename_axis('id_juego', inplace=True)

In [99]:
df_games['id_juego'] = df_games['id']

matriz_generos = pd.merge(df_juego_genero, filter_games_df, left_on='id_juego', right_on='id_juego')

In [100]:
matriz_generos

Unnamed: 0,id_juego,genres,playtime_forever,playtime_2weeks,total_play
0,282010,Action,1577,0,1577
1,282010,Indie,1577,0,1577
2,282010,Racing,1577,0,1577
3,70,Action,334490,1544,336034
4,1640,Strategy,3030,0,3030
...,...,...,...,...,...
20013,200980,Strategy,471,0,471
20014,200980,RPG,471,0,471
20015,200980,Indie,471,0,471
20016,13230,Action,71360,162,71522


In [149]:
genero_estadisticas = pd.merge(matriz_generos, df_games, left_on='id_juego', right_on='id_juego').groupby(['genres_x', 'release_year']).sum()
genero_estadisticas.drop(columns=['id', 'app_name', 'genres_y', 'release_date', 'tags', 'price', 'developer', 'genres_tags'], inplace=True)
genero_estadisticas.to_csv('generos_estadisticas.csv')
# genero_estadisticas.drop(columns=['id'], inplace=True)
# genero_estadisticas.info()

In [154]:
genero_estadisticas = pd.read_csv('generos_estadisticas.csv')
genero_estadisticas = genero_estadisticas.set_index('genres_x')
genero_estadisticas

Unnamed: 0_level_0,release_year,id_juego,playtime_forever,playtime_2weeks,total_play
genres_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,1900,841650,560429,2244,562673
Action,1983,227380,597,0,597
Action,1984,240340,109,0,109
Action,1988,308000,2843,0,2843
Action,1989,302330,48,0,48
...,...,...,...,...,...
Web Publishing,2013,699790,74442,324,74766
Web Publishing,2014,587670,1344,0,1344
Web Publishing,2015,1806090,159463,19684,179147
Web Publishing,2016,478960,0,0,0


In [161]:
genero_estadisticas = pd.read_csv('generos_estadisticas.csv')
genero_estadisticas = genero_estadisticas.set_index(['genres_x', 'release_year'])

genero_string = 'Web Publishing'

año_maximo = genero_estadisticas.query(f"genres_x == '{genero_string}'")['playtime_forever'].sort_values(ascending=False).head(1)

pd.DataFrame(año_maximo).reset_index().release_year.values[0]

2012

In [112]:
matriz_dummies

Unnamed: 0,id_juego,gen_2D,gen_Action,gen_Adventure,gen_Anime,gen_Arcade,gen_Atmospheric,gen_Casual,gen_Classic,gen_Co-op,...,gen_Simulation,gen_Singleplayer,gen_Sports,gen_Story Rich,gen_Strategy,gen_Survival,gen_Turn-Based,playtime_forever,playtime_2weeks,total_play
0,10,0,1,0,0,0,0,0,1,0,...,0,0,0,0,1,1,0,2106016,21209,2127225
1,20,0,1,1,0,0,0,1,1,1,...,0,0,0,1,0,0,0,168263,13071,181334
2,30,0,1,0,0,0,0,0,1,1,...,0,1,0,0,0,0,0,158986,13019,172005
3,40,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,4389,27,4416
4,50,0,1,1,0,0,1,0,1,1,...,0,1,0,1,0,0,0,96324,323,96647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7841,527510,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,154,154,308
7842,527520,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7843,527810,0,1,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
7844,527900,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,44,44,88


In [50]:
df_games.query('release_year > 2016')

Unnamed: 0_level_0,id,app_name,genres,release_date,tags,price,developer,release_year,genres_tags,id_juego
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
761140,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,Kotoshiro,2018,"[Action, Casual, Indie, Simulation, Strategy]",761140
643980,643980,Ironbound,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",0.00,Secret Level SRL,2018,"[Free to Play, Indie, RPG, Strategy, Card Game...",643980
670290,670290,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",0.00,Poolians.com,2017,"[Casual, Free to Play, Indie, Simulation, Spor...",670290
767400,767400,弹炸人2222,"[Action, Adventure, Casual]",2017-12-07,"[Action, Adventure, Casual]",0.99,彼岸领域,2017,"[Action, Adventure, Casual]",767400
772540,772540,Battle Royale Trainer,"[Action, Adventure, Simulation]",2018-01-04,"[Action, Adventure, Simulation, FPS, Shooter, ...",3.99,Trickjump Games Ltd,2018,"[Action, Adventure, Simulation, FPS, Shooter, ...",772540
...,...,...,...,...,...,...,...,...,...,...
745400,745400,Kebab it Up!,"[Action, Adventure, Casual, Indie]",2018-01-04,"[Action, Indie, Casual, Violent, Adventure]",1.99,Bidoniera Games,2018,"[Action, Adventure, Casual, Indie, Violent]",745400
773640,773640,Colony On Mars,"[Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.99,"Nikita ""Ghost_RUS""",2018,"[Casual, Indie, Simulation, Strategy]",773640
733530,733530,LOGistICAL: South Africa,"[Casual, Indie, Strategy]",2018-01-04,"[Strategy, Indie, Casual]",4.99,Sacada,2018,"[Casual, Indie, Strategy]",733530
610660,610660,Russian Roads,"[Indie, Racing, Simulation]",2018-01-04,"[Indie, Simulation, Racing]",1.99,Laush Dmitriy Sergeevich,2018,"[Indie, Racing, Simulation]",610660


In [51]:
# matriz_dummies_rev.to_csv('matriz_dummies_rev.csv')

In [52]:
matriz_dummies.index = matriz_dummies.index.astype(int)

to_drop = matriz_dummies.sum()[matriz_dummies.sum() == 0].index.to_list()

matriz_dummies.drop(to_drop, axis=1, inplace=True)

In [53]:
matriz_dummies

Unnamed: 0,id_juego,gen_1980s,gen_1990's,gen_2.5D,gen_2D,gen_2D Fighter,gen_3D Platformer,gen_3D Vision,gen_4 Player Local,gen_4X,...,gen_Warhammer 40K,gen_Web Publishing,gen_Western,gen_World War I,gen_World War II,gen_Zombies,gen_e-sports,playtime_forever,playtime_2weeks,total_play
0,10,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,2106016,21209,2127225
1,20,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,168263,13071,181334
2,30,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,158986,13019,172005
3,40,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,4389,27,4416
4,50,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,96324,323,96647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7917,527510,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,154,154,308
7918,527520,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7919,527810,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7920,527900,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,44,44,88


In [54]:
matriz_dummies.rename_axis('index', inplace=True)

In [55]:
matriz_dummies.index = matriz_dummies.id_juego.astype(int)

In [56]:
matriz_dummies.columns

Index(['id_juego', 'gen_1980s', 'gen_1990's', 'gen_2.5D', 'gen_2D',
       'gen_2D Fighter', 'gen_3D Platformer', 'gen_3D Vision',
       'gen_4 Player Local', 'gen_4X',
       ...
       'gen_Warhammer 40K', 'gen_Web Publishing', 'gen_Western',
       'gen_World War I', 'gen_World War II', 'gen_Zombies', 'gen_e-sports',
       'playtime_forever', 'playtime_2weeks', 'total_play'],
      dtype='object', length=304)

In [57]:
matriz_dummies = matriz_dummies.drop(matriz_dummies.columns[[0, -1, -2, -3]], axis=1)

In [58]:
matriz_dummies.to_csv('matriz_dummies.csv')

In [59]:
matriz_dummies.sample(2)

Unnamed: 0_level_0,gen_1980s,gen_1990's,gen_2.5D,gen_2D,gen_2D Fighter,gen_3D Platformer,gen_3D Vision,gen_4 Player Local,gen_4X,gen_6DOF,...,gen_Walking Simulator,gen_War,gen_Wargame,gen_Warhammer 40K,gen_Web Publishing,gen_Western,gen_World War I,gen_World War II,gen_Zombies,gen_e-sports
id_juego,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
441770,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
409380,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
from sklearn.metrics.pairwise import cosine_similarity

def comparar(id_1, id_2):
    row1 = matriz_dummies.loc[id_1].values.reshape(1,-1)
    row2 = matriz_dummies.loc[id_2].values.reshape(1,-1)
    return cosine_similarity(row1, row2)    


In [61]:
def get_recommended(id_juego):
    lista = []
    # id_juego = matriz_dummies.sample().index

    for i in matriz_dummies.index.tolist():
        if i != id_juego:
            (a, b) = i, comparar(id_juego, i)
            if 0.5 < b[0][0] <= 1:
                lista.append((a, b[0][0]))
    
    return pd.DataFrame(lista, columns=['id_juego', 'similitud']).sort_values('similitud', ascending=False).head(5)
    

In [62]:
id_juego = 30

# pd.DataFrame(lista, columns=['id_juego', 'similitud']).sort_values('similitud', ascending=False).head(5)

get_recommended(id_juego)

Unnamed: 0,id_juego,similitud
19,2640,0.819892
58,24840,0.787726
11,1200,0.774597
18,2630,0.751469
6,300,0.750555


In [63]:
# big = cosine_similarity(matriz_dummies,matriz_dummies)

In [64]:
# big[0:5, 0: 5]

In [65]:
generos_filtrados = df_juego_genero_tag['genero_tag'].value_counts().head(50).reset_index().head(38)['genero_tag'].to_list()

In [66]:
mask = df_juego_genero_tag['genero_tag'].isin(generos_filtrados)

In [67]:
df_juego_genero_tag['genero_tag'][mask]

0               Action
1               Casual
2                Indie
3           Simulation
4             Strategy
              ...     
159282          Casual
159283           Indie
159284          Puzzle
159285    Singleplayer
159286     Atmospheric
Name: genero_tag, Length: 110776, dtype: object

In [68]:
a_clustear = pd.get_dummies(df_juego_genero_tag['genero_tag'][mask], dtype='int')

In [69]:
a_clustear

Unnamed: 0,2D,Action,Adventure,Anime,Arcade,Atmospheric,Casual,Co-op,Difficult,Early Access,...,Retro,Sandbox,Sci-fi,Shooter,Simulation,Singleplayer,Sports,Story Rich,Strategy,Turn-Based
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159282,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159283,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159284,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159285,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [70]:
df_games.query('id == 12500')

Unnamed: 0_level_0,id,app_name,genres,release_date,tags,price,developer,release_year,genres_tags,id_juego
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
12500,12500,PuzzleQuest: Challenge of the Warlords,[Casual],2007-10-10,"[Puzzle, Casual, Match 3, RPG, Fantasy, 2D, St...",9.99,Infinite Interactive,2007,"[Casual, Puzzle, Match 3, RPG, Fantasy, 2D, St...",12500


In [71]:
from sklearn.cluster import KMeans
import numpy as np

# Assuming you have your data stored in a variable called 'data'

# Create a KMeans instance with 38 clusters
kmeans = KMeans(n_clusters=38)

# Fit the KMeans model to your data
kmeans.fit(a_clustear)

# Get the cluster labels for each data point
cluster_labels = kmeans.labels_

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

In [72]:
preds = kmeans.predict(a_clustear)

preds.shape

(110776,)

In [73]:
%pip install scikit-learn




In [74]:
df_games

Unnamed: 0_level_0,id,app_name,genres,release_date,tags,price,developer,release_year,genres_tags,id_juego
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
761140,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,Kotoshiro,2018,"[Action, Casual, Indie, Simulation, Strategy]",761140
643980,643980,Ironbound,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",0.00,Secret Level SRL,2018,"[Free to Play, Indie, RPG, Strategy, Card Game...",643980
670290,670290,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",0.00,Poolians.com,2017,"[Casual, Free to Play, Indie, Simulation, Spor...",670290
767400,767400,弹炸人2222,"[Action, Adventure, Casual]",2017-12-07,"[Action, Adventure, Casual]",0.99,彼岸领域,2017,"[Action, Adventure, Casual]",767400
773570,773570,Log Challenge,,,"[Action, Indie, Casual, Sports]",2.99,,1900,[nan],773570
...,...,...,...,...,...,...,...,...,...,...
773640,773640,Colony On Mars,"[Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.99,"Nikita ""Ghost_RUS""",2018,"[Casual, Indie, Simulation, Strategy]",773640
733530,733530,LOGistICAL: South Africa,"[Casual, Indie, Strategy]",2018-01-04,"[Strategy, Indie, Casual]",4.99,Sacada,2018,"[Casual, Indie, Strategy]",733530
610660,610660,Russian Roads,"[Indie, Racing, Simulation]",2018-01-04,"[Indie, Simulation, Racing]",1.99,Laush Dmitriy Sergeevich,2018,"[Indie, Racing, Simulation]",610660
658870,658870,EXIT 2 - Directions,"[Casual, Indie]",2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",4.99,"xropi,stev3ns",2017,"[Casual, Indie, Puzzle, Singleplayer, Atmosphe...",658870


In [75]:
df_recortado =df_games.query("release_year.notnull()")


In [76]:
df_recortado['release_year'] = df_recortado['release_year'].astype(int)

In [77]:
df_recortado.query("id == 12500")

Unnamed: 0_level_0,id,app_name,genres,release_date,tags,price,developer,release_year,genres_tags,id_juego
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
12500,12500,PuzzleQuest: Challenge of the Warlords,[Casual],2007-10-10,"[Puzzle, Casual, Match 3, RPG, Fantasy, 2D, St...",9.99,Infinite Interactive,2007,"[Casual, Puzzle, Match 3, RPG, Fantasy, 2D, St...",12500


In [78]:
df_recortado.index = df_recortado.index.astype(int)

In [79]:
df_recortado['id'] = df_recortado['id'].astype(int)

In [80]:
df_recortado.drop('id', axis=1)

Unnamed: 0_level_0,app_name,genres,release_date,tags,price,developer,release_year,genres_tags,id_juego
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,Kotoshiro,2018,"[Action, Casual, Indie, Simulation, Strategy]",761140
643980,Ironbound,"[Free to Play, Indie, RPG, Strategy]",2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",0.00,Secret Level SRL,2018,"[Free to Play, Indie, RPG, Strategy, Card Game...",643980
670290,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]",2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",0.00,Poolians.com,2017,"[Casual, Free to Play, Indie, Simulation, Spor...",670290
767400,弹炸人2222,"[Action, Adventure, Casual]",2017-12-07,"[Action, Adventure, Casual]",0.99,彼岸领域,2017,"[Action, Adventure, Casual]",767400
773570,Log Challenge,,,"[Action, Indie, Casual, Sports]",2.99,,1900,[nan],773570
...,...,...,...,...,...,...,...,...,...
773640,Colony On Mars,"[Casual, Indie, Simulation, Strategy]",2018-01-04,"[Strategy, Indie, Casual, Simulation]",1.99,"Nikita ""Ghost_RUS""",2018,"[Casual, Indie, Simulation, Strategy]",773640
733530,LOGistICAL: South Africa,"[Casual, Indie, Strategy]",2018-01-04,"[Strategy, Indie, Casual]",4.99,Sacada,2018,"[Casual, Indie, Strategy]",733530
610660,Russian Roads,"[Indie, Racing, Simulation]",2018-01-04,"[Indie, Simulation, Racing]",1.99,Laush Dmitriy Sergeevich,2018,"[Indie, Racing, Simulation]",610660
658870,EXIT 2 - Directions,"[Casual, Indie]",2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",4.99,"xropi,stev3ns",2017,"[Casual, Indie, Puzzle, Singleplayer, Atmosphe...",658870


In [81]:
items_unicos = pd.read_csv('id_items_unicos.csv')
items_unicos.columns = ['index', 'id_juego']

items_unicos

Unnamed: 0,index,id_juego
0,0,1250
1,1,22200
2,2,43110
3,3,251610
4,4,227300
...,...,...
3677,3677,307130
3678,3678,209120
3679,3679,220090
3680,3680,262850


In [82]:
df_recortado_2 = pd.merge(items_unicos, df_recortado.drop('id', axis=1), left_on='id_juego', right_on='id')

In [83]:
df_recortado_2.query("app_name.str.contains('boid')")


Unnamed: 0,index,id_juego_x,app_name,genres,release_date,tags,price,developer,release_year,genres_tags,id_juego_y
281,318,108600,Project Zomboid,"[Indie, RPG, Simulation, Early Access]",2013-11-08,"[Early Access, Survival, Zombies, Open World, ...",14.99,The Indie Stone,2013,"[Indie, RPG, Simulation, Early Access, Surviva...",108600


In [84]:
# lista_comp = []
# for i in matriz_dummies.index:
#     temp = []
#     for j in matriz_dummies.index:
#         temp.append(comparar(i,j))
#     lista_comp.append(temp)


In [85]:
# matriz = pd.DataFrame(lista_comp)

# matriz.to_csv('matrizon.csv')