<img src=https://d31uz8lwfmyn8g.cloudfront.net/Assets/logo-henry-white-lg.png><p>

# Extracción, Transformación y Carga (ETL)

En primer lugar, importamos las librerías necesarias para trabajar en este proyecto:

In [4]:
import pandas as pd
import numpy as np
import ast
import warnings
warnings.filterwarnings('ignore')
import nltk
import re

Contamos con 3 archivos "json" que contienen datos acerca de los juegos de Steam, y los items y reviews de usuarios, por lo que a continuación cargamos los archivos con los que trabajaremos en dataframes de Pandas y analizamos la información que contienen:

## Steam Games:

In [2]:
# Importamos la librería que nos permitirá cargar la información
import json

rows = []                                           # se crea una lista vacía que contendra las filas del DF
with open(r'Datasets\output_steam_games.json') as file:       # se carga el archivo original
    for line in file:                               # se carga la info iterando linea por linea
        data = json.loads(line)
        rows.append(data)

df_games = pd.DataFrame(rows)                       # se carga la info en un dataframe de Pandas
df_games.head()

NameError: name 'pd' is not defined

In [201]:
df_games.tail()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"
120444,,,Maze Run VR,,http://store.steampowered.com/app/681550/Maze_...,,"[Early Access, Adventure, Indie, Action, Simul...",http://steamcommunity.com/app/681550/reviews/?...,"[Single-player, Stats, Steam Leaderboards, HTC...",4.99,True,681550,


In [292]:
# Se analiza la información del dataset:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 11.9+ MB


In [293]:
df_games.shape

(120445, 13)

Como se observa en el apartado anterior, el dataset incluye columnas que no serán necesarias para responder las consultas solicitadas para la API, por lo que se procede a eliminarlas del análisis:

In [3]:
df_games = df_games.drop(columns=["url","reviews_url", "early_access"])

NameError: name 'df_games' is not defined

Se cuenta la cantidad de valores nulos por columna:

In [205]:
df_games.isnull().sum()

publisher       96362
genres          91593
app_name        88312
title           90360
release_date    90377
tags            88473
specs           88980
price           89687
id              88312
developer       91609
dtype: int64

Como podemos observar, existen muchas columnas con valores nulos, por lo cual se procede a eliminar los mismos:

In [328]:
# Se filtran los valores nulos para el subset de la columna "id"
#df_games1 = df_games.dropna(subset="id").reset_index(drop=True)
df_games1 = df_games.dropna().reset_index(drop=True)
df_games1.shape

(22530, 10)

Luego verifico si en la columna "id" existen valores duplicados:

In [207]:
print("Valores duplicados en Id:", df_games1['id'].duplicated().sum())

Valores duplicados en Id: 1


In [329]:
# Se eliminan los valores duplicados
df_games1 = df_games1.drop_duplicates(subset="id")

Se crea la columna "release_year" que permitirá realizar agrupaciones de datos por año de lanzamiento:

In [330]:
# Se extrae el año en una nueva columna llamada "release_year" a traves del uso de expresiones regulares:
df_games1['release_year'] = df_games1['release_date'].str.extract(r'(\d{4})')

# Se elimina la columna original
df_games1 = df_games1.drop(columns="release_date")
df_games1.head()


Unnamed: 0,publisher,genres,app_name,title,tags,specs,price,id,developer,release_year
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",Free To Play,643980,Secret Level SRL,2018
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",Free to Play,670290,Poolians.com,2017
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017
4,Trickjump Games Ltd,"[Action, Adventure, Simulation]",Battle Royale Trainer,Battle Royale Trainer,"[Action, Adventure, Simulation, FPS, Shooter, ...","[Single-player, Steam Achievements]",3.99,772540,Trickjump Games Ltd,2018


Por otro lado, se observa que la columna "price" tiene valores tipo "float" para aquellos juegos con precio, y tipo "string" para aquellos juegos que son gratis, por lo que a continuación se convierten todos los valores en numéricos:

In [331]:
# Se crea una lista para aquellos valores que no pueden convertirse a "float"
no_floats = []

for valor in df_games1["price"].unique():   # para cada valor único de la columna precio
    try:
        float(valor)                        # se prueba cambiar el valor a tipo floar
    except ValueError:
        no_floats.append(valor)             # si no se puede, se agrega a la lista de strings

In [299]:
# Observamos el total de valores "string" en la columna precio:
no_floats

['Free To Play',
 'Free to Play',
 'Free',
 'Free Demo',
 'Play for Free!',
 'Install Now',
 'Play WARMACHINE: Tactics Demo',
 'Free Mod',
 'Play Now',
 'Free HITMAN™ Holiday Pack',
 'Play the Demo',
 'Third-party']

Se puede observar que por las descripciones expuestas, en casi todos los casos se trata de pruebas gratuitas o demos, por lo que se reemplazan esos valores por cero:

In [332]:
# Los juegos gratuitos se reemplazan por cero
df_games1['price'] = df_games1['price'].replace(no_floats, 0.0)

# Se convierte la columna a tipo "float"
df_games1['price'] = df_games1['price'].astype(float)
df_games1.head()

Unnamed: 0,publisher,genres,app_name,title,tags,specs,price,id,developer,release_year
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",0.0,643980,Secret Level SRL,2018
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",0.0,670290,Poolians.com,2017
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017
4,Trickjump Games Ltd,"[Action, Adventure, Simulation]",Battle Royale Trainer,Battle Royale Trainer,"[Action, Adventure, Simulation, FPS, Shooter, ...","[Single-player, Steam Achievements]",3.99,772540,Trickjump Games Ltd,2018


Como podemos ver, la columna "genre" almacena una lista de géneros que identifican al juego, por lo que a continuación se observa la cantidad de valores únicos:

In [336]:
# Se realiza la apertura del DataFrame por la columna "genre":
df_games_genre = df_games1[['genres', 'id']]
df_games_genre = df_games_genre.explode('genres')

# Se consultan los valores únicos
unique_genres = df_games_genre['genres'].unique()
unique_genres

array(['Action', 'Casual', 'Indie', 'Simulation', 'Strategy',
       'Free to Play', 'RPG', 'Sports', 'Adventure', 'Racing',
       'Early Access', 'Massively Multiplayer',
       'Animation &amp; Modeling', 'Web Publishing', 'Education',
       'Software Training', 'Utilities', 'Design &amp; Illustration',
       'Audio Production', 'Video Production', 'Photo Editing'],
      dtype=object)

In [214]:
len(unique_genres)

21

In [337]:
df_games_genre

Unnamed: 0,genres,id
0,Action,761140
0,Casual,761140
0,Indie,761140
0,Simulation,761140
0,Strategy,761140
...,...,...
22528,Indie,610660
22528,Racing,610660
22528,Simulation,610660
22529,Casual,658870


Debido a que necesitamos analizar cada género por separado, se realiza la apertura de los mismos en columnas:

In [338]:
# Se corrigen los géneros que tienen error:
df_games_genre['genres'].replace("Animation &amp; Modeling","Animation and Modeling", inplace= True)
df_games_genre['genres'].replace("Design &amp; Illustration","Design and llustration", inplace=True)

# Se crean columnas para cada uno de los géneros:
df_games_genre = pd.get_dummies(data = df_games_genre, columns=["genres"], dtype = int, prefix="", prefix_sep = "")

print(df_games_genre.shape)
df_games_genre.head(2)

(55611, 22)


Unnamed: 0,id,Action,Adventure,Animation and Modeling,Audio Production,Casual,Design and llustration,Early Access,Education,Free to Play,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
0,761140,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,761140,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [339]:
# Se observa la cantidad de filas en el DF original para comparar:
df_games1.shape

(22529, 10)

Se observa que el DF de generos tienen valores duplicados, por lo que se procede a agruparlos por Id para unirlos al DF original:

In [348]:
# se agrupan los generos por id:
df_games_genre = df_games_genre.groupby("id").sum().reset_index()

df_games_genre.shape

(22529, 22)

Se une el nuevo DataFrame al original:

In [358]:
df_steam_games = pd.merge(df_games1, df_games_genre, on = "id", how = "inner")
df_steam_games.head()

Unnamed: 0,publisher,genres,app_name,title,tags,specs,price,id,developer,release_year,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018,...,0,0,0,1,0,0,1,0,0,0
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",0.0,643980,Secret Level SRL,2018,...,0,1,0,0,0,0,1,0,0,0
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",0.0,670290,Poolians.com,2017,...,0,0,0,1,0,1,0,0,0,0
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017,...,0,0,0,0,0,0,0,0,0,0
4,Trickjump Games Ltd,"[Action, Adventure, Simulation]",Battle Royale Trainer,Battle Royale Trainer,"[Action, Adventure, Simulation, FPS, Shooter, ...","[Single-player, Steam Achievements]",3.99,772540,Trickjump Games Ltd,2018,...,0,0,0,1,0,0,0,0,0,0


In [1]:
df_steam_games["posted_year"].max()

NameError: name 'df_steam_games' is not defined

Luego se eliminan las columnas que no se utilizarán:

In [359]:
df_steam_games = df_steam_games.drop(columns=["publisher","genres","specs", "tags"])

In [360]:
df_steam_games.columns

Index(['app_name', 'title', 'price', 'id', 'developer', 'release_year',
       'Action', 'Adventure', 'Animation and Modeling', 'Audio Production',
       'Casual', 'Design and llustration', 'Early Access', 'Education',
       'Free to Play', 'Indie', 'Massively Multiplayer', 'Photo Editing',
       'RPG', 'Racing', 'Simulation', 'Software Training', 'Sports',
       'Strategy', 'Utilities', 'Video Production', 'Web Publishing'],
      dtype='object')

Se realiza un ultimo control de las tipologías de datos:

In [364]:
df_steam_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22527 entries, 0 to 22528
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   app_name                22527 non-null  object 
 1   title                   22527 non-null  object 
 2   price                   22527 non-null  float64
 3   item_id                 22527 non-null  object 
 4   developer               22527 non-null  object 
 5   release_year            22527 non-null  int32  
 6   Action                  22527 non-null  int32  
 7   Adventure               22527 non-null  int32  
 8   Animation and Modeling  22527 non-null  int32  
 9   Audio Production        22527 non-null  int32  
 10  Casual                  22527 non-null  int32  
 11  Design and llustration  22527 non-null  int32  
 12  Early Access            22527 non-null  int32  
 13  Education               22527 non-null  int32  
 14  Free to Play            22527 non-null  int

In [362]:
# Se eliminan valores nulos y cambia el tipo de dato del release_year
df_steam_games = df_steam_games.dropna(subset="release_year")
df_steam_games["release_year"] = df_steam_games["release_year"].astype(int)

Se renombra la columna id:

In [363]:
df_steam_games = df_steam_games.rename(columns = {"id" : "item_id"})

Se exporta el dataframe modificado a un archivo parquet:

In [365]:
#df_steam_games.to_csv('steam_games.csv', index=False)

In [None]:
df_steam_games.to_parquet('steam_games.parquet', engine="pyarrow")

## User reviews:

Los archivos de items y reviews de usuarios corresponen a archivos "json" anidados, por lo que a continuación se crea una función que nos ayudará con la carga de la información en DataFrames de Pandas:

In [2]:
def cargar_df(ruta, variable_anidada):
    '''Función que recibe una ruta de acceso a un archivo json anidado y carga la información en un
    DataFrame de Pandas'''
    rows = []
    with open(ruta, 'r', encoding='utf-8') as file:                         # se lee el archivo iterando fila por fila y agregando a lista vacia
        for line in file:
            rows.append(ast.literal_eval(line)) 
    
    df = pd.DataFrame(rows)                                                 # se carga la info en un df de Pandas            
    df = df.explode(variable_anidada).reset_index()                         # se separan en filas los datos anidados y se resetea index
    df = df.drop(columns="index")                                           # se elimina el indice original
    df = pd.concat([df, pd.json_normalize(df[variable_anidada])], axis=1)   # se realiza la apertura en columnas de la informacion anidada
    df = df.drop(columns=variable_anidada)                                  # se elimina la columna anidada original

    return df

In [222]:
# Se carga el df correspondiente a las reviews:
df_reviews = cargar_df(r'Datasets\australian_user_reviews.json', "reviews")
df_reviews.head()

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...


Chequeamos la información del dataset:

In [223]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59333 entries, 0 to 59332
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      59333 non-null  object
 1   user_url     59333 non-null  object
 2   funny        59305 non-null  object
 3   posted       59305 non-null  object
 4   last_edited  59305 non-null  object
 5   item_id      59305 non-null  object
 6   helpful      59305 non-null  object
 7   recommend    59305 non-null  object
 8   review       59305 non-null  object
dtypes: object(9)
memory usage: 4.1+ MB


In [224]:
df_reviews.columns

Index(['user_id', 'user_url', 'funny', 'posted', 'last_edited', 'item_id',
       'helpful', 'recommend', 'review'],
      dtype='object')

Se eliminan columnas innecesarias:

In [225]:
df_reviews = df_reviews.drop(columns=["user_url", "funny", "last_edited", "helpful"])
df_reviews.head()

Unnamed: 0,user_id,posted,item_id,recommend,review
0,76561197970982479,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...
1,76561197970982479,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.
2,76561197970982479,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...
4,js41637,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...


Se controla la cantidad de valores nulos:

In [226]:
df_reviews.isna().sum()

user_id       0
posted       28
item_id      28
recommend    28
review       28
dtype: int64

In [227]:
# Se eliminan valores nulos:
df_reviews = df_reviews.dropna(subset="item_id")
df_reviews.shape

(59305, 5)

Se verifica la cantidad de valores duplicados:

In [228]:
has_duplicates = df_reviews.duplicated().sum()
has_duplicates

874

Se observan los valores duplicados:

In [229]:
df_reviews[df_reviews.duplicated()]

Unnamed: 0,user_id,posted,item_id,recommend,review
1114,bokkkbokkk,"Posted September 24, 2015.",346110,True,yep
2894,ImSeriouss,"Posted January 10, 2014.",218620,True,"Good graphics, fun heists! A bit laggy"
2895,ImSeriouss,"Posted January 10, 2014.",105600,True,So fun! DEFINITELY NOT RIP OFF OF MINECRAFT! e...
2896,ImSeriouss,"Posted December 17, 2014.",570,True,bobo pinoy
2897,ImSeriouss,"Posted January 13, 2014.",211820,True,If you want to play this game.. expect glithes...
...,...,...,...,...,...
44456,76561198092022514,Posted July 3.,422400,True,Muy entretenido y una coleccion de armas prome...
44457,76561198092022514,Posted June 1.,218620,True,"Tiene una jugabilidad y tematica muy buena :D,..."
44458,76561198092022514,"Posted August 17, 2014.",261820,True,"Buen juego, no importa el desarrrollo que tien..."
44459,76561198092022514,"Posted February 17, 2014.",224260,True,exelente aporte :D¡¡¡ es una buen mod basado e...


In [230]:
mascara = (df_reviews["user_id"] == "bokkkbokkk")
df_reviews[mascara]

Unnamed: 0,user_id,posted,item_id,recommend,review
1113,bokkkbokkk,"Posted September 24, 2015.",346110,True,yep
1114,bokkkbokkk,"Posted September 24, 2015.",346110,True,yep


Se observa que efectivamente se trata de valores duplicados, por lo que se eliminan del análisis:

In [231]:
df_reviews = df_reviews.drop_duplicates()
df_reviews.shape

(58431, 5)

Luego se extrae el año en que fue posteda la review, creando la columna "posted_year":

In [232]:
df_reviews["posted_year"] = df_reviews["posted"].str.extract(r'(\d{4})')

Se controla que no haya valores faltantes en esta nueva columna:

In [233]:
df_reviews["posted_year"].isnull().sum()

9933

Se observa que al hacer la extracción del año del posteo de la review existen algunos valores faltantes, por lo que se chequea cuales son:

In [234]:
df_reviews[df_reviews["posted_year"].isna()]

Unnamed: 0,user_id,posted,item_id,recommend,review,posted_year
6,evcentric,Posted February 3.,248820,True,A suitably punishing roguelike platformer. Wi...,
27,76561198079601835,Posted May 20.,730,True,ZIKA DO BAILE,
28,MeaTCompany,Posted July 24.,730,True,BEST GAME IN THE BLOODY WORLD,
31,76561198156664158,Posted June 16.,252950,True,love it,
32,76561198077246154,Posted June 11.,440,True,mt bom,
...,...,...,...,...,...,...
59328,76561198312638244,Posted July 10.,70,True,a must have classic from steam definitely wort...,
59329,76561198312638244,Posted July 8.,362890,True,this game is a perfect remake of the original ...,
59330,LydiaMorley,Posted July 3.,273110,True,had so much fun plaing this and collecting res...,
59331,LydiaMorley,Posted July 20.,730,True,:D,


In [235]:
df_reviews.shape

(58431, 6)

Unimos el DataFrame al de juegos, para reemplazar los valores nulos con el año de lanzamiento:

In [266]:
# Unimos los DataFrames
df_reviews_merged = pd.merge(df_reviews, df_steam_games, on = "item_id", how= "inner")

In [267]:
# Reemplazamos valores nulos:
df_reviews_merged["posted_year"].fillna(df_reviews_merged["release_year"], inplace = True)

In [268]:
df_reviews_merged.isnull().sum()

user_id                   0
posted                    0
item_id                   0
recommend                 0
review                    0
posted_year               0
app_name                  0
title                     0
price                     0
developer                 0
release_year              0
Action                    0
Adventure                 0
Animation and Modeling    0
Audio Production          0
Casual                    0
Design and llustration    0
Early Access              0
Education                 0
Free to Play              0
Indie                     0
Massively Multiplayer     0
Photo Editing             0
RPG                       0
Racing                    0
Simulation                0
Software Training         0
Sports                    0
Strategy                  0
Utilities                 0
Video Production          0
Web Publishing            0
dtype: int64

In [269]:
# Descartamos las columnas que no necesitamos:
df_reviews_merged = df_reviews_merged.drop(columns=['app_name', 'title', 'price', 'developer', 'release_year', 'Action',
       'Adventure', 'Animation and Modeling', 'Audio Production', 'Casual',
       'Design and llustration', 'Early Access', 'Education', 'Free to Play',
       'Indie', 'Massively Multiplayer', 'Photo Editing', 'RPG', 'Racing',
       'Simulation', 'Software Training', 'Sports', 'Strategy', 'Utilities',
       'Video Production', 'Web Publishing'])

In [270]:
# Eliminamos duplicados:
df_reviews_merged = df_reviews_merged.drop_duplicates()

In [271]:
# Se elimina la columna "posted" original:
df_reviews_merged = df_reviews_merged.drop(columns="posted")
df_reviews_merged.head()

Unnamed: 0,user_id,item_id,recommend,review,posted_year
0,76561197970982479,1250,1,Simple yet with great replayability. In my opi...,2011
1,death-hunter,1250,1,"Amazing, Non-stop action of blowing stuff to b...",2015
2,DJKamBer,1250,1,"Compared to Left 4 Dead 2, this game REALLY gi...",2013
3,diego9031,1250,1,Jogo ♥♥♥♥.,2015
4,76561198081962345,1250,1,cara nas imagens esse jogo da pouco de medo ma...,2014


Se observa que la columna "recommend" posee valores booleanos, por lo que los mismos se reemplazan por valores numéricos:

In [272]:
df_reviews_merged['recommend'].value_counts()

recommend
1    43262
0     5077
Name: count, dtype: int64

In [273]:
# Se importa el label encoder y se instancia el objeto:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# Se actualiza la columna codificada al df:
df_reviews_merged['recommend'] = le.fit_transform(df_reviews_merged['recommend'])

# Se controla que la cantidad de valores coincida:
df_reviews_merged['recommend'].value_counts()

recommend
1    43262
0     5077
Name: count, dtype: int64

Por último, se buscará reemplazar a la columna "review" por la columna "sentiment_analysis" aplicando análisis de sentimiento con NLP y respetando la siguiente escala: <br>

    - Si la review es mala, debe tomar el valor '0'.
    - Si la review es neutral, debe tomar el valor '1'.
    - Si la review es positiva, debe tomar el valor '2'.
    - En caso de ausencia de review, debe tomarl el valor '1'.

In [274]:
# En primer lugar, se importan las herramientas que nos ayudarán en la tarea:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
#nltk.download('vader_lexicon')

Luego se crea una función que nos devolverá las clasificaciones mencionadas anteriormente, en función al score obtenido por el algoritmo SentimentIntensityAnalyzer() de la biblioteca nltk:

In [275]:
def classify_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()         # se instancia el algoritmo
    if pd.isnull(text) or text == '':               # si la celda es nula o el texto es una cadena vacía
        return 1
    elif isinstance(text, str):                         # sino, si se trata de una cadena de texto válida
        sentiment = analyzer.polarity_scores(text)      # se obtienen los scores de sentimiento
        compound_score = sentiment['compound']          # se selecciona el score "compound"

        # Se califica el texto en función al score obtenido anteriormente
        if compound_score >= 0.05:
            return 2
        elif compound_score <= (-0.05):
            return 0
        else:
            return 1


In [277]:
# Se aplica la funcion creada anteriormente
df_reviews_merged['sentiment_analysis'] = df_reviews_merged["review"].apply(classify_sentiment)
df_reviews_merged.head()

Unnamed: 0,user_id,item_id,recommend,review,posted_year,sentiment_analysis
0,76561197970982479,1250,1,Simple yet with great replayability. In my opi...,2011,2
1,death-hunter,1250,1,"Amazing, Non-stop action of blowing stuff to b...",2015,2
2,DJKamBer,1250,1,"Compared to Left 4 Dead 2, this game REALLY gi...",2013,0
3,diego9031,1250,1,Jogo ♥♥♥♥.,2015,1
4,76561198081962345,1250,1,cara nas imagens esse jogo da pouco de medo ma...,2014,1


In [278]:
# Se controlan los valores para cada uno de las clasificaciones:
df_reviews_merged['sentiment_analysis'].value_counts()

sentiment_analysis
2    30765
1     9838
0     7736
Name: count, dtype: int64

Se eliminan las columnas que ya no vamos a utilizar:

In [279]:
df_reviews_merged= df_reviews_merged.drop(columns=["review"])

Se realiza un chequeo final de las tipologías de datos:

In [282]:
df_reviews_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48339 entries, 0 to 120669
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             48339 non-null  object
 1   item_id             48339 non-null  object
 2   recommend           48339 non-null  int64 
 3   posted_year         48339 non-null  int32 
 4   sentiment_analysis  48339 non-null  int64 
dtypes: int32(1), int64(2), object(2)
memory usage: 2.0+ MB


In [281]:
# Se cambia el tipo de dato de la columna "posted_year"
df_reviews_merged["posted_year"] = df_reviews_merged["posted_year"].astype(int)

Por último, se exporta el DataFrame a un archivo parquet:

In [283]:
#df_reviews_merged.to_csv('user_reviews.csv', index=False)

In [None]:
df_reviews_merged.to_parquet('user_reviews.parquet', engine="pyarrow")

## User items:

En primer lugar, se cargan los datos en un Dataframe de Pandas:

In [1]:
# DATASET EXPANDIDO
#df_items = cargar_df(r'Datasets\australian_users_items.json', "items")
#df_items.head()

NameError: name 'cargar_df' is not defined

In [5]:
rows = []
with open('australian_users_items.json', 'r', encoding='utf-8') as file:       # se lee el archivo iterando fila por fila y agregando a lista vacia
    for line in file:
        rows.append(ast.literal_eval(line)) 

df = pd.DataFrame(rows)                             # se carga la info en un df de 


In [6]:
# Se crea una copia del dataset para poder realizar las modificaciones que sean necesarias:
user_items = df.copy()

In [7]:
user_items.tail()

Unnamed: 0,user_id,items_count,steam_id,user_url,items
88305,76561198323066619,22,76561198323066619,http://steamcommunity.com/profiles/76561198323...,"[{'item_id': '413850', 'item_name': 'CS:GO Pla..."
88306,76561198326700687,177,76561198326700687,http://steamcommunity.com/profiles/76561198326...,"[{'item_id': '11020', 'item_name': 'TrackMania..."
88307,XxLaughingJackClown77xX,0,76561198328759259,http://steamcommunity.com/id/XxLaughingJackClo...,[]
88308,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"[{'item_id': '304930', 'item_name': 'Unturned'..."
88309,edward_tremethick,0,76561198331598578,http://steamcommunity.com/id/edward_tremethick,[]


Se analiza la información contenida en el DataFrame:

In [8]:
user_items.shape

(88310, 5)

In [9]:
user_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88310 entries, 0 to 88309
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      88310 non-null  object
 1   items_count  88310 non-null  int64 
 2   steam_id     88310 non-null  object
 3   user_url     88310 non-null  object
 4   items        88310 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.4+ MB


Se cuenta la cantidad de valores nulos:

In [10]:
user_items.isnull().sum()

user_id        0
items_count    0
steam_id       0
user_url       0
items          0
dtype: int64

In [None]:
# Se chequean las filas con registros nulos, para ver si los eliminamos:
user_items[user_items["item_id"].isnull()]

Unnamed: 0,user_id,items_count,steam_id,user_url,item_id,item_name,playtime_forever,playtime_2weeks
3733,Wackky,0,76561198039117046,http://steamcommunity.com/id/Wackky,,,,
3849,76561198079601835,0,76561198079601835,http://steamcommunity.com/profiles/76561198079...,,,,
6019,hellom8o,0,76561198117222320,http://steamcommunity.com/id/hellom8o,,,,
6523,starkillershadow553,0,76561198059648579,http://steamcommunity.com/id/starkillershadow553,,,,
7237,darkenkane,0,76561198058876001,http://steamcommunity.com/id/darkenkane,,,,
...,...,...,...,...,...,...,...,...
5169470,76561198316380182,0,76561198316380182,http://steamcommunity.com/profiles/76561198316...,,,,
5169471,76561198316970597,0,76561198316970597,http://steamcommunity.com/profiles/76561198316...,,,,
5169472,76561198318100691,0,76561198318100691,http://steamcommunity.com/profiles/76561198318...,,,,
5170006,XxLaughingJackClown77xX,0,76561198328759259,http://steamcommunity.com/id/XxLaughingJackClo...,,,,


In [11]:
# Se eliminan los valores nulos para el subset "item_id"
user_items = user_items.dropna(subset="item_id").reset_index()
user_items = user_items.drop(columns="index")
user_items.shape

KeyError: ['item_id']

Se eliminan las columnas que no se utilizarán en el análisis:

In [12]:
user_items = user_items.dropna()

In [13]:
user_items = user_items.drop(columns=["user_url","steam_id"])
user_items.shape

(88310, 3)

Se controla si existen valores duplicados:

In [14]:
user_items.duplicated().sum()

TypeError: unhashable type: 'list'

In [15]:
user_items[user_items.duplicated()].head()

TypeError: unhashable type: 'list'

In [None]:
# Se eliminan los valores duplicados:
user_items = user_items.drop_duplicates().reset_index()

In [None]:
user_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5094092 entries, 0 to 5094091
Data columns (total 6 columns):
 #   Column            Dtype  
---  ------            -----  
 0   index             int64  
 1   user_id           object 
 2   items_count       int64  
 3   item_id           object 
 4   item_name         object 
 5   playtime_forever  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 233.2+ MB


In [None]:
# Se cambia el formato de la columna "user_id"
user_items["user_id"] = user_items["user_id"].astype(str)

Por último, se exporta el DataFrame a un archivo parquet:

In [None]:
#user_items.to_csv('user_items.csv', index=False)

In [None]:
#user_items.to_parquet('user_items_extended.parquet', engine="pyarrow")

In [18]:
user_items.to_parquet('user_items.parquet', engine="pyarrow")