En este Notebook, se presenta un conjunto de pasos para realizar el proceso de Extracción, Carga y Transformación del cojunto de datos a utilizar en el MVP (Minimum Viable Product) o Producto Mínimo Viable

In [11]:
import pandas as pd
import numpy as np
import re

In [12]:
# Carga del Dataset steam_games.json.gz
dataset_game = '..\\Datasets\\steam_games.json.gz'

# Si el Dataset no se encuentra en la ruta especificada por "dataset_game" se mostrara un error
try:
    df_game = pd.read_json(dataset_game, compression = 'gzip', lines = True, orient='records')
    print(f'Dataset steam_games.json.gz cargado exitosamente ... :)')
except FileNotFoundError:
    print(f'ERROR : steam_games.json.gz, no existe en la ruta ... :( ')

Dataset steam_games.json.gz cargado exitosamente ... :)


In [13]:
# Observar dimensiones (fila, columna) del Dataset
print(f'Filas : {df_game.shape[0]}, Variables : {df_game.shape[1]}')

Filas : 120445, Variables : 13


In [14]:
# Observar las primeras lineas del Dataset
df_game.head(3)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,


In [15]:
# Observar cantidad de registros, variables, tipo de datos y cantidad no nulos por columnas
df_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   publisher     24083 non-null  object 
 1   genres        28852 non-null  object 
 2   app_name      32133 non-null  object 
 3   title         30085 non-null  object 
 4   url           32135 non-null  object 
 5   release_date  30068 non-null  object 
 6   tags          31972 non-null  object 
 7   reviews_url   32133 non-null  object 
 8   specs         31465 non-null  object 
 9   price         30758 non-null  object 
 10  early_access  32135 non-null  float64
 11  id            32133 non-null  float64
 12  developer     28836 non-null  object 
dtypes: float64(2), object(11)
memory usage: 11.9+ MB


Observamos una variable "id" que corresponde al identificador unico de contenido, en dicha variable existen 88312 valores nulos, y en un primer paso se eliminaran todas las filas donde el id teng aun valor NaN 

In [16]:
# Observar los valores nulos por variables
nulos_por_variables = [(column, df_game[column].isnull().sum()) for column in df_game.columns ]
nulos_por_variables

[('publisher', 96362),
 ('genres', 91593),
 ('app_name', 88312),
 ('title', 90360),
 ('url', 88310),
 ('release_date', 90377),
 ('tags', 88473),
 ('reviews_url', 88312),
 ('specs', 88980),
 ('price', 89687),
 ('early_access', 88310),
 ('id', 88312),
 ('developer', 91609)]

In [17]:
# Eliminar filas donde "id" tenga el valor NaN, dicha acción actualizará sobre el mismo dataframe, 
# el "id" nos permite relacionar datos contenidos en otros dataset.  
df_game.dropna(subset= "id", axis = 0, inplace=True)

In [18]:
# Observar la estructura de los datos al verse disminuido en 88312 filas por valores nulos en "id"
df_game.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32133 entries, 88310 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   publisher     24082 non-null  object 
 1   genres        28851 non-null  object 
 2   app_name      32132 non-null  object 
 3   title         30084 non-null  object 
 4   url           32133 non-null  object 
 5   release_date  30067 non-null  object 
 6   tags          31971 non-null  object 
 7   reviews_url   32133 non-null  object 
 8   specs         31464 non-null  object 
 9   price         30756 non-null  object 
 10  early_access  32133 non-null  float64
 11  id            32133 non-null  float64
 12  developer     28835 non-null  object 
dtypes: float64(2), object(11)
memory usage: 3.4+ MB


In [19]:
# Excluimos algunas variables que consideramos no aportaran información al modelo
variables_excluidas = ["publisher", "app_name", "url", "reviews_url", "tags", "specs", "early_access"]
df_game.drop(variables_excluidas,axis=1,inplace=True)
df_game.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32133 entries, 88310 to 120444
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   genres        28851 non-null  object 
 1   title         30084 non-null  object 
 2   release_date  30067 non-null  object 
 3   price         30756 non-null  object 
 4   id            32133 non-null  float64
 5   developer     28835 non-null  object 
dtypes: float64(1), object(5)
memory usage: 1.7+ MB


In [20]:
# Buscar valores nulos en las variables existentes. 
nulos_por_variables = [(column, df_game[column].isnull().sum()) for column in df_game.columns ]
nulos_por_variables

[('genres', 3282),
 ('title', 2049),
 ('release_date', 2066),
 ('price', 1377),
 ('id', 0),
 ('developer', 3298)]

In [21]:
# Eliminar filas donde encontremos valores nulos para la columnas existentes 
df_game.dropna(subset=[column for column in df_game.columns], axis=0, inplace=True)
df_game.info()  

<class 'pandas.core.frame.DataFrame'>
Index: 27462 entries, 88310 to 120443
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   genres        27462 non-null  object 
 1   title         27462 non-null  object 
 2   release_date  27462 non-null  object 
 3   price         27462 non-null  object 
 4   id            27462 non-null  float64
 5   developer     27462 non-null  object 
dtypes: float64(1), object(5)
memory usage: 1.5+ MB


In [22]:

# Convertir la varible tipo objeto "price" a un numero flotante y el valor Free to play cambiarlo por el valor 0
df_game["price"] = df_game["price"].apply(lambda fila : 0.0 if type(fila)==str else float(fila))

In [23]:
df_game["price"].describe()

count    27462.000000
mean         9.010884
std         15.987402
min          0.000000
25%          2.990000
50%          4.990000
75%          9.990000
max        995.000000
Name: price, dtype: float64

In [24]:
# Observamos la estructura d enuestro dataset y trabajemos en la columna release_date, que nos interesa solo el año
df_game.head(3)

Unnamed: 0,genres,title,release_date,price,id,developer
88310,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,4.99,761140.0,Kotoshiro
88311,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,0.0,643980.0,Secret Level SRL
88312,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,0.0,670290.0,Poolians.com


In [25]:
# Extraccion del año en la variable release_date
df_game["release_date"] = df_game["release_date"].apply(lambda fila : 'Sin Dato' if not re.match(r'^\d{4}-\d{2}-\d{2}$', fila) else fila.split('-')[0]  )
df_game.head(3)

Unnamed: 0,genres,title,release_date,price,id,developer
88310,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018,4.99,761140.0,Kotoshiro
88311,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018,0.0,643980.0,Secret Level SRL
88312,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017,0.0,670290.0,Poolians.com


In [26]:
# Desnormalizamos la variable genres en filas unicas para cada tipo
df_game = df_game.explode('genres')

In [27]:
# Renombremos id por item_id
renombrar_columna = {"id": "item_id"}
df_game.rename(columns=renombrar_columna, inplace=True)
df_game.head()

Unnamed: 0,genres,title,release_date,price,item_id,developer
88310,Action,Lost Summoner Kitty,2018,4.99,761140.0,Kotoshiro
88310,Casual,Lost Summoner Kitty,2018,4.99,761140.0,Kotoshiro
88310,Indie,Lost Summoner Kitty,2018,4.99,761140.0,Kotoshiro
88310,Simulation,Lost Summoner Kitty,2018,4.99,761140.0,Kotoshiro
88310,Strategy,Lost Summoner Kitty,2018,4.99,761140.0,Kotoshiro


In [28]:
# Crear un archivo con formato parquet
dataset_game = '..\\Datasets\\steam_games.parquet'
df_game.to_parquet(dataset_game)