En esta notebook llevaremos a cabo el proceso de ETL para el archivo australian_user_reviews.json que se encuentra en la carpeta de data. Procederemos a leerlo, analizar nulos, registros vacios y de ser necesario eliminarlos o inputar valores, para terminar cargando los datos nuevamente a un archivo csv con la información limpia.

### Extracción de datos users_reviews

Importamos las bibliotecas a utilizar

In [20]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
import warnings
from datetime import datetime
import Tools as t
warnings.filterwarnings("ignore")

Extraemos los datos de los datasets

In [21]:
ruta_review = 'data/australian_user_reviews.json'
import ast

filas_review = []

with open(ruta_review, encoding='utf-8') as f:
    for line in f.readlines():
        try:
            data = json.loads(line)         # Intenta cargar el JSON normalmente
        except json.JSONDecodeError:        # Si tiene error: 
            data = ast.literal_eval(line)   # Usa ast.literal_eval() para cargar el JSON inválido y cambiar '' por ""
        filas_review.append(data)           # Agrega la linea corregida a la lista

df_review = pd.DataFrame(filas_review)      # Una vez cargados todas las lineas a la lista, hace un DataFrame con dicha lista
df_review                                   # validamos el DataFrame

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


De la columna reviews, sacamos la informacion de los diccionarios

In [22]:
df_clean_reviews = df_review.explode('reviews')                                                                                 # Desanidamos la columna reviews

'''
Elimina la columna 'reviews' del DataFrame df_clean_reviews y expande la columna 'reviews' en varios campos (si cada elemento de la columna 'reviews' contiene varios valores) para luego 
unir esta información en un nuevo DataFrame df_reviews2
'''

df_reviews2 = pd.concat([df_clean_reviews.drop(['reviews'], axis=1), df_clean_reviews['reviews'].apply(pd.Series)], axis=1)     



In [23]:
df_reviews2.head()                                                                                                            # Revisamos como quedo el dataframe

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,
1,js41637,http://steamcommunity.com/id/js41637,,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,
1,js41637,http://steamcommunity.com/id/js41637,,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,


Se creó una columna con sólo NaN, por lo que la eliminaremos

In [24]:
del df_reviews2[0]

Validamos que se haya almacenado correctamente la informacion al nuevo dataframe

In [25]:
fila_etiqueta = df_review.loc[1564]                                 # Datos en el df original
print(fila_etiqueta)
contenido_reviews = df_review.loc[1564, 'reviews']
print(contenido_reviews)

user_id                                     76561198056300439
user_url    http://steamcommunity.com/profiles/76561198056...
reviews     [{'funny': '1 person found this review funny',...
Name: 1564, dtype: object
[{'funny': '1 person found this review funny', 'posted': 'Posted April 26, 2015.', 'last_edited': '', 'item_id': '730', 'helpful': '0 of 1 people (0%) found this review helpful', 'recommend': False, 'review': '♥♥♥♥ING ♥♥♥♥ OF COMPETITIVE MATCHESTHIS ♥♥♥♥ING ♥♥♥♥ FIND GAMES THAT MY TEAM HAS ♥♥♥♥ING 20 HOURS OF GAME, AND I HAVE 700, ♥♥♥♥ THIS ♥♥♥♥, NEVER GOT ONE GOOD TEAM ON THIS ♥♥♥♥ING FREAKING ♥♥♥♥♥♥♥♥♥♥♥♥ GAME'}, {'funny': '', 'posted': 'Posted January 17.', 'last_edited': '', 'item_id': '304240', 'helpful': 'No ratings yet', 'recommend': True, 'review': 'Best Game ever'}, {'funny': '', 'posted': 'Posted May 10, 2014.', 'last_edited': '', 'item_id': '238320', 'helpful': 'No ratings yet', 'recommend': True, 'review': "Well what can i say about this game?It's the BEST SURVIVAL 

In [26]:
fila_etiqueta = df_reviews2.loc[1564]                               # Datos en el df ya desanidado
fila_etiqueta

Unnamed: 0,user_id,user_url,funny,posted,last_edited,item_id,helpful,recommend,review
1564,76561198056300439,http://steamcommunity.com/profiles/76561198056...,1 person found this review funny,"Posted April 26, 2015.",,730,0 of 1 people (0%) found this review helpful,False,♥♥♥♥ING ♥♥♥♥ OF COMPETITIVE MATCHESTHIS ♥♥♥♥IN...
1564,76561198056300439,http://steamcommunity.com/profiles/76561198056...,,Posted January 17.,,304240,No ratings yet,True,Best Game ever
1564,76561198056300439,http://steamcommunity.com/profiles/76561198056...,,"Posted May 10, 2014.",,238320,No ratings yet,True,Well what can i say about this game?It's the B...
1564,76561198056300439,http://steamcommunity.com/profiles/76561198056...,,"Posted May 10, 2014.",,225260,No ratings yet,True,This Game Is AWESOME!!!It's very fun to play i...


#### Transformación de datos

Vemos que tipo de dato es cada variable, la cantidad de registros y columnas

In [27]:
df_reviews2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59333 entries, 0 to 25798
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      59333 non-null  object
 1   user_url     59333 non-null  object
 2   funny        59305 non-null  object
 3   posted       59305 non-null  object
 4   last_edited  59305 non-null  object
 5   item_id      59305 non-null  object
 6   helpful      59305 non-null  object
 7   recommend    59305 non-null  object
 8   review       59305 non-null  object
dtypes: object(9)
memory usage: 6.5+ MB


Calculamos la cantidad de valores nulos en el DataFrame df_reviews2

Calculamos la cantidad de registros totalmente nulos, y procedemos a eliminarlos

In [28]:
cantidad_filas_nulas = df_reviews2.isnull().all(axis=1).sum()
print("Cantidad de filas con todos los valores nulos:", cantidad_filas_nulas)
if cantidad_filas_nulas > 0:
    df_reviews2 = df_reviews2.dropna(how='all').reset_index(drop=True)
    print('Se eliminaron los registros totalmente vacios')
else:
    print('No se eliminaron registros, ya que no se encontraron registros totalmente vacios')

Cantidad de filas con todos los valores nulos: 0
No se eliminaron registros, ya que no se encontraron registros totalmente vacios


Luego buscamos cuantos nulos quedan en nuestro dataframe

In [29]:
nulos = t.PorcentajeNulos(df_reviews2)
nulos

Unnamed: 0,%_valores_nulos,Cantidad_Nulos,Cantidad_NO_Nulos,Total_Registros
user_id,0.0,0,59333,59333
user_url,0.0,0,59333,59333
funny,0.05,28,59305,59333
posted,0.05,28,59305,59333
last_edited,0.05,28,59305,59333
item_id,0.05,28,59305,59333
helpful,0.05,28,59305,59333
recommend,0.05,28,59305,59333
review,0.05,28,59305,59333


Eliminamos los registros nulos, ya que no representan más del 4% del total 

In [30]:
df_reviews2 = df_reviews2.dropna()

La columna funny tiene un 86% de celdas vacías (no nulas), por lo que procedemos a eliminarla del df

In [31]:
vacios = pd.DataFrame(df_reviews2['funny'].value_counts() / len(df_reviews2) * 100)
vacios = vacios.rename(columns={'count': '%Vacios'})
vacios

Unnamed: 0_level_0,%Vacios
funny,Unnamed: 1_level_1
,86.255796
1 person found this review funny,8.734508
2 people found this review funny,2.077397
3 people found this review funny,0.827923
4 people found this review funny,0.450215
...,...
58 people found this review funny,0.001686
405 people found this review funny,0.001686
105 people found this review funny,0.001686
"1,130 people found this review funny",0.001686


In [32]:
df_reviews2 = df_reviews2.drop('funny', axis=1)                 # Eliminamos la columna

Sucede lo mismo con last_edited

In [33]:
vacios = pd.DataFrame(df_reviews2['last_edited'].value_counts() / len(df_reviews2) * 100)
vacios = vacios.rename(columns={'count': '%Vacios'})
vacios

Unnamed: 0_level_0,%Vacios
last_edited,Unnamed: 1_level_1
,89.646741
"Last edited November 25, 2013.",0.166934
"Last edited October 17, 2015.",0.032038
"Last edited June 6, 2015.",0.030352
Last edited January 3.,0.028665
...,...
"Last edited May 30, 2015.",0.001686
"Last edited May 21, 2015.",0.001686
"Last edited February 11, 2014.",0.001686
"Last edited May 8, 2014.",0.001686


In [34]:
df_reviews2 = df_reviews2.drop('last_edited', axis=1)           # Eliminamos la columna

Revisamos los duplicados

In [35]:
filas_duplicadas = df_reviews2[df_reviews2.duplicated(keep='first')]
filas_duplicadas.head()

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,review
456,bokkkbokkk,http://steamcommunity.com/id/bokkkbokkk,"Posted September 24, 2015.",346110,1 of 1 people (100%) found this review helpful,True,yep
1182,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 10, 2014.",218620,1 of 3 people (33%) found this review helpful,True,"Good graphics, fun heists! A bit laggy"
1182,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 10, 2014.",105600,0 of 2 people (0%) found this review helpful,True,So fun! DEFINITELY NOT RIP OFF OF MINECRAFT! e...
1182,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted December 17, 2014.",570,No ratings yet,True,bobo pinoy
1182,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 13, 2014.",211820,No ratings yet,True,If you want to play this game.. expect glithes...


Al validar que sean resultados duplicados, observamos que realmente no son registros duplicados por lo que no se eliminarán.

In [36]:
user_id = 'ImSeriouss'

fila_usuario = df_reviews2.loc[df_reviews2['user_id'] == user_id]
fila_usuario.head()

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,review
1181,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 10, 2014.",218620,1 of 3 people (33%) found this review helpful,True,"Good graphics, fun heists! A bit laggy"
1181,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 10, 2014.",105600,0 of 2 people (0%) found this review helpful,True,So fun! DEFINITELY NOT RIP OFF OF MINECRAFT! e...
1181,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted December 17, 2014.",570,No ratings yet,True,bobo pinoy
1181,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 13, 2014.",211820,No ratings yet,True,If you want to play this game.. expect glithes...
1181,ImSeriouss,http://steamcommunity.com/id/ImSeriouss,"Posted January 10, 2014.",440,No ratings yet,True,Really good game! fun! Good for people who wan...


Para simplificar el dataframe, agregaremos la columna year_posted la cual contiene únicamente el año, para su posterior análisis

In [37]:
df_reviews2['year_posted'] = df_reviews2['posted'].str.extract(r'(\d{4})')          # Extraemos el año
mediana = df_reviews2['year_posted'].dropna().astype(float).median()                  # Como son más de 10,000 datos que no cuentan con el dato del año
df_reviews2['year_posted'] = df_reviews2['year_posted'].fillna(mediana)               # Rellenamos esos nulos con la media de los años con de los que si tienen el dato del year_posted

### Feature Engineering

#### Analisis de sentimientos

In [40]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
import langid

sia = SentimentIntensityAnalyzer()

Revisamos el idioma en que están escritos los reviews 

In [41]:
contador_idiomas = {}

for i in df_reviews2['review']:                                         # En cada linea del dataframe
    idioma,_ = langid.classify(i)                                       
    contador_idiomas[idioma] = contador_idiomas.get(idioma,0) + 1       # Clasificar el idioma, agregar 1 al contador de dicho idioma y actualizar el diccionario

contador_idiomas

{'en': 50527,
 'pl': 199,
 'de': 342,
 'pt': 1800,
 'fr': 302,
 'es': 1211,
 'da': 205,
 'th': 712,
 'sv': 153,
 'mr': 33,
 'ru': 204,
 'hi': 190,
 'lt': 77,
 'tl': 20,
 'it': 327,
 'zh': 322,
 'ko': 187,
 'nn': 62,
 'no': 559,
 'nl': 367,
 'wa': 85,
 'ne': 208,
 'br': 30,
 'km': 58,
 'af': 30,
 'eu': 89,
 'sl': 49,
 'cy': 45,
 'ms': 20,
 'id': 68,
 'mk': 3,
 'sk': 51,
 'fi': 68,
 'an': 20,
 'hy': 34,
 'eo': 22,
 'tr': 32,
 'nb': 17,
 'la': 78,
 'et': 75,
 'is': 12,
 'ja': 32,
 'cs': 29,
 'az': 10,
 'lv': 10,
 'ka': 16,
 'ca': 19,
 'se': 2,
 'ro': 33,
 'mt': 60,
 'gl': 24,
 'be': 3,
 'bg': 9,
 'mg': 12,
 'xh': 9,
 'sw': 8,
 'ky': 1,
 'ur': 5,
 'lo': 4,
 'vi': 5,
 'rw': 6,
 'ug': 5,
 'vo': 3,
 'he': 3,
 'sq': 6,
 'hu': 22,
 'hr': 8,
 'sr': 7,
 'lb': 4,
 'am': 11,
 'ht': 4,
 'bs': 5,
 'zu': 1,
 'fo': 2,
 'mn': 5,
 'qu': 4,
 'kk': 9,
 'jv': 1,
 'or': 1,
 'oc': 2,
 'uk': 3,
 'bn': 3,
 'ku': 1,
 'ar': 2,
 'ml': 1,
 'fa': 1,
 'ga': 1}

Para facilitar la visualización pasamos el diccionario a un dataframe, y calculamos el porcentaje que representa cada idioma sobre el total de reviews

In [47]:
contador_idiomas_sorted = dict(sorted(contador_idiomas.items(), key=lambda item: item[1], reverse=True))    # Acomodamos el diccionario de mayor a menor
pd_idiomas = pd.DataFrame(list(contador_idiomas_sorted.items()), columns=['Idioma', 'Conteo'])              # Guardamos el diccionario en un dataframe
pd_idiomas['Proporcion'] = round(pd_idiomas['Conteo'] *100 / pd_idiomas['Conteo'].sum().sum(),2)            # Calculamos el % que representa cada idioma
pd_idiomas

Unnamed: 0,Idioma,Conteo,Proporcion
0,en,50527,85.20
1,pt,1800,3.04
2,es,1211,2.04
3,th,712,1.20
4,no,559,0.94
...,...,...,...
82,or,1,0.00
83,ku,1,0.00
84,ml,1,0.00
85,fa,1,0.00


Podemos observar que el 85% de los reviews son en inglés, por lo que por fines de optimización de tiempos en este análisis consideraremos que todos los reviews están en inglés

##### VADER

In [48]:
res = {}                                                                    # Isntanciamos un diccionario vacio para almacenar los resultados
for i, linea in tqdm(df_reviews2.iterrows(), total=len(df_reviews2)):       # Iteramos en todas las lineas del df 
    text = linea['review']
    myid = linea['user_id']
    res[myid] = sia.polarity_scores(text)                                   # Determinamos su polarity score
 

  0%|          | 0/59305 [00:00<?, ?it/s]

In [50]:
vaders = pd.DataFrame(res).T                                                                # Resultados del análisis, que tan negativo, positivo o neutral es el review

merged_df = vaders.merge(df_reviews2, left_index=True, right_on='user_id', how='left')      # Unimos los resultados de VADER con el dataframe de reviews
merged_df

Unnamed: 0,neg,neu,pos,compound,user_id,user_url,posted,item_id,helpful,recommend,review,year_posted
0,0.000,0.704,0.296,0.9117,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011
0,0.000,0.704,0.296,0.9117,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.,2011
0,0.000,0.704,0.296,0.9117,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011
1,0.081,0.541,0.378,0.7713,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014
1,0.081,0.541,0.378,0.7713,js41637,http://steamcommunity.com/id/js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013
...,...,...,...,...,...,...,...,...,...,...,...,...
25797,0.026,0.692,0.282,0.9786,76561198312638244,http://steamcommunity.com/profiles/76561198312...,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...,2014.0
25797,0.026,0.692,0.282,0.9786,76561198312638244,http://steamcommunity.com/profiles/76561198312...,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...,2014.0
25798,0.000,0.203,0.797,0.8349,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,2014.0
25798,0.000,0.203,0.797,0.8349,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,Posted July 20.,730,No ratings yet,True,:D,2014.0


Validamos que se haya realizado correctamente el merge

In [51]:
fila = df_reviews2.loc[df_reviews2['user_id'] == 'js41637']
fila

Unnamed: 0,user_id,user_url,posted,item_id,helpful,recommend,review,year_posted
1,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014
1,js41637,http://steamcommunity.com/id/js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013
1,js41637,http://steamcommunity.com/id/js41637,"Posted November 29, 2013.",239030,1 of 4 people (25%) found this review helpful,True,Very fun little game to play when your bored o...,2013


Hacemos una nueva ponderación sobre las calificaciones resultantes del análisis de sentimientos, usamos la columna compound que es un "indicador global" (que considera el score pos, neg y neu) del sentimiento de ese review, y le asignamos nuevos valores:
* 0 si es Malo
* 1 si es Neutral
* 2 si es Bueno

In [52]:
merged_df['score'] = 1                                                              # Ponemos por default toda la columna con 1


for index, row in merged_df.iterrows():                                             # Iteramos sobre las filas 
    if row['compound'] > 0:                                                         # Actualizamos la columna 'score' basada en la puntuación compuesta ('compound')
        merged_df.at[index, 'score'] = 2
    elif row['compound'] < 0:
        merged_df.at[index, 'score'] = 0


In [53]:
merged_df

Unnamed: 0,neg,neu,pos,compound,user_id,user_url,posted,item_id,helpful,recommend,review,year_posted,score
0,0.000,0.704,0.296,0.9117,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted November 5, 2011.",1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011,2
0,0.000,0.704,0.296,0.9117,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted July 15, 2011.",22200,No ratings yet,True,It's unique and worth a playthrough.,2011,2
0,0.000,0.704,0.296,0.9117,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"Posted April 21, 2011.",43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,2011,2
1,0.081,0.541,0.378,0.7713,js41637,http://steamcommunity.com/id/js41637,"Posted June 24, 2014.",251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,2014,2
1,0.081,0.541,0.378,0.7713,js41637,http://steamcommunity.com/id/js41637,"Posted September 8, 2013.",227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,2013,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25797,0.026,0.692,0.282,0.9786,76561198312638244,http://steamcommunity.com/profiles/76561198312...,Posted July 10.,70,No ratings yet,True,a must have classic from steam definitely wort...,2014.0,2
25797,0.026,0.692,0.282,0.9786,76561198312638244,http://steamcommunity.com/profiles/76561198312...,Posted July 8.,362890,No ratings yet,True,this game is a perfect remake of the original ...,2014.0,2
25798,0.000,0.203,0.797,0.8349,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,Posted July 3.,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,2014.0,2
25798,0.000,0.203,0.797,0.8349,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,Posted July 20.,730,No ratings yet,True,:D,2014.0,2


Validamos que haya funcionado correctamente

In [54]:
score = merged_df.loc[merged_df['score'] == 1]
score

Unnamed: 0,neg,neu,pos,compound,user_id,user_url,posted,item_id,helpful,recommend,review,year_posted,score
6,0.0,1.0,0.0,0.0,76561198079601835,http://steamcommunity.com/profiles/76561198079...,Posted May 20.,730,0 of 1 people (0%) found this review helpful,True,ZIKA DO BAILE,2014.0,1
8,0.0,1.0,0.0,0.0,76561198089393905,http://steamcommunity.com/profiles/76561198089...,"Posted February 1, 2015.",72850,3 of 3 people (100%) found this review helpful,True,"Killed the Emperor, nobody cared and got away ...",2015,1
8,0.0,1.0,0.0,0.0,76561198089393905,http://steamcommunity.com/profiles/76561198089...,"Posted June 20, 2014.",440,3 of 3 people (100%) found this review helpful,True,10/10 would eat your money for hats and keys,2014,1
10,0.0,1.0,0.0,0.0,76561198077246154,http://steamcommunity.com/profiles/76561198077...,Posted June 11.,440,No ratings yet,True,mt bom,2014.0,1
10,0.0,1.0,0.0,0.0,76561198077246154,http://steamcommunity.com/profiles/76561198077...,"Posted August 25, 2014.",304930,No ratings yet,True,É muito bom,2014,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25777,0.0,1.0,0.0,0.0,3214213216,http://steamcommunity.com/id/3214213216,Posted February 7.,240,0 of 2 people (0%) found this review helpful,True,잼꾸르잼,2014.0,1
25784,0.0,1.0,0.0,0.0,76561198272389051,http://steamcommunity.com/profiles/76561198272...,Posted June 17.,440,1 of 1 people (100%) found this review helpful,True,Hi I don't know what this is,2014.0,1
25787,0.0,1.0,0.0,0.0,943525,http://steamcommunity.com/id/943525,Posted March 5.,298110,No ratings yet,False,"uplay, everytime",2014.0,1
25791,0.0,1.0,0.0,0.0,sexyawp,http://steamcommunity.com/id/sexyawp,Posted April 25.,427730,1 of 2 people (50%) found this review helpful,True,dont ask,2014.0,1


Eliminamos las columnas que ya no se necesitan: neg, neu, pos, compound, posted, review

In [55]:
eliminar = ['neg','neu','pos','compound','review','posted', 'user_url', 'helpful']
merged_df = merged_df.drop(columns=eliminar)

Vemos como queda el dataframe

In [56]:
merged_df.head()

Unnamed: 0,user_id,item_id,recommend,year_posted,score
0,76561197970982479,1250,True,2011,2
0,76561197970982479,22200,True,2011,2
0,76561197970982479,43110,True,2011,2
1,js41637,251610,True,2014,2
1,js41637,227300,True,2013,2


Modificamos la columna recommend para facilitar el análisis, donde  
False = 0  
True = 1

Validamos antes del cambio, cuantos True y False hay

In [None]:
merged_df['recommend'].value_counts()

recommend
True     52473
False     6832
Name: count, dtype: int64

Realizamos el cambio

In [58]:
merged_df['recommend'] = merged_df['recommend'].replace({False: 0, True: 1})
merged_df

Unnamed: 0,user_id,item_id,recommend,year_posted,score
0,76561197970982479,1250,1,2011,2
0,76561197970982479,22200,1,2011,2
0,76561197970982479,43110,1,2011,2
1,js41637,251610,1,2014,2
1,js41637,227300,1,2013,2
...,...,...,...,...,...
25797,76561198312638244,70,1,2014.0,2
25797,76561198312638244,362890,1,2014.0,2
25798,LydiaMorley,273110,1,2014.0,2
25798,LydiaMorley,730,1,2014.0,2


Validamos que no se haya modificado la cantidad de True y False

In [None]:
merged_df['recommend'].value_counts()

recommend
1    52473
0     6832
Name: count, dtype: int64

Ya que contamos con el dataframe limpio, procedemos con la modificación de los tipos de datos de cada columna que haga falta

In [None]:
merged_df['user_id'] = merged_df['user_id'].astype(str)
merged_df['year_posted'] = merged_df['year_posted'].astype(int)

Guardamos el conjunto ya limpio a un csv

In [None]:
merged_df.to_csv('users_reviews_cleaned.csv', index = False, encoding='utf-8')

Este es el fin de este ETL, porfavor da click [aqui](02_EDA.ipynb) para continuar con el EDA.