# Projeto Final Big Data - Reviews Steam 2021

Projeto para a disciplina de Big Data sobre reviews na Steam de 2021, realizando uma análise exploratória e uma predição baseada em machine learning para prever se um jogo receberia uma review positiva ou negativa baseado em alguns parâmetros.

**Dataset**: https://www.kaggle.com/datasets/najzeko/steam-reviews-2021

**Grupo**: Guilherme Lunetta, Rafael Monteiro e João Vitor Magalhães

In [1]:
import dask
from dask.distributed import Client
import dask.dataframe as dd
import dask.multiprocessing

### Cliente DASK

In [2]:
client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')

In [3]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 7.45 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59297,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 7.45 GiB

0,1
Comm: tcp://127.0.0.1:59331,Total threads: 2
Dashboard: http://127.0.0.1:59332/status,Memory: 1.86 GiB
Nanny: tcp://127.0.0.1:59302,
Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-i0btmm6y,Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-i0btmm6y

0,1
Comm: tcp://127.0.0.1:59334,Total threads: 2
Dashboard: http://127.0.0.1:59335/status,Memory: 1.86 GiB
Nanny: tcp://127.0.0.1:59303,
Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-momswm5e,Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-momswm5e

0,1
Comm: tcp://127.0.0.1:59328,Total threads: 2
Dashboard: http://127.0.0.1:59329/status,Memory: 1.86 GiB
Nanny: tcp://127.0.0.1:59300,
Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-16oj55gb,Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-16oj55gb

0,1
Comm: tcp://127.0.0.1:59337,Total threads: 2
Dashboard: http://127.0.0.1:59338/status,Memory: 1.86 GiB
Nanny: tcp://127.0.0.1:59301,
Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-w3am_9yq,Local directory: C:\Users\usuario\AppData\Local\Temp\dask-worker-space\worker-w3am_9yq


### Abrindo o dataset e dando os primeiros passos

In [23]:
# Foi necessário colocar dtype em todas as colunas pois o dask estava inferindo muito mal o tipo de cada coluna,
# dificultando muito a análise. Também foi necessário inferir todos os inteiros e doubles como strings.

cols = ['Unnamed: 0', 'app_id', 'app_name', 'review_id', 'language', 'timestamp_created', 'timestamp_updated', 'recommended',
        'votes_helpful', 'votes_funny', 'weighted_vote_score', 'comment_count', 'steam_purchase', 'received_for_free',
        'written_during_early_access', 'author.steamid', 'author.num_games_owned', 'author.num_reviews', 'author.playtime_forever',
        'author.playtime_last_two_weeks', 'author.playtime_at_review', 'author.last_played']

reviews = dd.read_csv('steam_reviews.csv',
                      usecols=cols,
                      sep=',',
                      encoding='UTF-8',
                      engine='python',
                      on_bad_lines='skip',
                      dtype={'Unnamed: 0': 'str',
                              'app_id': 'str',
                              'app_name': 'str',
                              'review_id': 'str',
                              'language': 'str',
                              'review': 'str',
                              'timestamp_created': 'str',
                              'timestamp_updated': 'str',
                              'recommended': 'str',
                              'votes_helpful': 'str',
                              'votes_funny': 'str',
                              'weighted_vote_score': 'str',
                              'comment_count': 'str',
                              'steam_purchase': 'str',
                              'received_for_free': 'str',
                              'written_during_early_access': 'str',
                              'author.steamid': 'str',
                              'author.num_games_owned': 'str',
                              'author.num_reviews': 'str',
                              'author.playtime_forever': 'str',
                              'author.playtime_last_two_weeks': 'str',
                              'author.playtime_at_review': 'str',
                              'author.last_played': 'str'})

In [24]:
reviews.head()

Unnamed: 0.1,Unnamed: 0,app_id,app_name,review_id,language,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
0,0,292030,The Witcher 3: Wild Hunt,85185598,schinese,1611381629,1611381629,True,0,0,...,True,False,False,76561199095369542,6,2,1909.0,1448.0,1909.0,1611343383.0
1,1,292030,The Witcher 3: Wild Hunt,85185250,schinese,1611381030,1611381030,True,0,0,...,True,False,False,76561198949504115,30,10,2764.0,2743.0,2674.0,1611386307.0
2,2,292030,The Witcher 3: Wild Hunt,85185111,schinese,1611380800,1611380800,True,0,0,...,True,False,False,76561199090098988,5,1,1061.0,1061.0,1060.0,1611383777.0
3,3,292030,The Witcher 3: Wild Hunt,85184605,english,1611379970,1611379970,True,0,0,...,True,False,False,76561199054755373,5,3,5587.0,3200.0,5524.0,1611383744.0
4,4,292030,The Witcher 3: Wild Hunt,85184287,schinese,1611379427,1611379427,True,0,0,...,True,False,False,76561199028326951,7,4,217.0,42.0,217.0,1610788249.0


In [25]:
linhas = len(reviews)

print(f'O dataset possui {linhas} de linhas!')

O dataset possui 21756295 de linhas!


In [12]:
# Renomeando a coluna de índice

columns = reviews.columns.to_list()
columns[0] = "index"
cols_dict = {}

for idx, column in enumerate(reviews.columns.to_list()):
    cols_dict[column] = columns[idx]

reviews = reviews.rename(columns=cols_dict)

In [13]:
reviews.head()

Unnamed: 0,index,app_id,app_name,review_id,language,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
0,0,292030,The Witcher 3: Wild Hunt,85185598,schinese,1611381629,1611381629,True,0,0,...,True,False,False,76561199095369542,6,2,1909.0,1448.0,1909.0,1611343383.0
1,1,292030,The Witcher 3: Wild Hunt,85185250,schinese,1611381030,1611381030,True,0,0,...,True,False,False,76561198949504115,30,10,2764.0,2743.0,2674.0,1611386307.0
2,2,292030,The Witcher 3: Wild Hunt,85185111,schinese,1611380800,1611380800,True,0,0,...,True,False,False,76561199090098988,5,1,1061.0,1061.0,1060.0,1611383777.0
3,3,292030,The Witcher 3: Wild Hunt,85184605,english,1611379970,1611379970,True,0,0,...,True,False,False,76561199054755373,5,3,5587.0,3200.0,5524.0,1611383744.0
4,4,292030,The Witcher 3: Wild Hunt,85184287,schinese,1611379427,1611379427,True,0,0,...,True,False,False,76561199028326951,7,4,217.0,42.0,217.0,1610788249.0


### Tratamento de dados

Devido ao fato do nosso dataset possuir uma coluna "reviews" que conta com a review de um usuário sobre um jogo e essa coluna possuir texto livre digitado diretamente pelo usuário, fica inviável trabalhar visto que essas reviews possuem vírgulas e por nosso arquivo .csv ser separado por vírgulas, isso acaba atrapalhando o nosso trabalho pois uma grande quantidade dos dados estão "sujos". 

Tentamos diversas opções para driblar esse problema, mas não chegamos em uma solução que nos permitisse continuar trabalhando com todos os dados. Diante disso, decidimos trabalhar com os reviews **APENAS** da língua chinesa, isso porque na língua chinesa não existe vírgula, o que não "suja" nossos dados, permitindo continuar o projeto.

In [29]:
chinese = reviews[reviews["language"] == "schinese"]
chinese.head()

Unnamed: 0.1,Unnamed: 0,app_id,app_name,review_id,language,timestamp_created,timestamp_updated,recommended,votes_helpful,votes_funny,...,steam_purchase,received_for_free,written_during_early_access,author.steamid,author.num_games_owned,author.num_reviews,author.playtime_forever,author.playtime_last_two_weeks,author.playtime_at_review,author.last_played
0,0,292030,The Witcher 3: Wild Hunt,85185598,schinese,1611381629,1611381629,True,0,0,...,True,False,False,76561199095369542,6,2,1909.0,1448.0,1909.0,1611343383.0
1,1,292030,The Witcher 3: Wild Hunt,85185250,schinese,1611381030,1611381030,True,0,0,...,True,False,False,76561198949504115,30,10,2764.0,2743.0,2674.0,1611386307.0
2,2,292030,The Witcher 3: Wild Hunt,85185111,schinese,1611380800,1611380800,True,0,0,...,True,False,False,76561199090098988,5,1,1061.0,1061.0,1060.0,1611383777.0
4,4,292030,The Witcher 3: Wild Hunt,85184287,schinese,1611379427,1611379427,True,0,0,...,True,False,False,76561199028326951,7,4,217.0,42.0,217.0,1610788249.0
8,8,292030,The Witcher 3: Wild Hunt,85183227,schinese,1611377703,1611377703,True,0,0,...,True,False,False,76561198130808993,581,17,6921.0,222.0,6921.0,1611317275.0


### Novo dataset

De qualquer forma, o novo dataset ainda é enorme, são 3.6 milhões de linhas.

In [30]:
linhas = len(chinese)

print(f"O novo dataset possui {linhas} linhas")

O novo dataset possui 3670537 linhas


### Análise exploratória

In [32]:
# Jogos mais avaliados

top_10_games = chinese["app_name"].value_counts().compute()
top_10_games



KeyboardInterrupt: 