# Video Game Sales

`Matheus Raz (mrol@cin.ufpe.br)`

`João Paulo Lins (jplo@cin.ufpe.br)`

## Pre-processamento (etapa 1)

In [136]:
from IPython.display import display

import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score

In [80]:
df = pd.read_csv('vgsales.csv')
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
Rank            16598 non-null int64
Name            16598 non-null object
Platform        16598 non-null object
Year            16327 non-null float64
Genre           16598 non-null object
Publisher       16540 non-null object
NA_Sales        16598 non-null float64
EU_Sales        16598 non-null float64
JP_Sales        16598 non-null float64
Other_Sales     16598 non-null float64
Global_Sales    16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [82]:
df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
5,6,Tetris,GB,1989.0,Puzzle,Nintendo,23.20,2.26,4.22,0.58,30.26
6,7,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,11.38,9.23,6.50,2.90,30.01
7,8,Wii Play,Wii,2006.0,Misc,Nintendo,14.03,9.20,2.93,2.85,29.02
8,9,New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo,14.59,7.06,4.70,2.26,28.62
9,10,Duck Hunt,NES,1984.0,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31


In [83]:
minor = df['Year'].min()
major = df['Year'].max()
print("Menor ano da base: %d"%(minor))
display(df[df['Year'] == minor])
print("Maior ano da base: %d"%(major))
display(df[df['Year'] == major])

Menor ano da base: 1980


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
258,259,Asteroids,2600,1980.0,Shooter,Atari,4.0,0.26,0.0,0.05,4.31
544,545,Missile Command,2600,1980.0,Shooter,Atari,2.56,0.17,0.0,0.03,2.76
1766,1768,Kaboom!,2600,1980.0,Misc,Activision,1.07,0.07,0.0,0.01,1.15
1969,1971,Defender,2600,1980.0,Misc,Atari,0.99,0.05,0.0,0.01,1.05
2669,2671,Boxing,2600,1980.0,Fighting,Activision,0.72,0.04,0.0,0.01,0.77
4025,4027,Ice Hockey,2600,1980.0,Sports,Activision,0.46,0.03,0.0,0.01,0.49
5366,5368,Freeway,2600,1980.0,Action,Activision,0.32,0.02,0.0,0.0,0.34
6317,6319,Bridge,2600,1980.0,Misc,Activision,0.25,0.02,0.0,0.0,0.27
6896,6898,Checkers,2600,1980.0,Misc,Atari,0.22,0.01,0.0,0.0,0.24


Maior ano da base: 2020


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
5957,5959,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.27,0.0,0.0,0.02,0.29


In [84]:
df.drop(labels=5957,inplace=True)
display(df[df['Year'] == major])

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales


O jogo `Imagine: Makeup Artist` possui vendas espalhadas pelo mundo todo mesmo antes de seu lançamento, isso nos levou a pensar que fosse um tipo de pre-venda, porém após algumas pesquisas constatamos que o jogo foi lançado em 2009 e não possuia nenhuma previsão de lançamento para 2020, por isso tudo indica que seja um dado ruidoso, nos levando a questionar se os outros valores referentes a ele estariam realmente corretos. Optamos então por deletar esse valor do nosso dataset.

In [85]:
def verify(x):
    try:
        x = int(x)
        print("Deu certo!!")
    except:
        print("Valor",x,"não é número!!")

df.columns

Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')

In [86]:
# df['Year'].apply(verify)
qtdAnosNan = len(df[np.isnan(df['Year'])])
display("Quantidade de valores \"NaN\" na coluna Year: %d"%(qtdAnosNan))
df.dropna(inplace=True)
display("Quantidade após drop desses valores \"NaN\": %d"%(len(df[np.isnan(df['Year'])])))
df['Year'] = df['Year'].apply(lambda x : int(x)) # Transformando valores da coluna ano para inteiro.
display(df.head())

'Quantidade de valores "NaN" na coluna Year: 271'

'Quantidade após drop desses valores "NaN": 0'

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


### Tratamento de valores "NaN"

Testamos se havia valores `NaN`em todas as colunas do dataframe e vimos que só a coluna referente ao ano de lançamento do jogo que havia esses valores ruidosos, por serem apenas 271 linhas que não constavam, optamos por dropar eles, já que era um número pequeno comparado ao total de dados do dataframe (16.500 linhas).

In [87]:
print("qtd possíveis valores para plataforma: {}".format(len(df['Platform'].value_counts())))
print("qtd possíveis valores para genero: {}".format(len(df['Genre'].value_counts())))
print("qtd possíveis valores para publisher: {}".format(len(df['Publisher'].value_counts())))

qtd possíveis valores para plataforma: 31
qtd possíveis valores para genero: 12
qtd possíveis valores para publisher: 576


## Normalização

Normalizamos os valores de vendas de acordo com o maior valor de cada coluna em específica `value/maxValue`.

In [88]:
max_NA_Sales = df['NA_Sales'].max()
print(max_NA_Sales)
max_EU_Sales = df['EU_Sales'].max()
print(max_EU_Sales)
max_JP_Sales = df['JP_Sales'].max()
print(max_JP_Sales)
max_Other_Sales = df['Other_Sales'].max()
print(max_Other_Sales)
max_Global_Sales = df['Global_Sales'].max()
print(max_Global_Sales)

41.49
29.02
10.22
10.57
82.74


In [89]:
for i in df.columns[6:]:
    if(i == 'NA_Sales'):
        df[i] = df[i].apply(lambda x : x/max_NA_Sales)
    elif(i == 'JP_Sales'):
        df[i] = df[i].apply(lambda x : x/max_JP_Sales)
    elif(i == 'EU_Sales'):
        df[i] = df[i].apply(lambda x : x/max_EU_Sales)
    elif(i == 'Other_Sales'):
        df[i] = df[i].apply(lambda x : x/max_Other_Sales)
    elif(i == 'Global_Sales'):
        df[i] = df[i].apply(lambda x : x/max_Global_Sales)
df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006,Sports,Nintendo,1.000000,1.000000,0.368885,0.800378,1.000000
1,2,Super Mario Bros.,NES,1985,Platform,Nintendo,0.700892,0.123363,0.666341,0.072848,0.486343
2,3,Mario Kart Wii,Wii,2008,Racing,Nintendo,0.382020,0.443832,0.370841,0.313150,0.432922
3,4,Wii Sports Resort,Wii,2009,Sports,Nintendo,0.379610,0.379394,0.320939,0.280038,0.398840
4,5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,0.271632,0.306340,1.000000,0.094607,0.379139
5,6,Tetris,GB,1989,Puzzle,Nintendo,0.559171,0.077877,0.412916,0.054872,0.365724
6,7,New Super Mario Bros.,DS,2006,Platform,Nintendo,0.274283,0.318057,0.636008,0.274361,0.362702
7,8,Wii Play,Wii,2006,Misc,Nintendo,0.338154,0.317023,0.286693,0.269631,0.350737
8,9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,0.351651,0.243280,0.459883,0.213813,0.345903
9,10,Duck Hunt,NES,1984,Shooter,Nintendo,0.649072,0.021709,0.027397,0.044465,0.342156


In [90]:
# Platform
def changeCategoric(column):
    teste = df[column].value_counts()
    binarys = []
    binarys_reference = teste.index
    count = 0
    for i in teste.index:
        zero = '0'
        binary = []
        for j in range(len(teste)):
            binary.append(zero)
        binary[count] = '1'
        count+=1
        current = ''
        for z in binary:
            current+=z
        binarys.append(current)
    values = []
    for i in binarys:
        values.append(i.encode('ascii'))
    return values,binarys_reference

convert_platform = changeCategoric('Platform')
print("PLATAFORMA CONVERTIDA\n\n\n\n")
display(convert_platform[0])
display(convert_platform[1])
print("GENERO CONVERTIDO\n\n\n\n")
convert_genre = changeCategoric('Genre')
display(convert_genre[0])
display(convert_genre[1])

PLATAFORMA CONVERTIDA






[b'1000000000000000000000000000000',
 b'0100000000000000000000000000000',
 b'0010000000000000000000000000000',
 b'0001000000000000000000000000000',
 b'0000100000000000000000000000000',
 b'0000010000000000000000000000000',
 b'0000001000000000000000000000000',
 b'0000000100000000000000000000000',
 b'0000000010000000000000000000000',
 b'0000000001000000000000000000000',
 b'0000000000100000000000000000000',
 b'0000000000010000000000000000000',
 b'0000000000001000000000000000000',
 b'0000000000000100000000000000000',
 b'0000000000000010000000000000000',
 b'0000000000000001000000000000000',
 b'0000000000000000100000000000000',
 b'0000000000000000010000000000000',
 b'0000000000000000001000000000000',
 b'0000000000000000000100000000000',
 b'0000000000000000000010000000000',
 b'0000000000000000000001000000000',
 b'0000000000000000000000100000000',
 b'0000000000000000000000010000000',
 b'0000000000000000000000001000000',
 b'0000000000000000000000000100000',
 b'0000000000000000000000000010000',
 

Index(['DS', 'PS2', 'PS3', 'Wii', 'X360', 'PSP', 'PS', 'PC', 'XB', 'GBA', 'GC',
       '3DS', 'PSV', 'PS4', 'N64', 'SNES', 'XOne', 'SAT', 'WiiU', '2600',
       'NES', 'GB', 'DC', 'GEN', 'NG', 'SCD', 'WS', '3DO', 'TG16', 'GG',
       'PCFX'],
      dtype='object')

GENERO CONVERTIDO






[b'100000000000',
 b'010000000000',
 b'001000000000',
 b'000100000000',
 b'000010000000',
 b'000001000000',
 b'000000100000',
 b'000000010000',
 b'000000001000',
 b'000000000100',
 b'000000000010',
 b'000000000001']

Index(['Action', 'Sports', 'Misc', 'Role-Playing', 'Shooter', 'Adventure',
       'Racing', 'Platform', 'Simulation', 'Fighting', 'Strategy', 'Puzzle'],
      dtype='object')

In [91]:
def changePlatform(x):
    value = -1
    for i in range(len(convert_platform[1])):
        if(convert_platform[1][i] == x):
            value = i
    return convert_platform[0][value]

def changeGenre(x):
    value = -1
    for i in range(len(convert_genre[1])):
        if(convert_genre[1][i] == x):
            value = i
    return convert_genre[0][value]

In [92]:
df['Platform'] = df['Platform'].apply(changePlatform)
df['Genre'] = df['Genre'].apply(changeGenre)

In [93]:
df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,b'0001000000000000000000000000000',2006,b'010000000000',Nintendo,1.000000,1.000000,0.368885,0.800378,1.000000
1,2,Super Mario Bros.,b'0000000000000000000010000000000',1985,b'000000010000',Nintendo,0.700892,0.123363,0.666341,0.072848,0.486343
2,3,Mario Kart Wii,b'0001000000000000000000000000000',2008,b'000000100000',Nintendo,0.382020,0.443832,0.370841,0.313150,0.432922
3,4,Wii Sports Resort,b'0001000000000000000000000000000',2009,b'010000000000',Nintendo,0.379610,0.379394,0.320939,0.280038,0.398840
4,5,Pokemon Red/Pokemon Blue,b'0000000000000000000001000000000',1996,b'000100000000',Nintendo,0.271632,0.306340,1.000000,0.094607,0.379139
5,6,Tetris,b'0000000000000000000001000000000',1989,b'000000000001',Nintendo,0.559171,0.077877,0.412916,0.054872,0.365724
6,7,New Super Mario Bros.,b'1000000000000000000000000000000',2006,b'000000010000',Nintendo,0.274283,0.318057,0.636008,0.274361,0.362702
7,8,Wii Play,b'0001000000000000000000000000000',2006,b'001000000000',Nintendo,0.338154,0.317023,0.286693,0.269631,0.350737
8,9,New Super Mario Bros. Wii,b'0001000000000000000000000000000',2009,b'000000010000',Nintendo,0.351651,0.243280,0.459883,0.213813,0.345903
9,10,Duck Hunt,b'0000000000000000000010000000000',1984,b'000010000000',Nintendo,0.649072,0.021709,0.027397,0.044465,0.342156


In [94]:
print('maior: {}'.format(df['Year'].max()))
print('menor: {}'.format(df['Year'].min()))

maior: 2017
menor: 1980


In [95]:
# print(convert_genre[0])
# print(convert_genre[1])
df[(df['Publisher'] == 'Electronic Arts') & (df['Genre'] == convert_genre[0][1])]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
77,78,FIFA 16,b'0000000000000100000000000000000',2015,b'010000000000',Electronic Arts,0.026753,0.208822,0.005871,0.119205,0.102611
112,113,FIFA 14,b'0010000000000000000000000000000',2013,b'010000000000',Electronic Arts,0.018800,0.148863,0.006849,0.163671,0.083394
121,122,FIFA 12,b'0010000000000000000000000000000',2011,b'010000000000',Electronic Arts,0.020246,0.148863,0.010763,0.134342,0.080856
124,125,FIFA 15,b'0000000000000100000000000000000',2014,b'010000000000',Electronic Arts,0.019041,0.147829,0.004892,0.139073,0.079647
199,200,FIFA Soccer 11,b'0010000000000000000000000000000',2010,b'010000000000',Electronic Arts,0.014461,0.113370,0.005871,0.106906,0.061397
211,212,Madden NFL 06,b'0100000000000000000000000000000',2005,b'010000000000',Electronic Arts,0.095927,0.008959,0.000978,0.062441,0.059343
219,220,FIFA 15,b'0010000000000000000000000000000',2014,b'010000000000',Electronic Arts,0.013738,0.108201,0.003914,0.101230,0.058255
221,222,FIFA 17,b'0000000000000100000000000000000',2016,b'010000000000',Electronic Arts,0.006749,0.129221,0.005871,0.065279,0.057650
238,239,Madden NFL 2005,b'0100000000000000000000000000000',2004,b'010000000000',Electronic Arts,0.100747,0.008959,0.000978,0.007569,0.054750
240,241,Madden NFL 07,b'0100000000000000000000000000000',2006,b'010000000000',Electronic Arts,0.087491,0.008270,0.000978,0.057711,0.054266


## Aprendizagem (etapa 2)

In [137]:
df.corr()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Rank,1.0,0.178128,-0.400318,-0.37915,-0.269333,-0.332742,-0.426983
Year,0.178128,1.0,-0.091301,0.006151,-0.169379,0.041157,-0.074637
NA_Sales,-0.400318,-0.091301,1.0,0.768925,0.451284,0.634519,0.94127
EU_Sales,-0.37915,0.006151,0.768925,1.0,0.436377,0.726256,0.903264
JP_Sales,-0.269333,-0.169379,0.451284,0.436377,1.0,0.290558,0.612773
Other_Sales,-0.332742,0.041157,0.634519,0.726256,0.290558,1.0,0.747964
Global_Sales,-0.426983,-0.074637,0.94127,0.903264,0.612773,0.747964,1.0


In [145]:
df_RNA = df[['NA_Sales','EU_Sales']].copy()
classes_RNA = df['Other_Sales'].copy()
display(df_RNA)
classes_RNA

Unnamed: 0,NA_Sales,EU_Sales
0,1.000000,1.000000
1,0.700892,0.123363
2,0.382020,0.443832
3,0.379610,0.379394
4,0.271632,0.306340
5,0.559171,0.077877
6,0.274283,0.318057
7,0.338154,0.317023
8,0.351651,0.243280
9,0.649072,0.021709


0        0.800378
1        0.072848
2        0.313150
3        0.280038
4        0.094607
5        0.054872
6        0.274361
7        0.269631
8        0.213813
9        0.044465
10       0.260170
11       0.181646
12       0.067171
13       0.203406
14       0.169347
15       0.157994
16       0.391675
17       1.000000
18       0.052034
19       0.193945
20       0.129612
21       0.039735
22       0.043519
23       0.130558
24       0.168401
25       0.047304
26       0.077578
27       0.111637
28       0.109745
29       0.124882
           ...   
16568    0.000000
16569    0.000000
16570    0.000000
16571    0.000000
16572    0.000000
16573    0.000000
16574    0.000000
16575    0.000000
16576    0.000000
16577    0.000000
16578    0.000000
16579    0.000000
16580    0.000000
16581    0.000000
16582    0.000000
16583    0.000000
16584    0.000000
16585    0.000000
16586    0.000000
16587    0.000000
16588    0.000000
16589    0.000000
16590    0.000000
16591    0.000000
16592    0

In [146]:
x_treino, x_teste, y_treino, y_teste = train_test_split(df_RNA,classes_RNA,test_size = 0.3)

model = LinearRegression()
# 2. Use fit
model.fit(x_treino, y_treino)
# 3. Check the score
score_regressao = model.score(x_teste, y_teste)
print("Porcentagem de acerto da regressão linear: {}%".format(score_regressao*100))

Porcentagem de acerto da regressão linear: 77.2092165010196%


### Hipótese: 

#### É possível prever o número de vendas em outras regiões do mundo baseado nas vendas da América do Norte, Europa e Japão?

A correlação entre as vendas das regiões da América do Norte e Europa com as vendas de outros lugares do mundo são bem fortes, o que implica serem valores relevantes entre eles. Através da execução da Regressão Linear, observamos que a chance de acerto foi de 77% aproximadamente, confirmando a hipótese de que é possível prever o número de vendas de um jogo em outras regioes, a partir das vendas desse jogo na América do Norte e na Europa.