# Resumo

Uma rede neural do módulo Keras da plataforma Tensforflow é utilizada para prever as vendas de jogos na América do Norte, Europa e Japão se baseando em alguns atributos do jogo.<br>
A base de dados utilizada é a Video Game Sales, do Kaggle.

# Importação dos recursos

In [1]:
import pandas as pd
from tensorflow.keras.layers import Dense, Dropout, Activation, Input # classes para a definição de camadas
from tensorflow.keras.models import Model # classe de redes neurais mais flexível do que a Sequential
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [2]:
ds = pd.read_csv('/kaggle/input/card-12-games/games.csv')
ds

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.00,0.00,0.01,0.00,0.01,,,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.00,0.01,0.00,0.00,0.01,,,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.00,0.00,0.01,0.00,0.01,,,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.00,0.00,0.00,0.01,,,,,,


# Manipulação da base de dados

## Deletando colunas não pertinentes

In [3]:
ds.columns

Index(['Name', 'Platform', 'Year_of_Release', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Score',
       'Critic_Count', 'User_Score', 'User_Count', 'Developer', 'Rating'],
      dtype='object')

In [4]:
ds = ds.drop('Other_Sales', axis=1) # o objetivo é prever apenas EU_Sales, NA_Sales e JP_Sales
ds = ds.drop('Global_Sales', axis=1) # o objetivo é prever apenas EU_Sales, NA_Sales e JP_Sales
ds = ds.drop('Developer', axis=1) # atributo reduntante (em relação a Publisher) e com muitos valores nulos

## Lidando com valores nulos

In [5]:
ds.isnull().sum()

Name                  2
Platform              0
Year_of_Release     269
Genre                 2
Publisher            54
NA_Sales              0
EU_Sales              0
JP_Sales              0
Critic_Score       8582
Critic_Count       8582
User_Score         6704
User_Count         9129
Rating             6769
dtype: int64

Para atributos categóricos, preenchemos os valores faltantes com a moda:

In [6]:
catCol = ['Name', 'Year_of_Release', 'Genre', 'Publisher', 'Rating']
for i in catCol:
    mode_val = ds[i].mode()[0]
    ds[i] = ds[i].fillna(mode_val)

In [7]:
ds.isnull().sum()

Name                  0
Platform              0
Year_of_Release       0
Genre                 0
Publisher             0
NA_Sales              0
EU_Sales              0
JP_Sales              0
Critic_Score       8582
Critic_Count       8582
User_Score         6704
User_Count         9129
Rating                0
dtype: int64

Para atributos numéricos, preenchemos os valores faltantes com a média:

Primeiro, temos que lidar com a coluna "User_Score", que possui números em formato de String.<br>
Porém, alguns valores são palavras, não pertinentes para a análise, que não podem diretamente ser convertidas para valor numérico.<br>
Lidando com eles:

In [8]:
ds.loc[ds['User_Score'] == 'tbd', 'User_Score'] = None

In [9]:
ds['User_Score'].value_counts()

User_Score
7.8    324
8      290
8.2    282
8.3    254
8.5    253
      ... 
1.5      2
0.3      2
1.1      2
0        1
9.7      1
Name: count, Length: 95, dtype: int64

In [10]:
ds.isnull().sum()

Name                  0
Platform              0
Year_of_Release       0
Genre                 0
Publisher             0
NA_Sales              0
EU_Sales              0
JP_Sales              0
Critic_Score       8582
Critic_Count       8582
User_Score         9129
User_Count         9129
Rating                0
dtype: int64

Agora convertemos os valores da coluna para o tipo float:

In [11]:
ds['User_Score'] = ds['User_Score'].astype(float)

Agora, fazemos a substituição pela média dos valores faltantes nas colunas numéricas:

In [12]:
numCol = ['Critic_Score', 'Critic_Count', 'User_Score', 'User_Count']
for i in numCol:
    ds[i] = ds[i].fillna(ds[i].mean())

In [13]:
ds.isnull().sum()

Name               0
Platform           0
Year_of_Release    0
Genre              0
Publisher          0
NA_Sales           0
EU_Sales           0
JP_Sales           0
Critic_Score       0
Critic_Count       0
User_Score         0
User_Count         0
Rating             0
dtype: int64

In [14]:
ds.shape

(16719, 13)

In [15]:
ds['Name'].value_counts()

Name
Need for Speed: Most Wanted                         14
FIFA 14                                              9
Ratatouille                                          9
LEGO Marvel Super Heroes                             9
Madden NFL 07                                        9
                                                    ..
Jewels of the Tropical Lost Island                   1
Sherlock Holmes and the Mystery of Osborne House     1
The King of Fighters '95 (CD)                        1
Megamind: Mega Team Unite                            1
Haitaka no Psychedelica                              1
Name: count, Length: 11562, dtype: int64

O número de registros com valores únicos para a coluna "Name" é muito grande, por isso, será apagada.

In [16]:
ds = ds.drop('Name', axis=1)

# Criação dos conjuntos previsores e de classe

In [17]:
ds.columns

Index(['Platform', 'Year_of_Release', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Critic_Score', 'Critic_Count', 'User_Score',
       'User_Count', 'Rating'],
      dtype='object')

In [18]:
X = ds.iloc[:, [0, 1, 2, 3, 7, 8, 9, 10, 11]].values

In [19]:
y_na = ds.iloc[:, 4].values
y_eu = ds.iloc[:, 5].values
y_jp = ds.iloc[:, 6].values

## Aplicação de One Hot Encoding nos atributos categóricos não ordinais

In [20]:
ds.columns

Index(['Platform', 'Year_of_Release', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Critic_Score', 'Critic_Count', 'User_Score',
       'User_Count', 'Rating'],
      dtype='object')

In [21]:
cols = [0, 2, 3, 8]
onehotencoder = ColumnTransformer(transformers = [('OneHot', OneHotEncoder(), cols)], remainder = 'passthrough')
X = onehotencoder.fit_transform(X).toarray()

In [22]:
X.shape

(16719, 637)

In [23]:
ds

Unnamed: 0,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,76.000000,51.000000,8.000000,322.000000,E
1,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,68.967679,26.360821,7.125046,162.229908,E
2,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,82.000000,73.000000,8.300000,709.000000,E
3,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,80.000000,73.000000,8.000000,192.000000,E
4,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,68.967679,26.360821,7.125046,162.229908,E
...,...,...,...,...,...,...,...,...,...,...,...,...
16714,PS3,2016.0,Action,Tecmo Koei,0.00,0.00,0.01,68.967679,26.360821,7.125046,162.229908,E
16715,X360,2006.0,Sports,Codemasters,0.00,0.01,0.00,68.967679,26.360821,7.125046,162.229908,E
16716,PSV,2016.0,Adventure,Idea Factory,0.00,0.00,0.01,68.967679,26.360821,7.125046,162.229908,E
16717,GBA,2003.0,Platform,Wanadoo,0.01,0.00,0.00,68.967679,26.360821,7.125046,162.229908,E


# Definindo a estrutura da rede neural

Primeiro, criamos a camada:

In [24]:
input_layer = Input(shape=(637,))
hidden_layer1 = Dense(units = 320, activation='relu')(input_layer) # (637+3)/2 == 320
hidden_layer2 = Dense(units = 320, activation='relu')(hidden_layer1)

#agora, definimos 1 camada de saída para cara atributo a ser previsto, todas ligadas na útltima cada de saída:
output_layer1 = Dense(units = 1, activation = 'linear')(hidden_layer2)
output_layer2 = Dense(units = 1, activation = 'linear')(hidden_layer2)
output_layer3 = Dense(units = 1, activation = 'linear')(hidden_layer2)

Agora, criamos a instância do regressor com as camadas criadas:

In [25]:
regressor = Model(inputs = input_layer, outputs = [output_layer1, output_layer2, output_layer3])

Agora, compilamos o modelo:

In [26]:
regressor.compile(optimizer = 'adam', loss = 'mse') # a função loss é a Mean Squared Error, preferível para problemas mais sofisticados

Agora, fazemos o treinamento:

In [27]:
regressor.fit(X, [y_na, y_eu, y_jp], epochs = 500, batch_size = 100)

Epoch 1/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - loss: 1307.8806
Epoch 2/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 81.0566
Epoch 3/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 4.7737
Epoch 4/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 18.5930
Epoch 5/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 19.1213
Epoch 6/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 1.5461
Epoch 7/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 1.5103
Epoch 8/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 3.7719
Epoch 9/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 2.9633
Epoch 10/500
[1m168/168[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[

<keras.src.callbacks.history.History at 0x7d31247d7670>

Agora, podemos verificar a precisão do regressor na base de dados em que foi treinado:

In [28]:
predictions_na, predictions_eu, predictions_jp = regressor.predict(X)

[1m523/523[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step


In [29]:
from sklearn.metrics import mean_absolute_error

In [30]:
predictions_na, predictions_na.mean()

(array([[11.134612  ],
        [ 0.6874395 ],
        [10.3420925 ],
        ...,
        [-0.1135484 ],
        [ 0.1231426 ],
        [-0.06312561]], dtype=float32),
 0.25452745)

In [31]:
y_na, y_na.mean()

(array([4.136e+01, 2.908e+01, 1.568e+01, ..., 0.000e+00, 1.000e-02,
        0.000e+00]),
 0.26333034272384714)

In [32]:
mean_absolute_error(y_na, predictions_na)

0.2377292282712952

In [33]:
predictions_eu, predictions_eu.mean()

(array([[ 7.8973193 ],
        [ 0.3298332 ],
        [ 7.4216347 ],
        ...,
        [-0.03838557],
        [ 0.08199322],
        [-0.01394254]], dtype=float32),
 0.14252864)

In [34]:
y_eu, y_eu.mean()

(array([28.96,  3.58, 12.76, ...,  0.  ,  0.  ,  0.  ]), 0.14502482205873557)

In [35]:
mean_absolute_error(y_eu, predictions_eu)

0.1503973397092604

In [36]:
predictions_jp, predictions_jp.mean()

(array([[2.7647595 ],
        [0.20193282],
        [2.5426745 ],
        ...,
        [0.07347815],
        [0.07219375],
        [0.07213075]], dtype=float32),
 0.06851211)

In [37]:
y_jp, y_jp.mean()

(array([3.77, 6.81, 3.79, ..., 0.01, 0.  , 0.01]), 0.07760212931395419)

In [38]:
mean_absolute_error(y_jp, predictions_jp)

0.10182372961685528