**Base de dados de Games**

*   Prever vendas de Games
*   Regressão com múltiplas saídas



**Importações Iniciais**




In [0]:
import pandas as pd
from keras.layers import Dense, Dropout, Activation, Input
from keras.models import Model

**Leitura da base de dados e criação do DataFrame**

In [0]:
df = pd.read_csv('games.csv')

**Visualização e análise dos dados do DataFrame**

In [0]:
df

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16714,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,Tecmo Koei,0.00,0.00,0.01,0.00,0.01,,,,,,
16715,LMA Manager 2007,X360,2006.0,Sports,Codemasters,0.00,0.01,0.00,0.00,0.01,,,,,,
16716,Haitaka no Psychedelica,PSV,2016.0,Adventure,Idea Factory,0.00,0.00,0.01,0.00,0.01,,,,,,
16717,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.00,0.00,0.00,0.01,,,,,,


A função ***describe*** éé utilizada para mostrar dados estatísticos do DataFrame

*   Value counts
*   Mean
*   std(desvio padrão)
*   porcentagens
*   valores máximos
*   valores minímos









In [0]:
df.describe()

Unnamed: 0,Year_of_Release,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Count
count,16450.0,16719.0,16719.0,16719.0,16719.0,16719.0,8137.0,8137.0,7590.0
mean,2006.487356,0.26333,0.145025,0.077602,0.047332,0.533543,68.967679,26.360821,162.229908
std,5.878995,0.813514,0.503283,0.308818,0.18671,1.547935,13.938165,18.980495,561.282326
min,1980.0,0.0,0.0,0.0,0.0,0.01,13.0,3.0,4.0
25%,2003.0,0.0,0.0,0.0,0.0,0.06,60.0,12.0,10.0
50%,2007.0,0.08,0.02,0.0,0.01,0.17,71.0,21.0,24.0
75%,2010.0,0.24,0.11,0.04,0.03,0.47,79.0,36.0,81.0
max,2020.0,41.36,28.96,10.22,10.57,82.53,98.0,113.0,10665.0


In [0]:
df = df.loc[df['NA_Sales'] > 1]
df = df.loc[df['EU_Sales'] > 1]

**Apagando algumas colunas que não influenciam na previsão das targets**




In [0]:
df = df.drop('Other_Sales', axis = 1)
df = df.drop('Global_Sales', axis = 1)
df = df.drop('Developer', axis = 1)
df = df.drop('Name', axis = 1)

**Selecionando colunas do tipo objeto**

*   Com as colunas do tipo objeto(variáveis categóricas), podemos realizar o label encoder. Ou seja, transformar essas variáveis para tipos númericos. 

*   Com a aplicação da função Label Encoder, podemos substituir os valores nulos pela média. Assim, não perderíamos uma grande quantidade de dados, caso optássemos por apagar os dados do tipo NaN.





In [0]:
df_objeto = df.select_dtypes(include=[object]).columns

In [0]:
df_objeto

Index(['Platform', 'Genre', 'Publisher', 'User_Score', 'Rating'], dtype='object')

**Importação da função Label Encoder**

*   Importação da função e aplicação da mesma
*   Laço para percorrer a lista de objetos(colunas categóricas) e transformando-as em variáveis númericas



In [0]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()

In [0]:
for coluna in df_objeto:
  if coluna in df:
    df[coluna] = label_encoder.fit_transform(df[coluna].astype(str))

Visualização de como o df está, após a transformação

In [0]:
df

Unnamed: 0,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,15,2006.0,10,15,41.36,28.96,3.77,76.0,51.0,35,322.0,0
1,7,1985.0,4,15,29.08,3.58,6.81,,,49,,4
2,15,2008.0,6,15,15.68,12.76,3.79,82.0,73.0,38,709.0,0
3,15,2009.0,10,15,15.61,10.93,3.28,80.0,73.0,35,192.0,0
4,2,1996.0,7,15,11.27,8.89,10.22,,,49,,4
...,...,...,...,...,...,...,...,...,...,...,...,...
591,15,2007.0,10,15,1.05,1.05,0.24,79.0,47.0,35,124.0,1
603,17,2009.0,0,22,1.04,1.22,0.03,,,49,,4
610,10,2001.0,6,1,1.13,1.12,0.06,80.0,15.0,34,46.0,3
624,9,1998.0,0,6,1.15,1.14,0.06,,,49,,4


Visualizando a quantidade de valores NaN que temos no DataFrame

In [0]:
df.isnull().sum()

Platform            0
Year_of_Release     1
Genre               0
Publisher           0
NA_Sales            0
EU_Sales            0
JP_Sales            0
Critic_Score       95
Critic_Count       95
User_Score          0
User_Count         91
Rating              0
dtype: int64

**Pré-processamento dos dados**


*   Tratamento de dados faltantes

*   Divisão do DataFrame em:
    1. Features
    2. venda_na
    3. venda_eu
    4. venda_jp





**Substituindo valores NaN pela média da coluna**


In [0]:
df.isnull().sum()

Platform            0
Year_of_Release     1
Genre               0
Publisher           0
NA_Sales            0
EU_Sales            0
JP_Sales            0
Critic_Score       95
Critic_Count       95
User_Score          0
User_Count         91
Rating              0
dtype: int64

In [0]:
df.update(df['Year_of_Release'].fillna(df['Year_of_Release'].mean()))
df.update(df['Critic_Score'].fillna(df['Critic_Score'].mean()))
df.update(df['Critic_Count'].fillna(df['Critic_Count'].mean()))
df.update(df['User_Count'].fillna(df['User_Count'].mean()))

**Divisão do DataFrame**

*   features
*   venda_na
*   venda_eu
*   venda_jp





In [0]:
features = df.iloc[:, [0,1,2,3,7,8,9,10,11]].values
venda_na = df.iloc[:,4].values
venda_eu = df.iloc[:,5].values
venda_jp = df.iloc[:,6].values

**Utilização da função OneHotEncoder, para atributos categóricos que são representados por sequências**

In [0]:
from sklearn.compose import ColumnTransformer
onehotencoder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [0,2,3,8])],remainder='passthrough') 
features = onehotencoder.fit_transform(features).toarray()

**Validação cruzada**

In [0]:
from sklearn.model_selection import cross_val_score
from keras.wrappers.scikit_learn import KerasRegressor

**Estrutura da Rede Neural**


*   Regressão de múltiplas saídas



In [0]:
def criarRede():
  camada_entradas = Input(shape=(71,))
  camada_oculta1 = Dense(units = 32, activation='sigmoid')(camada_entradas)
  camada_oculta2 = Dense(units=2, activation='sigmoid')(camada_oculta1)
  camada_saida1 = Dense(units=1, activation='linear')(camada_oculta2)
  camada_saida2 = Dense(units=1, activation='linear')(camada_oculta2)
  camada_saida3 = Dense(units=1, activation='linear')(camada_oculta2)

  regressor = Model(inputs = camada_entradas,
                  outputs = [camada_saida1, camada_saida2, camada_saida3])
  
  regressor.compile(optimizer='adam', loss='mse')

  return regressor

In [0]:
regressor = KerasRegressor(build_fn=criarRede,
                         epochs = 20,
                         batch_size = 100)

**Validação Cruzada**

O LeaveOneOut() é equivalente a KFold(n_splits=n) e LeavePOut(p=1), onde ***n*** é o número de amostras.


* Devido ao alto número de conjunto de testes(que é igual ao número de amostras), esse método de validação cruzada pode ser muito caro

*   Para conjuntos grandes deve-se ***KFold***, ***ShuffleSplit*** ou ***StratifieldFold***



In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut

In [0]:
X = np.array(features)
y = np.array((venda_na,venda_eu,venda_jp))

In [0]:
loo = LeaveOneOut()

In [0]:
loo.get_n_splits(y)

3

In [0]:
for train_index, test_index in loo.split(y):
  print ( "TRAIN:" , train_index , "TEST:" , test_index)
  X_train , X_test = X[train_index], X[test_index]
  y_train , y_test = y[train_index], y [test_index]
  print ( X_train , X_test , y_train , y_test )

TRAIN: [1 2] TEST: [0]
[[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 