In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import EarlyStopping
tf.random.set_seed(16)
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error

<h2> An experimental report: The forecasting of video games sales from 1980 to 2020 accross the globe </h2> 

<h4> Introduction </h4>

This experimental report will look to predict the global sales for video games from 1980 to 2020. This datatset looks at video games that had sales greater than 100,000 copies. It include sales from Northern America, Europe, Japan and global sales. 

This report is a regression task using several types of models. The aim is to tune as many hyperparameters as possible in order to find a combination that reduces the validation mean average error, which shows a good accuracy compared to our global_sales target. The hyperparameters tuned in the coursework range from number of layers and neurons, to activation functions, loss functions and optimizers. 

The first part looks at the raw dataset, its strenghts and weaknesses, and applies data preprocessing methods before creating models and tuning hyperparameters. At the time of writing this report, I have created over 20 models, each with a corresponding number. However, throughout the time, I dropped roughly four of them, as they showed no improvement or were duplications of other models. In the interest of keeping an order of models and having them organized, the models are numbered from Model 1 to Model 20, with a few missing. 

This report also includes a performance assessment for the best model that offered a val_mae of 0.46 and another model that offerd a val_mae of 0.49. As a result, I have discovered that the 0.46 model is the best one to use in this regression task. The performance assessment showed that the model with an architecture of three layers of 132 neurons, relu activation, rmsprop optimizer and mse loss is the best fit. 

In the following section I will look at the methodology of this experimental report. 


<h4> Methodology </h4>

<h4> Dataset </h4> 
The first thing to do is to look at the raw data so I can assess the state it is in. This specific dataset was taken from Kaggle, and it was webscrapped from www.vgchartz.com with BeautifulSoup, a popular platform that looks at global historical video games sales. 

I started with data cleaning, such as renaming columns to ensure there are no spaces or anything that might stop the code from rendering properly. I also performed some exploratory data analysis, as it is necessary to establish the average, min and max values for the regression model. 

The dataset contains missing values (329 out of 16598), but not significant enough to affect the data. The missing cells are also values that could not be replaced by the mean or median, as they represent years of publication and publishers' names, therefore I chose to drop those cells.  

There are quite a few columns containing string, so I changed the datatype to categorical, in order to follow best practices but also because otherwise the machine learning models would not be able to learn from it.The next step was to encode the categorical data. Since I dropped four columns, I only had two numerical columns, *global_sales and year*. The first one would feature as our target variable. A target variable should not be standardized, therefore I only dealt with the categorical columns. 

<h4> Encoding </h4> 
The reason I chose label encoding is that I tried one hot encoding beforehand, which created 10,000 more columns than what I had previously. This had the potential to confuse the model and create information bottleneck. It also uses too much computational powers, and it can potentially give noise to the model, especially the name column, which has multiple games with several entries. With the use of label encoding, I grouped all entries of one game (i.e. 2002 FIFA World Cup had 9 entries accross several years and several platforms, so all entries for said game will be assigned a random number). This seemed to worked fine for the models I ran. The only possible limitation that could stem from label encoding is that a machine learning model could think that the game part of the group 10 is higher and more important than the game that is part of group 4, even though there is no rank for the games. 


<h4> Validation </h4> 
The models used in this report used the holdout validation, where the dataset is split into training set, test set and validation set. The models are trained on the train set, tested against the test set and evaluated on the validation set. The holdout validation is easy to implement and it can train models faster.

A common way to check the accuracy of the model is the mean absolute error, so I will be focusing on that (represent by val_mae). 

<h4> Hyperparameters </h4> 

This report is concerned with tuning several types of hyperparameters in order to produce a signal of the performance on the validation data. This represents the learning proces of the mode, and finding a good configuration that works with my dataset (Chollet, 2021, pg 133). 

<h4> Preventing information leaks </h4> 

The report is structured in a way that would prevent informaiton leaks, which happen when a model is tuned and ran several times. The first model used will be ran once, and based on the result of the val_mae, I will redo other models with other hyperparametres. Each of them will then be compared by looking at the minimum val_mae (validation mean absolute error). The two best models will have their performance assessed and tested against the test data. 


<h4> Model structure </h4> 

The general rule of thumb is to start small with any model. "Deep Learning with Python" (Chollet, 2021, pg 145) highlights the importance of starting with few layers and parameters, and increase and tune them as you go along, "until you see diminishing returns with regard to validation loss".

<h4> Overfitting </h4> 

Model 20 is a model updated to include two dropout layers to prevent overfitting, however looking at the val_mae, it is not disimilar to the rest of the models I ran. 

<h4> Structure </h4> 
Each model will contain its architecture, and a history object, that remembers all the metrics that took place during training (Chollet, 2021, pg 102) 

<h4> Loss function </h4> 

Mean squared error (MSE) is usually used as a function loss for regression, therefore I will stick with it through the majority of the models. 

In the next section I will look at the raw data and its structure and perform data preprocessing. 




<h2> Raw data 

In [2]:
df = pd.read_csv("vgsales.csv")
df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,11078,Action Man-Operation Extreme,PS,,Action,,0.05,0.03,0.00,0.01,0.09
1,3219,Advance Wars: Days of Ruin,DS,,Strategy,Nintendo,0.44,0.13,0.00,0.06,0.63
2,1515,Adventure,2600,,Adventure,Atari,1.21,0.08,0.00,0.01,1.30
3,16249,Agarest Senki: Re-appearance,PS3,,Role-Playing,Idea Factory,0.00,0.00,0.01,0.00,0.01
4,2115,Air-Sea Battle,2600,,Shooter,Atari,0.91,0.06,0.00,0.01,0.98
...,...,...,...,...,...,...,...,...,...,...,...
16593,1971,Defender,2600,1980.0,Misc,Atari,0.99,0.05,0.00,0.01,1.05
16594,5368,Freeway,2600,1980.0,Action,Activision,0.32,0.02,0.00,0.00,0.34
16595,4027,Ice Hockey,2600,1980.0,Sports,Activision,0.46,0.03,0.00,0.01,0.49
16596,1768,Kaboom!,2600,1980.0,Misc,Activision,1.07,0.07,0.00,0.01,1.15


In [3]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [4]:
df.shape

(16598, 11)

In [5]:
df_missing = df.isna()
df_missing.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,False,False,False,True,False,True,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,False


In [6]:
df_missing = df_missing.sum()
df_missing

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

In [7]:
df.isna().mean().round(4) * 100 

Rank            0.00
Name            0.00
Platform        0.00
Year            1.63
Genre           0.00
Publisher       0.35
NA_Sales        0.00
EU_Sales        0.00
JP_Sales        0.00
Other_Sales     0.00
Global_Sales    0.00
dtype: float64

In [8]:
df.describe()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


<h1> Renaming and droppping columns </h1>

In [9]:
print(df.columns)

Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')


<h4> In this report, I will be working on predicting the global sales for video games, which is the sum of all other sales in the Northern America, Europe, Japan and Other sales columns. Therefore, I will be droppping those in order for the machine learning algorithm to focus solely on the global sales. 

In [10]:
columns_to_goaway = ['Rank', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales']

df_small = df.drop(columns=columns_to_goaway)
df_small.head(3)

Unnamed: 0,Name,Platform,Year,Genre,Publisher,Global_Sales
0,Action Man-Operation Extreme,PS,,Action,,0.09
1,Advance Wars: Days of Ruin,DS,,Strategy,Nintendo,0.63
2,Adventure,2600,,Adventure,Atari,1.3


In [11]:
df_small.rename(columns = {'Name':'name', 'Platform':'platform', 'Year':'year', 'Genre':'genre', 'Publisher':'publisher', 'Global_Sales':'global_sales'}, inplace = True)
print(df_small.columns)

Index(['name', 'platform', 'year', 'genre', 'publisher', 'global_sales'], dtype='object')


<h1> Missing data </h1>

<h3> There are several ways to deal with missing data in a data science report. According to the Chollet (2021), "if a feature is numerical, avoid inputting an arbitraty value like 0, because it may create discontinuity in the latent space formed by our features, making it harder for a model trained on it to generalize". 
    
The author continue to highlight that I could replace the missing value with the average of the median value or the mean value for a feature in the dataset. However, given the nature of it (years of publication and publishers' names), it would be considered tempering with the dataset, as the missing numerical values cannot be aproximated. 
    
As a reminder, given the high number of entries in our data (over 16,000) and the missing data being a small percentage of that, I will be dropping the missing values. 

In [12]:
df_clean = df_small.dropna()
df_clean

Unnamed: 0,name,platform,year,genre,publisher,global_sales
271,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.29
272,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.01
273,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.03
274,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.01
275,12-Sai. Koisuru Diary,3DS,2016.0,Adventure,Happinet,0.04
...,...,...,...,...,...,...
16593,Defender,2600,1980.0,Misc,Atari,1.05
16594,Freeway,2600,1980.0,Action,Activision,0.34
16595,Ice Hockey,2600,1980.0,Sports,Activision,0.49
16596,Kaboom!,2600,1980.0,Misc,Activision,1.15


<h1> Converting data 

<h3> In order for the machine learning model to be able to feed on the data, the entries need to be the either categorical or numerical (float or integer). Our global_sales is already a float, so for this purpose I will be converting the string columns. 

In [13]:
df_clean.dtypes

name             object
platform         object
year            float64
genre            object
publisher        object
global_sales    float64
dtype: object

In [14]:
datatype_convert = ['name', 'platform','genre','publisher']
df_clean[datatype_convert] = df_clean[datatype_convert].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean[datatype_convert] = df_clean[datatype_convert].astype('category')


In [15]:
df_clean

Unnamed: 0,name,platform,year,genre,publisher,global_sales
271,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.29
272,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.01
273,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.03
274,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.01
275,12-Sai. Koisuru Diary,3DS,2016.0,Adventure,Happinet,0.04
...,...,...,...,...,...,...
16593,Defender,2600,1980.0,Misc,Atari,1.05
16594,Freeway,2600,1980.0,Action,Activision,0.34
16595,Ice Hockey,2600,1980.0,Sports,Activision,0.49
16596,Kaboom!,2600,1980.0,Misc,Activision,1.15


<h1> Encoding of categorical columns </h1>

<h3> In this report, my target valuable is global_sales. I will not be encoding or standardizing the target. The target variable represents the variable you are trying to predict, and its scale will not affect the behavior of the machine learning models I am working with. This is due to the fact that, my models can predict based on relationships and differences in values rather than absolute values.

Therefore, I will encode the categorical columns with fit_transform in order for the machine learning model to be able to understand the data and learn from it. Essentially, it is translating in a language that the model can perceive. 

In [16]:
# Label encoding for 'genre'
label_encoder_genre = LabelEncoder()
df_clean['genre_encoded'] = label_encoder_genre.fit_transform(df_clean['genre'])
df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['genre_encoded'] = label_encoder_genre.fit_transform(df_clean['genre'])


Unnamed: 0,name,platform,year,genre,publisher,global_sales,genre_encoded
271,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.29,9
272,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.01,0
273,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.03,7
274,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.01,7
275,12-Sai. Koisuru Diary,3DS,2016.0,Adventure,Happinet,0.04,1
...,...,...,...,...,...,...,...
16593,Defender,2600,1980.0,Misc,Atari,1.05,3
16594,Freeway,2600,1980.0,Action,Activision,0.34,0
16595,Ice Hockey,2600,1980.0,Sports,Activision,0.49,10
16596,Kaboom!,2600,1980.0,Misc,Activision,1.15,3


In [17]:
# Label encoding for 'name'
label_encoder_genre = LabelEncoder()
df_clean['name_encoded'] = label_encoder_genre.fit_transform(df_clean['name'])
df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['name_encoded'] = label_encoder_genre.fit_transform(df_clean['name'])


Unnamed: 0,name,platform,year,genre,publisher,global_sales,genre_encoded,name_encoded
271,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.29,9,4240
272,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.01,0,1044
273,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.03,7,7098
274,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.01,7,7098
275,12-Sai. Koisuru Diary,3DS,2016.0,Adventure,Happinet,0.04,1,30
...,...,...,...,...,...,...,...,...
16593,Defender,2600,1980.0,Misc,Atari,1.05,3,1979
16594,Freeway,2600,1980.0,Action,Activision,0.34,0,3263
16595,Ice Hockey,2600,1980.0,Sports,Activision,0.49,10,4206
16596,Kaboom!,2600,1980.0,Misc,Activision,1.15,3,4659


In [18]:
# Label encoding for 'platform' 
label_encoder_genre = LabelEncoder()
df_clean['platform_encoded'] = label_encoder_genre.fit_transform(df_clean['platform'])
df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['platform_encoded'] = label_encoder_genre.fit_transform(df_clean['platform'])


Unnamed: 0,name,platform,year,genre,publisher,global_sales,genre_encoded,name_encoded,platform_encoded
271,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.29,9,4240,4
272,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.01,0,1044,20
273,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.03,7,7098,18
274,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.01,7,7098,20
275,12-Sai. Koisuru Diary,3DS,2016.0,Adventure,Happinet,0.04,1,30,2
...,...,...,...,...,...,...,...,...,...
16593,Defender,2600,1980.0,Misc,Atari,1.05,3,1979,0
16594,Freeway,2600,1980.0,Action,Activision,0.34,0,3263,0
16595,Ice Hockey,2600,1980.0,Sports,Activision,0.49,10,4206,0
16596,Kaboom!,2600,1980.0,Misc,Activision,1.15,3,4659,0


In [19]:
# Label encoding for 'publisher' 
label_encoder_genre = LabelEncoder()
df_clean['publisher_encoded'] = label_encoder_genre.fit_transform(df_clean['publisher'])
df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['publisher_encoded'] = label_encoder_genre.fit_transform(df_clean['publisher'])


Unnamed: 0,name,platform,year,genre,publisher,global_sales,genre_encoded,name_encoded,platform_encoded,publisher_encoded
271,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.29,9,4240,4,524
272,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.01,0,1044,20,230
273,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.03,7,7098,18,445
274,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.01,7,7098,20,445
275,12-Sai. Koisuru Diary,3DS,2016.0,Adventure,Happinet,0.04,1,30,2,212
...,...,...,...,...,...,...,...,...,...,...
16593,Defender,2600,1980.0,Misc,Atari,1.05,3,1979,0,53
16594,Freeway,2600,1980.0,Action,Activision,0.34,0,3263,0,21
16595,Ice Hockey,2600,1980.0,Sports,Activision,0.49,10,4206,0,21
16596,Kaboom!,2600,1980.0,Misc,Activision,1.15,3,4659,0,21


In [20]:
df_clean.query('name == "MLB 09: The Show"')

Unnamed: 0,name,platform,year,genre,publisher,global_sales,genre_encoded,name_encoded,platform_encoded,publisher_encoded
6178,MLB 09: The Show,PS3,2009.0,Sports,Sony Computer Entertainment,0.72,10,5279,17,455
6179,MLB 09: The Show,PS2,2009.0,Sports,Sony Computer Entertainment,0.33,10,5279,16,455
6180,MLB 09: The Show,PSP,2009.0,Sports,Sony Computer Entertainment,0.26,10,5279,19,455


In [21]:
df_clean.query('publisher == "Eidos Interactive"')

Unnamed: 0,name,platform,year,genre,publisher,global_sales,genre_encoded,name_encoded,platform_encoded,publisher_encoded
5517,Batman: Arkham Asylum,PS3,2009.0,Action,Eidos Interactive,4.25,0,658,17,137
5518,Batman: Arkham Asylum,X360,2009.0,Action,Eidos Interactive,3.50,0,658,28,137
5519,Batman: Arkham Asylum,PC,2009.0,Action,Eidos Interactive,0.32,0,658,13,137
5523,Battlestations: Pacific,X360,2009.0,Strategy,Eidos Interactive,0.33,11,718,28,137
5524,Battlestations: Pacific,PC,2009.0,Strategy,Eidos Interactive,0.03,11,718,13,137
...,...,...,...,...,...,...,...,...,...,...
15606,Tomb Raider III: Adventures of Lara Croft,PS,1997.0,Action,Eidos Interactive,3.54,0,10191,15,137
15731,Machine Head,PS,1996.0,Shooter,Eidos Interactive,0.07,8,5337,15,137
15855,The Incredible Hulk: The Pantheon Saga,PS,1996.0,Action,Eidos Interactive,0.16,0,9753,15,137
15868,Tomb Raider,PS,1996.0,Action,Eidos Interactive,4.63,0,10187,15,137


<h1> Dropping unnecessary columns 
    
<h4> The process of encoding categorical columns created alternative columns, i.e. "name_encoded" which will be used in my models, therefore I will be dropping the initial columns. 

In [22]:
columns_bye = ['name', 'genre', 'publisher', 'platform']

df_clean = df_clean.drop(columns=columns_bye)
df_clean.head(3)

Unnamed: 0,year,global_sales,genre_encoded,name_encoded,platform_encoded,publisher_encoded
271,2020.0,0.29,9,4240,4,524
272,2017.0,0.01,0,1044,20,230
273,2017.0,0.03,7,7098,18,445


In [23]:
df_clean.shape

(16291, 6)

In [24]:
#training set = 9.700
#test = 4887
#validation = the rest 


<h1> Splitting the data </h1>

<h4> I will split the global_sales data and create a separate target array, that will be used as the basis for the next machine learning models. 

In [25]:
target = df_clean.iloc[:, 1]
df_clean.drop('global_sales', axis=1, inplace=True)
df_clean

Unnamed: 0,year,genre_encoded,name_encoded,platform_encoded,publisher_encoded
271,2020.0,9,4240,4,524
272,2017.0,0,1044,20,230
273,2017.0,7,7098,18,445
274,2017.0,7,7098,20,445
275,2016.0,1,30,2,212
...,...,...,...,...,...
16593,1980.0,3,1979,0,53
16594,1980.0,0,3263,0,21
16595,1980.0,10,4206,0,21
16596,1980.0,3,4659,0,21


In [26]:
target

271      0.29
272      0.01
273      0.03
274      0.01
275      0.04
         ... 
16593    1.05
16594    0.34
16595    0.49
16596    1.15
16597    2.76
Name: global_sales, Length: 16291, dtype: float64

<h4> Now that we have the separate target array for global_sales, we will look at the rest of the data. 

I will split df_clean into separate sets of train, test and validation. The test size will be 30% and the train size will be 70% of df_clean, which are recommended sizes. 

The random state is set to 16, to ensure reproducible results for my data split. The model will start with random weights and biases, so it is necessary to be able to control that, so the results can be reproduced at a later date if needed. </h4>

In [27]:
train, test, train_target, test_target = train_test_split(df_clean,target, test_size=0.3, train_size=0.7, random_state=16, shuffle=True)

In [28]:
train.shape, train_target.shape

((11403, 5), (11403,))

In [29]:
test.shape, test_target.shape

((4888, 5), (4888,))

In [30]:
train, validation, train_target, validation_target = train_test_split(train, train_target, test_size=0.3, train_size=0.7, random_state=16, shuffle=True)

In [31]:
validation.shape, validation_target.shape

((3421, 5), (3421,))

In [32]:
train.shape, train_target.shape

((7982, 5), (7982,))

In [33]:
validation

Unnamed: 0,year,genre_encoded,name_encoded,platform_encoded,publisher_encoded
9378,2007.0,0,10321,19,21
10742,2005.0,6,3211,4,564
2774,2012.0,2,7043,28,55
9608,2006.0,1,1445,16,243
8560,2007.0,7,2693,19,156
...,...,...,...,...,...
8034,2008.0,7,9263,4,347
13803,2001.0,10,278,16,17
2835,2012.0,0,8175,2,525
4504,2010.0,10,2889,13,138


In [34]:
train

Unnamed: 0,year,genre_encoded,name_encoded,platform_encoded,publisher_encoded
1208,2015.0,10,10662,20,69
16269,1993.0,7,7952,23,465
15741,1996.0,10,5298,15,455
14891,1999.0,0,8951,15,288
11953,2004.0,3,8276,16,445
...,...,...,...,...,...
7155,2008.0,1,1761,4,499
15762,1996.0,10,6424,15,138
4348,2010.0,1,1149,17,524
11189,2005.0,7,8821,19,230


<h4> From here on out I will look at several models, based on the standard model taken from the Boston lab. I will start with the intial model and then tune hyperparametres or add more as I go along.

There is a big range of updates that will follow, mostly concerning loss functions, activation functions, optimizers, number of layers and numbers of neurons. 

As a reminder, the most important value I am looking at will be val_mae, which looks at the mean absolute error of the validation set (it renders the error rate which is a way to evaluate the performance of the model). Looking above at the description of the initial dataset, it seems that the global_sales values range between 0.01 and 0.47, with a few sales of over 82 and a max of 82.74. 

With the next models, I am trying to achieve the lowest val_mae possible, showing that the models can predict the global video games sales. 
    
    

![image.png](attachment:436e50ff-fdba-4b72-abc8-bdc67c698486.png)

<h3> MODEL 1:  BOSTON MODEL </h3>

<h4> As mentioned above, the first model and my starting point is the basic Boston model created by Jeremie Wenger in Lab 5. </h4>

In [35]:
tf.keras.backend.clear_session()
model1 = tf.keras.models.Sequential()
model1.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model1.add(tf.keras.layers.Dense(64, activation = 'relu'))
model1.add(tf.keras.layers.Dense(1))
model1.compile(optimizer='rmsprop',
        loss='mse',
        metrics=['mae'])


In [36]:
model1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                384       
                                                                 
 dense_1 (Dense)             (None, 64)                4160      
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 4,609
Trainable params: 4,609
Non-trainable params: 0
_________________________________________________________________


In [37]:
history1 = model1.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 25, batch_size= 5, verbose= 1 
        )

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [38]:
min(history1.history['val_mae'])

0.506214439868927

------------------------------------------------

<h3> MODEL 2: 132 NEURONS ON ONE LAYER AND BATCH SIZE OF 15 </h3> 

In [39]:
tf.keras.backend.clear_session()
model2 = tf.keras.models.Sequential()
model2.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model2.add(tf.keras.layers.Dense(64, activation = 'relu'))
model2.add(tf.keras.layers.Dense(132, activation = 'relu'))
model2.add(tf.keras.layers.Dense(1))
model2.compile(optimizer='rmsprop',
        loss='mse',
        metrics=['mae'])


In [40]:
history2 = model2.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 25, batch_size= 15, verbose= 1 
        )

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [41]:
min(history2.history['val_mae'])

0.5378407835960388

-----------------------------------------------------

<h3> MODEL 3:  ONE MORE LAYER OF 64 AND BATCH SIZE OF 20</h3> 

<h4> There is not much of a difference between Model 1 and Model 2, so I am adding additional layers with the first layer of 132 neurons and batch size of 20. </h4>

In [42]:
tf.keras.backend.clear_session()
model3 = tf.keras.models.Sequential()
model3.add(tf.keras.layers.Dense(132, activation = 'relu', input_shape = (train.shape[1],)))
model3.add(tf.keras.layers.Dense(64, activation = 'relu'))
model3.add(tf.keras.layers.Dense(64, activation = 'relu'))
model3.add(tf.keras.layers.Dense(64, activation = 'relu'))
model3.add(tf.keras.layers.Dense(1))
model3.compile(optimizer='rmsprop',
        loss='mse',
        metrics=['mae'])

In [None]:
history3 = model3.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 25, batch_size= 20, verbose= 1 
        )

Epoch 1/25
  3/400 [..............................] - ETA: 12s - loss: 130362.1094 - mae: 247.5479

In [None]:
min(history3.history['val_mae'])

-------------------------------------

<h3> MODEL 5: 132 NEURONS ON THREE LAYERS AND BATCH SIZE OF 30 </h3> 

<h4> I wonder if increasing the number of neurons and the batch size will be able to give us a much smaller val_mae.  </h4> 

In [None]:
tf.keras.backend.clear_session()
model5 = tf.keras.models.Sequential()
model5.add(tf.keras.layers.Dense(132, activation = 'relu', input_shape = (train.shape[1],)))
model5.add(tf.keras.layers.Dense(132, activation = 'relu'))
model5.add(tf.keras.layers.Dense(132, activation = 'relu'))
model5.add(tf.keras.layers.Dense(64, activation = 'relu'))
model5.add(tf.keras.layers.Dense(1))
model5.compile(optimizer='rmsprop',
        loss='mse',
        metrics=['mae'])
    

In [None]:
history5 = model5.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 25, batch_size= 30, verbose= 1 
        )

In [None]:
min(history5.history['val_mae'])

<h4> As mentioned above, the value of interest is val_mae. It started from 0.52 and it is not brought down to 0.46, which makes it the most successful model so far. I will introduce a visualisation in order to see the correlation between the training loss and the validation loss. 

<h3> Visualisation of the best model so far 

In [None]:
def plot_history(history):
 
  
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(1, len(loss) + 1)
 
    fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, constrained_layout=True, figsize=(10,3))

 
    ax1.plot(epochs, loss,  label="Training loss")
    ax1.plot(epochs, val_loss,  label="Validation loss")
    ax1.set_title("Training and validation loss")
    ax1.legend()
  
    plt.show()

In [None]:
plot_history(history5)

<h3> It is interesting to point out that the training loss started up really high at 3592.4045 and immediately after it had such a rapid decline that it seems to be lost within the validation loss representation, which has quite a steady trajectory. 

-------------------------------

<h3> MODEL 6: SGD, LEARNING RATE OF 0.1 AND BATCH SIZE OF 5 </h3> 

<h4> The next model will have the same architecture as our best model so far with two updates. I will apply a stochastic gradient descent optimizer, with a learning rate of 0.1 and a batch size of 5 </h4> 

In [None]:
tf.keras.backend.clear_session()
model6 = tf.keras.models.Sequential()
model6.add(tf.keras.layers.Dense(132, activation = 'sigmoid', input_shape = (train.shape[1],)))
model6.add(tf.keras.layers.Dense(132, activation = 'sigmoid'))
model6.add(tf.keras.layers.Dense(132, activation = 'sigmoid'))
model6.add(tf.keras.layers.Dense(64, activation = 'sigmoid'))
model6.add(tf.keras.layers.Dense(1))
model6.compile(optimizer = SGD(learning_rate=0.01),
        loss='mse',
        metrics=['mae'])
    

In [None]:
history6 = model6.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 25, batch_size= 5, verbose= 1 
        )

In [None]:
min(history6.history['val_mae'])

In [None]:
def plot_history(history):
 
  
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(1, len(loss) + 1)
 
    fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, constrained_layout=True, figsize=(10,3))

 
    ax2.plot(epochs, loss,  label="Training loss")
    ax2.plot(epochs, val_loss,  label="Validation loss")
    ax2.set_title("Training and validation loss")
    ax2.legend()
    plt.show()

In [None]:
plot_history(history6)

<h3> With some updates, this model brought the val_mae to 0.49. Both the training loss and the validation loss have quite a steady trajectory. 

--------------------------

<h3> MODEL 7: TWO LAYERS OF 64 NEURONS, SGD, LEARNING RATE OF 0.3 AND 20 EPOCHS </h3> 


In [None]:
tf.keras.backend.clear_session()
model7 = tf.keras.models.Sequential()
model7.add(tf.keras.layers.Dense(64, activation = 'sigmoid', input_shape = (train.shape[1],)))
model7.add(tf.keras.layers.Dense(64, activation = 'sigmoid'))
model7.add(tf.keras.layers.Dense(1))
model7.compile(optimizer = SGD(learning_rate=0.03),
        loss='mse',
        metrics=['mae'])
    

In [None]:
history7 = model7.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 5, verbose= 1 
        )

In [None]:
min(history7.history['val_mae'])

-------------------------------------------

<h3> MODEL 8: LOW BATCH SIZE OF 1 AND LEARNING RATE OF 0.001 </h3> 

<h4> According to the Chollet (2021), this activation function used to be popular in the early days of neural networks, therefore I will be trying it to see the response. 

In [None]:
tf.keras.backend.clear_session()
model8 = tf.keras.models.Sequential()
model8.add(tf.keras.layers.Dense(64, activation = 'tanh', input_shape = (train.shape[1],)))
model8.add(tf.keras.layers.Dense(64, activation = 'tanh'))
model8.add(tf.keras.layers.Dense(1))
model8.compile(optimizer = SGD(learning_rate=0.001),
        loss='mse',
        metrics=['mae'])
    

In [None]:
history8 = model8.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history8.history['val_mae'])

<h4> It seems the tanh activation function gives quite high val_mae, so I will not be using it for the rest of this report. 

----------------------------------------------------

<h3> MODEL 9: EVEN LOWER LEARNING RATE OF 0.0001 </h3> 

<h4> Even though we have tuned several parametres so far and I have brought down the val_mae to a low 0.46, I want to try to update more parametres to see if I can continue lowering the validation MAE. I will now update the learning rate to 0.0001 in the first successful model. 

In [None]:
tf.keras.backend.clear_session()
model9 = tf.keras.models.Sequential()
model9.add(tf.keras.layers.Dense(64, activation = 'sigmoid', input_shape = (train.shape[1],)))
model9.add(tf.keras.layers.Dense(64, activation = 'sigmoid'))
model9.add(tf.keras.layers.Dense(1))
model9.compile(optimizer = SGD(learning_rate=0.0001),
        loss='mse',
        metrics=['mae'])
    

In [None]:
history9 = model9.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history9.history['val_mae'])

--------------------------------------------------------------------------

<h3> MODEL 10: HUBER LOSS, BATCH SIZE OF 1 AND 20 EPOCHS </h3>

In [None]:
tf.keras.backend.clear_session()
model10 = tf.keras.models.Sequential()
model10.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model10.add(tf.keras.layers.Dense(64, activation = 'relu'))
model10.add(tf.keras.layers.Dense(1))
model10.compile(optimizer = SGD(learning_rate=0.0001),
        loss='huber_loss',
        metrics=['mae'])
    

In [None]:
history10 = model10.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history10.history['val_mae'])

<h4> It is not lower than 0.46, however 0.47 showes that it this new model can continue bringing down the val_mae, meaning it brings down the error rate in predicting the global sales for video games. Let us look at some visualisation for it to see how the loss performed

In [None]:
def plot_history(history):
 
  
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(1, len(loss) + 1)
 
    fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, constrained_layout=True, figsize=(10,3))

 
    ax2.plot(epochs, loss,  label="Training loss")
    ax2.plot(epochs, val_loss,  label="Validation loss")
    ax2.set_title("Training and validation loss")
    ax2.legend()
    plt.show()

In [None]:
plot_history(history10)

-----------------------------------------------------------------------

<h3> MODEL 11: LOGCOSH FUNCTION </h3>


In [None]:
tf.keras.backend.clear_session()
model11 = tf.keras.models.Sequential()
model11.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model11.add(tf.keras.layers.Dense(64, activation = 'relu'))
model11.add(tf.keras.layers.Dense(1))
model11.compile(optimizer = SGD(learning_rate=0.0001),
        loss='logcosh',
        metrics=['mae'])
    

In [None]:
history11 = model11.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history11.history['val_mae'])

<h4> This is a new model that also has brought the val_mae to a low 0.46. The difference is in the optimizer. The first succesful model used the rmsprop and had three layers of 132 neurons and one layer of 64 neurons. This model performs just as well, with a SGD optimizer, with a learning rate of 0.0001 and a logcosh loss function. I will look at the visualisation for the loss.  

In [None]:
def plot_history(history):
 
  
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(1, len(loss) + 1)
 
    fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, constrained_layout=True, figsize=(10,3))

 
    ax2.plot(epochs, loss,  label="Training loss")
    ax2.plot(epochs, val_loss,  label="Validation loss")
    ax2.set_title("Training and validation loss")
    ax2.legend()
    plt.show()

In [None]:
plot_history(history11)

<h3> Looking at all the visualisation I have done so far, the training loss always decline rapidly after the first epoch, getting as low as possible, and having the same trajectory as the validation loss. The learning rate is directly responsible for the training loss, and it shows how valuable it is. I will be keeping it on for the next model. 

-------------------------------

<h3> MODEL 13: mean_squared_logarithmic_error LOSS FUNCTION </h3>

<h4> For all models that have mean_squared_logarithmic_error, their validation and training loss cannot be directly compared to the losses of other models with different loss functions. However, as this is an experimental report, I will be trying it out to see the results it gives. 

In [None]:
tf.keras.backend.clear_session()
model13 = tf.keras.models.Sequential()
model13.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model13.add(tf.keras.layers.Dense(64, activation = 'relu'))
model13.add(tf.keras.layers.Dense(1))
model13.compile(optimizer = SGD(learning_rate=0.0001),
        loss='mean_squared_logarithmic_error',
        metrics=['mae'])
    

In [None]:
history13 = model13.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history13.history['val_mae'])

---------------------------------------

<h3> MODEL 14: SGD, LOW LEARNING RATE AND 3 LAYERS WITH 64 NEURONS </h3>

<h4> As we are keeping the new SGD with the learning rate of 0.0001 and the mean_squared_logarithmic_error loss function, we are now looking to update the number of layers again to see if it changes the outcome. I will now add one more layer of 64 neurons to the model. </h4>

In [None]:
tf.keras.backend.clear_session()
model14 = tf.keras.models.Sequential()
model14.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model14.add(tf.keras.layers.Dense(64, activation = 'relu'))
model14.add(tf.keras.layers.Dense(64, activation = 'relu'))
model14.add(tf.keras.layers.Dense(1))
model14.compile(optimizer = SGD(learning_rate=0.0001),
        loss='mean_squared_logarithmic_error',
        metrics=['mae'])
    

In [None]:
history14 = model14.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history14.history['val_mae'])

----------------------------------

<h3> MODEL 15: THREE LAYERS OF 132 NEURONS, RELU, SGD AND mean_squared_logarithmic_error </h3> 

In [None]:
tf.keras.backend.clear_session()
model15 = tf.keras.models.Sequential()
model15.add(tf.keras.layers.Dense(132, activation = 'relu', input_shape = (train.shape[1],)))
model15.add(tf.keras.layers.Dense(132, activation = 'relu'))
model15.add(tf.keras.layers.Dense(132, activation = 'relu'))
model15.add(tf.keras.layers.Dense(1))
model15.compile(optimizer = SGD(learning_rate=0.0001),
        loss='mean_squared_logarithmic_error',
        metrics=['mae'])
        

In [None]:
history15 = model15.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1 
        )

In [None]:
min(history15.history['val_mae'])

------------------------

<H3>  MODEL 17: LINEAR ACTIVATION FUNCTION </H3>

In [None]:
tf.keras.backend.clear_session()
model17 = tf.keras.models.Sequential()
model17.add(tf.keras.layers.Dense(64, activation = 'linear', input_shape = (train.shape[1],)))
model17.add(tf.keras.layers.Dense(64, activation = 'linear'))
model17.add(tf.keras.layers.Dense(64, activation = 'linear'))
model17.add(tf.keras.layers.Dense(1))
model17.compile(optimizer = SGD(learning_rate=0.0001),
        loss='mean_squared_logarithmic_error',
        metrics=['mae'])
        

In [None]:
history17 = model17.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1
        )

In [None]:
min(history17.history['val_mae'])

explain why you are not using this module and how high this val mae is 

--------------------------------------------

<h3> MODEL 19: FIVE LAYERS OF 64 NEURONS </h3>

In [None]:
tf.keras.backend.clear_session()
model19 = tf.keras.models.Sequential()
model19.add(tf.keras.layers.Dense(64, activation = 'relu', input_shape = (train.shape[1],)))
model19.add(tf.keras.layers.Dense(64, activation = 'relu'))
model19.add(tf.keras.layers.Dense(64, activation = 'relu'))
model19.add(tf.keras.layers.Dense(64, activation = 'relu'))
model19.add(tf.keras.layers.Dense(64, activation = 'relu'))
model19.add(tf.keras.layers.Dense(1))
model19.compile(optimizer = SGD(learning_rate=0.000001),
        loss='mean_squared_logarithmic_error',
        metrics=['mae'])

In [None]:
history19 = model19.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 20, batch_size= 1, verbose= 1
        )

In [None]:
min(history19.history['val_mae'])

<h4> It is quite clear that the mean_squared_logarithmic_error is not a useful loss function. The val_mae that it gave out are incredibly high, representing the very low accuracy of the model. 

<h3> MODEL 20 - DROPOUT LAYERS

<h4> This is an example of a simple model with just two dense layers, modified to take on two dropout layers to prevent overfitting. 

In [None]:
tf.keras.backend.clear_session()
model_play = tf.keras.models.Sequential()
model_play.add(tf.keras.layers.Dense(64, activation = 'sigmoid', input_shape = (train.shape[1],)))
model_play.add(tf.keras.layers.Dropout(0.3))
model_play.add(tf.keras.layers.Dense(64, activation = 'sigmoid'))
model_play.add(tf.keras.layers.Dropout(0.3))
model_play.add(tf.keras.layers.Dense(1))
model_play.compile(optimizer = SGD(learning_rate=0.03),
        loss='mse',
        metrics=['mae'])
    

In [None]:
history_play = model_play.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 10, batch_size= 5, verbose= 1 
        )

In [None]:
prediction_play=model_play.predict(test)
print(prediction_play)

In [None]:
(prediction_play!=0.44713214).sum()

In [None]:
(prediction_play!=0.4397072).sum()

--------------------------------------------

<h4> According to the Chollet (2021, pg 133), you train the training data and evaluate the model on the validation data, which is represented by all the models we have worked on so far. 

He goes one to say that once you have a trained model, you use the predict method in order to test it in a more practical way, so that is what I will be doing below. Now that I have chosen the right model for the dataset which produces satisfactory results (val_mae=0.46), I will now test it against the test data with the predict method. 


<h3> A. PERFORMANCE ASSESSMENT ON FIRST SUCCESSFUL MODEL - MODEL 11</h3> 

As I have looked at a mutiple number of hyperparametres, including activation functions, loss functions, optimizers, numbers of layers and neurons etc, I will stop in order to continue with the assessment of performance. I am looking at the two most successful models, one that gave 0.46 as val_mae and one that gave 0.49. The two were the lowest rates, which signifies that the models have validated the results against the validation sets. As a reminder, the start of the first val_mae in the first model was of 31. 

- The first model had three layers of 132 neurons and one layer of 64 neurons. It had a relu activation function, a rmsprop optimizer and the mse loss. 
- The second best model had two layers of 64 neurons each, a sigmoid activation function, a SGD optimizer with a learning rate of 0.3, and the mse loss. 

I will now look to assess the model performance against the test set I previously split.

In [None]:
tf.keras.backend.clear_session()
model_best = tf.keras.models.Sequential()
model_best.add(tf.keras.layers.Dense(132, activation = 'relu', input_shape = (train.shape[1],)))
model_best.add(tf.keras.layers.Dense(132, activation = 'relu'))
model_best.add(tf.keras.layers.Dense(132, activation = 'relu'))
model_best.add(tf.keras.layers.Dense(64, activation = 'relu'))
model_best.add(tf.keras.layers.Dense(1))
model_best.compile(optimizer='rmsprop',
        loss='mse',
        metrics=['mae'])
    

In [None]:
history_best = model_best.fit(
        train,
        train_target,
        validation_data=(validation, validation_target),
        epochs= 25, batch_size= 30, verbose= 1 
        )

In [None]:
min(history_best.history['val_mae'])

I will check the individual predictions of global sales now. 

In [None]:
test_predictions=model_best.predict(test)
print(test_predictions)

In [None]:
(test_predictions!=0.52884823).sum()

In [None]:
mae_predictions=model_best.predict(test)
print(test_predictions)

In [None]:
mean_absolute_error(test_target, test_predictions)

<h4> My performance under the test set is 0.59. On average my model’s predictions are 0.59 units away from the real global sales in the test set.

<h3> B. PERFORMANCE ASSESSMENT ON ANOTHER SUCCESSFUL MODEL - MODEL 7</h3>

In [None]:
tf.keras.backend.clear_session()
second_best = tf.keras.models.Sequential()
second_best.add(tf.keras.layers.Dense(64, activation = 'sigmoid', input_shape = (train.shape[1],)))
second_best.add(tf.keras.layers.Dense(64, activation = 'sigmoid'))
second_best.add(tf.keras.layers.Dense(1))
second_best.compile(optimizer = SGD(learning_rate=0.03),
        loss='mse',
        metrics=['mae'])
    

In [None]:
test_predictions_two=second_best.predict(test)
print(test_predictions_two)

In [None]:
mean_absolute_error(test_target, test_predictions_two)

<h4> The performance assessment for a randomly chosen model that had an initial value of val_mae of 0.49 shows that the accuracy is quite low. The mean absolute error highlights that the model is 0.82 units away from the global sales in the test set. 

<h1> Results </h1> 

<h4> Batch sizes </h4> 

- It seems that bigger the the faster the model will run, however it is a trade off, because it can cause the model to use more memory but also it can overlook examples that it can learn from. 
- However, if I lower the batch size, the model runs slowers, but it is more stable, it does not overlook examples that it can learn from and it uses less computing power. As this was an explanatory analysis of how the model performs whilst updating different hyperparametres, it was good and necessary to experiment with both types of batch sizes. 
- It seems for this specific regression task, both batch sizes of 5 and 30 were successful. 

<h4> Loss function </h4> 

- Throughout this report, I have used several loss functions: mse, huber loss, logcosh, mean_squared_logarithmic_error. The best models (Model 10 and Model 11) had huber loss and respectively logcosh as loss functions. 
- Comparatively, the models using the mean_squared_logarithmic_error loss function, had val_mae of 4080 (Model 17), which is incredibly high and clearly not useful for any regression task. 

<h4> Optimizer & Learning Rate </h4> 

- The optimizer with the best outcome was SGD with a learning_rate=0.0001, which is unsurpising. Chollet (2021) highlights the importance of the learning rate and how tuning it ever so slightly can have a massive impact on the validation result.

<h4> Activation function </h4> 

- It is unsurprising that the best model that offered a 0.46 model has relu activation and rmsprop as optimizer, as both are a good default for a regression task, according to the "Deep learning with python" (Chollet, 2021). 


<h4> Numbers of layers and neurons </h4> 

- The first model I ran was the Boston lab done by Jeremie Wenger, and it had two layers of 64 neurons each. My most successful model had three layers of 132 neurons each and one layer with 64 neurons.  


<h1> Conclusion 

In conclusion, this explanatory report of had an aim of predicting the global sales for video games from 1980 to 2020. I have tuned several hyperparameters, in order to obtain the best possible model to predict the sales. 

The average of sales was between 0.01 and 0.47 with some outliers reaching 82. According to the best model, the validation for the mean absolute error was 0.46. Ideally, I would have wanted to reduce it to 0.35, but due to the lack of extensive time, that was not possible. 

The predictions of the best model are 0.59 units away from the real global_sales, which is not a bad result. I have also paid attention to overfitting and used dropout layers, however the validation_mae was on average the same as the unsuccessful models. 

This was a very interesting report to conduct, and a great introduction into the world of the many machine learning model possibilities. Ideally, moving forward, I want to focus on learning to reduce the validation mae through other options. 

<h4> References

Chollet, F. (2021) ‘1-6’, in Deep learning with python. 2nd edn. Shelter Island: Manning Publications. 

Smith, G. (2016) Video game sales, Kaggle. Available at: https://www.kaggle.com/datasets/gregorut/videogamesales (Accessed: 02 November 2023). 

William D’Angelo,  posted N. 15th et al. (no date) Video game charts, game sales, top sellers, game data, VGChartz. Available at: https://www.vgchartz.com/ (Accessed: 02 November 2023). 