# Machine Learning Model 
##      By; Martin Mwangi Wambui , student at Maseno University
##    Source:Machine Learning for Begginers by Oliver Theobald
# We will design a system to predict the global sales of video games using gradient boosting by performing the following six steps

## 1. Import necessary libraries.
## 2.Import dataset
## 3.Scrub the data set
## 4.Split the data into training and test data
## 5.Select an algorithm and configure its hyperparameters
## 6.Select the results

In [1]:
# Importing Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib




In [2]:
#import dataset
vgsales = pd.read_csv(r"C:\Users\Admin\Documents\Python\Kaggle Datasets\PS4_GamesSales.csv", encoding = "Latin")
vgsales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1034 entries, 0 to 1033
Data columns (total 9 columns):
Game             1034 non-null object
Year             825 non-null float64
Genre            1034 non-null object
Publisher        825 non-null object
North America    1034 non-null float64
Europe           1034 non-null float64
Japan            1034 non-null float64
Rest of World    1034 non-null float64
Global           1034 non-null float64
dtypes: float64(6), object(3)
memory usage: 72.8+ KB


In [3]:
vgsales.iloc[11]

Game             Star Wars Battlefront 2015
Year                                   2015
Genre                               Shooter
Publisher                   Electronic Arts
North America                          3.31
Europe                                 3.19
Japan                                  0.23
Rest of World                           1.3
Global                                 8.03
Name: 11, dtype: object

In [4]:
# deleting that we dont need
del vgsales['Game']
del vgsales['Europe']
del vgsales['Japan']
del vgsales['Rest of World']

We have removed Game because it contains the name of the game we wont use it in the model
We also removed sales in Europe, Japan and the Rest of the world because our aim is to predict Global sales after realising the game in North America market
sales figure dataset are represented in millions (USD$)

In [5]:
vgsales

Unnamed: 0,Year,Genre,Publisher,North America,Global
0,2014.0,Action,Rockstar Games,6.06,19.39
1,2015.0,Shooter,Activision,6.18,15.09
2,2018.0,Action-Adventure,Rockstar Games,5.26,13.94
3,2017.0,Shooter,Activision,4.67,13.40
4,2017.0,Sports,EA Sports,1.27,11.80
...,...,...,...,...,...
1029,,Role-Playing,,0.00,0.00
1030,2017.0,Racing,Tammeka Games,0.00,0.00
1031,,Action,,0.00,0.00
1032,,Action,,0.00,0.00


In [6]:
vgsales.isna().sum()

Year             209
Genre              0
Publisher        209
North America      0
Global             0
dtype: int64

we can see from our dataset that we have two columns with null values
A close check up of the data you will realise that those columns with null values has got zeros in sales, thus the best decision is to remove the rows with null values

In [7]:
vgsales=vgsales.dropna()
vgsales

Unnamed: 0,Year,Genre,Publisher,North America,Global
0,2014.0,Action,Rockstar Games,6.06,19.39
1,2015.0,Shooter,Activision,6.18,15.09
2,2018.0,Action-Adventure,Rockstar Games,5.26,13.94
3,2017.0,Shooter,Activision,4.67,13.40
4,2017.0,Sports,EA Sports,1.27,11.80
...,...,...,...,...,...
1025,2019.0,Action,THQ Nordic,0.00,0.00
1026,2017.0,Platform,THQ Nordic,0.00,0.00
1027,2017.0,Adventure,Daedalic Entertainment,0.00,0.00
1028,2018.0,Action,Bandai Namco Entertainment,0.00,0.00


We have now dropped all the rows with missing values


In [8]:
vgsales.isnull().sum()

Year             0
Genre            0
Publisher        0
North America    0
Global           0
dtype: int64

Next we will convert columns containing non_numerical data to numerical
We will use encoding , which will convert categorical data features into binary format represented as 1 or 0 
These 2 numbers represents "True" and "False"
With Pandas, one-hot encoding can be performed using the pd.get_dummies function


In [9]:
num_vgsales= pd.get_dummies(vgsales, columns=["Genre","Publisher"])
num_vgsales

Unnamed: 0,Year,North America,Global,Genre_Action,Genre_Action-Adventure,Genre_Adventure,Genre_Fighting,Genre_MMO,Genre_Misc,Genre_Music,...,Publisher_Ubisoft,Publisher_Unknown,Publisher_Versus Evil,Publisher_Wander MMO,Publisher_Warner Bros. Interactive,Publisher_Warner Bros. Interactive Entertainment,Publisher_Wired Productions,Publisher_Xseed Games,Publisher_Yacht Club Games,Publisher_Yeti
0,2014.0,6.06,19.39,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2015.0,6.18,15.09,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2018.0,5.26,13.94,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2017.0,4.67,13.40,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2017.0,1.27,11.80,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025,2019.0,0.00,0.00,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1026,2017.0,0.00,0.00,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1027,2017.0,0.00,0.00,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1028,2018.0,0.00,0.00,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next we need to remove Global column because this column will acts like as our y(dependent) variable 

In [10]:
del num_vgsales['Global']

finaly create X and Y arrays from the dataset using.values 

In [11]:
X = num_vgsales.values
y = vgsales['Global'].values

We can now split the dataset into training nad test segments
we will do 70/30 solit be calling Scikit-learn funtion with an argument of "0.3"


In [12]:
# rows are shuffled using shuffle parameter
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, 
                                                   shuffle = True)

## Select the algorithm and configure its hyperparameters

In [13]:
# in this exercise we are using gradient boosting algorithm for this exercise
model = ensemble.GradientBoostingRegressor(
n_estimators = 150,
learning_rate = 0.1,
max_depth = 4,
min_samples_split=4,
min_samples_leaf=4,
max_features =0.5,
loss = 'huber')

## n_estimators-
represents how many decision tress to build, high number of trees will generally improve accuracy but they will also inrease the model's processing time, 150 is our initial starting point
## learning_rate 
controls the rate at which additional decision trees influence the overall prediction
This effectively shrinks the contribution of each tree by the set learning_rate
## max_depth
defines the maximum number of layers(depth) for each decision tree.
If none is selected the nodes expand untill all the leaves are pure or untill all leaves contain less than min_samples_leaf.
## min_sample_split
defines the minimum number of samples required to implement a new binary split.
for example min_sample_split = 10 means there must be 10 variables in order to create a new branch.
## min_samples_leaf 
represents the minimum number of samples that must appear in each child node(leaf) before a new branch to be created.
Ths helps to mitigate the impact of outliers and anomalies in the form of a low number of samples found in one leaf as a result of binary split.
For example, min_samples_leaf = 4 requires there to be at least four available samples within each leaf in order for a new branch to be created.
## max_fatures
is the total number of fatures presented to the model when determining the best split.
As mentioned in "CH 13), random forests and gradient booting restrict the total number of fatures shown to each individual tree in order to create multiple tree in order to create multiple classifies that can be voted upon later.
If the value is an interger(whole number) the model will consider max_features at each split(branch) .
If the value is a float then the max_features is 5he percentage of total features randomly selected.
Altthough it sets a maximum number of fetures may exceed the set limit if no split is initially found.
Loss calculates the model's error rate.
for this exercise we are are using
## huber
which protects againt outliers and anomalies.
Alternative error rate options include Is(least squares regression), Lad(least absolute deviations) and quantile( quantile regression).
Huber is the a combination of Is and Lad

In [14]:
# to start model training
model.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='huber', max_depth=4,
                          max_features=0.5, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=4, min_samples_split=4,
                          min_weight_fraction_leaf=0.0, n_estimators=150,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

Lastly we have to save the training model as a file using the Joblib.dump function
Because it will allow us to use the training model again in the future for predicting new values without needing to rebulid the model from scratch 

In [15]:
joblib.dump(model, 'videogames_trained_model.pkl')

['videogames_trained_model.pkl']

### Evaluating the results

In [16]:
# we will use mean absolute error to evaluate the accuracy of the model
mse = mean_absolute_error(y_train, model.predict(X_train))
print("Training Set Mean Absolute Error.%.4f" %mse)

Training Set Mean Absolute Error.0.1180


Here we input our y values, which represent the correct results from the training data.
The model.predict function is the called on X training set and generates a prediction.
The mean absolute error funtion will then compare the difference between the model's expected predictions and the actual values
the same process is repeated with test data


In [17]:
mse = mean_absolute_error(y_test, model.predict(X_test))
print("Training Set Mean Absolute Error.%.4f" % mse)

Training Set Mean Absolute Error.0.1991


Now run the entire model program br right-clicking and pressing Run 
or navigating from the Jupyter Notebook menu Cell>Run All

 0.1991 MAE means that the model can correctly predict for global  sales more than 80% thus a good model.
 