<a href="https://colab.research.google.com/github/janinerottmann/Practical-Data-Science/blob/master/2.%20Predicting%20Video%20Game%20Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Practical Data Science 19/20*
# Programming Assignment 2 - Predicting Video Game Sales

In this programming assignment you need to apply your new (or refreshed) machine learning knowledge. You will need to create a modeling pipeline training and evaluating a machine learning model build on several numeric as well as categorical features

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas: 

In [0]:
import pandas as pd

In [0]:
data_path = 'https://raw.githubusercontent.com/pds1920/_a2-template/master/data/video_game_sales.csv'
game_sales_data = pd.read_csv(data_path)
game_sales_data.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,Wii Sports,Wii,2006.0,Sports,82.53,76.0,51.0,8.0,322.0,E
1,Super Mario Bros.,NES,1985.0,Platform,40.24,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,35.52,82.0,73.0,8.3,709.0,E
3,Wii Sports Resort,Wii,2009.0,Sports,32.77,80.0,73.0,8.0,192.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,31.37,,,,,


## Splitting the Dataset

Before you can get started training a machine learning model you will have to split the dataframe into features and the target variable (try to use as many features as possible):

In [0]:
# target variable
y = game_sales_data['Global_Sales']

# feature variables
game_sales_features = ['Platform', 'Year_of_Release', 'Genre', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 'Rating']
X = game_sales_data[game_sales_features]
X.head()

Unnamed: 0,Platform,Year_of_Release,Genre,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,Wii,2006.0,Sports,76.0,51.0,8.0,322.0,E
1,NES,1985.0,Platform,,,,,
2,Wii,2008.0,Racing,82.0,73.0,8.3,709.0,E
3,Wii,2009.0,Sports,80.0,73.0,8.0,192.0,E
4,GB,1996.0,Role-Playing,,,,,


In [0]:
X.describe()

Unnamed: 0,Year_of_Release,Critic_Score,Critic_Count,User_Score,User_Count
count,16442.0,8130.0,8130.0,10007.0,7585.0
mean,2006.486437,68.976015,26.358549,7.126238,162.277521
std,5.87973,13.935162,18.978236,1.30619,561.459579
min,1980.0,13.0,3.0,0.0,4.0
25%,2003.0,60.0,12.0,6.8,10.0
50%,2007.0,71.0,21.0,7.13,24.0
75%,2010.0,79.0,36.0,8.0,81.0
max,2020.0,98.0,113.0,9.7,10665.0


In [0]:
y.describe()

count    16711.000000
mean         0.533713
std          1.548282
min          0.010000
25%          0.060000
50%          0.170000
75%          0.470000
max         82.530000
Name: Global_Sales, dtype: float64

Next, you will have to create a train-test split in order to be able to evaluate your models. Use 80\% of the data for training and 20\% for evaluation (take a look at the sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to identify the relevant parameters):

In [0]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.20, random_state=2)

## Removing missing values
If you inspect your training data you will find that some of the variables have missing values. Use the ``SimpleImputer`` to replace missing values in numerical columns with the column mean and missing values in categorical columns with the most frequent value (take a look at the SimpleImputer [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to identify the relevant parameters). You can decide if you want to use the simple or the advanced imputation strategy (or just try both).

In [0]:
# columns with missing values:
cols_with_missing = X_train.columns.values[X_train.isna().sum() > 0]
cols_with_missing

array(['Year_of_Release', 'Genre', 'Critic_Score', 'Critic_Count',
       'User_Score', 'User_Count', 'Rating'], dtype=object)

In [0]:
# identify categorical features
low_cardinality_cols_train = [cname for cname in X_train.columns if X_train[cname].dtype == "object"]
low_cardinality_cols_test = [cname for cname in X_test.columns if X_test[cname].dtype == "object"]
low_cardinality_cols_train

['Platform', 'Genre', 'Rating']

In [0]:
# identify numeric features
numerical_cols_train = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]
numerical_cols_test = [cname for cname in X_test.columns if X_test[cname].dtype in ['int64', 'float64']]
numerical_cols_train

['Year_of_Release', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count']

In [0]:
# simple imputer
from sklearn.impute import SimpleImputer

# numeric imputation: mean by default
simple_imputer_num = SimpleImputer()
train_X_num = pd.DataFrame(simple_imputer_num.fit_transform(X_train[numerical_cols_train]), columns=numerical_cols_train, index=X_train.index)
test_X_num = pd.DataFrame(simple_imputer_num.transform(X_test[numerical_cols_test]), columns=numerical_cols_test, index=X_test.index)

# categorical imputation: most_frequent
simple_imputer_cat = SimpleImputer(strategy = 'most_frequent')
train_X_cat = pd.DataFrame(simple_imputer_cat.fit_transform(X_train[low_cardinality_cols_train]), columns=low_cardinality_cols_train, index=X_train.index)
test_X_cat = pd.DataFrame(simple_imputer_cat.transform(X_test[low_cardinality_cols_test]), columns=low_cardinality_cols_test, index=X_test.index)

## Encoding categorical variables

Prior to training your model you will have to encode the categorical variables. Inspect all categorical variables and use the ``LabelEncoder`` or the ``OneHotEncoder`` where appropriate. Remember that you have to combine the numerical as well as the label encoded and the one hot encoded dataframes at the end.

In [0]:
# label encoding
from sklearn.preprocessing import LabelEncoder

# copy to protect original data
label_X_train_cat = train_X_cat.copy()
label_X_test_cat = test_X_cat.copy()

label_encoder = LabelEncoder()
for col in train_X_cat:
    label_X_train_cat[col] = label_encoder.fit_transform(label_X_train_cat[col])
    label_X_test_cat[col] = label_encoder.transform(label_X_test_cat[col])

In [0]:
# join numeric and categorical columns
X_train_joined = train_X_num.join(label_X_train_cat)
X_test_joined = test_X_num.join(label_X_test_cat)
X_train_joined.head()

Unnamed: 0,Year_of_Release,Critic_Score,Critic_Count,User_Score,User_Count,Platform,Genre,Rating
13348,2005.0,54.0,9.0,7.13,162.891477,7,4,1
15085,2009.0,69.012498,26.358895,7.122507,162.891477,28,1,1
12943,2002.0,69.012498,26.358895,7.122507,162.891477,29,4,1
16675,2006.0,69.012498,26.358895,7.122507,162.891477,19,5,1
8135,2011.0,69.012498,26.358895,7.13,162.891477,26,0,2


In [0]:
# check if imputation worked
missing_values_train = X_train_joined.columns.values[X_train_joined.isna().sum() > 0]
missing_values_train

missing_values_test = X_test_joined.columns.values[X_test_joined.isna().sum() > 0]
missing_values_test

print(missing_values_train)
print(missing_values_test)

[]
[]


## Train the Model

Now our dataset should be ready and we can train a predictive model. Train a Decision Tree as well as a Random Forest and compare the in-sample as well as the out-of-sample performance of both models usinge the mean absolute error.

In [0]:
# decision tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

# Define
dt_model = DecisionTreeRegressor(random_state=1)

# Fit
dt_model.fit(X_train_joined, y_train)

# Prediction
preds = dt_model.predict(X_test_joined)

# Evaluate
score = mean_absolute_error(y_test, preds)
print('MAE: {}'.format(score))

MAE: 0.568390723559546


In [0]:
# random forest
from sklearn.ensemble import RandomForestRegressor

# Define
rf_model = RandomForestRegressor(n_estimators=100, max_depth = 10, random_state=0) #depth of 10 workes best

# Fit
rf_model.fit(X_train_joined, y_train)

# Prediction
preds = rf_model.predict(X_test_joined)

# Evaluate
score = mean_absolute_error(y_test, preds)
print('MAE: {}'.format(score))

MAE: 0.45003181539666814
