*Practical Data Science 19/20*
# Programming Assignment 2 - Predicting Video Game Sales

In this programming assignment you need to apply your new (or refreshed) machine learning knowledge. You will need to create a modeling pipeline training and evaluating a machine learning model build on several numeric as well as categorical features

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas: 

In [None]:
import pandas as pd

In [None]:
data_path = 'https://github.com/pds2021/course/raw/main/assignments/Data/02/video_game_sales.csv'
game_sales_data = pd.read_csv(data_path)
game_sales_data.head()

## Splitting the Dataset

Before you can get started training a machine learning model you will have to split the dataframe into features and the target variable (try to use as many features as possible):

In [None]:
game_sales_data.set_index('Name', inplace=True)

In [None]:
game_sales_data.columns

In [None]:
y = game_sales_data['Global_Sales']
X = game_sales_data.drop('Global_Sales', axis=1)
print(y.head())
print(X.head())

Next, you will have to create a train-test split in order to be able to evaluate your models. Use 80\% of the data for training and 20\% for evaluation (take a look at the sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to identify the relevant parameters):

In [None]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, 
                                                  train_size=0.8, 
                                                  random_state = 0)

## Removing missing values
If you inspect your training data you will find that some of the variables have missing values. Use the ``SimpleImputer`` to replace missing values in numerical columns with the column mean and missing values in categorical columns with the most frequent value (take a look at the SimpleImputer [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to identify the relevant parameters). You can decide if you want to use the simple or the advanced imputation strategy (or just try both).

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
train_X.dtypes

In [None]:
num_cols = [col for col in train_X.columns if train_X[col].dtype == 'float64']
cat_cols = [col for col in train_X.columns if train_X[col].dtype == 'object']

In [None]:
num_imputer = SimpleImputer(strategy='mean')

train_X_num_imputed = pd.DataFrame(num_imputer.fit_transform(train_X[num_cols]), 
                                   columns=num_cols, index=train_X.index)
val_X_num_imputed = pd.DataFrame(num_imputer.transform(val_X[num_cols]), 
                                   columns=num_cols, index=val_X.index)

cat_imputer = SimpleImputer(strategy='most_frequent')

train_X_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(train_X[cat_cols]), 
                                   columns=cat_cols, index=train_X.index)
val_X_cat_imputed = pd.DataFrame(cat_imputer.transform(val_X[cat_cols]), 
                                   columns=cat_cols, index=val_X.index)

## Encoding categorical variables

Prior to training your model you will have to encode the categorical variables. Inspect all categorical variables and use the ``LabelEncoder`` or the ``OneHotEncoder`` where appropriate. Remember that you have to combine the numerical as well as the label encoded and the one hot encoded dataframes at the end.

In [None]:
for cat in cat_cols:
    print("{}: {}".format(cat, game_sales_data[cat].nunique()))

In [None]:
game_sales_data['Rating'].value_counts()

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

In [None]:
ordinal_encoder = OrdinalEncoder()
train_X_cat_label = pd.DataFrame(ordinal_encoder.fit_transform(train_X_cat_imputed[["Platform", 'Genre']]),
                                 columns=["Platform", 'Genre'], 
                                 index=train_X_cat_imputed.index)
val_X_cat_label = pd.DataFrame(ordinal_encoder.transform(val_X_cat_imputed[["Platform", 'Genre']]),
                                 columns=["Platform", 'Genre'], 
                                 index=val_X_cat_imputed.index)


ohe_encoder = OneHotEncoder(sparse=False)
train_X_cat_ohe = pd.DataFrame(ohe_encoder.fit_transform(train_X_cat_imputed[['Rating']]),
                                 index=train_X_cat_imputed.index)
val_X_cat_ohe = pd.DataFrame(ohe_encoder.transform(val_X_cat_imputed[['Rating']]),
                                 index=val_X_cat_imputed.index)

In [None]:
train_X = pd.concat([train_X_num_imputed, train_X_cat_label, train_X_cat_ohe], axis=1)
val_X = pd.concat([val_X_num_imputed, val_X_cat_label, val_X_cat_ohe], axis=1)

## Train the Model

Now our dataset should be ready and we can train a predictive model. Train a Decision Tree as well as a Random Forest and compare the in-sample as well as the out-of-sample performance of both models usinge the mean absolute error.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

In [None]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model_rf = RandomForestRegressor(n_estimators=100, random_state=1)
    model_rf.fit(X_train, y_train)
    preds_rf = model_rf.predict(X_valid)
    model_dt = DecisionTreeRegressor(random_state=1)
    model_dt.fit(X_train, y_train)
    preds_dt = model_dt.predict(X_valid)
    return mean_absolute_error(y_valid, preds_rf), mean_absolute_error(y_valid, preds_dt)

In [None]:
oos_rf, oos_dt = score_dataset(train_X, val_X, train_y, val_y)
is_rf, is_dt = score_dataset(train_X, train_X, train_y, train_y)

In [None]:
print('Out-of-sample\nRandom Forest: {}\nDecicion Tree" {}'.format(oos_rf, oos_dt))
print('------------------------------')
print('In-sample\nRandom Forest: {}\nDecicion Tree" {}'.format(is_rf, is_dt))