*Practical Data Science 20/21*
# Programming Assignment 3 - Predicting Video Game Sales with Deep Learning

In this programming assignment, you need to apply your new deep learning knowledge. In contrast to PA2, you don't have to bother about feature engineering. You will build an artificial neural network with multiple layers that learns features from the raw data.

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Again, your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas: 

In [1]:
import pandas as pd

In [2]:
data_path = 'https://github.com/pds2021/course/raw/main/assignments/Data/02/video_game_sales.csv'
df = pd.read_csv(data_path)
df

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,Wii Sports,Wii,2006.0,Sports,82.53,76.0,51.0,8.0,322.0,E
1,Super Mario Bros.,NES,1985.0,Platform,40.24,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,35.52,82.0,73.0,8.3,709.0,E
3,Wii Sports Resort,Wii,2009.0,Sports,32.77,80.0,73.0,8.0,192.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,31.37,,,,,
...,...,...,...,...,...,...,...,...,...,...
16706,Samurai Warriors: Sanada Maru,PS3,2016.0,Action,0.01,,,,,
16707,LMA Manager 2007,X360,2006.0,Sports,0.01,,,,,
16708,Haitaka no Psychedelica,PSV,2016.0,Adventure,0.01,,,,,
16709,Spirits & Spells,GBA,2003.0,Platform,0.01,,,,,


## Creating Dataloaders

First, import everything we need for the tabular model

In [None]:
!pip install -Uqq fastai  # upgrade fastai on colab
from fastai.tabular.all import *
from sklearn.model_selection import train_test_split

To create [TabularDataLoaders](https://docs.fast.ai/tabular.data.html#TabularDataLoaders), you need to assign the variables to dependent, categorical and continuous variables

In [None]:
# Write your code here
y_names = 'Global_Sales'
cat_names = ['Platform', 'Genre', 'Rating']
cont_names = ['Year_of_Release', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count']

fast.ai contains classes that define [transformations](https://docs.fast.ai/tabular.transform.html) for preprocessing. Provide a list of appropriate preprocessing steps.

In [None]:
# Write your code here
procs = [FillMissing, Categorify, Normalize]

You also need to define the train and validation set (using indices!). 
- Use a train/test split of 80/20

In [None]:
# Write your code here
train_idx, valid_idx = train_test_split(range(len(df)), test_size=0.2, random_state = 0)

Now you're ready to create the [TabularDataLoaders](https://docs.fast.ai/tabular.data.html#TabularDataLoaders) that you'll use for training.
- Use the Factory Method that creates the `dls` from a dataframe
- Set the batch size to 16

In [None]:
dls = TabularDataLoaders.from_df(df=df, y_names=y_names, valid_idx=valid_idx, 
                                 procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 bs=64)

How many features numerical and categorical feautures are created by the dataloaders after preprocessing the data?
- print the names and the count of these features

In [None]:
# Write your code here
print('Categorical feature names:', dls.cat_names)
print('Number of categorical features:', len(dls.cat_names))
print('Continuous feature names:', dls.cont_names)
print('Number of continuous features:', len(dls.cont_names))

## Modeling

### Create Learner

Create an [appropriate learner](https://docs.fast.ai/tabular.learner.html) for data. A learner creates a neural network for us.
- Use 100 nodes in the first and 50 nodes in the second layer
- Choose the [metrics](https://docs.fast.ai/metrics.html) *root mean squared error* and mean *absolute error*. You can pass a list of metrics to the learner.

In [None]:
# Write your code here
# Limiting the output activation fixes instable training and leads to better results
range_scale = 1.2
y_range = (float(df.iloc[train_idx].Global_Sales.min() * range_scale),
           float(df.iloc[train_idx].Global_Sales.max() * range_scale))

learn = tabular_learner(dls, layers=[100,50], metrics=[rmse, mae], y_range=y_range)

How many Embeddings are there in the model? Use `learn.dls.show_batch()` to explain why!

In [None]:
# Write your answer here
learn.model

In [None]:
learn.dls.show_batch()

In [None]:
# Write your answer here
# Eight Embeddings: thre for cat. Platform, Genre, Rating and five
# additional embedding layers for continuous variables containing na-values

### Find the learning rate 
You need to find a suitable learning rate for the training
- Read the [docs](https://fastai1.fast.ai/callbacks.one_cycle.html) how to find the right learning rate
- Repeat this step until you get a meaningful plot ;)

In [None]:
# Write your code here
learn.lr_find()

### Fit the model

- How many epochs (cycles) are necessary to train the model? Is there a problem with overfitting?

In [None]:
# Write your answer and code here
# More than 3 rounds/epochs may increase the chances for overfitting
# Trial and error!
learn.fit_one_cycle(10, lr_max=5e-3)

### Evaluate the model

Report the in-sample as well as the out-of-sample performance usinge the mean absolute error.

In [None]:
# Write your code here
print("in-sample performance: {}".format(np.array(learn.validate(ds_idx=0)[2])))

In [None]:
# Write your code here
print("out-of-sample performance: {}".format(np.array(learn.validate(ds_idx=1)[2])))

## Discussion

Looking at the results, discuss the advantages and disadvantages of deep learning for tabular data.

In [None]:
# Write your answer here
# Some points you could have mentioned:

# Advantages
# - does not require feature engineering
# - thus, it can flexibly adapt to new challenges

# Disadvantages
# - often requires large amount of data
# - requires high compiuting power 
# - training ist often time consuming