# Homework 2: Rossman Kaggle: Forecasting Sales

# Part 1: Preprocessing

## Table of Contents 

* **HW-2: Rossman Kaggle: Forecasting Sales**
  * Instructions
  * Learning Goals
  * Loading the DataFrame
  * Q1: Data-Preprocesing and Understanding the data **(10 marks)**(HW1_Part1)
  * Q2: Modelling without Entity Embeddings**(30 marks)**(HW1_Part2)
    * 2.1 Baseline model - Linear Regression  
    * 2.2 Random Forest 
    * 2.3 XGBoost 
    * 2.4 Multi Layer Perceptron 
  * Q3: Modelling MLP with Entity Embeddings**(10 marks)**(HW1_Part3)
  * Q4 : Modelling other models with Entity Embeddings **(40 marks)**(HW1_Part4)
    * 4.1 Baseline model - Linear Regression  
    * 4.2 Random Forest 
    * 4.3 XGBoost
  * Q4: Final Comments **(10 marks)** (HW1_Part4)

## Instructions


- Please restart the kernel and run the entire notebook again before you submit.

- Running cells out of order is a common pitfall in Notebooks. To make sure your code works restart the kernel and run the whole notebook again before you submit. 

- We have tried to include all the libraries you may need to do the assignment in the imports statement at the top of this notebook.

- 
- Comment your code well.

- 
- Please use .head() when viewing data. Do not submit a notebook that is **excessively long**. 

- Your plots should include clear labels for the $x$ and $y$ axes as well as a descriptive title ("MSE plot" is not a descriptive title; "95 % confidence interval of coefficients of polynomial degree 5" is).

- **Ensure you make appropriate plots for all the questions it is applicable to, regardless of it being explicitly asked for.**

<hr style="height:2pt">

## Learning Goals

**We will look here into the practicalities of Trees, MLPs and Entity Embedding.**

The homework is divided into four main parts:
1. Data-preprocessing
2. Developing different models and evaulating the models - without Entity Embeddings
3. Pass on the entity embeddings from Neural Network model to other models and evaluate the models
4. Compare the models

## Read this first!

The homework is divided into **4 notebooks**
1. Preprocessing and Storing Data
2. Modelling without Entity Embeddings
3. MultiLayer Perceptron with Entity Embeddings 
4. Modelling with Entity Embeddings and Comparing the results


This Homework is based on the **paper attached in the data folder**

Lets talk about the paper first:

A very simple explaination of what the paper is trying to achieve is to show how to accuracy of the model changes using Entity Embeddings. 

Things to note:

1. We want the results to be like the results shown in the paper

**We will not be implementing KNNs - instead just for our comparison we will use Linear regression**

2. The paper specifically mentions the parameters it uses to achieve these results, and we will be using the same as well. 


**Again remember we will not be implementing KNNs - instead just for our comparison we will use Linear regression**


3. The last point we want you to note is the following: we will be using MAPE as the metric just like the paper does.


#### So lets get started! Please note: this particular notebook is only for Data Preprocessing and saving the datafile. The notebooks for Modelling without Entity Embeddings MLP with Entity Embedding and other models with Entity Embeddings is Part2 and Part3 and Part4. 

**Why are we doing this?** 

Each of this processing requires high RAM, which you may or may not have access to - hence we split the work in four parts and call the work from each part into the next one! Also this helps us modularise it better!!

Since we have not done entity embeddings the code there is provided for you!!!!

Learn from it!! It goes from the session we had on e,beddings.

In [None]:
#importing libraries
import numpy as np
import scipy.stats
import scipy.special
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm
import pandas as pd
from sklearn.pipeline import make_pipeline, make_union, Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import ParameterGrid
from keras.models import Sequential
from keras.models import Model as KerasModel
from keras.layers import Input, Dense, Activation, Reshape
from keras.layers import Concatenate
from keras.layers.embeddings import Embedding
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
import pickle
import csv
from datetime import datetime
from sklearn import preprocessing
from keras.callbacks import ModelCheckpoint
import xgboost as xgb
%matplotlib inline

## Q1. Data Pre-Processing and Saving the data

### 1.1 Loading and understanding the data

#### About the data

Most of the fields are self-explanatory. The following are descriptions for those that aren't. 

1. **Id** - an Id that represents a (Store, Date) duple within the test set
2. **Store** - a unique Id for each store
3. **Sales** - the turnover for any given day (this is what you are predicting)
4. **Customers** - the number of customers on a given day
5. **Open** - an indicator for whether the store was open: 
    * 0 = closed
    * 1 = open
6. **StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. 
    * a = public holiday 
    * b = Easter holiday
    * c = Christmas
    * 0 = None
7. **SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools
8. **StoreType** - differentiates between 4 different store models: a, b, c, d
9. **Assortment** - describes an assortment level: 
    * a = basic
    * b = extra
    * c = extended
10. **CompetitionDistance** - distance in meters to the nearest competitor store
11. **CompetitionOpenSince[Month/Year]** - gives the approximate year and month of the time the nearest competitor was opened
12. **Promo** - indicates whether a store is running a promo on that day
13. **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 
    * 0 = store is not participating
    * 1 = store is participating
14. **Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2
15. **PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

**Note, since this data is large, we do not want to convert this data into dataframes, we will store it as array of dictionaries and pass the same to the models.**
**Also, we reccommend using Google Colab for completing this Homework.** 


In [None]:
#importing the data as a string 
#your code here 
train_set = ("________")
stores_set = ("________")
store_states = ("________")
test_set = ("________")

We will now define functions:
1. To convert our csv files into dictionaries
2. To replace nan values

In [None]:
def csv2dicts(csvfile):
    data = []
    keys = []
    for row_index, row in enumerate(csvfile):
        if row_index == 0:
            keys = row
            print(row)
            continue
        data.append({key: value for key, value in zip(keys, row)})
    return data

In [None]:
def set_nan_as_string(data, replace_str='0'):
    for i, x in enumerate(data):
        for key, value in x.items():
            if value == '':
                x[key] = replace_str
        data[i] = x

Save the train_set as a dictionary using csv2dicts function defined above. 

Further save this as a pickle file - call it **train_set.pickle**

In [None]:
# save the train_set as a dictionary using csv2dicts function defined above. 
# Save this as a pickle file - call it train_set.pickle
#your code here
with open(train_set) as csvfile:
    data = csv.reader(csvfile, delimiter=',')
    with open('train_set.pickle', 'wb') as f:
        #your code here

If you look at store_states - it is basically sharing information about which stores are located in which states. Hence we will add this in the stores_set itself

In [None]:
#lets do the same thing for the store_set and store_states - call this pickle as store_set.pickle
#your code here

Next we want to store the train_data length, hence load the data back from the pickle files saved and only assign num_records as the length of the train data

In [None]:
with open('train_set.pickle', 'rb') as f:
    train_data = pickle.load(f)
    num_records = len(train_data)
with open('store_set.pickle', 'rb') as f:
    store_data = pickle.load(f)

If you have saved and loaded the files correctly then **train_data[1]** and **store_data[1]** should be as follows:

![Mape.jpeg](https://drive.google.com/uc?export=view&id=1D7IMgfjbRvWNuJV_v5nx5H7TfzGjP811)


Check if the column names are the same - if not recheck the previous codes


In [None]:
#check the same
train_data[1], store_data[1]

### 1.2 Feature list

We will define a function to extract features from the data

The function should return the following paramters:
* the store index = from the train_set it should show the 'store'
* year = this should come from train_set 'Date'
* month = this should come from train_set 'Date'
* day = this should come from train_set 'Date'
* day_of_week = this should come from train_set 'DayOfWeek'
* check if the store is open 
    * if yes - save that 
    * else it should save 1
* promo = this should come from train_set 'Promo'
* store_state = this should come from store_state 'State'

In [None]:
def feature_list(record):
    #your code here

Now lets create two dictionaries - train_data_X and train_data_y 

* Run through the train_set, and check if the 'Sales' are not equal to 0 and 'Open' is not equal to 0
* If yes, then store the features(from feature list) into a variable named f1
* append the f1 values in train_data_X
* append the **Sales not equal** to 0 to train_data_y

In [None]:
train_data_X = []
train_data_y = []

for record in train_data:
  #your code here

In [None]:
#again check how your train_data_X looks
train_data_X[1]

The next step is going to be labelencoding the train_data_X. We do this using LabelEncoder from sklearn

We will run this for the complete train_data_X

In [None]:
check_X = train_data_X
check_X = np.array(check_X)
train_data_X = np.array(train_data_X)
les = []
for i in range(train_data_X.shape[1]):
    #your code here

In [None]:
#again check how your train_data_X looks 
train_data_X[1]

We will dump the les dictionary(defined in the previous step) into les.pickle

In [None]:
with open('les.pickle', 'wb') as f:
    #your code here

And convert our train_data_X as int datatype, and save our train_data_y as an numpy array

In [1]:
#your code here

Finally we will store our train_data_X, train_data_y in a pickle file - **feature_train_data.pickle**

In [None]:
with open('feature_train_data.pickle', 'wb') as f:
  #your code here

## You are done with Part 1 of the Homework!


Save all the pickle files locally in your system/drive!