# Machine Learning Challenge

Below are 2 data challenges that test for your ability to:
- Wrangle/clean data to make it usable by a model
- Figure out how to set up X's and y's for a use case, given a dataset
- Write code to robustly and reproducibly preprocess data
- Pick/design the right model, and tune hyperparameters to get the best performance

You can use any programming language, model, and package to solve these problems. Let us know of any assumptions you make in your process.

#### Deliverables:
- A link to a github repository that contains:
    - Clearly commented code that was written to solve these problems
    - Your trained models stored in a file (`.pkl`, `.h5`, `.tar` - whatever is appropriate). The models must have `predict(X)` functions. 
    - A readme file that contains:
        - Instructions to easily access/load the above
        - A writeup explaining any significant design decisions and your reasons for making them. 
        - If needed, a brief writeup explaining anything you are particularly proud of in your implementation that you might want us to focus on

#### How we'll assess your work:
- Accuracy/RMSE of your model when predicting on held-out data
- How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:
    - A new column that wasn't present in the dataset given to you
    - New value in a categorical field that wasn't seen in the dataset given to you
    - NA values
- Efficiency of the code. 
    - Is it easy to understand? 
    - Are the variable names descriptive? 
    - Are there any variables created that aren't used? 
    - Is redundant code replaced with function calls? 
    - Is vectorized implementation used instead of nested for loops? 
    - Are classes defined and objects created where applicable? 
    - Are packages used to perform tasks instead of implementing them from scratch?
    
**NOTE:** Your stored models, once loaded, should *just work* when fed with our held-out data (which looks similar to the data we've given you). We won't do any preprocessing before we feed it into the model's `predict(X)` function; `predict(X)` should handle the preprocessing. Pay particular attention to handling the edge cases we've talked about.

Feel free to ask questions to clarify things. Submit everything you tried, not just the things that worked. I encourage you to try and showcase your talents. The more you go above and beyond what's expected, the more impressed we'll be. **Bonus points if you fit Keras/Tensorflow/Pytorch/Caffe models** in addition to your Linear/Tree-based models.

## 0. Import dependencies

In [1]:
import pandas as pd
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt

from sklearn import preprocessing as scale
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from math import sqrt

import xgboost
from xgboost import XGBRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_score, precision_score, f1_score,recall_score, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV

from pprint import pprint

  from numpy.core.umath_tests import inner1d


### Functions used in Task 1

In [2]:
def preProcess(data):
    """
    Function to preprocess similar datasets: 
    Takes in a dataframe, checks for null values, replaces categorical value columns with dummy variables
    and fills the remaining null values in the numerical columns with the means of that column"""
    
    df = data
    df.fillna(method='ffill', inplace=True)         #As the data is arranged chronologically, we fill the next missing variable with that of the previous hour/day
    df.fillna(method='bfill', inplace=True)         #Incase some NaNs are at the start
    
    categorical_columns = df.select_dtypes(include=['object'])
    dummy_columns = pd.get_dummies(categorical_columns)
    
    df = pd.concat([df.drop(categorical_columns, axis=1), dummy_columns], axis=1)
     
    return df

In [3]:
def normalize(df):
    """
    Function that takes in a dataset with numerical values and standardizes it
    """
    standard_sc = scale.StandardScaler()
    x_std = standard_sc.fit_transform(df)
    df_scaled = pd.DataFrame(x_std)
    return df_scaled

## Task 2
`forecasting_dataset.csv` is a file that contains pollution data for a city. Your task is to create a model that, when fed with columns `co_gt`, `nhmc`, `c6h6`, `s2`, `nox`, `s3`, `no2`, `s4`, `s5`, `t`, `rh`, `ah`, and `level`, predicts the value of `y` six hours later.

**NOTE:** In the data we've given you, the value of `y` for a given row is the value of `y` *for the timestamp of that same row*. We're asking you to predict the value of `y` 6 hours *after the timestamp of that row*.

In [57]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the level column looks like
pd.read_csv("forecasting_dataset.csv").drop(labels=['date', 'time', 'y'], axis='columns').head()

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
0,-200.0,-200.0,7.2,867.0,-200.0,834.0,-200.0,1314.0,891.0,14.8,57.3,0.9603,
1,0.5,-200.0,3.9,704.0,-200.0,861.0,-200.0,1603.0,860.0,24.4,65.0,1.9612,Low
2,3.7,-200.0,23.3,1386.0,,626.0,109.0,2138.0,,23.3,38.6,1.0919,High
3,2.1,-200.0,12.1,1052.0,183.0,779.0,,1690.0,952.0,28.5,27.3,1.0479,High
4,4.4,-200.0,21.7,1342.0,786.0,499.0,206.0,1546.0,2006.0,12.9,54.1,0.8003,High


### Importing Data

leaving the dates and time as separate columns doesn't sort well, so let's parse them, sort and then drop! 

In [5]:
df = pd.read_csv("forecasting_dataset.csv", parse_dates=[['date','time']]).sort_values(by = ['date_time'])
df.drop('date_time', axis=1, inplace=True)

### Preprocessing

We'll use the same preprocess function we used in the first task to fill the NaNs with adjacent values and create dummy variable columns using native pandas functions. 

In [6]:
df = preProcess(df)

### Label Construction

Similar to task 1 we can convert the time series dataset to that of supervised learning. 
We need to predict the y value, given a set of parameters, 6 hours later. 

So we can simply shift the dataset to align the y values 6 hours later to the parameters in the present timestep

In [7]:
df['y_6_hours_later'] = df.y.shift(-6)
df.drop('y', axis=1, inplace=True)
df.dropna(inplace=True)
    

Splitting the dataset into X and y 

In [8]:
df_X = df.iloc[:,:-1]
df_y = df.iloc[:,-1:]

In [9]:
df_y.head()

Unnamed: 0,y_6_hours_later
3974,1185.0
6374,1136.0
883,1094.0
5937,1010.0
8292,1011.0


### Modelling

We first split the data into training, validation, and testing sets. From here onwards we forget that the test set exists. 

We use just the validation set to determine the model and tune its hyperparameters. Finally at the end once all factors are decided we find out the test error. 

In [10]:
xtrain, xtest, ytrain, ytest = train_test_split(df_X, df_y, test_size = 0.4, random_state = 19 )
xval, xtest, yval, ytest = train_test_split(xtest, ytest, test_size = 0.5)

In [11]:
xtrain.head()

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level_High,level_Low,level_Moderate,level_Very High,level_Very low
2825,1.3,-200.0,6.4,829.0,82.0,1124.0,75.0,1333.0,621.0,26.1,21.1,0.7023,1,0,0,0,0
1102,0.7,-200.0,3.9,707.0,63.0,1063.0,85.0,1325.0,417.0,40.2,17.7,1.2989,0,1,0,0,0
6433,0.9,-200.0,4.1,718.0,106.0,981.0,71.0,1223.0,872.0,16.3,56.8,1.047,0,1,0,0,0
8038,1.3,-200.0,4.4,731.0,272.0,815.0,149.0,936.0,1018.0,1.0,67.7,0.4523,1,0,0,0,0
5435,-200.0,-200.0,8.5,885.0,-200.0,884.0,-200.0,1588.0,974.0,11.0,66.1,1.0236,0,0,0,0,1


### Model Selection

For simplicity we'll train with three models, linear regression, random forest and XGBoost.

In [12]:
def linearRegression(xtrain,xtest, ytrain, ytest):
    LR = LinearRegression()
    model = LR.fit(xtrain,ytrain)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest)
    print("R^2 score: ", (score))
    print("Coefficients of the model: ", model.coef_)
    
    print('\nRMSE:', sqrt(mean_squared_error(ytest,pred)))


In [27]:
def randomForestRegressor(xtrain,xtest, ytrain, ytest):
    LR = RandomForestRegressor() #SPOILER!: the tuned parameters -> bootstrap = True, max_depth = 70, min_samples_leaf = 1, min_samples_split = 2, n_estimators = 1400 )
    model = LR.fit(xtrain,ytrain)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest)
    #print(model)
    print("R^2 score: ", (score))
    #print(pred)
    print("\nRMSE:", sqrt(mean_squared_error(ytest,pred)))


In [14]:
def XGBRegression(xtrain,xtest, ytrain, ytest):
    XGB = XGBRegressor()
    model = XGB.fit(xtrain,ytrain)
    score = XGB.score(xtest, ytest)
    pred = model.predict(xtest)
    print("R2 score",(score))
    #print(pred)
    print("\nRMSE:", sqrt(mean_squared_error(ytest,pred)))

In [15]:
linearRegression(xtrain,xval, ytrain, yval)

R^2 score:  0.4374621144731929
Coefficients of the model:  [[ 3.83556912e-01  2.36019205e-01 -1.47677657e+00 -2.26211659e-01
   1.75088855e-01  5.57624178e-02 -3.40337098e-01 -4.52689167e-02
   3.29726856e-01  6.50345551e+00  2.26607814e+00 -3.88754555e+00
  -1.46289951e+01  4.88009047e+01  1.93546516e+00 -1.32297519e+02
   9.61901441e+01]]

RMSE: 250.45362050999765


In [16]:
XGBRegression(xtrain,xval,ytrain,yval)

R2 score 0.48476455945239566

RMSE: 239.69241445454526


In [17]:
randomForestRegressor(xtrain, xval, ytrain, yval)

  This is separate from the ipykernel package so we can avoid doing imports until


R^2 score:  0.49462090364806693

RMSE: 237.38871150957598


### Hyperparameter tuning

Random forest gives us the highest r^2 score as well as the lowest error so we can choose to tune the hyperparameters for it. A useful method is the randomSearchCV function. 

In [18]:
from pprint import pprint

rf = RandomForestRegressor()
print('Current Parameters: \n')
pprint(rf.get_params())  #pretty print

Current Parameters: 

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [19]:
#Method of selecting samples for training each tree
bootstrap = True, False

#Max levels in tree
max_depth = [int(x) for x in np.linspace(10,110, num=11)]
max_depth.append(None)

#max number of features to consider at every split
max_features = ['auto', 'sqrt']

#Minimum number of samples required at each leaf node
min_samples_leaf = [1,2,4]

#Minum number of samples required to split a node
min_samples_split = [2,5,10]

# Number of Trees in the random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]

random_grid = {'bootstrap'        : bootstrap,
               'max_depth'        : max_depth,
               'min_samples_leaf' : min_samples_leaf,
               'min_samples_split': min_samples_split,
               'n_estimators':n_estimators}
print("Our search grid: \n")
pprint(random_grid)

Our search grid: 

{'bootstrap': (True, False),
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


Random Search Trainig

In [20]:
RFR = RandomForestRegressor()

#Random search using 3 fold cross validation and serch across 100 combinations

RFR_random = RandomizedSearchCV(estimator = rf, 
                                param_distributions = random_grid,
                                n_iter = 200, 
                                cv = 3,
                                verbose =2, 
                                random_state = 19,
                                n_jobs = -1)

RFR_random.fit(xtrain, ytrain)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 150 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 353 tasks      | elapsed: 24.8min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed: 43.5min finished
  self.best_estimator_.fit(X, y, **fit_params)


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=200, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'bootstrap': (True, False), 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]},
          pre_dispatch='2*n_jobs', random_state=19, refit=True,
          return_train_score='warn', scoring=None, verbose=2)

In [23]:
RFR_random.best_params_

{'bootstrap': True,
 'max_depth': 70,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1400}

I feel like this might overfit the data...

### Evaluation

We can finally evaluate on the test data!

In [29]:
randomForestRegressor(xtrain, xtest, ytrain, ytest)

  import sys


R^2 score:  0.5940845519491698

RMSE: 224.28082656117505
