# Machine Learning Challenge

Below are 2 data challenges that test for your ability to:
- Wrangle/clean data to make it usable by a model
- Figure out how to set up X's and y's for a use case, given a dataset
- Write code to robustly and reproducibly preprocess data
- Pick/design the right model, and tune hyperparameters to get the best performance

You can use any programming language, model, and package to solve these problems. Let us know of any assumptions you make in your process.

#### Deliverables:
- A link to a github repository that contains:
    - Clearly commented code that was written to solve these problems
    - Your trained models stored in a file (`.pkl`, `.h5`, `.tar` - whatever is appropriate). The models must have `predict(X)` functions. 
    - A readme file that contains:
        - Instructions to easily access/load the above
        - A writeup explaining any significant design decisions and your reasons for making them. 
        - If needed, a brief writeup explaining anything you are particularly proud of in your implementation that you might want us to focus on

#### How we'll assess your work:
- Accuracy/RMSE of your model when predicting on held-out data
- How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:
    - A new column that wasn't present in the dataset given to you
    - New value in a categorical field that wasn't seen in the dataset given to you
    - NA values
- Efficiency of the code. 
    - Is it easy to understand? 
    - Are the variable names descriptive? 
    - Are there any variables created that aren't used? 
    - Is redundant code replaced with function calls? 
    - Is vectorized implementation used instead of nested for loops? 
    - Are classes defined and objects created where applicable? 
    - Are packages used to perform tasks instead of implementing them from scratch?
    
**NOTE:** Your stored models, once loaded, should *just work* when fed with our held-out data (which looks similar to the data we've given you). We won't do any preprocessing before we feed it into the model's `predict(X)` function; `predict(X)` should handle the preprocessing. Pay particular attention to handling the edge cases we've talked about.

Feel free to ask questions to clarify things. Submit everything you tried, not just the things that worked. I encourage you to try and showcase your talents. The more you go above and beyond what's expected, the more impressed we'll be. **Bonus points if you fit Keras/Tensorflow/Pytorch/Caffe models** in addition to your Linear/Tree-based models.

## 0. Import dependencies

In [163]:
import pandas as pd
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt

from sklearn import preprocessing as scale
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

import xgboost
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.naive_bayes import GaussianNB 
from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_score, precision_score, f1_score,recall_score, roc_auc_score

## Task 1
`predictive_maintenance_dataset.csv` is a file that contains parameters and settings (`operational_setting_1`, `operational_setting_2`, `sensor_measurement_1`, `sensor_measurement_2`, etc.) for many wind turbines. There is a column called `unit_number` which specifies which turbine it is, and one called `status`, in which a value of 1 means the turbine broke down that day, and 0 means it didn't. Your task is to create a model that, when fed with operational settings and sensor measurements (`unit_number` and `time_stamp` will *not* be fed in), outputs 1 if the turbine will break down within the next 40 days, and 0 if not.

**NOTE:** The model should output 1 if the turbine is anywhere between 40 and 0 days away from failure, not *only* 40 days from failure.

In [118]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the operational_setting_3 column looks like
df_X = pd.read_csv("predictive_maintenance_dataset.csv").drop(labels=['status', 'unit_number', 'time_stamp'], axis='columns')
df_X

Unnamed: 0,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,42.0007,0.8415,High,445.00,,1362.47,1143.17,3.91,5.70,142.53,...,133.75,2388.50,8129.92,9.1182,,332.0,2212.0,100.00,10.77,6.5717
1,-0.0023,0.0004,High,518.67,642.33,1581.03,1400.06,14.62,21.61,554.60,...,522.19,2388.00,8135.70,8.3817,0.03,393.0,2388.0,100.00,39.07,23.3958
2,,0.6216,Low,462.54,536.71,1250.87,1037.52,7.05,9.00,174.56,...,163.11,2028.06,7867.90,10.8827,,306.0,1915.0,84.93,14.33,8.6202
3,42.0006,,High,,549.28,1349.42,1114.02,3.91,5.71,137.97,...,130.58,2387.71,8074.81,9.3776,0.02,,2212.0,100.00,10.60,6.2614
4,-0.0016,0.0004,High,518.67,643.84,1604.53,1431.41,14.62,21.61,551.30,...,519.44,2388.24,8135.95,8.5223,0.03,396.0,2388.0,100.00,38.39,23.0682
5,25.0046,0.6219,Low,462.54,536.72,,1047.79,7.05,9.03,175.36,...,164.97,2028.40,7880.19,10.8625,0.02,308.0,1915.0,84.93,14.38,8.6381
6,,0.6200,Low,462.54,536.79,1267.31,1045.78,7.05,9.03,174.81,...,165.05,2028.37,7881.95,10.9150,0.02,307.0,1915.0,84.93,14.18,8.5752
7,42.0053,0.8400,High,445.00,548.84,1348.71,1119.73,3.91,5.71,138.95,...,130.38,2387.86,8079.78,9.3526,0.02,329.0,2212.0,100.00,10.64,6.5382
8,0.0029,-0.0003,High,,642.48,1588.88,1393.88,14.62,21.61,,...,522.01,2388.06,,8.3743,0.03,392.0,2388.0,100.00,38.95,23.4351
9,10.0008,0.2504,High,489.05,604.49,1498.95,1309.51,10.52,15.49,394.85,...,371.56,2388.09,8128.11,,0.03,368.0,2319.0,100.00,28.48,17.2737


In [49]:
df = pd.read_csv("predictive_maintenance_dataset.csv").sort_values(by = ['unit_number', 'time_stamp'], ascending = True)
df

Unnamed: 0,unit_number,time_stamp,status,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
73382,2,2017-04-01 12:00:00,0,-0.0018,0.0006,High,518.67,641.89,1583.84,1391.28,...,522.33,2388.06,8137.72,8.3905,0.03,391.0,2388.0,100.00,38.94,23.4585
90923,2,2017-04-02 12:00:00,0,0.0043,-0.0003,High,518.67,641.82,1587.05,1393.13,...,522.70,2387.98,8131.09,8.4167,0.03,,2388.0,100.00,39.06,23.4085
82527,2,2017-04-03 12:00:00,0,0.0018,0.0003,High,518.67,641.55,1588.32,1398.96,...,522.58,2387.99,8140.58,8.3802,0.03,391.0,2388.0,100.00,39.11,23.4250
96521,2,2017-04-04 12:00:00,0,0.0035,-0.0004,High,518.67,641.68,1584.15,1396.08,...,522.49,2387.93,8140.44,8.4018,0.03,391.0,2388.0,100.00,39.13,23.5027
73137,2,2017-04-05 12:00:00,0,0.0005,0.0004,High,518.67,641.73,1579.03,1402.52,...,522.27,2387.94,8136.67,8.3867,0.03,390.0,2388.0,100.00,39.18,23.4234
6093,2,2017-04-06 12:00:00,0,-0.0010,0.0004,High,518.67,641.30,1577.50,1396.76,...,522.80,2387.99,8133.65,8.3800,0.03,392.0,2388.0,100.00,39.15,23.4270
91573,2,2017-04-07 12:00:00,0,0.0001,-0.0002,High,518.67,642.03,1587.49,1400.65,...,522.14,2388.04,8136.33,8.3941,0.03,391.0,2388.0,100.00,39.10,23.4718
77471,2,2017-04-08 12:00:00,0,0.0015,-0.0004,High,518.67,642.55,1590.41,,...,522.77,,,8.3861,0.03,391.0,2388.0,100.00,,23.4381
93541,2,2017-04-09 12:00:00,0,0.0017,-0.0004,High,518.67,641.98,1581.99,1395.01,...,522.40,2387.98,8145.29,8.3868,0.03,390.0,2388.0,100.00,39.06,23.4875
30788,2,2017-04-10 12:00:00,0,,0.0002,High,518.67,,1586.37,1394.86,...,521.99,2387.97,8138.64,8.3982,0.03,391.0,2388.0,100.00,,23.6005


### 1. Preprocess data

In [141]:
def preProcess(data):
    """
    Function to preprocess similar datasets: 
    Takes in a dataframe, checks for null values, replaces categorical value columns with dummy variables
    and fills the remaining null values in the numerical columns with the means of that column"""
    
    df = data
    df.fillna(method='ffill', inplace=True)         #As the data is arranged chronologically, we fill the next missing variable with that of the previous hour/day
    df.fillna(method='bfill', inplace=True)         #Incase some NaNs are at the start
    
    categorical_columns = df.select_dtypes(include=['object'])
    dummy_columns = pd.get_dummies(categorical_columns)
    
    df = pd.concat([df.drop(categorical_columns, axis=1), dummy_columns], axis=1)
    
    """
    for col in df.columns:
        
        t = type(df[df[col].notnull()][col].iat[0]) #type of first non-null value in the column (just incase first value is null)
        print(t)
        if t == str: #hopefully there's no edge case where a string column isn't categorical!
            df[col].fillna(df[col].mode()[0], inplace=True) #for categorical column we just fill with the most repeated value
            #if df[col].dtype.name == 'category':
            df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col])], axis=1) #quantifying categorical column
        
         #more data types to account for should be added here
    """
    #df.fillna(df.mean(axis=0), axis=0, inplace=True)
    
     
    return df

### 2. Setting Labels

We have an interesting case here: where we're checking if a turbine is going to fail in 40 days or less. So essentially we're trying to figure out a problem where given all the parameters what is the likelihood that a certain unit fails within a 40 day timespan. 


So we just have to identify the date the turbines failed and mark the data going back up to a maximum of 40 days as a failure as well.

In [37]:
def setLabels(df,y_col, limit=40):
    """
    Function that takes in the dataframe, the target-variable column, and a number defining the window period for failure. 
    Returns a dataframe with the required target variable
    """
     
    y_col_new = y_col + '_new'
    df[y_col_new] = df[y_col]
    df[y_col_new] = df[y_col_new].replace(0, np.NaN) #Let's replace all the 0s with NaNs and then we work backwords
    df[y_col_new] = df[y_col_new].fillna(method='bfill', limit=40) # fill backward up to 40days. Thankfully the data is frequent and daily
    df[y_col_new] = df[y_col_new].fillna('0') #fill the rest with zeros
    df.drop(y_col, axis = 1, inplace=True)
    
    return df

## Gotta have this outside the function!

df = df.drop(['time_stamp', 'unit_number'], axis = 1)
status = df['status']
df = df.drop(['status'], axis = 1)

In [73]:
def normalize(df):
    """
    Function that takes in a dataset with numerical values and standardizes it
    """
    standard_sc = scale.StandardScaler()
    x_std = standard_sc.fit_transform(df)
    df_scaled = pd.DataFrame(x_std)
    return df_scaled

### Modelling

In [91]:
def dataSplit(df_X, y, dtype, test_size=0.2):
    """Function to split the training data into training, validation, and testing size and convert target variable to required type"""
    xtrain, xtest, ytrain, ytest = train_test_split(df_X, y, test_size = test_size, random_state = 19)
    #xtrain, xval, ytrain, yval = train_test_split(df_X, y, test_size = valid_size, random_state = 19)
    
    ytrain, ytest = ytrain.astype(dtype), ytest.astype(dtype)
    return xtrain, xtest, ytrain, ytest

In [105]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    print('accuracy score: ', accuracy_score(ytest, pred))
    print('Recall score: ', recall_score(ytest,pred))
    
    print('Precision Score: ',precision_score(ytest,pred))
    print('F1_score: ',f1_score(ytest, pred))
    print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [102]:
def score(training_model):
    model = training_model.fit(xtrain.values,ytrain.values)
    pred = model.predict(xtest.values)
    metrics(pred,ytest)
    return pred

In [176]:
def logisticRegression(xtrain,xtest,ytrain,ytest):
    LR = LogisticRegression(multi_class='ovr')
    pred = score(LR)
    return pred

def randomForestClassifier(xtrain,xtest,ytrain,ytest,n_estimators=25,min_samples_split=25,max_depth=5,random_state=72):
    RF = RandomForestClassifier(n_estimators = 25, min_samples_split=25, max_depth =5, random_state=72)
    pred = score(RF)
    return pred

def gaussianNaiveBayes(xtrain,xtest,ytrain,ytest):
    GNB = GaussianNB()
    pred = score(GNB)
    return pred

def supportVectorMachine(xtrain,xtest,ytrain,ytest):
    svc = SVC(kernel='linear')
    pred = score(svc)
    return pred

In [5]:
def readData(dataset):
    return pd.read_csv(dataset)

In [64]:
def sortDrop(dataset, first_column, second_column):
    """Function that sorts by first and second column and drops those columns after"""
    return dataset.sort_values(by = [first_column, second_column]).drop([first_column,second_column],axis=1)

In [168]:
def xgbClassifier(xtrain,xval,ytrain,yval, max_depth=9, n_estimators=50, learning_rate=0.05, objective='binary:logistic'):
    xgb = xgboost.XGBClassifier( max_depth=9, n_estimators=50, learning_rate=0.05, objective='binary:logistic')
    model = xgb.fit(xtrain,ytrain)
    pred = xgb.predict(xval)
    metrics(yval, pred)
    
    #pickle.dump(model, open("xbg_model.pkl","wb")) 
    joblib.dump(xgb, "xgbmodel")
    print ("model saved")

In [169]:
def train(dataset):
    
    df = readData(dataset)
    
    df = sortDrop(df, 'unit_number', 'time_stamp')
    
    df = preProcess(df)
    
    df = setLabels(df, 'status', 40)
    
    df_y = df['status_new']
    df_X = df.drop('status_new', axis = 1)
    
    df_X = normalize(df_X)
    
    xtrain, xtest, ytrain, ytest = dataSplit(df_X, df_y, int, test_size=0.25)
    
    xgbClassifier(xtrain,xtest,ytrain,ytest)
    
    
    return df_X
    
    
    
    

In [170]:
a = train("predictive_maintenance_dataset.csv")


accuracy score:  0.944939114033
Recall score:  0.800825435647
Precision Score:  0.884816753927
F1_score:  0.840728556527
roc_auc_score:  0.888856921287
model saved


In [173]:
def Predict(dataset):
    
    
    #df = readData(dataset)
    df = preProcess(dataset)
    
    df = normalize(df)
    
    #loaded_model = pickle.load(open("xgb_model.pkl", "rb"))
    loaded_model = joblib.load("xgbmodel")
    
    y_pred = loaded_model.predict(df)
    
    #metrics(ytest,y_pred)
    print(y_pred)
    
    
    

In [174]:
Predict(df_X)

[1 0 0 ..., 0 1 1]


## Task 2
`forecasting_dataset.csv` is a file that contains pollution data for a city. Your task is to create a model that, when fed with columns `co_gt`, `nhmc`, `c6h6`, `s2`, `nox`, `s3`, `no2`, `s4`, `s5`, `t`, `rh`, `ah`, and `level`, predicts the value of `y` six hours later.

**NOTE:** In the data we've given you, the value of `y` for a given row is the value of `y` *for the timestamp of that same row*. We're asking you to predict the value of `y` 6 hours *after the timestamp of that row*.

In [299]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the level column looks like
pd.read_csv("forecasting_dataset.csv").drop(labels=['date', 'time', 'y'], axis='columns')

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
0,-200.0,-200.0,7.2,867.0,-200.0,834.0,-200.0,1314.0,891.0,14.8,57.3,0.9603,
1,0.5,-200.0,3.9,704.0,-200.0,861.0,-200.0,1603.0,860.0,24.4,65.0,1.9612,Low
2,3.7,-200.0,23.3,1386.0,,626.0,109.0,2138.0,,23.3,38.6,1.0919,High
3,2.1,-200.0,12.1,1052.0,183.0,779.0,,1690.0,952.0,28.5,27.3,1.0479,High
4,4.4,-200.0,21.7,1342.0,786.0,499.0,206.0,1546.0,2006.0,12.9,54.1,0.8003,High
5,-200.0,,7.0,859.0,-200.0,892.0,-200.0,,754.0,17.7,54.0,1.0826,Very low
6,-200.0,-200.0,,1004.0,123.0,818.0,119.0,,783.0,32.4,22.4,1.0701,Very low
7,2.3,-200.0,,1035.0,100.0,746.0,112.0,1611.0,1062.0,27.9,27.6,1.0252,High
8,1.0,-200.0,4.5,737.0,49.0,899.0,63.0,,611.0,25.1,55.7,,Moderate
9,1.9,-200.0,10.2,984.0,110.0,838.0,92.0,1897.0,976.0,26.6,47.6,1.6315,High


In [470]:
df = pd.read_csv("forecasting_dataset.csv").sort_values(by = ['date', 'time'], ascending = True)
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(by = ['date', 'time'], ascending = [True, False])
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
6874,1/1/2005,2018-09-28 23:00:00,1091,1.7,-200.0,,773.0,,820.0,115.0,1003.0,1232.0,5.6,59.7,0.5463,High
353,1/1/2005,2018-09-28 22:00:00,1118,2.1,-200.0,6.4,830.0,295.0,765.0,130.0,1058.0,1313.0,5.7,59.9,0.5523,
2244,1/1/2005,2018-09-28 21:00:00,1176,2.3,-200.0,8.1,,334.0,718.0,137.0,1104.0,1389.0,6.2,59.6,0.5698,High
6069,1/1/2005,2018-09-28 20:00:00,1198,2.5,-200.0,7.9,897.0,402.0,720.0,151.0,1072.0,1436.0,7.8,54.6,0.5786,High
3046,1/1/2005,2018-09-28 19:00:00,1328,3.6,-200.0,11.4,1029.0,622.0,637.0,172.0,1188.0,1611.0,8.1,54.1,0.5882,High
7296,1/1/2005,2018-09-28 18:00:00,1472,4.7,-200.0,16.6,1198.0,832.0,555.0,191.0,1344.0,1735.0,,51.8,0.5961,
3712,1/1/2005,2018-09-28 17:00:00,1281,3.0,-200.0,12.1,1053.0,510.0,659.0,165.0,1192.0,1438.0,10.9,39.7,0.5166,High
4670,1/1/2005,2018-09-28 16:00:00,1102,2.1,-200.0,7.7,885.0,313.0,772.0,139.0,1051.0,1142.0,12.8,32.6,,High
4195,1/1/2005,2018-09-28 15:00:00,1085,2.2,-200.0,7.9,896.0,299.0,760.0,147.0,1049.0,1138.0,12.5,32.3,0.4670,High
1524,1/1/2005,2018-09-28 14:00:00,1117,2.4,-200.0,8.9,934.0,357.0,721.0,153.0,1075.0,1206.0,10.9,35.9,0.4680,High


The time column doesn't sort properly 1 am shows up after 7. Make sure to change the format! 

Now we shift everything up by 6 hours to generate our labels

In [471]:
df['y_6_hours_later'] = df.y.shift(6)
df = df.iloc[6:]

In [472]:
df = df.dropna()
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level,y_6_hours_later
3712,1/1/2005,2018-09-28 17:00:00,1281,3.0,-200.0,12.1,1053.0,510.0,659.0,165.0,1192.0,1438.0,10.9,39.7,0.5166,High,1091.0
4195,1/1/2005,2018-09-28 15:00:00,1085,2.2,-200.0,7.9,896.0,299.0,760.0,147.0,1049.0,1138.0,12.5,32.3,0.4670,High,1176.0
1524,1/1/2005,2018-09-28 14:00:00,1117,2.4,-200.0,8.9,934.0,357.0,721.0,153.0,1075.0,1206.0,10.9,35.9,0.4680,High,1198.0
6211,1/1/2005,2018-09-28 11:00:00,1011,1.7,-200.0,5.4,782.0,225.0,846.0,113.0,992.0,1019.0,6.8,48.6,0.4822,High,1281.0
1909,1/1/2005,2018-09-28 10:00:00,973,1.2,-200.0,4.7,748.0,190.0,878.0,97.0,968.0,991.0,4.7,57.2,0.4932,High,1102.0
5968,1/1/2005,2018-09-28 08:00:00,915,1.1,-200.0,3.0,653.0,169.0,973.0,94.0,905.0,901.0,2.6,63.2,0.4721,High,1117.0
236,1/1/2005,2018-09-28 07:00:00,974,1.4,-200.0,4.5,736.0,168.0,888.0,97.0,945.0,966.0,3.0,60.7,0.4667,High,1149.0
7848,1/1/2005,2018-09-28 05:00:00,1004,1.4,-200.0,4.8,753.0,181.0,879.0,106.0,942.0,1036.0,4.2,57.1,0.4759,High,1011.0
6035,1/1/2005,2018-09-28 03:00:00,1163,2.7,-200.0,7.6,881.0,-200.0,748.0,-200.0,1001.0,1296.0,4.9,53.9,0.4693,High,939.0
2768,1/1/2005,2018-09-28 02:00:00,1173,2.5,-200.0,7.5,878.0,300.0,738.0,129.0,1002.0,1355.0,5.9,50.0,0.4689,High,915.0


In [448]:
df = preProcessData(df)

In [473]:
df = pd.concat([df.drop('level', axis=1), pd.get_dummies(df['level'])], axis=1)
                

In [474]:
df.isnull().sum()

date               0
time               0
y                  0
co_gt              0
nhmc               0
c6h6               0
s2                 0
nox                0
s3                 0
no2                0
s4                 0
s5                 0
t                  0
rh                 0
ah                 0
y_6_hours_later    0
High               0
Low                0
Moderate           0
Very High          0
Very low           0
dtype: int64

In [450]:
df_scale = df.iloc[:

SyntaxError: unexpected EOF while parsing (<ipython-input-450-594aa12ffbd3>, line 1)

In [457]:
df_scaled = normalize(df_scale)

NameError: name 'df_scale' is not defined

In [282]:
y = df_scaled.iloc[:, :1]
df_scaled 
y

Unnamed: 0,0
0,0.505041
1,0.493843
2,0.495088
3,0.497576
4,0.495088
5,0.496332
6,0.488866
7,0.482645
8,0.480157
9,0.481401


In [475]:
df_X = df.iloc[:, 3:-1]

In [476]:
df_X = df_X.drop('y_6_hours_later', axis =1)

In [477]:
df_X

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,High,Low,Moderate,Very High
3712,3.0,-200.0,12.1,1053.0,510.0,659.0,165.0,1192.0,1438.0,10.9,39.7,0.5166,1,0,0,0
4195,2.2,-200.0,7.9,896.0,299.0,760.0,147.0,1049.0,1138.0,12.5,32.3,0.4670,1,0,0,0
1524,2.4,-200.0,8.9,934.0,357.0,721.0,153.0,1075.0,1206.0,10.9,35.9,0.4680,1,0,0,0
6211,1.7,-200.0,5.4,782.0,225.0,846.0,113.0,992.0,1019.0,6.8,48.6,0.4822,1,0,0,0
1909,1.2,-200.0,4.7,748.0,190.0,878.0,97.0,968.0,991.0,4.7,57.2,0.4932,1,0,0,0
5968,1.1,-200.0,3.0,653.0,169.0,973.0,94.0,905.0,901.0,2.6,63.2,0.4721,1,0,0,0
236,1.4,-200.0,4.5,736.0,168.0,888.0,97.0,945.0,966.0,3.0,60.7,0.4667,1,0,0,0
7848,1.4,-200.0,4.8,753.0,181.0,879.0,106.0,942.0,1036.0,4.2,57.1,0.4759,1,0,0,0
6035,2.7,-200.0,7.6,881.0,-200.0,748.0,-200.0,1001.0,1296.0,4.9,53.9,0.4693,1,0,0,0
2768,2.5,-200.0,7.5,878.0,300.0,738.0,129.0,1002.0,1355.0,5.9,50.0,0.4689,1,0,0,0


In [461]:
y = df['y_6_hours_later']
y

3712    1091.0
4195    1176.0
1524    1198.0
6211    1281.0
1909    1102.0
5968    1117.0
236     1149.0
7848    1011.0
6035     939.0
2768     915.0
3051    1001.0
2107    1004.0
3903    1054.0
4557    1163.0
5242    1275.0
6354    1046.0
3878    1361.0
2920    1366.0
6068    1372.0
5157    1410.0
2001    1451.0
3065    1335.0
1492    1413.0
4534     989.0
2281     960.0
7329    1052.0
6935    1178.0
3923    1433.0
2890    1438.0
5828    1508.0
         ...  
6674    1172.0
2631     958.0
5140    1008.0
828      991.0
8091    1104.0
2221    1311.0
2374    1228.0
4885     882.0
6806     856.0
4371     847.0
2622    1100.0
583     1226.0
5384    -200.0
3063    -200.0
8063    -200.0
4197    -200.0
8173    -200.0
5845    1297.0
2992    -200.0
2150    -200.0
1670    -200.0
2172    -200.0
5784    1081.0
1764    1089.0
3473    1329.0
4160    1425.0
4849    1315.0
1545    1166.0
2564    1148.0
4525     889.0
Name: y_6_hours_later, Length: 4321, dtype: float64

In [462]:
df_scaled = normalize(df_X)

In [463]:
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.518415,-0.302851,0.246461,0.447871,1.368847,-0.451535,0.892289,-0.479799,1.023294,0.015883,0.014936,0.191154,0.818703,-0.442673,-0.180269,-0.026358
1,0.508618,-0.302851,0.146393,-0.006820,0.561416,-0.142562,0.755062,-0.786932,0.368172,0.052314,-0.127821,0.189899,0.818703,-0.442673,-0.180269,-0.026358
2,0.511067,-0.302851,0.170219,0.103232,0.783364,-0.261868,0.800805,-0.731090,0.516666,0.015883,-0.058372,0.189924,0.818703,-0.442673,-0.180269,-0.026358
3,0.502496,-0.302851,0.086828,-0.336978,0.278242,0.120525,0.495857,-0.909355,0.108306,-0.077474,0.186629,0.190284,0.818703,-0.442673,-0.180269,-0.026358
4,0.496373,-0.302851,0.070150,-0.435447,0.144308,0.218418,0.373877,-0.960902,0.047161,-0.125291,0.352535,0.190562,0.818703,-0.442673,-0.180269,-0.026358
5,0.495149,-0.302851,0.029646,-0.710578,0.063948,0.509037,0.351006,-1.096212,-0.149375,-0.173108,0.468283,0.190028,0.818703,-0.442673,-0.180269,-0.026358
6,0.498822,-0.302851,0.065385,-0.470200,0.060121,0.249009,0.373877,-1.010301,-0.007432,-0.164000,0.420055,0.189892,0.818703,-0.442673,-0.180269,-0.026358
7,0.498822,-0.302851,0.072533,-0.420966,0.109868,0.221477,0.442491,-1.016744,0.145430,-0.136676,0.350606,0.190124,0.818703,-0.442673,-0.180269,-0.026358
8,0.514741,-0.302851,0.139245,-0.050262,-1.348099,-0.179271,-1.890363,-0.890025,0.713203,-0.120737,0.288873,0.189957,0.818703,-0.442673,-0.180269,-0.026358
9,0.512292,-0.302851,0.136863,-0.058951,0.565243,-0.209863,0.617836,-0.887877,0.842044,-0.097967,0.213637,0.189947,0.818703,-0.442673,-0.180269,-0.026358


In [226]:
# might want to fill the NaNs with a moving window average instead

Now we shift everything up by 6 hours to generate our labels

In [464]:
df_X

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,High,Low,Moderate,Very High
3712,3.0,-200.0,12.1,1053.0,510.0,659.0,165.0,1192.0,1438.0,10.9,39.7,0.5166,1,0,0,0
4195,2.2,-200.0,7.9,896.0,299.0,760.0,147.0,1049.0,1138.0,12.5,32.3,0.4670,1,0,0,0
1524,2.4,-200.0,8.9,934.0,357.0,721.0,153.0,1075.0,1206.0,10.9,35.9,0.4680,1,0,0,0
6211,1.7,-200.0,5.4,782.0,225.0,846.0,113.0,992.0,1019.0,6.8,48.6,0.4822,1,0,0,0
1909,1.2,-200.0,4.7,748.0,190.0,878.0,97.0,968.0,991.0,4.7,57.2,0.4932,1,0,0,0
5968,1.1,-200.0,3.0,653.0,169.0,973.0,94.0,905.0,901.0,2.6,63.2,0.4721,1,0,0,0
236,1.4,-200.0,4.5,736.0,168.0,888.0,97.0,945.0,966.0,3.0,60.7,0.4667,1,0,0,0
7848,1.4,-200.0,4.8,753.0,181.0,879.0,106.0,942.0,1036.0,4.2,57.1,0.4759,1,0,0,0
6035,2.7,-200.0,7.6,881.0,-200.0,748.0,-200.0,1001.0,1296.0,4.9,53.9,0.4693,1,0,0,0
2768,2.5,-200.0,7.5,878.0,300.0,738.0,129.0,1002.0,1355.0,5.9,50.0,0.4689,1,0,0,0


### Modelling

In [478]:
xtrain, xtest, ytrain, ytest = train_test_split(df_X, y, test_size = 0.2, random_state = 19 )

In [466]:
xtrain

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1009,0.530660,-0.302851,0.377504,0.954692,2.122704,0.141939,0.930408,0.495293,1.540841,0.127455,0.609110,0.209852,0.818703,-0.442673,-0.180269,-0.026358
452,-1.967358,-0.302851,0.132097,-0.076327,-1.348099,-0.115029,-1.890363,0.136613,-0.003065,0.141117,0.736434,0.214118,-1.221444,-0.442673,-0.180269,-0.026358
3538,0.502496,-0.302851,0.186897,0.184324,-0.027893,0.117466,0.617836,0.278367,-0.496590,0.550976,-0.378609,0.204206,0.818703,-0.442673,-0.180269,-0.026358
1273,0.517190,-0.302851,0.360825,0.896770,1.211953,-0.818633,0.572094,0.396495,1.117195,0.177549,0.367968,0.208193,0.818703,-0.442673,-0.180269,-0.026358
39,0.550252,-0.302851,0.613379,1.719269,2.157144,-1.158198,0.899913,1.004316,1.748297,0.065976,0.462496,0.201941,0.818703,-0.442673,-0.180269,-0.026358
363,0.538007,-0.302851,0.384651,0.974965,2.191584,-0.986886,1.098129,0.269776,1.687152,0.063699,0.443205,0.201420,0.818703,-0.442673,-0.180269,-0.026358
1816,0.519639,2.527905,0.291731,0.636119,0.132828,0.322429,0.564470,0.637046,0.584362,0.068253,0.053519,0.194073,0.818703,-0.442673,-0.180269,-0.026358
727,-1.967358,-0.302851,0.344147,0.833055,-1.348099,-0.754391,-1.890363,0.639194,1.143400,0.280014,0.321669,0.216008,-1.221444,-0.442673,-0.180269,-0.026358
1891,0.530660,-0.302851,0.420390,1.105291,0.370082,-0.301637,0.755062,1.158956,0.892270,0.134286,0.217495,0.201091,0.818703,-0.442673,-0.180269,-0.026358
3256,0.511067,-0.302851,0.267905,0.537651,0.148135,-0.160916,0.953279,0.499588,0.034059,0.635224,-0.432624,0.205478,0.818703,-0.442673,-0.180269,-0.026358


In [392]:
LogisticRegression.fit(xtrain, ytrain)
print("Train R2 score is:", r2_score(y_train, LogisticRegression.predict(xtrain)).round(3))
print("Train RMSE is: {} \n".format(np.sqrt(mean_squared_error(ytrain, LogisticRegression.predict(xtrain)).round(3))))
   

TypeError: fit() missing 1 required positional argument: 'y'

In [479]:
def logisticRegression(xtrain,xtest, ytrain, ytest):
    LR = LogisticRegression()
    model = LR.fit(xtrain.values,ytrain.values)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest.values)
    print(score)
    metrics(ytest,pred)

In [480]:
def linearRegression(xtrain,xtest, ytrain, ytest):
    LR = LinearRegression()
    model = LR.fit(xtrain,ytrain)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest)
    print(model)
    print(score)
    print(pred)
    print(model.coef_)
    #print(xtrain, ytrain)
    metrics(ytest,pred)

In [481]:
def randomForestRegression(xtrain,xtest, ytrain, ytest):
    LR = RandomForestRegressor(max_depth = 50,random_state = 9, max_features = 'sqrt', n_estimators = 200,min_samples_split = 5,min_samples_leaf =2)
    model = LR.fit(xtrain,ytrain)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest)
    print(model)
    print(score)
    print(pred)
    print(mean_squared_error(ytest,pred))
    #metrics(ytest,pred)

In [482]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    # print('accuracy score: ', accuracy_score(ytest, pred))
    print('RMSE:', mean_squared_error(ytest,pred))
  #  print('Recall score: ', recall_score(ytest,pred))
    
  #  print('average_precision_score: ', average_precision_score(ytest,pred))
  #  print('Precision Score: ',precision_score(ytest,pred))
  #  print('F1_score: ',f1_score(ytest, pred))
  #  print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [483]:
len(ytrain)

3456

In [484]:
#logisticRegression(xtrain,xtest, ytrain, ytest)

In [485]:
linearRegression(xtrain,xtest, ytrain, ytest)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
0.259850412404
[ 1076.56611894  1312.51548513  1126.69516765  1087.47486933  1117.9740421
  1135.45451341  1052.16659451   989.12912746   949.46217848  1143.53076422
  1005.30327668  1016.17086992  1069.85610054   905.18853121  1143.27434806
  1134.93297522  1027.60064605  1282.23039496  1128.48864534  1146.2855783
  1110.63895191   997.00887389  1121.7967745   1041.66659109  1043.24134499
  1182.1231417   1153.10931891  1118.40303207  1138.51770465  1077.96136893
   989.88892395  1059.94025894  1058.55160528   950.11773905  1119.61830668
  1035.50863113  1093.61999696  1118.11830656  1104.92803513  1141.27625854
  1120.82344203  1080.27315719  1051.39456789  1107.71847783  1036.61132747
  1268.46611155  1292.06243288   996.4885906   1112.88866484  1190.92999655
  1045.27662235  1072.49227243  1084.3736004   1080.17681956   997.73491149
  1155.72417773  1119.62212726  1055.54990842  1230.71146975  1216.2675383

In [488]:
randomForestRegression(xtrain, xtest, ytrain, ytest)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=2, min_samples_split=5,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
           oob_score=False, random_state=9, verbose=0, warm_start=False)
0.37339946306
[  926.52234524  1207.64236093  1160.87845437  1080.32447619  1159.32342965
  1096.15320454  1022.59174669   958.02604672   999.34352824  1205.45633333
  1060.03151629  1014.61126281  1097.4914991    857.32931962  1256.11052381
  1102.7359504    988.04146356  1243.92524008  1037.46265079  1175.2839329
  1119.54067983  1010.05164543  1115.01926858   997.14182937   908.37939399
  1105.7564807   1136.8439999   1098.14244553  1162.68916071  1068.97149008
  1002.29738312  1155.22038294  1051.7575119   1145.94134102  1181.71105556
  1013.36155447  1141.43545238  1161.55439033  1082.31539989  1342.06547619
