# Machine Learning Challenge

Below are 2 data challenges that test for your ability to:
- Wrangle/clean data to make it usable by a model
- Figure out how to set up X's and y's for a use case, given a dataset
- Write code to robustly and reproducibly preprocess data
- Pick/design the right model, and tune hyperparameters to get the best performance

You can use any programming language, model, and package to solve these problems. Let us know of any assumptions you make in your process.

#### Deliverables:
- A link to a github repository that contains:
    - Clearly commented code that was written to solve these problems
    - Your trained models stored in a file (`.pkl`, `.h5`, `.tar` - whatever is appropriate). The models must have `predict(X)` functions. 
    - A readme file that contains:
        - Instructions to easily access/load the above
        - A writeup explaining any significant design decisions and your reasons for making them. 
        - If needed, a brief writeup explaining anything you are particularly proud of in your implementation that you might want us to focus on

#### How we'll assess your work:
- Accuracy/RMSE of your model when predicting on held-out data
- How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:
    - A new column that wasn't present in the dataset given to you
    - New value in a categorical field that wasn't seen in the dataset given to you
    - NA values
- Efficiency of the code. 
    - Is it easy to understand? 
    - Are the variable names descriptive? 
    - Are there any variables created that aren't used? 
    - Is redundant code replaced with function calls? 
    - Is vectorized implementation used instead of nested for loops? 
    - Are classes defined and objects created where applicable? 
    - Are packages used to perform tasks instead of implementing them from scratch?
    
**NOTE:** Your stored models, once loaded, should *just work* when fed with our held-out data (which looks similar to the data we've given you). We won't do any preprocessing before we feed it into the model's `predict(X)` function; `predict(X)` should handle the preprocessing. Pay particular attention to handling the edge cases we've talked about.

Feel free to ask questions to clarify things. Submit everything you tried, not just the things that worked. I encourage you to try and showcase your talents. The more you go above and beyond what's expected, the more impressed we'll be. **Bonus points if you fit Keras/Tensorflow/Pytorch/Caffe models** in addition to your Linear/Tree-based models.

## 0. Import dependencies

In [2]:
import pandas as pd
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt

from sklearn import preprocessing as scale
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

import xgboost
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.naive_bayes import GaussianNB 
from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_score, precision_score, f1_score,recall_score, roc_auc_score

  from numpy.core.umath_tests import inner1d


## Task 1
`predictive_maintenance_dataset.csv` is a file that contains parameters and settings (`operational_setting_1`, `operational_setting_2`, `sensor_measurement_1`, `sensor_measurement_2`, etc.) for many wind turbines. There is a column called `unit_number` which specifies which turbine it is, and one called `status`, in which a value of 1 means the turbine broke down that day, and 0 means it didn't. Your task is to create a model that, when fed with operational settings and sensor measurements (`unit_number` and `time_stamp` will *not* be fed in), outputs 1 if the turbine will break down within the next 40 days, and 0 if not.

**NOTE:** The model should output 1 if the turbine is anywhere between 40 and 0 days away from failure, not *only* 40 days from failure.

In [118]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the operational_setting_3 column looks like
df_X = pd.read_csv("predictive_maintenance_dataset.csv").drop(labels=['status', 'unit_number', 'time_stamp'], axis='columns')
df_X

Unnamed: 0,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,42.0007,0.8415,High,445.00,,1362.47,1143.17,3.91,5.70,142.53,...,133.75,2388.50,8129.92,9.1182,,332.0,2212.0,100.00,10.77,6.5717
1,-0.0023,0.0004,High,518.67,642.33,1581.03,1400.06,14.62,21.61,554.60,...,522.19,2388.00,8135.70,8.3817,0.03,393.0,2388.0,100.00,39.07,23.3958
2,,0.6216,Low,462.54,536.71,1250.87,1037.52,7.05,9.00,174.56,...,163.11,2028.06,7867.90,10.8827,,306.0,1915.0,84.93,14.33,8.6202
3,42.0006,,High,,549.28,1349.42,1114.02,3.91,5.71,137.97,...,130.58,2387.71,8074.81,9.3776,0.02,,2212.0,100.00,10.60,6.2614
4,-0.0016,0.0004,High,518.67,643.84,1604.53,1431.41,14.62,21.61,551.30,...,519.44,2388.24,8135.95,8.5223,0.03,396.0,2388.0,100.00,38.39,23.0682
5,25.0046,0.6219,Low,462.54,536.72,,1047.79,7.05,9.03,175.36,...,164.97,2028.40,7880.19,10.8625,0.02,308.0,1915.0,84.93,14.38,8.6381
6,,0.6200,Low,462.54,536.79,1267.31,1045.78,7.05,9.03,174.81,...,165.05,2028.37,7881.95,10.9150,0.02,307.0,1915.0,84.93,14.18,8.5752
7,42.0053,0.8400,High,445.00,548.84,1348.71,1119.73,3.91,5.71,138.95,...,130.38,2387.86,8079.78,9.3526,0.02,329.0,2212.0,100.00,10.64,6.5382
8,0.0029,-0.0003,High,,642.48,1588.88,1393.88,14.62,21.61,,...,522.01,2388.06,,8.3743,0.03,392.0,2388.0,100.00,38.95,23.4351
9,10.0008,0.2504,High,489.05,604.49,1498.95,1309.51,10.52,15.49,394.85,...,371.56,2388.09,8128.11,,0.03,368.0,2319.0,100.00,28.48,17.2737


In [49]:
df = pd.read_csv("predictive_maintenance_dataset.csv").sort_values(by = ['unit_number', 'time_stamp'], ascending = True)
df

Unnamed: 0,unit_number,time_stamp,status,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
73382,2,2017-04-01 12:00:00,0,-0.0018,0.0006,High,518.67,641.89,1583.84,1391.28,...,522.33,2388.06,8137.72,8.3905,0.03,391.0,2388.0,100.00,38.94,23.4585
90923,2,2017-04-02 12:00:00,0,0.0043,-0.0003,High,518.67,641.82,1587.05,1393.13,...,522.70,2387.98,8131.09,8.4167,0.03,,2388.0,100.00,39.06,23.4085
82527,2,2017-04-03 12:00:00,0,0.0018,0.0003,High,518.67,641.55,1588.32,1398.96,...,522.58,2387.99,8140.58,8.3802,0.03,391.0,2388.0,100.00,39.11,23.4250
96521,2,2017-04-04 12:00:00,0,0.0035,-0.0004,High,518.67,641.68,1584.15,1396.08,...,522.49,2387.93,8140.44,8.4018,0.03,391.0,2388.0,100.00,39.13,23.5027
73137,2,2017-04-05 12:00:00,0,0.0005,0.0004,High,518.67,641.73,1579.03,1402.52,...,522.27,2387.94,8136.67,8.3867,0.03,390.0,2388.0,100.00,39.18,23.4234
6093,2,2017-04-06 12:00:00,0,-0.0010,0.0004,High,518.67,641.30,1577.50,1396.76,...,522.80,2387.99,8133.65,8.3800,0.03,392.0,2388.0,100.00,39.15,23.4270
91573,2,2017-04-07 12:00:00,0,0.0001,-0.0002,High,518.67,642.03,1587.49,1400.65,...,522.14,2388.04,8136.33,8.3941,0.03,391.0,2388.0,100.00,39.10,23.4718
77471,2,2017-04-08 12:00:00,0,0.0015,-0.0004,High,518.67,642.55,1590.41,,...,522.77,,,8.3861,0.03,391.0,2388.0,100.00,,23.4381
93541,2,2017-04-09 12:00:00,0,0.0017,-0.0004,High,518.67,641.98,1581.99,1395.01,...,522.40,2387.98,8145.29,8.3868,0.03,390.0,2388.0,100.00,39.06,23.4875
30788,2,2017-04-10 12:00:00,0,,0.0002,High,518.67,,1586.37,1394.86,...,521.99,2387.97,8138.64,8.3982,0.03,391.0,2388.0,100.00,,23.6005


### 1. Preprocess data

In [6]:
def preProcess(data):
    """
    Function to preprocess similar datasets: 
    Takes in a dataframe, checks for null values, replaces categorical value columns with dummy variables
    and fills the remaining null values in the numerical columns with the means of that column"""
    
    df = data
    df.fillna(method='ffill', inplace=True)         #As the data is arranged chronologically, we fill the next missing variable with that of the previous hour/day
    df.fillna(method='bfill', inplace=True)         #Incase some NaNs are at the start
    
    categorical_columns = df.select_dtypes(include=['object'])
    dummy_columns = pd.get_dummies(categorical_columns)
    
    df = pd.concat([df.drop(categorical_columns, axis=1), dummy_columns], axis=1)
    
    """
    for col in df.columns:
        
        t = type(df[df[col].notnull()][col].iat[0]) #type of first non-null value in the column (just incase first value is null)
        print(t)
        if t == str: #hopefully there's no edge case where a string column isn't categorical!
            df[col].fillna(df[col].mode()[0], inplace=True) #for categorical column we just fill with the most repeated value
            #if df[col].dtype.name == 'category':
            df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col])], axis=1) #quantifying categorical column
        
         #more data types to account for should be added here
    """
    #df.fillna(df.mean(axis=0), axis=0, inplace=True)
    
     
    return df

### 2. Setting Labels

We have an interesting case here: where we're checking if a turbine is going to fail in 40 days or less. So essentially we're trying to figure out a problem where given all the parameters what is the likelihood that a certain unit fails within a 40 day timespan. 


So we just have to identify the date the turbines failed and mark the data going back up to a maximum of 40 days as a failure as well.

In [37]:
def setLabels(df,y_col, limit=40):
    """
    Function that takes in the dataframe, the target-variable column, and a number defining the window period for failure. 
    Returns a dataframe with the required target variable
    """
     
    y_col_new = y_col + '_new'
    df[y_col_new] = df[y_col]
    df[y_col_new] = df[y_col_new].replace(0, np.NaN) #Let's replace all the 0s with NaNs and then we work backwords
    df[y_col_new] = df[y_col_new].fillna(method='bfill', limit=40) # fill backward up to 40days. Thankfully the data is frequent and daily
    df[y_col_new] = df[y_col_new].fillna('0') #fill the rest with zeros
    df.drop(y_col, axis = 1, inplace=True)
    
    return df

## Gotta have this outside the function!

df = df.drop(['time_stamp', 'unit_number'], axis = 1)
status = df['status']
df = df.drop(['status'], axis = 1)

In [19]:
def normalize(df):
    """
    Function that takes in a dataset with numerical values and standardizes it
    """
    standard_sc = scale.StandardScaler()
    x_std = standard_sc.fit_transform(df)
    df_scaled = pd.DataFrame(x_std)
    return df_scaled

### Modelling

In [91]:
def dataSplit(df_X, y, dtype, test_size=0.2):
    """Function to split the training data into training, validation, and testing size and convert target variable to required type"""
    xtrain, xtest, ytrain, ytest = train_test_split(df_X, y, test_size = test_size, random_state = 19)
    #xtrain, xval, ytrain, yval = train_test_split(df_X, y, test_size = valid_size, random_state = 19)
    
    ytrain, ytest = ytrain.astype(dtype), ytest.astype(dtype)
    return xtrain, xtest, ytrain, ytest

In [105]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    print('accuracy score: ', accuracy_score(ytest, pred))
    print('Recall score: ', recall_score(ytest,pred))
    
    print('Precision Score: ',precision_score(ytest,pred))
    print('F1_score: ',f1_score(ytest, pred))
    print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [102]:
def score(training_model):
    model = training_model.fit(xtrain.values,ytrain.values)
    pred = model.predict(xtest.values)
    metrics(pred,ytest)
    return pred

In [176]:
def logisticRegression(xtrain,xtest,ytrain,ytest):
    LR = LogisticRegression(multi_class='ovr')
    pred = score(LR)
    return pred

def randomForestClassifier(xtrain,xtest,ytrain,ytest,n_estimators=25,min_samples_split=25,max_depth=5,random_state=72):
    RF = RandomForestClassifier(n_estimators = 25, min_samples_split=25, max_depth =5, random_state=72)
    pred = score(RF)
    return pred

def gaussianNaiveBayes(xtrain,xtest,ytrain,ytest):
    GNB = GaussianNB()
    pred = score(GNB)
    return pred

def supportVectorMachine(xtrain,xtest,ytrain,ytest):
    svc = SVC(kernel='linear')
    pred = score(svc)
    return pred

In [5]:
def readData(dataset):
    return pd.read_csv(dataset)

In [64]:
def sortDrop(dataset, first_column, second_column):
    """Function that sorts by first and second column and drops those columns after"""
    return dataset.sort_values(by = [first_column, second_column]).drop([first_column,second_column],axis=1)

In [168]:
def xgbClassifier(xtrain,xval,ytrain,yval, max_depth=9, n_estimators=50, learning_rate=0.05, objective='binary:logistic'):
    xgb = xgboost.XGBClassifier( max_depth=9, n_estimators=50, learning_rate=0.05, objective='binary:logistic')
    model = xgb.fit(xtrain,ytrain)
    pred = xgb.predict(xval)
    metrics(yval, pred)
    
    #pickle.dump(model, open("xbg_model.pkl","wb")) 
    joblib.dump(xgb, "xgbmodel")
    print ("model saved")

In [169]:
def train(dataset):
    
    df = readData(dataset)
    
    df = sortDrop(df, 'unit_number', 'time_stamp')
    
    df = preProcess(df)
    
    df = setLabels(df, 'status', 40)
    
    df_y = df['status_new']
    df_X = df.drop('status_new', axis = 1)
    
    df_X = normalize(df_X)
    
    xtrain, xtest, ytrain, ytest = dataSplit(df_X, df_y, int, test_size=0.25)
    
    xgbClassifier(xtrain,xtest,ytrain,ytest)
    
    
    return df_X
    
    
    
    

In [170]:
a = train("predictive_maintenance_dataset.csv")


accuracy score:  0.944939114033
Recall score:  0.800825435647
Precision Score:  0.884816753927
F1_score:  0.840728556527
roc_auc_score:  0.888856921287
model saved


In [173]:
def Predict(dataset):
    
    
    #df = readData(dataset)
    df = preProcess(dataset)
    
    df = normalize(df)
    
    #loaded_model = pickle.load(open("xgb_model.pkl", "rb"))
    loaded_model = joblib.load("xgbmodel")
    
    y_pred = loaded_model.predict(df)
    
    #metrics(ytest,y_pred)
    print(y_pred)
    
    
    

In [174]:
Predict(df_X)

[1 0 0 ..., 0 1 1]


## Task 2
`forecasting_dataset.csv` is a file that contains pollution data for a city. Your task is to create a model that, when fed with columns `co_gt`, `nhmc`, `c6h6`, `s2`, `nox`, `s3`, `no2`, `s4`, `s5`, `t`, `rh`, `ah`, and `level`, predicts the value of `y` six hours later.

**NOTE:** In the data we've given you, the value of `y` for a given row is the value of `y` *for the timestamp of that same row*. We're asking you to predict the value of `y` 6 hours *after the timestamp of that row*.

In [40]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the level column looks like
pd.read_csv("forecasting_dataset.csv").drop(labels=['date', 'time', 'y'], axis='columns')

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
0,-200.0,-200.0,7.2,867.0,-200.0,834.0,-200.0,1314.0,891.0,14.8,57.3,0.9603,
1,0.5,-200.0,3.9,704.0,-200.0,861.0,-200.0,1603.0,860.0,24.4,65.0,1.9612,Low
2,3.7,-200.0,23.3,1386.0,,626.0,109.0,2138.0,,23.3,38.6,1.0919,High
3,2.1,-200.0,12.1,1052.0,183.0,779.0,,1690.0,952.0,28.5,27.3,1.0479,High
4,4.4,-200.0,21.7,1342.0,786.0,499.0,206.0,1546.0,2006.0,12.9,54.1,0.8003,High
5,-200.0,,7.0,859.0,-200.0,892.0,-200.0,,754.0,17.7,54.0,1.0826,Very low
6,-200.0,-200.0,,1004.0,123.0,818.0,119.0,,783.0,32.4,22.4,1.0701,Very low
7,2.3,-200.0,,1035.0,100.0,746.0,112.0,1611.0,1062.0,27.9,27.6,1.0252,High
8,1.0,-200.0,4.5,737.0,49.0,899.0,63.0,,611.0,25.1,55.7,,Moderate
9,1.9,-200.0,10.2,984.0,110.0,838.0,92.0,1897.0,976.0,26.6,47.6,1.6315,High


In [79]:
df = pd.read_csv("forecasting_dataset.csv").sort_values(by = ['date', 'time'], ascending = True)
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(by = ['date', 'time'], ascending = [False, True])
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
4525,9/9/2004,2018-09-29 00:00:00,999,1.4,-200.0,7.5,879.0,180.0,1114.0,119.0,1261.0,960.0,24.9,28.3,0.8775,High
1925,9/9/2004,2018-09-29 01:00:00,980,1.2,-200.0,7.4,873.0,,1088.0,98.0,1258.0,949.0,23.2,30.5,0.8558,High
5965,9/9/2004,2018-09-29 02:00:00,859,0.6,-200.0,3.8,698.0,105.0,,89.0,1157.0,819.0,22.3,33.6,0.8959,Low
1820,9/9/2004,2018-09-29 03:00:00,814,,-200.0,2.3,606.0,-200.0,1370.0,-200.0,1131.0,,23.0,33.0,0.9184,Low
5156,9/9/2004,2018-09-29 04:00:00,806,0.2,-200.0,2.3,605.0,74.0,1366.0,68.0,,725.0,21.4,35.7,0.8982,Low
6668,9/9/2004,2018-09-29 05:00:00,813,0.4,-200.0,2.7,631.0,91.0,1327.0,65.0,1108.0,710.0,,33.4,0.8545,
2899,9/9/2004,2018-09-29 06:00:00,889,0.6,-200.0,4.9,,143.0,1224.0,81.0,1213.0,806.0,20.3,35.5,0.8342,Low
6122,9/9/2004,2018-09-29 07:00:00,1200,3.0,,18.2,,553.0,798.0,136.0,1574.0,1416.0,19.7,37.5,,High
5954,9/9/2004,2018-09-29 08:00:00,1289,3.2,,21.4,1335.0,542.0,724.0,130.0,1681.0,1521.0,21.0,35.6,,High
2564,9/9/2004,2018-09-29 09:00:00,1355,3.5,-200.0,21.3,1333.0,590.0,741.0,157.0,1691.0,1605.0,25.0,28.9,0.9025,High


The time column doesn't sort properly 1 am shows up after 7. Make sure to change the format! 

Now we shift everything up by 6 hours to generate our labels

In [86]:
df['y_6_hours_later'] = df.y.shift(-6)
#df = df.iloc[6:]

In [87]:
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level,y_6_hours_later
4525,9/9/2004,2018-09-29 00:00:00,999,1.4,-200.0,7.5,879.0,180.0,1114.0,119.0,1261.0,960.0,24.9,28.3,0.8775,High,889.0
1925,9/9/2004,2018-09-29 01:00:00,980,1.2,-200.0,7.4,873.0,,1088.0,98.0,1258.0,949.0,23.2,30.5,0.8558,High,1200.0
5965,9/9/2004,2018-09-29 02:00:00,859,0.6,-200.0,3.8,698.0,105.0,,89.0,1157.0,819.0,22.3,33.6,0.8959,Low,1289.0
1820,9/9/2004,2018-09-29 03:00:00,814,,-200.0,2.3,606.0,-200.0,1370.0,-200.0,1131.0,,23.0,33.0,0.9184,Low,1355.0
5156,9/9/2004,2018-09-29 04:00:00,806,0.2,-200.0,2.3,605.0,74.0,1366.0,68.0,,725.0,21.4,35.7,0.8982,Low,1277.0
6668,9/9/2004,2018-09-29 05:00:00,813,0.4,-200.0,2.7,631.0,91.0,1327.0,65.0,1108.0,710.0,,33.4,0.8545,,1156.0
2899,9/9/2004,2018-09-29 06:00:00,889,0.6,-200.0,4.9,,143.0,1224.0,81.0,1213.0,806.0,20.3,35.5,0.8342,Low,1054.0
6122,9/9/2004,2018-09-29 07:00:00,1200,3.0,,18.2,,553.0,798.0,136.0,1574.0,1416.0,19.7,37.5,,High,1096.0
5954,9/9/2004,2018-09-29 08:00:00,1289,3.2,,21.4,1335.0,542.0,724.0,130.0,1681.0,1521.0,21.0,35.6,,High,1160.0
2564,9/9/2004,2018-09-29 09:00:00,1355,3.5,-200.0,21.3,1333.0,590.0,741.0,157.0,1691.0,1605.0,25.0,28.9,0.9025,High,1148.0


In [88]:
df = preProcess(df.drop(['date','time'], axis =1))

In [89]:
df_y = df['y_6_hours_later']

In [90]:
df.drop(['y','y_6_hours_later'], axis=1, inplace=True)

In [18]:
df_scaled = normalize(df_scale)

NameError: name 'normalize' is not defined

In [108]:
df = normalize(df)

In [22]:
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.702336,0.505041,-0.31012,0.250231,0.440084,1.362754,-0.429167,0.884451,-0.465154,1.015552,0.030681,0.025911,0.199353,0.143314,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
1,0.176381,0.493843,-0.31012,0.148613,-0.039186,0.604932,-0.087248,0.684131,-0.759865,0.372046,0.072553,-0.106420,0.199353,0.222643,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
2,0.126430,0.495088,-0.31012,0.153232,-0.007805,0.551077,-0.123558,0.745768,-0.764045,0.363350,0.065941,-0.112011,0.198140,0.393054,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
3,0.220456,0.497576,-0.31012,0.176327,0.100601,0.774192,-0.241565,0.791996,-0.709701,0.511183,0.030681,-0.044914,0.198164,0.457693,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
4,0.314481,0.495088,-0.31012,0.164779,0.043545,0.870362,-0.178023,0.745768,-0.709701,0.589447,-0.000172,0.053868,0.198722,0.839648,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
5,0.161690,0.496332,-0.31012,0.164779,-0.067714,0.404898,-0.069093,0.576267,-0.699251,0.391612,-0.020006,0.098599,0.198722,1.262737,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
6,-0.091004,0.488866,-0.31012,0.095494,-0.333025,0.266413,0.136664,0.483811,-0.883183,0.104644,-0.059675,0.191790,0.198512,0.701556,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
7,-0.202659,0.482645,-0.31012,0.079328,-0.430020,0.131775,0.233491,0.360537,-0.933347,0.043771,-0.105954,0.352078,0.198781,0.175633,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
8,-0.302561,0.480157,-0.31012,0.040066,-0.712447,-0.041332,0.590540,0.275787,-1.081747,0.043771,-0.123584,0.385626,0.198475,0.125685,0.798622,-0.435587,-0.17462,-0.03779,-0.498328
9,-0.373080,0.481401,-0.31012,0.040066,-0.701036,0.050992,0.520946,0.337424,-1.065026,-0.151889,-0.152233,0.463907,0.198265,0.219705,0.798622,-0.435587,-0.17462,-0.03779,-0.498328


In [226]:
# might want to fill the NaNs with a moving window average instead

Now we shift everything up by 6 hours to generate our labels

In [109]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,0.485760,-0.309854,0.145134,-0.055717,0.092902,0.948941,0.530898,-0.324081,-0.021255,0.339660,-0.188259,0.207790,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
1,0.483273,-0.309854,0.142832,-0.072812,0.092902,0.870101,0.368902,-0.330368,-0.045155,0.302181,-0.147088,0.207259,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
2,0.475815,-0.309854,0.059985,-0.571427,-0.195436,0.870101,0.299475,-0.542049,-0.327613,0.282339,-0.089075,0.208241,-1.256337,2.301832,-0.177781,-0.037776,-0.495320
3,0.475815,-0.309854,0.025466,-0.833556,-1.368008,1.725206,-1.929896,-0.596542,-0.327613,0.297771,-0.100303,0.208792,-1.256337,2.301832,-0.177781,-0.037776,-0.495320
4,0.470842,-0.309854,0.025466,-0.836405,-0.314615,1.713077,0.137479,-0.596542,-0.531852,0.262497,-0.049775,0.208297,-1.256337,2.301832,-0.177781,-0.037776,-0.495320
5,0.473328,-0.309854,0.034671,-0.762325,-0.249259,1.594818,0.114337,-0.644746,-0.564443,0.262497,-0.092817,0.207227,-1.256337,2.301832,-0.177781,-0.037776,-0.495320
6,0.475815,-0.309854,0.085300,-0.762325,-0.049345,1.282492,0.237763,-0.424682,-0.355859,0.238245,-0.053518,0.206730,-1.256337,2.301832,-0.177781,-0.037776,-0.495320
7,0.505649,-0.309854,0.391373,-0.762325,1.526900,-0.009262,0.662037,0.331921,0.969521,0.225017,-0.016089,0.206730,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
8,0.508136,-0.309854,0.465014,1.243530,1.484610,-0.233651,0.615753,0.556177,1.197660,0.253678,-0.051646,0.206730,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
9,0.511865,-0.309854,0.462713,1.237832,1.669146,-0.182102,0.824033,0.577136,1.380171,0.341865,-0.177031,0.208403,0.795965,-0.434437,-0.177781,-0.037776,-0.495320


### Modelling

In [110]:
xtrain, xtest, ytrain, ytest = train_test_split(df, df_y, test_size = 0.2, random_state = 19 )

In [111]:
xtrain

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
430,0.493218,-0.309854,0.241788,0.209261,0.285127,-0.163908,0.500042,0.044789,0.322040,0.317613,-0.004861,0.213540,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
8278,0.490732,-0.309854,0.119819,-0.195329,0.619598,-0.173005,0.368902,-0.609117,0.187330,-0.052772,0.665105,0.204705,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
3520,0.485760,-0.309854,0.158941,0.012665,-0.253103,0.782165,0.276333,0.044789,-0.560098,0.394777,-0.334230,0.204372,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
6217,-2.017873,-0.309854,0.004754,-1.021604,-1.368008,1.500817,-1.929896,-1.001041,-0.775200,-0.112298,0.618320,0.201054,-1.256337,-0.434437,-0.177781,-0.037776,2.018898
1728,-2.017873,-0.309854,0.110614,-0.252313,-0.399194,0.218160,0.230049,0.199882,-0.236357,0.414619,-0.053518,0.219150,-1.256337,-0.434437,-0.177781,-0.037776,2.018898
5820,0.544186,-0.309854,0.361456,1.357500,1.807548,-0.827979,1.186595,-0.022278,2.155845,0.028801,0.262751,0.202873,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
3664,0.493218,-0.309854,0.163544,0.035459,-0.222347,0.448613,0.314903,0.392701,-0.623108,0.352888,-0.090946,0.212668,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
5780,0.484517,-0.309854,0.092204,-0.349187,0.142880,0.081707,0.492327,-0.416298,-0.192902,0.055257,0.635163,0.211023,0.795965,-0.434437,-0.177781,-0.037776,-0.495320
6959,0.477058,-0.309854,0.055383,-0.599919,-0.341527,0.360677,-0.024516,-0.206713,-0.325440,0.242654,0.444278,0.222348,-1.256337,2.301832,-0.177781,-0.037776,-0.495320
8196,0.525539,-0.309854,0.573175,1.591136,3.495283,-1.122111,1.487445,0.470247,2.225373,-0.088047,0.695048,0.202628,0.795965,-0.434437,-0.177781,-0.037776,-0.495320


In [112]:
LogisticRegression.fit(xtrain, ytrain)
print("Train R2 score is:", r2_score(y_train, LogisticRegression.predict(xtrain)).round(3))
print("Train RMSE is: {} \n".format(np.sqrt(mean_squared_error(ytrain, LogisticRegression.predict(xtrain)).round(3))))
   

TypeError: fit() missing 1 required positional argument: 'y'

In [113]:
def logisticRegression(xtrain,xtest, ytrain, ytest):
    LR = LogisticRegression()
    model = LR.fit(xtrain.values,ytrain.values)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest.values)
    print(score)
    metrics2(ytest,pred)

In [114]:
def linearRegression(xtrain,xtest, ytrain, ytest):
    LR = LinearRegression()
    model = LR.fit(xtrain,ytrain)
    #score = LR.score(xtest, ytest)
    pred = model.predict(xtest)
    #print(model)
    #print(score)
    print(pred)
    print(model.coef_)
    #print(xtrain, ytrain)
    metrics2(ytest,pred)

In [115]:
def randomForestRegressor(xtrain,xtest, ytrain, ytest):
    LR = RandomForestRegressor(max_depth = 50,random_state = 9, max_features = 'sqrt', n_estimators = 200,min_samples_split = 5,min_samples_leaf =2)
    model = LR.fit(xtrain,ytrain)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest)
    #print(model)
    #print(score)
    print(pred)
    print(mean_squared_error(ytest,pred))
    metrics2(ytest,pred)

In [116]:
def metrics2(ytest, pred):
    """
    Function to evaluate models against models 
    """
    # print('accuracy score: ', accuracy_score(ytest, pred))
    print('RMSE:', mean_squared_error(ytest,pred))
  #  print('Recall score: ', recall_score(ytest,pred))
    
  #  print('average_precision_score: ', average_precision_score(ytest,pred))
  #  print('Precision Score: ',precision_score(ytest,pred))
  #  print('F1_score: ',f1_score(ytest, pred))
  #  print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [117]:
ytest

6685     924.0
6726    1167.0
7896     865.0
898     1317.0
4375     889.0
2717    1150.0
6786     875.0
3971    1056.0
7801    1196.0
3973    1409.0
511     1390.0
1200     874.0
7346    1223.0
654     1281.0
960     1300.0
310     1352.0
1984     973.0
6446     815.0
2519     925.0
1327    2008.0
2620    1454.0
7736    -200.0
5935    1251.0
1453     897.0
2128     935.0
1682     839.0
4561    1093.0
2285     820.0
472     1252.0
3574    1029.0
         ...  
1445     991.0
4202     753.0
5027    1116.0
608      939.0
3280     972.0
7918     862.0
1285     976.0
4946    1569.0
4010    1023.0
5797    1014.0
2868    -200.0
8167    1236.0
8294    1016.0
5780    1337.0
996     1200.0
4421     930.0
8149     817.0
1511    1121.0
7137     720.0
351     1176.0
5670    1191.0
329     1481.0
4682     884.0
2541    -200.0
6289    1012.0
6827    1251.0
2103    1982.0
6262    1448.0
4508    1052.0
7973     882.0
Name: y_6_hours_later, Length: 1685, dtype: float64

In [118]:
#logisticRegression(xtrain,xtest, ytrain, ytest)

In [119]:
linearRegression(xtrain,xtest, ytrain, ytest)

[1036.82471394 1005.39985816  935.78670756 ... 1222.79149115 1054.6582839
  148.79992204]
[  5.35258669  29.28711771  15.63948488 -80.43074747  34.29233592
   2.33961896 -34.77066078   6.8333999  101.97702856 120.81013412
 106.47268092 -61.97287516  -9.94222359  10.9650486   -3.29662165
 -11.70363661   4.64309292]
RMSE: 83711.56104224092


In [120]:
randomForestRegression(xtrain, xtest, ytrain, ytest)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=50,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=2, min_samples_split=5,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
           oob_score=False, random_state=9, verbose=0, warm_start=False)
0.4470845924233477
[ 947.65384921  877.05493723  849.89543254 ... 1146.18155435 1044.42585642
  258.71022012]
61580.30219853067
RMSE: 61580.30219853067
