# Machine Learning Challenge

Below are 2 data challenges that test for your ability to:
- Wrangle/clean data to make it usable by a model
- Figure out how to set up X's and y's for a use case, given a dataset
- Write code to robustly and reproducibly preprocess data
- Pick/design the right model, and tune hyperparameters to get the best performance

You can use any programming language, model, and package to solve these problems. Let us know of any assumptions you make in your process.

#### Deliverables:
- A link to a github repository that contains:
    - Clearly commented code that was written to solve these problems
    - Your trained models stored in a file (`.pkl`, `.h5`, `.tar` - whatever is appropriate). The models must have `predict(X)` functions. 
    - A readme file that contains:
        - Instructions to easily access/load the above
        - A writeup explaining any significant design decisions and your reasons for making them. 
        - If needed, a brief writeup explaining anything you are particularly proud of in your implementation that you might want us to focus on

#### How we'll assess your work:
- Accuracy/RMSE of your model when predicting on held-out data
- How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:
    - A new column that wasn't present in the dataset given to you
    - New value in a categorical field that wasn't seen in the dataset given to you
    - NA values
- Efficiency of the code. 
    - Is it easy to understand? 
    - Are the variable names descriptive? 
    - Are there any variables created that aren't used? 
    - Is redundant code replaced with function calls? 
    - Is vectorized implementation used instead of nested for loops? 
    - Are classes defined and objects created where applicable? 
    - Are packages used to perform tasks instead of implementing them from scratch?
    
**NOTE:** Your stored models, once loaded, should *just work* when fed with our held-out data (which looks similar to the data we've given you). We won't do any preprocessing before we feed it into the model's `predict(X)` function; `predict(X)` should handle the preprocessing. Pay particular attention to handling the edge cases we've talked about.

Feel free to ask questions to clarify things. Submit everything you tried, not just the things that worked. I encourage you to try and showcase your talents. The more you go above and beyond what's expected, the more impressed we'll be. **Bonus points if you fit Keras/Tensorflow/Pytorch/Caffe models** in addition to your Linear/Tree-based models.

## 0. Import dependencies

In [102]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import preprocessing as scale
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

import xgboost
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.ensemble import RandomForestClassifier 
from sklearn.naive_bayes import GaussianNB 
from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_score, precision_score, f1_score,recall_score, roc_auc_score

## Task 1
`predictive_maintenance_dataset.csv` is a file that contains parameters and settings (`operational_setting_1`, `operational_setting_2`, `sensor_measurement_1`, `sensor_measurement_2`, etc.) for many wind turbines. There is a column called `unit_number` which specifies which turbine it is, and one called `status`, in which a value of 1 means the turbine broke down that day, and 0 means it didn't. Your task is to create a model that, when fed with operational settings and sensor measurements (`unit_number` and `time_stamp` will *not* be fed in), outputs 1 if the turbine will break down within the next 40 days, and 0 if not.

**NOTE:** The model should output 1 if the turbine is anywhere between 40 and 0 days away from failure, not *only* 40 days from failure.

In [3]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the operational_setting_3 column looks like
df_X = pd.read_csv("predictive_maintenance_dataset.csv").drop(labels=['status', 'unit_number', 'time_stamp'], axis='columns')
df_X

Unnamed: 0,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,42.0007,0.8415,High,445.00,,1362.47,1143.17,3.91,5.70,142.53,...,133.75,2388.50,8129.92,9.1182,,332.0,2212.0,100.00,10.77,6.5717
1,-0.0023,0.0004,High,518.67,642.33,1581.03,1400.06,14.62,21.61,554.60,...,522.19,2388.00,8135.70,8.3817,0.03,393.0,2388.0,100.00,39.07,23.3958
2,,0.6216,Low,462.54,536.71,1250.87,1037.52,7.05,9.00,174.56,...,163.11,2028.06,7867.90,10.8827,,306.0,1915.0,84.93,14.33,8.6202
3,42.0006,,High,,549.28,1349.42,1114.02,3.91,5.71,137.97,...,130.58,2387.71,8074.81,9.3776,0.02,,2212.0,100.00,10.60,6.2614
4,-0.0016,0.0004,High,518.67,643.84,1604.53,1431.41,14.62,21.61,551.30,...,519.44,2388.24,8135.95,8.5223,0.03,396.0,2388.0,100.00,38.39,23.0682
5,25.0046,0.6219,Low,462.54,536.72,,1047.79,7.05,9.03,175.36,...,164.97,2028.40,7880.19,10.8625,0.02,308.0,1915.0,84.93,14.38,8.6381
6,,0.6200,Low,462.54,536.79,1267.31,1045.78,7.05,9.03,174.81,...,165.05,2028.37,7881.95,10.9150,0.02,307.0,1915.0,84.93,14.18,8.5752
7,42.0053,0.8400,High,445.00,548.84,1348.71,1119.73,3.91,5.71,138.95,...,130.38,2387.86,8079.78,9.3526,0.02,329.0,2212.0,100.00,10.64,6.5382
8,0.0029,-0.0003,High,,642.48,1588.88,1393.88,14.62,21.61,,...,522.01,2388.06,,8.3743,0.03,392.0,2388.0,100.00,38.95,23.4351
9,10.0008,0.2504,High,489.05,604.49,1498.95,1309.51,10.52,15.49,394.85,...,371.56,2388.09,8128.11,,0.03,368.0,2319.0,100.00,28.48,17.2737


### 1. Import data

In [79]:
df = pd.read_csv("predictive_maintenance_dataset.csv").sort_values(by = ['unit_number', 'time_stamp'], ascending = True).drop('time_stamp',axis=1)
df

Unnamed: 0,unit_number,status,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
73382,2,0,-0.0018,0.0006,High,518.67,641.89,1583.84,1391.28,14.62,...,522.33,2388.06,8137.72,8.3905,0.03,391.0,2388.0,100.00,38.94,23.4585
90923,2,0,0.0043,-0.0003,High,518.67,641.82,1587.05,1393.13,14.62,...,522.70,2387.98,8131.09,8.4167,0.03,,2388.0,100.00,39.06,23.4085
82527,2,0,0.0018,0.0003,High,518.67,641.55,1588.32,1398.96,14.62,...,522.58,2387.99,8140.58,8.3802,0.03,391.0,2388.0,100.00,39.11,23.4250
96521,2,0,0.0035,-0.0004,High,518.67,641.68,1584.15,1396.08,14.62,...,522.49,2387.93,8140.44,8.4018,0.03,391.0,2388.0,100.00,39.13,23.5027
73137,2,0,0.0005,0.0004,High,518.67,641.73,1579.03,1402.52,14.62,...,522.27,2387.94,8136.67,8.3867,0.03,390.0,2388.0,100.00,39.18,23.4234
6093,2,0,-0.0010,0.0004,High,518.67,641.30,1577.50,1396.76,14.62,...,522.80,2387.99,8133.65,8.3800,0.03,392.0,2388.0,100.00,39.15,23.4270
91573,2,0,0.0001,-0.0002,High,518.67,642.03,1587.49,1400.65,14.62,...,522.14,2388.04,8136.33,8.3941,0.03,391.0,2388.0,100.00,39.10,23.4718
77471,2,0,0.0015,-0.0004,High,518.67,642.55,1590.41,,14.62,...,522.77,,,8.3861,0.03,391.0,2388.0,100.00,,23.4381
93541,2,0,0.0017,-0.0004,High,518.67,641.98,1581.99,1395.01,14.62,...,522.40,2387.98,8145.29,8.3868,0.03,390.0,2388.0,100.00,39.06,23.4875
30788,2,0,,0.0002,High,518.67,,1586.37,1394.86,14.62,...,521.99,2387.97,8138.64,8.3982,0.03,391.0,2388.0,100.00,,23.6005


In [72]:
categorical_columns = df.select_dtypes(include=['object'])
categorical_columns = categorical_columns.fillna(method='ffill')
    
#dummy_columns = pd.get_dummies(categorical_columns)
    
#df = pd.concat([df.drop(categorical_columns, axis=1), dummy_columns], axis=1)
#df
categorical_columns


Unnamed: 0,operational_setting_3
73382,High
90923,High
82527,High
96521,High
73137,High
6093,High
91573,High
77471,High
93541,High
30788,High


In [None]:
df.groupby('unit_number').count().where(df['unit_number']<40) # No unit number values less than 40

In [20]:
df.select_dtypes(include=['object']).dtypes

time_stamp               object
operational_setting_3    object
dtype: object

In [14]:
df['time_stamp'] = pd.to_datetime(df['time_stamp'])

In [24]:
df.select_dtypes(include=['object'])

Unnamed: 0,time_stamp,operational_setting_3
73382,2017-04-01 12:00:00,High
90923,2017-04-02 12:00:00,High
82527,2017-04-03 12:00:00,High
96521,2017-04-04 12:00:00,High
73137,2017-04-05 12:00:00,High
6093,2017-04-06 12:00:00,High
91573,2017-04-07 12:00:00,High
77471,2017-04-08 12:00:00,High
93541,2017-04-09 12:00:00,High
30788,2017-04-10 12:00:00,High


### 2. Explore data

Are there any null columns? 

In [80]:
# Check for columns with Null values
nullcols = []

for col in df.columns:
    nbnull = (df[col].isnull()*1).sum()
    if (nbnull>0): 
        t = type(df[df[col].notnull()][col].iat[0]) # type of first non-null value
        nullcols.append([col,t])
        print(col, nbnull, t)

operational_setting_1 7141 <class 'numpy.float64'>
operational_setting_2 7196 <class 'numpy.float64'>
operational_setting_3 7227 <class 'str'>
sensor_measurement_1 7209 <class 'numpy.float64'>
sensor_measurement_2 7198 <class 'numpy.float64'>
sensor_measurement_3 7190 <class 'numpy.float64'>
sensor_measurement_4 7335 <class 'numpy.float64'>
sensor_measurement_5 7244 <class 'numpy.float64'>
sensor_measurement_6 7444 <class 'numpy.float64'>
sensor_measurement_7 7213 <class 'numpy.float64'>
sensor_measurement_8 7276 <class 'numpy.float64'>
sensor_measurement_9 7207 <class 'numpy.float64'>
sensor_measurement_10 7191 <class 'numpy.float64'>
sensor_measurement_11 7180 <class 'numpy.float64'>
sensor_measurement_12 7227 <class 'numpy.float64'>
sensor_measurement_13 7115 <class 'numpy.float64'>
sensor_measurement_14 7068 <class 'numpy.float64'>
sensor_measurement_15 7257 <class 'numpy.float64'>
sensor_measurement_16 7059 <class 'numpy.float64'>
sensor_measurement_17 7167 <class 'numpy.float64'>

That's a lot of empty values! 

### Categorical value

 Consider replacement with mode and creating dummy variables 

In [81]:
df['operational_setting_3'].fillna(df['operational_setting_3'].mode()[0], inplace=True)

Converting to dummy variable to numerically quantify categories
and further reduce variables by only including the high column to indicate if the load is high or low (1 or 0)

In [82]:
df = pd.concat([df.drop('operational_setting_3', axis=1), pd.get_dummies(df.operational_setting_3)], axis=1)

### Numerical values

Some 7000 values are missing out of 144000, that's about 5%, a significant number. This could be valuable information that otherwise may skew our data if not used. 

There are several ways we can approach the missing numerical values. We could use the mean or median values for the entire data set, or narrow down to those values of the individual units. 

Interestingly, the 'status' column gives us a score of how often each turbine breaks down. 

So now we just fill in the missing nan's with the average values of each individual tubine

In [83]:
#df.fillna(df.mean(axis=0), axis=0, inplace=True)
df.fillna(method='ffill', inplace=True)

### Setting Labels

We have an interesting case here: where we're checking if a turbine is going to fail in 40 days or less. So essentially we're trying to figure out a problem where given all the parameters what is the likelihood that a certain unit fails within a 40 day timespan. 


So we just have to identify the date the turbines failed and mark any other data point going back up to a maximum of 40 days as a failure as well.

In [84]:
df.groupby(['status']).count()

Unnamed: 0_level_0,unit_number,operational_setting_1,operational_setting_2,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,High,Low
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,143570,143570,143570,143570,143570,143570,143570,143570,143570,143570,...,143570,143570,143570,143570,143570,143570,143570,143570,143570,143570
1,633,633,633,633,633,633,633,633,633,633,...,633,633,633,633,633,633,633,633,633,633


In [85]:
#df = pd.read_csv("processed_dataset.csv")

In [86]:
df

Unnamed: 0,unit_number,status,operational_setting_1,operational_setting_2,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,High,Low
73382,2,0,-0.0018,0.0006,518.67,641.89,1583.84,1391.28,14.62,21.60,...,8137.72,8.3905,0.03,391.0,2388.0,100.00,38.94,23.4585,1,0
90923,2,0,0.0043,-0.0003,518.67,641.82,1587.05,1393.13,14.62,21.61,...,8131.09,8.4167,0.03,391.0,2388.0,100.00,39.06,23.4085,1,0
82527,2,0,0.0018,0.0003,518.67,641.55,1588.32,1398.96,14.62,21.60,...,8140.58,8.3802,0.03,391.0,2388.0,100.00,39.11,23.4250,1,0
96521,2,0,0.0035,-0.0004,518.67,641.68,1584.15,1396.08,14.62,21.61,...,8140.44,8.4018,0.03,391.0,2388.0,100.00,39.13,23.5027,1,0
73137,2,0,0.0005,0.0004,518.67,641.73,1579.03,1402.52,14.62,21.61,...,8136.67,8.3867,0.03,390.0,2388.0,100.00,39.18,23.4234,1,0
6093,2,0,-0.0010,0.0004,518.67,641.30,1577.50,1396.76,14.62,21.61,...,8133.65,8.3800,0.03,392.0,2388.0,100.00,39.15,23.4270,1,0
91573,2,0,0.0001,-0.0002,518.67,642.03,1587.49,1400.65,14.62,21.61,...,8136.33,8.3941,0.03,391.0,2388.0,100.00,39.10,23.4718,1,0
77471,2,0,0.0015,-0.0004,518.67,642.55,1590.41,1400.65,14.62,21.61,...,8136.33,8.3861,0.03,391.0,2388.0,100.00,39.10,23.4381,1,0
93541,2,0,0.0017,-0.0004,518.67,641.98,1581.99,1395.01,14.62,21.60,...,8145.29,8.3868,0.03,390.0,2388.0,100.00,39.06,23.4875,1,0
30788,2,0,0.0017,0.0002,518.67,641.98,1586.37,1394.86,14.62,21.60,...,8138.64,8.3982,0.03,391.0,2388.0,100.00,39.06,23.6005,1,0


In [87]:
df['status'] = df['status'].replace(0, np.NaN) #Let's replace all the 0s with NaNs and then we work backwords

In [88]:
df['status'] = df['status'].fillna(method='bfill', limit=40) # fill backward up to 40days. Thankfully the data is frequent and daily
df['status'] = df['status'].fillna('0') #fill the rest with zeros

In [89]:
df.groupby(['status']).count()

Unnamed: 0_level_0,unit_number,operational_setting_1,operational_setting_2,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,High,Low
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,25953,25953,25953,25953,25953,25953,25953,25953,25953,25953,...,25953,25953,25953,25953,25953,25953,25953,25953,25953,25953
0.0,118250,118250,118250,118250,118250,118250,118250,118250,118250,118250,...,118250,118250,118250,118250,118250,118250,118250,118250,118250,118250


In [91]:
df = df.drop(['unit_number'], axis = 1)
status = df['status']
df = df.drop(['status'], axis = 1)

In [92]:
df.head()

Unnamed: 0,operational_setting_1,operational_setting_2,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,High,Low
73382,-0.0018,0.0006,518.67,641.89,1583.84,1391.28,14.62,21.6,554.53,2388.01,...,8137.72,8.3905,0.03,391.0,2388.0,100.0,38.94,23.4585,1,0
90923,0.0043,-0.0003,518.67,641.82,1587.05,1393.13,14.62,21.61,554.77,2387.98,...,8131.09,8.4167,0.03,391.0,2388.0,100.0,39.06,23.4085,1,0
82527,0.0018,0.0003,518.67,641.55,1588.32,1398.96,14.62,21.6,555.14,2388.04,...,8140.58,8.3802,0.03,391.0,2388.0,100.0,39.11,23.425,1,0
96521,0.0035,-0.0004,518.67,641.68,1584.15,1396.08,14.62,21.61,554.25,2387.98,...,8140.44,8.4018,0.03,391.0,2388.0,100.0,39.13,23.5027,1,0
73137,0.0005,0.0004,518.67,641.73,1579.03,1402.52,14.62,21.61,555.12,2388.03,...,8136.67,8.3867,0.03,390.0,2388.0,100.0,39.18,23.4234,1,0


In [93]:
standard_sc = scale.StandardScaler()
x_std = standard_sc.fit_transform(df)
df_scaled = pd.DataFrame(x_std)

In [94]:
df_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
count,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,...,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0,144203.0
mean,-2.83817e-17,1.166803e-16,1.94257e-15,1.576761e-16,-1.393857e-15,9.365961e-16,2.869705e-16,3.468874e-17,2.648959e-16,5.562813e-15,...,-1.089227e-14,2.642652e-15,-1.3434e-15,9.208285e-16,-7.52115e-16,-1.748628e-15,-1.009127e-16,-1.860578e-16,-2.44398e-17,2.44398e-17
std,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,...,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003,1.000003
min,-1.033451,-1.105096,-1.35183,-1.464823,-1.90589,-1.747285,-1.411984,-1.367677,-1.29194,-2.529554,...,-2.997123,-1.189098,-1.04587,-1.901897,-2.524492,-2.903865,-1.357807,-1.358341,-2.985677,-0.3349324
25%,-1.032852,-1.102924,-1.205957,-1.123459,-0.9355601,-0.9902227,-1.044146,-1.006238,-1.065005,-0.4400264,...,-0.2329706,-0.8143819,-1.04587,-0.9346152,-0.4394793,0.3443687,-1.000667,-1.00057,0.3349324,-0.3349324
50%,-0.4274883,-0.4195824,0.1657078,0.2060015,0.2160565,0.2299775,0.1366856,0.1509877,0.1806149,0.3438362,...,0.3760103,-0.4107696,0.956142,0.2261228,0.3467883,0.3443687,0.1878068,0.1871813,0.3349324,-0.3349324
75%,1.08453,1.177054,1.068543,1.049413,1.003154,1.02795,1.097283,1.105001,1.101033,0.795943,...,0.622834,0.392194,0.956142,0.999948,0.796084,0.3443687,1.091764,1.092054,0.3349324,-0.3349324
max,1.50841,1.182483,1.068543,1.114345,1.258834,1.312857,1.097283,1.105001,1.201365,0.8000882,...,2.537892,2.686699,0.956142,1.225647,0.796084,0.3443687,1.182331,1.184875,0.3349324,2.985677


In [None]:
df.to_csv("processed_dataset.csv")

### Modelling

In [95]:
#train, xtest, ytrain, ytest = train_test_split(df, status, test_size = 0.2, random_state = 19 )
xtrain, xval, ytrain, yval = train_test_split(df, status, test_size=0.25)

In [98]:
ytrain=ytrain.astype(int)
yval=yval.astype(int)

In [20]:
xtrain.describe()

Unnamed: 0,operational_setting_1,operational_setting_2,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,High,Low
count,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,...,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0,115362.0
mean,17.095803,0.40679,486.109731,597.692097,1468.055363,1261.922977,9.93332,14.480519,361.261443,2274.685667,...,8089.525072,9.049242,0.025221,360.96602,2274.496905,98.407854,26.03417,15.6235,0.899274,0.100726
std,16.533247,0.368365,30.437328,42.506023,118.130153,136.415871,4.267104,6.447521,174.27596,142.30753,...,80.403452,0.750428,0.004995,31.003386,142.437511,4.632375,11.702132,7.020771,0.300967,0.300967
min,-0.0087,-0.0006,445.0,535.48,1242.98,1023.77,3.91,5.67,136.17,1914.72,...,7848.43,8.1563,0.02,302.0,1915.0,84.93,10.16,6.1008,0.0,0.0
25%,0.0012,0.0002,449.44,549.99,1357.64,1127.04,5.48,8.0,175.74,2212.13,...,8070.81,8.4377,0.02,332.0,2212.0,100.0,14.34,8.603825,1.0,0.0
50%,10.0078,0.2519,491.19,606.46,1493.67,1292.71,10.52,15.45,392.84,2323.68,...,8119.65,8.7423,0.03,367.0,2324.0,100.0,28.24,16.9367,1.0,0.0
75%,35.0014,0.84,518.67,642.35,1586.7,1402.31,14.62,21.61,553.32,2388.05,...,8139.61,9.3439,0.03,392.0,2388.0,100.0,38.83,23.2983,1.0,0.0
max,42.008,0.842,518.67,644.71,1616.91,1441.16,14.62,21.61,570.81,2388.6,...,8293.72,11.0669,0.03,399.0,2388.0,100.0,39.89,23.9505,1.0,1.0


In [21]:
xtest

Unnamed: 0,operational_setting_1,operational_setting_2,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,sensor_measurement_8,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,High,Low
47990,41.9983,0.8400,445.00,550.45,1357.09,1140.43,3.91,5.71,137.93,2211.79,...,8081.70,9.4303,0.02,331.0,2212.0,100.00,10.63,6.2737,1,0
113866,42.0021,0.8420,445.00,549.18,1345.36,1117.40,3.91,5.70,138.39,2211.93,...,8074.88,9.3927,0.02,330.0,2212.0,100.00,10.52,6.4533,1,0
11120,0.0009,0.0002,518.67,642.20,1586.06,1405.09,14.62,21.58,550.65,2387.27,...,8123.42,8.4448,0.03,391.0,2388.0,100.00,38.83,23.2921,1,0
118337,-0.0011,0.0001,518.67,642.73,1582.28,1396.55,14.62,21.61,553.61,2388.06,...,8131.22,8.3617,0.03,393.0,2388.0,100.00,39.07,23.3611,1,0
65767,35.0004,0.8407,449.44,555.19,1350.23,1114.68,5.48,7.98,193.61,2222.90,...,8059.90,9.2970,0.02,332.0,2223.0,100.00,14.82,8.8429,1,0
52637,0.0000,0.0000,518.67,643.00,1597.09,1423.39,14.62,21.61,552.22,2388.14,...,8168.40,8.4757,0.03,396.0,2388.0,100.00,38.40,23.1119,1,0
55706,0.0013,0.0000,518.67,642.90,1608.01,1417.04,14.62,21.61,552.83,2388.20,...,8199.57,8.4871,0.03,395.0,2388.0,100.00,38.62,23.0893,1,0
134913,0.0029,0.0003,518.67,642.88,1589.56,1412.40,14.62,21.61,553.22,2388.14,...,8154.49,8.4568,0.03,393.0,2388.0,100.00,38.71,23.1353,1,0
25959,20.0039,0.8400,449.44,555.52,1372.47,1135.44,5.48,8.00,194.61,2222.95,...,8066.54,9.3328,0.02,331.0,2223.0,100.00,14.76,8.9531,1,0
16428,0.0008,0.0000,518.67,642.52,1583.29,1396.95,14.62,21.60,556.80,2388.04,...,8143.30,8.3717,0.03,391.0,2388.0,100.00,39.11,23.3853,1,0


In [None]:
xtrain

In [None]:
xtrain.describe()

In [None]:
ytest

In [None]:
def score(training_model):
    model = training_model.fit(xtrain.values,ytrain.values)
    pred = model.predict(xtest.values)
    metrics(pred,ytest)
    

In [116]:
def logisticRegression(xtrain,xval, ytrain, yval):
    LR = LogisticRegression()
    model = LR.fit(xtrain, ytrain)
    pred = model.predict(xval)
    metrics(pred,yval)


In [117]:
logisticRegression(xtrain,xval, ytrain, yval)

accuracy score:  0.810795817037
Recall score:  0.320833333333
Precision Score:  0.0351973182996
F1_score:  0.063435397501


In [118]:
randomForestClassifier(xtrain,xval,ytrain,yval, n_estimators=20,min_samples_split=2,max_depth=25,random_state=72)

accuracy score:  0.912041274861
Recall score:  0.597135456346
Precision Score:  0.88146648673
F1_score:  0.711962939413


In [124]:
xgbClassifier(xtrain,xval,ytrain,yval)

accuracy score:  0.944939114033
Recall score:  0.809690690233
Precision Score:  0.878347107438
F1_score:  0.842622690874


In [123]:
gaussianNaiveBayes(xtrain,xval,ytrain,yval)

In [109]:
def randomForestClassifier(xtrain,xval,ytrain,yval,n_estimators=25,min_samples_split=25,max_depth=5,random_state=72):
    RF = RandomForestClassifier(n_estimators = 25, min_samples_split=25, max_depth =5, random_state=72)
    
    model = RF.fit(xtrain,ytrain)
    pred = RF.predict(xval)
    metrics(yval, pred)

In [106]:
def xgbClassifier(xtrain,xval,ytrain,yval, max_depth=9, n_estimators=50, learning_rate=0.05, objective='binary:logistic'):
    xgb = xgboost.XGBClassifier( max_depth=9, n_estimators=50, learning_rate=0.05, objective='binary:logistic')
    #score(xgb)
    model = xgb.fit(xtrain,ytrain)
    pred = xgb.predict(xval)
    metrics(yval, pred)

In [121]:
def gaussianNaiveBayes(xtrain,xtest,ytrain,ytest):
    GNB = GaussianNB()
    model = GNB.fit(xtrain,ytrain)
    pred = GNB.predict(xtest)
    metrics = (ytest, pred)

In [115]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    print('accuracy score: ', accuracy_score(ytest, pred))
    #print('RMSE:', mean_squared_error(ytest,pred))
    print('Recall score: ', recall_score(ytest,pred))
    
    #print('average_precision_score: ', average_precision_score(ytest,pred))
    print('Precision Score: ',precision_score(ytest,pred))
    print('F1_score: ',f1_score(ytest, pred))
    #print('roc_auc_score: ', roc_auc_score(ytest, pred))

### Conclusion

100% on everything? That's very fishy! Perhaps, my way of marking due for repair might be doing something to the dataset?
or the extent of data scaling might be? 

## Task 2
`forecasting_dataset.csv` is a file that contains pollution data for a city. Your task is to create a model that, when fed with columns `co_gt`, `nhmc`, `c6h6`, `s2`, `nox`, `s3`, `no2`, `s4`, `s5`, `t`, `rh`, `ah`, and `level`, predicts the value of `y` six hours later.

**NOTE:** In the data we've given you, the value of `y` for a given row is the value of `y` *for the timestamp of that same row*. We're asking you to predict the value of `y` 6 hours *after the timestamp of that row*.

In [None]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the level column looks like
pd.read_csv("forecasting_dataset.csv").head().drop(labels=['date', 'time', 'y'], axis='columns')

In [125]:
df = pd.read_csv("forecasting_dataset.csv").sort_values(by = ['date', 'time'], ascending = True)
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(by = ['date', 'time'], ascending = [True, False])
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
6874,1/1/2005,2018-09-29 23:00:00,1091,1.7,-200.0,,773.0,,820.0,115.0,1003.0,1232.0,5.6,59.7,0.5463,High
353,1/1/2005,2018-09-29 22:00:00,1118,2.1,-200.0,6.4,830.0,295.0,765.0,130.0,1058.0,1313.0,5.7,59.9,0.5523,
2244,1/1/2005,2018-09-29 21:00:00,1176,2.3,-200.0,8.1,,334.0,718.0,137.0,1104.0,1389.0,6.2,59.6,0.5698,High
6069,1/1/2005,2018-09-29 20:00:00,1198,2.5,-200.0,7.9,897.0,402.0,720.0,151.0,1072.0,1436.0,7.8,54.6,0.5786,High
3046,1/1/2005,2018-09-29 19:00:00,1328,3.6,-200.0,11.4,1029.0,622.0,637.0,172.0,1188.0,1611.0,8.1,54.1,0.5882,High
7296,1/1/2005,2018-09-29 18:00:00,1472,4.7,-200.0,16.6,1198.0,832.0,555.0,191.0,1344.0,1735.0,,51.8,0.5961,
3712,1/1/2005,2018-09-29 17:00:00,1281,3.0,-200.0,12.1,1053.0,510.0,659.0,165.0,1192.0,1438.0,10.9,39.7,0.5166,High
4670,1/1/2005,2018-09-29 16:00:00,1102,2.1,-200.0,7.7,885.0,313.0,772.0,139.0,1051.0,1142.0,12.8,32.6,,High
4195,1/1/2005,2018-09-29 15:00:00,1085,2.2,-200.0,7.9,896.0,299.0,760.0,147.0,1049.0,1138.0,12.5,32.3,0.4670,High
1524,1/1/2005,2018-09-29 14:00:00,1117,2.4,-200.0,8.9,934.0,357.0,721.0,153.0,1075.0,1206.0,10.9,35.9,0.4680,High


In [126]:
df['y_6_hours_later'] = df.y.shift(6)
df = df.iloc[6:]

In [127]:
df = preProcessData(df.drop(['date','time'], axis =1))

NameError: name 'preProcessData' is not defined