# Machine Learning Challenge

Below are 2 data challenges that test for your ability to:
- Wrangle/clean data to make it usable by a model
- Figure out how to set up X's and y's for a use case, given a dataset
- Write code to robustly and reproducibly preprocess data
- Pick/design the right model, and tune hyperparameters to get the best performance

You can use any programming language, model, and package to solve these problems. Let us know of any assumptions you make in your process.

#### Deliverables:
- A link to a github repository that contains:
    - Clearly commented code that was written to solve these problems
    - Your trained models stored in a file (`.pkl`, `.h5`, `.tar` - whatever is appropriate). The models must have `predict(X)` functions. 
    - A readme file that contains:
        - Instructions to easily access/load the above
        - A writeup explaining any significant design decisions and your reasons for making them. 
        - If needed, a brief writeup explaining anything you are particularly proud of in your implementation that you might want us to focus on

#### How we'll assess your work:
- Accuracy/RMSE of your model when predicting on held-out data
- How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:
    - A new column that wasn't present in the dataset given to you
    - New value in a categorical field that wasn't seen in the dataset given to you
    - NA values
- Efficiency of the code. 
    - Is it easy to understand? 
    - Are the variable names descriptive? 
    - Are there any variables created that aren't used? 
    - Is redundant code replaced with function calls? 
    - Is vectorized implementation used instead of nested for loops? 
    - Are classes defined and objects created where applicable? 
    - Are packages used to perform tasks instead of implementing them from scratch?
    
**NOTE:** Your stored models, once loaded, should *just work* when fed with our held-out data (which looks similar to the data we've given you). We won't do any preprocessing before we feed it into the model's `predict(X)` function; `predict(X)` should handle the preprocessing. Pay particular attention to handling the edge cases we've talked about.

Feel free to ask questions to clarify things. Submit everything you tried, not just the things that worked. I encourage you to try and showcase your talents. The more you go above and beyond what's expected, the more impressed we'll be. **Bonus points if you fit Keras/Tensorflow/Pytorch/Caffe models** in addition to your Linear/Tree-based models.

## 0. Import dependencies

In [254]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import preprocessing as scale
from sklearn.utils import resample
from sklearn.model_selection import train_test_split


from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.naive_bayes import GaussianNB 
from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_score, precision_score, f1_score,recall_score, roc_auc_score

## Task 1
`predictive_maintenance_dataset.csv` is a file that contains parameters and settings (`operational_setting_1`, `operational_setting_2`, `sensor_measurement_1`, `sensor_measurement_2`, etc.) for many wind turbines. There is a column called `unit_number` which specifies which turbine it is, and one called `status`, in which a value of 1 means the turbine broke down that day, and 0 means it didn't. Your task is to create a model that, when fed with operational settings and sensor measurements (`unit_number` and `time_stamp` will *not* be fed in), outputs 1 if the turbine will break down within the next 40 days, and 0 if not.

**NOTE:** The model should output 1 if the turbine is anywhere between 40 and 0 days away from failure, not *only* 40 days from failure.

In [160]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the operational_setting_3 column looks like
df_X = pd.read_csv("predictive_maintenance_dataset.csv").drop(labels=['status', 'unit_number', 'time_stamp'], axis='columns')
df_X

Unnamed: 0,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,42.0007,0.8415,High,445.00,,1362.47,1143.17,3.91,5.70,142.53,...,133.75,2388.50,8129.92,9.1182,,332.0,2212.0,100.00,10.77,6.5717
1,-0.0023,0.0004,High,518.67,642.33,1581.03,1400.06,14.62,21.61,554.60,...,522.19,2388.00,8135.70,8.3817,0.03,393.0,2388.0,100.00,39.07,23.3958
2,,0.6216,Low,462.54,536.71,1250.87,1037.52,7.05,9.00,174.56,...,163.11,2028.06,7867.90,10.8827,,306.0,1915.0,84.93,14.33,8.6202
3,42.0006,,High,,549.28,1349.42,1114.02,3.91,5.71,137.97,...,130.58,2387.71,8074.81,9.3776,0.02,,2212.0,100.00,10.60,6.2614
4,-0.0016,0.0004,High,518.67,643.84,1604.53,1431.41,14.62,21.61,551.30,...,519.44,2388.24,8135.95,8.5223,0.03,396.0,2388.0,100.00,38.39,23.0682
5,25.0046,0.6219,Low,462.54,536.72,,1047.79,7.05,9.03,175.36,...,164.97,2028.40,7880.19,10.8625,0.02,308.0,1915.0,84.93,14.38,8.6381
6,,0.6200,Low,462.54,536.79,1267.31,1045.78,7.05,9.03,174.81,...,165.05,2028.37,7881.95,10.9150,0.02,307.0,1915.0,84.93,14.18,8.5752
7,42.0053,0.8400,High,445.00,548.84,1348.71,1119.73,3.91,5.71,138.95,...,130.38,2387.86,8079.78,9.3526,0.02,329.0,2212.0,100.00,10.64,6.5382
8,0.0029,-0.0003,High,,642.48,1588.88,1393.88,14.62,21.61,,...,522.01,2388.06,,8.3743,0.03,392.0,2388.0,100.00,38.95,23.4351
9,10.0008,0.2504,High,489.05,604.49,1498.95,1309.51,10.52,15.49,394.85,...,371.56,2388.09,8128.11,,0.03,368.0,2319.0,100.00,28.48,17.2737


In [163]:
df = pd.read_csv("predictive_maintenance_dataset.csv").sort
df

Unnamed: 0,unit_number,time_stamp,status,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,540,2017-02-19 12:00:00,0,42.0007,0.8415,High,445.00,,1362.47,1143.17,...,133.75,2388.50,8129.92,9.1182,,332.0,2212.0,100.00,10.77,6.5717
1,396,2017-11-21 12:00:00,0,-0.0023,0.0004,High,518.67,642.33,1581.03,1400.06,...,522.19,2388.00,8135.70,8.3817,0.03,393.0,2388.0,100.00,39.07,23.3958
2,513,2017-02-12 12:00:00,0,,0.6216,Low,462.54,536.71,1250.87,1037.52,...,163.11,2028.06,7867.90,10.8827,,306.0,1915.0,84.93,14.33,8.6202
3,211,2014-06-05 12:00:00,0,42.0006,,High,,549.28,1349.42,1114.02,...,130.58,2387.71,8074.81,9.3776,0.02,,2212.0,100.00,10.60,6.2614
4,460,2014-11-27 12:00:00,0,-0.0016,0.0004,High,518.67,643.84,1604.53,1431.41,...,519.44,2388.24,8135.95,8.5223,0.03,396.0,2388.0,100.00,38.39,23.0682
5,306,2015-09-10 12:00:00,0,25.0046,0.6219,Low,462.54,536.72,,1047.79,...,164.97,2028.40,7880.19,10.8625,0.02,308.0,1915.0,84.93,14.38,8.6381
6,124,2015-12-29 12:00:00,0,,0.6200,Low,462.54,536.79,1267.31,1045.78,...,165.05,2028.37,7881.95,10.9150,0.02,307.0,1915.0,84.93,14.18,8.5752
7,199,2014-08-14 12:00:00,0,42.0053,0.8400,High,445.00,548.84,1348.71,1119.73,...,130.38,2387.86,8079.78,9.3526,0.02,329.0,2212.0,100.00,10.64,6.5382
8,89,2014-11-06 12:00:00,0,0.0029,-0.0003,High,,642.48,1588.88,1393.88,...,522.01,2388.06,,8.3743,0.03,392.0,2388.0,100.00,38.95,23.4351
9,692,2015-08-11 12:00:00,0,10.0008,0.2504,High,489.05,604.49,1498.95,1309.51,...,371.56,2388.09,8128.11,,0.03,368.0,2319.0,100.00,28.48,17.2737


### 1. Preprocess data

In [108]:
def preProcessData(data):
    """
    Function to preprocess similar datasets: 
    Takes in a dataframe, checks for null values, replaces categorical value columns with dummy variables
    and fills the remaining null values in the numerical columns with the means of that column"""
    
    df = data
    nullcols = []

    for col in df.columns:
        
        nbnull = (df[col].isnull()*1).sum()
        if nbnull>0: #if there are null values in a column
            t = type(df[df[col].notnull()][col].iat[0]) # type of first non-null value
            #nullcols.append([col,t])
            print(col, nbnull, t)
                
            
            if t == str: #hopefully there's no edge case where a string column isn't categorical!
                df[col].fillna(df[col].mode()[0], inplace=True)
                #if df[col].dtype.name == 'category':
                df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col])], axis=1)
                
            #elif t == 'numpy.float64': #numerical
    df.fillna(df.mean(axis=0), axis=0, inplace=True)
                
                #more data types to account off can be added here 
     
    return df

### 2. Setting Labels

We have an interesting case here: where we're checking if a turbine is going to fail in 40 days or less. So essentially we're trying to figure out a problem where given all the parameters what is the likelihood that a certain unit fails within a 40 day timespan. 


So we just have to identify the date the turbines failed and mark the data going back up to a maximum of 40 days as a failure as well.

In [164]:
df.groupby(['status']).count()

Unnamed: 0_level_0,unit_number,time_stamp,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,143570,143570,136461,136402,136375,136393,136399,136411,136277,136350,...,136375,136490,136527,136350,136544,136434,136337,136408,136582,136432
1,633,633,601,605,601,601,606,602,591,609,...,601,598,608,596,600,602,601,600,603,606


In [165]:
def setLabels(df,y_col, limit=40):
    """
    Function that takes in the dataframe, the target-variable column, 
    and a number defining the window period for failure
    """
    print("Initial split of functioning to non-functioning machines: ", df.groupby(['status']).count())
    df[y_col] = df[y_col].replace(0, np.NaN) #Let's replace all the 0s with NaNs and then we work backwords
    df[y_col] = df[y_col].fillna(method='bfill', limit=40) # fill backward up to 40days. Thankfully the data is frequent and daily
    df[y_col] = df[y_col].fillna('0') #fill the rest with zeros
    print("Final split of functioning to non-functioning machines: ", df.groupby(['status']).count())
    
    return df

## Gotta have this outside the function!

In [83]:
df = df.drop(['time_stamp', 'unit_number'], axis = 1)
status = df['status']
df = df.drop(['status'], axis = 1)

In [219]:
def normalize(df):
    """
    Function that takes in a dataset with numerical values and standardizes it
    """
    standard_sc = scale.StandardScaler()
    x_std = standard_sc.fit_transform(df)
    df_scaled = pd.DataFrame(x_std)
    return df_scaled

### Modelling

In [169]:
def dataSplit(df, status, test_size=0.25):
    xtrain, xtest, ytrain, ytest = train_test_split(df_scaled, df['status'], test_size = 0.25, random_state = 19)
    ytrain, ytest = ytrain.astype(int), ytest.astype(int)
    return xtrain, xtest, ytrain, ytest

In [141]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    print('accuracy score: ', accuracy_score(ytest, pred))
    print('RMSE:', mean_squared_error(ytest,pred))
    print('Recall score: ', recall_score(ytest,pred))
    
    print('average_precision_score: ', average_precision_score(ytest,pred))
    print('Precision Score: ',precision_score(ytest,pred))
    print('F1_score: ',f1_score(ytest, pred))
    print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [None]:
def score(training_model):
    model = training_model.fit(xtrain.values,ytrain.values)
    pred = model.predict(xtest.values)
    metrics(pred,ytest)
    return pred

In [176]:
def logisticRegression(xtrain,xtest,ytrain,ytest):
    LR = LogisticRegression(multi_class='ovr')
    pred = score(LR)
    return pred

def randomForestClassifier(xtrain,xtest,ytrain,ytest,n_estimators=25,min_samples_split=25,max_depth=5,random_state=72):
    RF = RandomForestClassifier(n_estimators = 25, min_samples_split=25, max_depth =5, random_state=72)
    pred = score(RF)
    return pred

def gaussianNaiveBayes(xtrain,xtest,ytrain,ytest):
    GNB = GaussianNB()
    pred = score(GNB)
    return pred

def supportVectorMachine(xtrain,xtest,ytrain,ytest):
    svc = SVC(kernel='linear')
    pred = score(svc)
    return pred

In [None]:
def train(df, status):
    
    xtrain, xtest, ytrain, ytest = dataSplit(df, status, test_size=0.2)
    
    
    

In [None]:
X = pd.read_csv("predictive_maintenance_dataset.csv")

df = X.sort_values(by = ['unit_number', 'time_stamp'])
    
df = preProcessData(df)
    
df = setLabels(df, 'status', 40)
    
df = df.drop(['time_stamp', 'unit_number'], axis = 1)
    
df_scaled = normalize(df)

predictions =

In [178]:
def Predict(X):
    
    
    
    

    
    
    

## Task 2
`forecasting_dataset.csv` is a file that contains pollution data for a city. Your task is to create a model that, when fed with columns `co_gt`, `nhmc`, `c6h6`, `s2`, `nox`, `s3`, `no2`, `s4`, `s5`, `t`, `rh`, `ah`, and `level`, predicts the value of `y` six hours later.

**NOTE:** In the data we've given you, the value of `y` for a given row is the value of `y` *for the timestamp of that same row*. We're asking you to predict the value of `y` 6 hours *after the timestamp of that row*.

In [299]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the level column looks like
pd.read_csv("forecasting_dataset.csv").drop(labels=['date', 'time', 'y'], axis='columns')

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
0,-200.0,-200.0,7.2,867.0,-200.0,834.0,-200.0,1314.0,891.0,14.8,57.3,0.9603,
1,0.5,-200.0,3.9,704.0,-200.0,861.0,-200.0,1603.0,860.0,24.4,65.0,1.9612,Low
2,3.7,-200.0,23.3,1386.0,,626.0,109.0,2138.0,,23.3,38.6,1.0919,High
3,2.1,-200.0,12.1,1052.0,183.0,779.0,,1690.0,952.0,28.5,27.3,1.0479,High
4,4.4,-200.0,21.7,1342.0,786.0,499.0,206.0,1546.0,2006.0,12.9,54.1,0.8003,High
5,-200.0,,7.0,859.0,-200.0,892.0,-200.0,,754.0,17.7,54.0,1.0826,Very low
6,-200.0,-200.0,,1004.0,123.0,818.0,119.0,,783.0,32.4,22.4,1.0701,Very low
7,2.3,-200.0,,1035.0,100.0,746.0,112.0,1611.0,1062.0,27.9,27.6,1.0252,High
8,1.0,-200.0,4.5,737.0,49.0,899.0,63.0,,611.0,25.1,55.7,,Moderate
9,1.9,-200.0,10.2,984.0,110.0,838.0,92.0,1897.0,976.0,26.6,47.6,1.6315,High


In [301]:
df = pd.read_csv("forecasting_dataset.csv").sort_values(by = ['date', 'time'], ascending = True)
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(by = ['date', 'time'], ascending = [True, False])
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
6874,1/1/2005,2018-09-28 23:00:00,1091,1.7,-200.0,,773.0,,820.0,115.0,1003.0,1232.0,5.6,59.7,0.5463,High
353,1/1/2005,2018-09-28 22:00:00,1118,2.1,-200.0,6.4,830.0,295.0,765.0,130.0,1058.0,1313.0,5.7,59.9,0.5523,
2244,1/1/2005,2018-09-28 21:00:00,1176,2.3,-200.0,8.1,,334.0,718.0,137.0,1104.0,1389.0,6.2,59.6,0.5698,High
6069,1/1/2005,2018-09-28 20:00:00,1198,2.5,-200.0,7.9,897.0,402.0,720.0,151.0,1072.0,1436.0,7.8,54.6,0.5786,High
3046,1/1/2005,2018-09-28 19:00:00,1328,3.6,-200.0,11.4,1029.0,622.0,637.0,172.0,1188.0,1611.0,8.1,54.1,0.5882,High
7296,1/1/2005,2018-09-28 18:00:00,1472,4.7,-200.0,16.6,1198.0,832.0,555.0,191.0,1344.0,1735.0,,51.8,0.5961,
3712,1/1/2005,2018-09-28 17:00:00,1281,3.0,-200.0,12.1,1053.0,510.0,659.0,165.0,1192.0,1438.0,10.9,39.7,0.5166,High
4670,1/1/2005,2018-09-28 16:00:00,1102,2.1,-200.0,7.7,885.0,313.0,772.0,139.0,1051.0,1142.0,12.8,32.6,,High
4195,1/1/2005,2018-09-28 15:00:00,1085,2.2,-200.0,7.9,896.0,299.0,760.0,147.0,1049.0,1138.0,12.5,32.3,0.4670,High
1524,1/1/2005,2018-09-28 14:00:00,1117,2.4,-200.0,8.9,934.0,357.0,721.0,153.0,1075.0,1206.0,10.9,35.9,0.4680,High


The time column doesn't sort properly 1 am shows up after 7. Make sure to change the format! 

Now we shift everything up by 6 hours to generate our labels

In [314]:
df['y_6_hours_later'] = df.y.shift(6)
df = df.iloc[6:]

In [315]:
df

Unnamed: 0,date,time,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,...,s5,t,rh,ah,High,Low,Moderate,Very High,Very low,y_6_hours_later
6211,1/1/2005,2018-09-28 11:00:00,1281.0,1.700000,-200.000000,5.400000,782.00000,225.000000,846.000000,113.000000,...,1019.000000,6.800000,48.600000,0.482200,1,0,0,0,0,1091.0
1909,1/1/2005,2018-09-28 10:00:00,1102.0,1.200000,-200.000000,4.700000,748.00000,190.000000,878.000000,97.000000,...,991.000000,4.700000,57.200000,0.493200,1,0,0,0,0,1118.0
8353,1/1/2005,2018-09-28 09:00:00,1085.0,1.000000,-200.000000,3.000000,649.00000,145.000000,996.000000,86.000000,...,970.416968,3.900000,59.000000,0.480700,1,0,0,0,0,1176.0
5968,1/1/2005,2018-09-28 08:00:00,1117.0,1.100000,-200.000000,3.000000,653.00000,169.000000,973.000000,94.000000,...,901.000000,2.600000,63.200000,0.472100,1,0,0,0,0,1198.0
236,1/1/2005,2018-09-28 07:00:00,1149.0,1.400000,-200.000000,4.500000,736.00000,168.000000,888.000000,97.000000,...,966.000000,3.000000,60.700000,0.466700,1,0,0,0,0,1328.0
3412,1/1/2005,2018-09-28 06:00:00,1097.0,1.500000,-154.657311,5.300000,777.00000,171.000000,859.000000,99.000000,...,1009.000000,3.500000,58.300000,0.463600,1,0,0,0,0,1472.0
7848,1/1/2005,2018-09-28 05:00:00,1011.0,1.400000,-200.000000,4.800000,753.00000,181.000000,879.000000,106.000000,...,1036.000000,4.200000,57.100000,0.475900,1,0,0,0,0,1281.0
3197,1/1/2005,2018-09-28 04:00:00,973.0,1.900000,-200.000000,5.600000,791.00000,253.000000,830.000000,126.000000,...,1131.000000,4.300000,55.300000,0.465000,1,0,0,0,0,1102.0
6035,1/1/2005,2018-09-28 03:00:00,939.0,2.700000,-200.000000,7.600000,881.00000,-200.000000,748.000000,-200.000000,...,1296.000000,4.900000,53.900000,0.469300,1,0,0,0,0,1085.0
2768,1/1/2005,2018-09-28 02:00:00,915.0,2.500000,-200.000000,7.500000,878.00000,300.000000,738.000000,129.000000,...,1355.000000,5.900000,50.000000,0.468900,1,0,0,0,0,1117.0


In [316]:
df = preProcessData(df)

In [284]:
df_scale = df.iloc[:, 2:]
df_scale

Unnamed: 0,y,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,High,Low,Moderate,Very High,Very low
3712,1091.0,3.000000,-200.000000,12.100000,1053.00000,510.000000,659.000000,165.000000,1192.000000,1438.000000,10.900000,39.700000,0.516600,1,0,0,0,0
4670,1118.0,2.100000,-200.000000,7.700000,885.00000,313.000000,772.000000,139.000000,1051.000000,1142.000000,12.800000,32.600000,-7.799045,1,0,0,0,0
4195,1176.0,2.200000,-200.000000,7.900000,896.00000,299.000000,760.000000,147.000000,1049.000000,1138.000000,12.500000,32.300000,0.467000,1,0,0,0,0
1524,1198.0,2.400000,-200.000000,8.900000,934.00000,357.000000,721.000000,153.000000,1075.000000,1206.000000,10.900000,35.900000,0.468000,1,0,0,0,0
5266,1328.0,2.200000,-200.000000,8.400000,914.00000,382.000000,742.000000,147.000000,1414.719486,1242.000000,9.500000,41.200000,0.490800,1,0,0,0,0
6498,1472.0,2.300000,-200.000000,1.128952,875.00000,261.000000,778.000000,125.000000,1080.000000,1151.000000,8.600000,43.600000,-7.799045,1,0,0,0,0
6211,1281.0,1.700000,-200.000000,5.400000,782.00000,225.000000,846.000000,113.000000,992.000000,1019.000000,6.800000,48.600000,0.482200,1,0,0,0,0
1909,1102.0,1.200000,-200.000000,4.700000,748.00000,190.000000,878.000000,97.000000,968.000000,991.000000,4.700000,57.200000,0.493200,1,0,0,0,0
8353,1085.0,1.000000,-200.000000,3.000000,649.00000,145.000000,996.000000,86.000000,897.000000,970.416968,3.900000,59.000000,0.480700,1,0,0,0,0
5968,1117.0,1.100000,-200.000000,3.000000,653.00000,169.000000,973.000000,94.000000,905.000000,901.000000,2.600000,63.200000,0.472100,1,0,0,0,0


In [285]:
df_scaled = normalize(df_scale)

In [291]:
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.143314,5.193151e-01,-0.316484,2.581822e-01,0.451259,1.395045e+00,-0.445220,0.909603,-4.770707e-01,1.042621,2.291174e-02,3.206680e-02,2.068201e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
1,0.222643,5.078192e-01,-0.316484,1.546368e-01,-0.040820,6.178126e-01,-0.093003,0.703779,-7.790962e-01,0.382597,6.674627e-02,-1.030617e-01,-7.289722e-16,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
2,0.393054,5.090965e-01,-0.316484,1.593434e-01,-0.008600,5.625779e-01,-0.130407,0.767109,-7.833802e-01,0.373678,5.982503e-02,-1.087713e-01,2.055865e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
3,0.457693,5.116511e-01,-0.316484,1.828764e-01,0.102703,7.914077e-01,-0.251969,0.814607,-7.276876e-01,0.525305,2.291174e-02,-4.025548e-02,2.056114e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
4,0.839648,5.090965e-01,-0.316484,1.711099e-01,0.044122,8.900413e-01,-0.186512,0.767109,4.870401e-16,0.605578,-9.387381e-03,6.061507e-02,2.061784e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
5,1.262737,5.103738e-01,-0.316484,1.045077e-17,-0.070110,4.126548e-01,-0.074302,0.592950,-7.169775e-01,0.402666,-3.015110e-02,1.062923e-01,-7.289722e-16,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
6,0.701556,5.027099e-01,-0.316484,1.005108e-01,-0.342511,2.706225e-01,0.137652,0.497954,-9.054757e-01,0.108331,-7.167855e-02,2.014532e-01,2.059645e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
7,0.175633,4.963233e-01,-0.316484,8.403764e-02,-0.442098,1.325356e-01,0.237395,0.371293,-9.568843e-01,0.045896,-1.201272e-01,3.651299e-01,2.062381e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
8,0.125685,4.937686e-01,-0.316484,4.403147e-02,-0.732073,-4.500485e-02,0.605196,0.284214,-1.108968e+00,0.000000,-1.385839e-01,3.993878e-01,2.059272e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
9,0.219705,4.950460e-01,-0.316484,4.403147e-02,-0.720357,4.968337e-02,0.533506,0.347544,-1.091832e+00,-0.154787,-1.685759e-01,4.793230e-01,2.057133e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667


In [290]:
y = df_scaled.iloc[:, :1]
df_scaled 
y

Unnamed: 0,0
0,0.143314
1,0.222643
2,0.393054
3,0.457693
4,0.839648
5,1.262737
6,0.701556
7,0.175633
8,0.125685
9,0.219705


In [321]:
df_X = df.iloc[:, 3:-1]

In [322]:
df_X

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,High,Low,Moderate,Very High,Very low
6211,1.700000,-200.000000,5.400000,782.00000,225.000000,846.000000,113.000000,992.000000,1019.000000,6.800000,48.600000,0.482200,1,0,0,0,0
1909,1.200000,-200.000000,4.700000,748.00000,190.000000,878.000000,97.000000,968.000000,991.000000,4.700000,57.200000,0.493200,1,0,0,0,0
8353,1.000000,-200.000000,3.000000,649.00000,145.000000,996.000000,86.000000,897.000000,970.416968,3.900000,59.000000,0.480700,1,0,0,0,0
5968,1.100000,-200.000000,3.000000,653.00000,169.000000,973.000000,94.000000,905.000000,901.000000,2.600000,63.200000,0.472100,1,0,0,0,0
236,1.400000,-200.000000,4.500000,736.00000,168.000000,888.000000,97.000000,945.000000,966.000000,3.000000,60.700000,0.466700,1,0,0,0,0
3412,1.500000,-154.657311,5.300000,777.00000,171.000000,859.000000,99.000000,954.000000,1009.000000,3.500000,58.300000,0.463600,1,0,0,0,0
7848,1.400000,-200.000000,4.800000,753.00000,181.000000,879.000000,106.000000,942.000000,1036.000000,4.200000,57.100000,0.475900,1,0,0,0,0
3197,1.900000,-200.000000,5.600000,791.00000,253.000000,830.000000,126.000000,967.000000,1131.000000,4.300000,55.300000,0.465000,1,0,0,0,0
6035,2.700000,-200.000000,7.600000,881.00000,-200.000000,748.000000,-200.000000,1001.000000,1296.000000,4.900000,53.900000,0.469300,1,0,0,0,0
2768,2.500000,-200.000000,7.500000,878.00000,300.000000,738.000000,129.000000,1002.000000,1355.000000,5.900000,50.000000,0.468900,1,0,0,0,0


In [323]:
y = df['y_6_hours_later']
y

6211    1091.0
1909    1118.0
8353    1176.0
5968    1198.0
236     1328.0
3412    1472.0
7848    1281.0
3197    1102.0
6035    1085.0
2768    1117.0
547     1149.0
3051    1097.0
2107    1011.0
3903     973.0
4557     939.0
7502     915.0
5242     974.0
6354    1001.0
2558    1004.0
4823    1054.0
3878    1163.0
5896    1173.0
694     1275.0
2920    1046.0
6068    1177.0
7782    1312.0
692     1361.0
5157    1385.0
2001    1365.0
3065    1366.0
         ...  
8173    -200.0
1185    -200.0
5845    -200.0
2992    -200.0
332     -200.0
2150    -200.0
1670    -200.0
5836    -200.0
6350    1297.0
1219    -200.0
3802    -200.0
2172    -200.0
5784    -200.0
1764    -200.0
6216    -200.0
7545    -200.0
3473    -200.0
4160    -200.0
4849    1081.0
1545    1089.0
2564    1123.0
5954    1299.0
6122    1329.0
2899    1425.0
6668    1315.0
5156    1166.0
1820    1148.0
5965    1160.0
1925    1096.0
4525    1054.0
Name: y_6_hours_later, Length: 8409, dtype: float64

In [234]:
df_scaled = normalize(df_X)

In [235]:
df_scaled

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,5.193151e-01,-0.316484,2.581822e-01,0.451259,1.395045e+00,-0.445220,0.909603,-4.770707e-01,1.042621,2.291174e-02,3.206680e-02,2.068201e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
1,5.078192e-01,-0.316484,1.546368e-01,-0.040820,6.178126e-01,-0.093003,0.703779,-7.790962e-01,0.382597,6.674627e-02,-1.030617e-01,-7.289722e-16,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
2,5.090965e-01,-0.316484,1.593434e-01,-0.008600,5.625779e-01,-0.130407,0.767109,-7.833802e-01,0.373678,5.982503e-02,-1.087713e-01,2.055865e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
3,5.116511e-01,-0.316484,1.828764e-01,0.102703,7.914077e-01,-0.251969,0.814607,-7.276876e-01,0.525305,2.291174e-02,-4.025548e-02,2.056114e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
4,5.090965e-01,-0.316484,1.711099e-01,0.044122,8.900413e-01,-0.186512,0.767109,4.870401e-16,0.605578,-9.387381e-03,6.061507e-02,2.061784e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
5,5.103738e-01,-0.316484,1.045077e-17,-0.070110,4.126548e-01,-0.074302,0.592950,-7.169775e-01,0.402666,-3.015110e-02,1.062923e-01,-7.289722e-16,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
6,5.027099e-01,-0.316484,1.005108e-01,-0.342511,2.706225e-01,0.137652,0.497954,-9.054757e-01,0.108331,-7.167855e-02,2.014532e-01,2.059645e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
7,4.963233e-01,-0.316484,8.403764e-02,-0.442098,1.325356e-01,0.237395,0.371293,-9.568843e-01,0.045896,-1.201272e-01,3.651299e-01,2.062381e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
8,4.937686e-01,-0.316484,4.403147e-02,-0.732073,-4.500485e-02,0.605196,0.284214,-1.108968e+00,0.000000,-1.385839e-01,3.993878e-01,2.059272e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667
9,4.950460e-01,-0.316484,4.403147e-02,-0.720357,4.968337e-02,0.533506,0.347544,-1.091832e+00,-0.154787,-1.685759e-01,4.793230e-01,2.057133e-01,0.764882,-0.420622,-0.170973,-0.03779,-0.482667


In [226]:
# might want to fill the NaNs with a moving window average instead

Now we shift everything up by 6 hours to generate our labels

### Modelling

In [324]:
xtrain, xtest, ytrain, ytest = train_test_split(df_X, y, test_size = 0.2, random_state = 19 )

In [325]:
ytrain.describe()

count    6727.000000
mean     1039.321540
std       342.067327
min      -200.000000
25%       913.000000
50%      1049.000000
75%      1220.000000
max      2008.000000
Name: y_6_hours_later, dtype: float64

In [310]:
LogisticRegression.fit(xtrain, ytrain)
print("Train R2 score is:", r2_score(y_train, LogisticRegression.predict(xtrain)).round(3))
print("Train RMSE is: {} \n".format(np.sqrt(mean_squared_error(ytrain, LogisticRegression.predict(xtrain)).round(3))))
   

TypeError: fit() missing 1 required positional argument: 'y'

In [338]:
def linearRegression(xtrain,xtest, ytrain, ytest):
    LR = LinearRegression()
    model = LR.fit(xtrain.values,ytrain.values)
    score = LR.score(xtest, ytest)
    pred = model.predict(xtest.values)
    print(score)
    metrics(ytest,pred)

In [339]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    #print('accuracy score: ', accuracy_score(ytest, pred))
    print('RMSE:', mean_squared_error(ytest,pred))
  #  print('Recall score: ', recall_score(ytest,pred))
    
  #  print('average_precision_score: ', average_precision_score(ytest,pred))
  #  print('Precision Score: ',precision_score(ytest,pred))
  #  print('F1_score: ',f1_score(ytest, pred))
  #  print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [340]:
linearRegression(xtrain,xtest, ytrain, ytest)

0.209674398311
RMSE: 87984.9846328


In [298]:
ytest

Unnamed: 0,0
6029,-0.409052
8142,-0.594154
5295,-0.229827
5276,0.005222
2086,-0.276837
114,-0.444310
3576,1.671135
8123,-0.514825
4702,0.046356
2939,-0.235703
