# Machine Learning Challenge

Below are 2 data challenges that test for your ability to:
- Wrangle/clean data to make it usable by a model
- Figure out how to set up X's and y's for a use case, given a dataset
- Write code to robustly and reproducibly preprocess data
- Pick/design the right model, and tune hyperparameters to get the best performance

You can use any programming language, model, and package to solve these problems. Let us know of any assumptions you make in your process.

#### Deliverables:
- A link to a github repository that contains:
    - Clearly commented code that was written to solve these problems
    - Your trained models stored in a file (`.pkl`, `.h5`, `.tar` - whatever is appropriate). The models must have `predict(X)` functions. 
    - A readme file that contains:
        - Instructions to easily access/load the above
        - A writeup explaining any significant design decisions and your reasons for making them. 
        - If needed, a brief writeup explaining anything you are particularly proud of in your implementation that you might want us to focus on

#### How we'll assess your work:
- Accuracy/RMSE of your model when predicting on held-out data
- How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:
    - A new column that wasn't present in the dataset given to you
    - New value in a categorical field that wasn't seen in the dataset given to you
    - NA values
- Efficiency of the code. 
    - Is it easy to understand? 
    - Are the variable names descriptive? 
    - Are there any variables created that aren't used? 
    - Is redundant code replaced with function calls? 
    - Is vectorized implementation used instead of nested for loops? 
    - Are classes defined and objects created where applicable? 
    - Are packages used to perform tasks instead of implementing them from scratch?
    
**NOTE:** Your stored models, once loaded, should *just work* when fed with our held-out data (which looks similar to the data we've given you). We won't do any preprocessing before we feed it into the model's `predict(X)` function; `predict(X)` should handle the preprocessing. Pay particular attention to handling the edge cases we've talked about.

Feel free to ask questions to clarify things. Submit everything you tried, not just the things that worked. I encourage you to try and showcase your talents. The more you go above and beyond what's expected, the more impressed we'll be. **Bonus points if you fit Keras/Tensorflow/Pytorch/Caffe models** in addition to your Linear/Tree-based models.

## 0. Import dependencies

In [1]:
import pandas as pd
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt
from math import sqrt

from sklearn import preprocessing as scale
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib

import xgboost 
from xgboost import XGBRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

from sklearn.metrics import mean_squared_error, accuracy_score, average_precision_score, precision_score, f1_score,recall_score, roc_auc_score

  from numpy.core.umath_tests import inner1d


## Task 1
`predictive_maintenance_dataset.csv` is a file that contains parameters and settings (`operational_setting_1`, `operational_setting_2`, `sensor_measurement_1`, `sensor_measurement_2`, etc.) for many wind turbines. There is a column called `unit_number` which specifies which turbine it is, and one called `status`, in which a value of 1 means the turbine broke down that day, and 0 means it didn't. Your task is to create a model that, when fed with operational settings and sensor measurements (`unit_number` and `time_stamp` will *not* be fed in), outputs 1 if the turbine will break down within the next 40 days, and 0 if not.

**NOTE:** The model should output 1 if the turbine is anywhere between 40 and 0 days away from failure, not *only* 40 days from failure.

In [47]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the operational_setting_3 column looks like
df_X = pd.read_csv("predictive_maintenance_dataset.csv").drop(labels=['status', 'unit_number', 'time_stamp'], axis='columns')
df_X.head()

Unnamed: 0,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,sensor_measurement_6,sensor_measurement_7,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
0,42.0007,0.8415,High,445.0,,1362.47,1143.17,3.91,5.7,142.53,...,133.75,2388.5,8129.92,9.1182,,332.0,2212.0,100.0,10.77,6.5717
1,-0.0023,0.0004,High,518.67,642.33,1581.03,1400.06,14.62,21.61,554.6,...,522.19,2388.0,8135.7,8.3817,0.03,393.0,2388.0,100.0,39.07,23.3958
2,,0.6216,Low,462.54,536.71,1250.87,1037.52,7.05,9.0,174.56,...,163.11,2028.06,7867.9,10.8827,,306.0,1915.0,84.93,14.33,8.6202
3,42.0006,,High,,549.28,1349.42,1114.02,3.91,5.71,137.97,...,130.58,2387.71,8074.81,9.3776,0.02,,2212.0,100.0,10.6,6.2614
4,-0.0016,0.0004,High,518.67,643.84,1604.53,1431.41,14.62,21.61,551.3,...,519.44,2388.24,8135.95,8.5223,0.03,396.0,2388.0,100.0,38.39,23.0682


In [7]:
df = pd.read_csv("predictive_maintenance_dataset.csv").sort_values(by = ['unit_number', 'time_stamp'], ascending = True)
df.head()

Unnamed: 0,unit_number,time_stamp,status,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,...,sensor_measurement_12,sensor_measurement_13,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21
73382,2,2017-04-01 12:00:00,0,-0.0018,0.0006,High,518.67,641.89,1583.84,1391.28,...,522.33,2388.06,8137.72,8.3905,0.03,391.0,2388.0,100.0,38.94,23.4585
90923,2,2017-04-02 12:00:00,0,0.0043,-0.0003,High,518.67,641.82,1587.05,1393.13,...,522.7,2387.98,8131.09,8.4167,0.03,,2388.0,100.0,39.06,23.4085
82527,2,2017-04-03 12:00:00,0,0.0018,0.0003,High,518.67,641.55,1588.32,1398.96,...,522.58,2387.99,8140.58,8.3802,0.03,391.0,2388.0,100.0,39.11,23.425
96521,2,2017-04-04 12:00:00,0,0.0035,-0.0004,High,518.67,641.68,1584.15,1396.08,...,522.49,2387.93,8140.44,8.4018,0.03,391.0,2388.0,100.0,39.13,23.5027
73137,2,2017-04-05 12:00:00,0,0.0005,0.0004,High,518.67,641.73,1579.03,1402.52,...,522.27,2387.94,8136.67,8.3867,0.03,390.0,2388.0,100.0,39.18,23.4234


### 1. Preprocess data

In [2]:
def preProcess(data):
    """
    Function to preprocess similar datasets: 
    Takes in a dataframe, checks for null values, replaces categorical value columns with dummy variables
    and fills the remaining null values in the numerical columns with the means of that column"""
    
    df = data
    df.fillna(method='ffill', inplace=True)         #As the data is arranged chronologically, we fill the next missing variable with that of the previous hour/day
    df.fillna(method='bfill', inplace=True)         #Incase some NaNs are at the start
    
    categorical_columns = df.select_dtypes(include=['object'])
    dummy_columns = pd.get_dummies(categorical_columns)
    
    df = pd.concat([df.drop(categorical_columns, axis=1), dummy_columns], axis=1)
    
    return df

### 2. Setting Labels

We have an interesting case here: where we're checking if a turbine is going to fail in 40 days or less. So essentially we're trying to figure out a problem where given all the parameters what is the likelihood that a certain unit fails within a 40 day timespan. 


So we just have to identify the date the turbines failed and mark the data going back up to a maximum of 40 days as a failure as well.

In [9]:
def setLabels(df,y_col, limit=40):
    """
    Function that takes in the dataframe, the target-variable column, and a number defining the window period for failure. 
    Returns a dataframe with the required target variable
    """
     
    y_col_new = y_col + '_new'
    df[y_col_new] = df[y_col]
    df[y_col_new] = df[y_col_new].replace(0, np.NaN) #Let's replace all the 0s with NaNs and then we work backwords
    df[y_col_new] = df[y_col_new].fillna(method='bfill', limit=40) # fill backward up to 40days. Thankfully the data is frequent and daily
    df[y_col_new] = df[y_col_new].fillna('0') #fill the rest with zeros
    df.drop(y_col, axis = 1, inplace=True)
    
    return df

In [10]:
def normalize(df):
    """
    Function that takes in a dataset with numerical values and standardizes it
    """
    standard_sc = scale.StandardScaler()
    x_std = standard_sc.fit_transform(df)
    df_scaled = pd.DataFrame(x_std)
    return df_scaled

### Modelling

In [3]:
def dataSplit(df_X, y, dtype, test_size=0.2):
    """Function to split the training data into training, validation, and testing size and convert target variable to required type"""
    xtrain, xtest, ytrain, ytest = train_test_split(df_X, y, test_size = test_size, random_state = 19)
    #xtrain, xval, ytrain, yval = train_test_split(df_X, y, test_size = valid_size, random_state = 19)
    
    ytrain, ytest = ytrain.astype(dtype), ytest.astype(dtype)
    return xtrain, xtest, ytrain, ytest

In [12]:
def metrics(ytest, pred):
    """
    Function to evaluate models against models 
    """
    print('accuracy score: ', accuracy_score(ytest, pred))
    print('Recall score: ', recall_score(ytest,pred))
    
    print('Precision Score: ',precision_score(ytest,pred))
    print('F1_score: ',f1_score(ytest, pred))
    print('roc_auc_score: ', roc_auc_score(ytest, pred))

In [13]:
def score(training_model):
    """
    Function to receive training model and perform and evaluate predictions
    """
    model = training_model.fit(xtrain.values,ytrain.values)
    pred = model.predict(xtest.values)
    metrics(pred,ytest)
    return pred

In [3]:
def logisticRegression(xtrain,xtest,ytrain,ytest):
    LR = LogisticRegression(multi_class='ovr')
    pred = score(LR)
    return pred

def randomForestClassifier(xtrain,xtest,ytrain,ytest,n_estimators=25,min_samples_split=25,max_depth=5,random_state=72):
    RF = RandomForestClassifier(n_estimators = 50, min_samples_split=25, max_depth =10, random_state=72)
    pred = score(RF)
    return pred

In [14]:
def readData(dataset):
    return pd.read_csv(dataset)

In [15]:
def sortDrop(dataset, first_column, second_column):
    """Function that sorts by first and second column and drops those columns after..very specific..I know"""
    return dataset.sort_values(by = [first_column, second_column]).drop([first_column,second_column],axis=1)

In [16]:
def xgbClassifier(xtrain,xval,ytrain,yval):
    xgb = xgboost.XGBClassifier( max_depth=20, n_estimators=200, learning_rate=0.05, objective='binary:logistic')
    model = xgb.fit(xtrain,ytrain)
    pred = xgb.predict(xval)
    metrics(yval, pred)
    
    #pickle.dump(model, open("xbg_model.pkl","wb")) 
    joblib.dump(xgb, "xgbmodel")
    print ("model saved")

In [24]:
def train(dataset):
    """
    Function that receives the filename as a string, reads the data, preprocesses, splits, and trains with an XGB Classifier
    which then saves the model"""
    
    df = readData(dataset)
    
    df = sortDrop(df, 'unit_number', 'time_stamp')
    
    df = preProcess(df)
    
    df = setLabels(df, 'status', 40)
    
    df_y = df['status_new']
    df_X = df.drop('status_new', axis = 1)
    
    df_X = normalize(df_X)
    
    xtrain, xtest, ytrain, ytest = dataSplit(df_X, df_y, int, test_size=0.25)
    
    xgbClassifier(xtrain,xtest,ytrain,ytest)
        

In [20]:
train("predictive_maintenance_dataset.csv")


accuracy score:  0.9476574852292585
Recall score:  0.8214613268113726
Precision Score:  0.8819957328081405
F1_score:  0.8506529481598734
roc_auc_score:  0.8985479394909484
model saved


  if diff:


In [21]:
def Predict(dataset):
    
    
    #df = readData(dataset)
    df = preProcess(dataset)
    
    df = normalize(df)
    
    loaded_model = joblib.load("xgbmodel")
    
    y_pred = loaded_model.predict(df)
    
    #metrics(ytest,y_pred)
    print(y_pred)
    
    
    

In [48]:
Predict(df_X)

## Task 2
`forecasting_dataset.csv` is a file that contains pollution data for a city. Your task is to create a model that, when fed with columns `co_gt`, `nhmc`, `c6h6`, `s2`, `nox`, `s3`, `no2`, `s4`, `s5`, `t`, `rh`, `ah`, and `level`, predicts the value of `y` six hours later.

**NOTE:** In the data we've given you, the value of `y` for a given row is the value of `y` *for the timestamp of that same row*. We're asking you to predict the value of `y` 6 hours *after the timestamp of that row*.

In [20]:
## What the data that we'll feed into your model's predict(X) function will look like:
# Notice what the level column looks like
A = pd.read_csv("forecasting_dataset.csv").drop(labels=['date', 'time', 'y'], axis='columns')
A.tail()

Unnamed: 0,co_gt,nhmc,c6h6,s2,nox,s3,no2,s4,s5,t,rh,ah,level
8416,-200.0,-200.0,1.9,578.0,-200.0,1017.0,-200.0,876.0,607.0,5.8,59.0,0.5493,Very low
8417,1.3,-200.0,9.7,968.0,115.0,1276.0,112.0,1325.0,768.0,31.8,13.8,0.6394,High
8418,,54.0,2.6,626.0,39.0,1325.0,52.0,1284.0,755.0,9.6,58.5,0.7,Low
8419,-200.0,-200.0,21.0,1324.0,-200.0,527.0,-200.0,1886.0,1571.0,22.9,52.6,1.4519,Very low
8420,1.6,-200.0,6.4,832.0,244.0,748.0,119.0,1230.0,934.0,,59.9,0.9261,High


In [4]:
def readData2(data):
    data = pd.read_csv("forecasting_dataset.csv", parse_dates=[['date','time']]).sort_values(by = ['date_time'])
    data.drop('date_time', axis=1, inplace=True)
    return data

In [5]:
def setLabels2(df):
    """
    Function that converts the time series to a supervised learning problem by shifting the time labels by 6 hours
    """
    df['y_6_hours_later'] = df.y.shift(-6)
    df.drop('y', axis=1, inplace=True)
    df.dropna(inplace=True)
    return df

In [6]:
def generateXandY(df):
    """
    Function that splits the dataset into training examples and labels
    """
    df_X = df.iloc[:,:-1]
    df_y = df.iloc[:,-1:]
    return df_X, df_y

In [25]:
def randomForestRegression(xtrain,xtest, ytrain, ytest):
    RFR = RandomForestRegressor(max_depth = 25,random_state = 9, n_estimators = 210, min_samples_split = 2,min_samples_leaf =1)
    
    model = RFR.fit(xtrain,ytrain)
    pred = model.predict(xtest)
    
    R2score = RFR.score(xtest, ytest)    
    print("R2 score",(R2score))
    print(sqrt(mean_squared_error(ytest,pred)))
    
    joblib.dump(model, "RandomForestRegressormodel")
    print ("model saved")
    
    #metrics(ytest,pred)

In [26]:
def train(dataset):
    
    df = readData2(dataset)
    df = preProcess(df)
    df = setLabels2(df)
    
    df_X, df_y = generateXandY(df)
    
    xtrain, xtest, ytrain, ytest = dataSplit(df_X, df_y, float, test_size=0.25)
    
    randomForestRegression(xtrain, xtest, ytrain, ytest)    
       

In [27]:
train("forecasting_dataset.csv")



R2 score 0.5902549350238033
221.45892805223028
model saved


In [13]:
def predict(dataset):    
    
    df = preProcess(dataset)
    
    #df = normalize(df)
    
    #loaded_model = pickle.load(open("xgb_model.pkl", "rb"))
    loaded_model = joblib.load("RandomForestRegressormodel")
    
    y_pred = loaded_model.predict(df)
    return y_pred

In [17]:
predict(A)

array([ 893.56571429, 1209.14214286,  819.57142857, ..., 1196.71142857,
       1230.38857143, 1103.31857143])