In this project, I will be utilizing Machine Learning in order to predict future mechanical failure of oil well machinery. I was given a trainining set of 60,000 rows of data taken from 107 sensors on different wells, as well as a target column with 0 for a surface related failure, or 1 for a below ground failure. The goal was use this training set to predict where a test set of wells would fail (surface or below ground). 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree, metrics
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE


import warnings
warnings.filterwarnings('ignore')

For this project, I decided to fill the NaN cells with their given column's mean. Because of every column being a numberical sensor reading, I felt that this was a fair way to deal with NaN values, especially because the Kaggle competition did not allow for you to remove any rows when training the model, or predicting values. I would have ideally created a separate model where I only used non-NaN rows and compared model performance to see what would be the most accurate model.

In [4]:
train = pd.read_csv('equip_failures_training_set.csv')
test = pd.read_csv('equip_failures_test_set.csv')

train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)

In [5]:
train.head()

Unnamed: 0,id,target,sensor1_measure,sensor2_measure,sensor3_measure,sensor4_measure,sensor5_measure,sensor6_measure,sensor7_histogram_bin0,sensor7_histogram_bin1,...,sensor105_histogram_bin2,sensor105_histogram_bin3,sensor105_histogram_bin4,sensor105_histogram_bin5,sensor105_histogram_bin6,sensor105_histogram_bin7,sensor105_histogram_bin8,sensor105_histogram_bin9,sensor106_measure,sensor107_measure
0,1,0,76698,0.713189,2130706000.0,280.0,0.0,0.0,0.0,0.0,...,1240520.0,493384.0,721044.0,469792.0,339156.0,157956.0,73224.0,0.0,0.0,0.0
1,2,0,33058,0.713189,0.0,190620.639314,0.0,0.0,0.0,0.0,...,421400.0,178064.0,293306.0,245416.0,133654.0,81140.0,97576.0,1500.0,0.0,0.0
2,3,0,41040,0.713189,228.0,100.0,0.0,0.0,0.0,0.0,...,277378.0,159812.0,423992.0,409564.0,320746.0,158022.0,95128.0,514.0,0.0,0.0
3,4,0,12,0.0,70.0,66.0,0.0,10.0,0.0,0.0,...,240.0,46.0,58.0,44.0,10.0,0.0,0.0,0.0,4.0,32.0
4,5,0,60874,0.713189,1368.0,458.0,0.0,0.0,0.0,0.0,...,622012.0,229790.0,405298.0,347188.0,286954.0,311560.0,433954.0,1218.0,0.0,0.0


In this data set, 100 of the sensors take a single measurement, while 7 of the sensors (sensors 7, 24, 25, 26, 64, 69 and 104) each have 10 values taken in equal time intervals. I decided to try 3 models: one created by using all of the values for each of these sensors, one using the mean of the 10 values, and one using the median of the 10 values. This way, I have more opportunity to find the most accurate values.

I am first going to use the data set as is with all 10 values for each of the histogram sensors.

In [6]:
y_train = train[train.columns[1]]
X_train = train[train.columns[2:]]
X_test = test[test.columns[1:]]

I am going to try all of these options with 4 classification algorithms (Logistic Regression, Decision Tree, Random Forest, and XGBoost)

In [7]:
#logistic regression
lreg = LogisticRegression()
lreg.fit(X_train, y_train)
y_pred = lreg.predict(X_test)

In [None]:
#tested accuracy = 0.97678

In [8]:
#decision tree
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

In [None]:
#tested accuracy = 0.97464

In [9]:
#random forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [None]:
#tested accuracy = 0.98857

In [None]:
#xgboost
xgb = xgboost.XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test.as_matrix())

In [11]:
#tested accuracy = 0.96517

As you can see, the random forest model performed the best per the judging policy of the Kaggle competition. I going to run all of these models again but I am going to replace the 10 separate values for each of the histogramic sensors with one value of the mean of their values, and test with the same mean in the test set. I also decided to not use the XGBoost algorithm on the next two versions since it was substantially less accurate on the original data set

First, I need to create a column for the mean of each of these sensors, and remove the 10 separate values from the data frame. I have chosen to hard code the column indices due to knowing which sensors were histogramic, but in the future I would develop a function to automate finding these sensors and calculating the mean.

In [12]:
col_7 = train.loc[:, train.columns[8]:train.columns[17]]
train['sensor_7_mean'] = col_7.mean(axis=1)
col_24 = train.loc[:, train.columns[34]:train.columns[43]]
train['sensor_24_mean'] = col_24.mean(axis=1)
col_25 = train.loc[:, train.columns[44]:train.columns[53]]
train['sensor_25_mean'] = col_25.mean(axis=1)
col_26 = train.loc[:, train.columns[54]:train.columns[63]]
train['sensor_26_mean'] = col_26.mean(axis=1)
col_64 = train.loc[:, train.columns[101]:train.columns[110]]
train['sensor_64_mean'] = col_64.mean(axis=1)
col_69 = train.loc[:, train.columns[115]:train.columns[124]]
train['sensor_69_mean'] = col_69.mean(axis=1)
col_105 = train.loc[:, train.columns[160]:train.columns[169]]
train['sensor_105_mean'] = col_105.mean(axis=1)

train.drop(col_7, axis=1, inplace=True)
train.drop(col_24, axis=1, inplace=True)
train.drop(col_25, axis=1, inplace=True)
train.drop(col_26, axis=1, inplace=True)
train.drop(col_64, axis=1, inplace=True)
train.drop(col_69, axis=1, inplace=True)
train.drop(col_105, axis=1, inplace=True)

I now create the new X and y train frames, do the same column calculation on the test set, and re run the models on this new data frame.

In [13]:
X_train = train[train.columns[2:]]
y_train = train[train.columns[1]]

In [14]:
col_7_t = test.loc[:, test.columns[7]:test.columns[16]]
test['sensor_7_mean'] = col_7_t.mean(axis=1)
col_24_t = test.loc[:, test.columns[33]:test.columns[42]]
test['sensor_24_mean'] = col_24_t.mean(axis=1)
col_25_t = test.loc[:, test.columns[43]:test.columns[52]]
test['sensor_25_mean'] = col_25_t.mean(axis=1)
col_26_t = test.loc[:, test.columns[53]:test.columns[62]]
test['sensor_26_mean'] = col_26_t.mean(axis=1)
col_64_t = test.loc[:, test.columns[100]:test.columns[109]]
test['sensor_64_mean'] = col_64_t.mean(axis=1)
col_69_t = test.loc[:, test.columns[114]:test.columns[123]]
test['sensor_69_mean'] = col_69_t.mean(axis=1)
col_105_t = test.loc[:, test.columns[159]:test.columns[168]]
test['sensor_105_mean'] = col_105_t.mean(axis=1)

test.drop(col_7_t, axis=1, inplace=True)
test.drop(col_24_t, axis=1, inplace=True)
test.drop(col_25_t, axis=1, inplace=True)
test.drop(col_26_t, axis=1, inplace=True)
test.drop(col_64_t, axis=1, inplace=True)
test.drop(col_69_t, axis=1, inplace=True)
test.drop(col_105_t, axis=1, inplace=True)

In [15]:
X_test = test[test.columns[1:]]

In [16]:
#logistic regression
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train, y_train)

lreg = LogisticRegression()
lreg.fit(X_train, y_train)
y_pred = lreg.predict(X_test)

In [19]:
#tested accuracy = 0.96839

In [17]:
#decision tree
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

In [20]:
#tested accuracy = 0.94642

In [18]:
#random forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [21]:
#tested accuracy = 0.98330

As you can see, the models run with the mean of histogramic sensors were generally less accurate than the same algorithm run on the original data. The random forest algorithm was still the most accurate, but not as accurate as the previous random forest.

I am now going to follow the same procedure, but using the median instead of the mean to see if this is any more accurate. I must first reload the original datasets to re-create the median columns from the original data.

In [22]:
train = pd.read_csv('equip_failures_training_set.csv')
test = pd.read_csv('equip_failures_test_set.csv')

train.fillna(train.mean(), inplace=True)
test.fillna(test.mean(), inplace=True)

In [23]:
col_7 = train.loc[:, train.columns[8]:train.columns[17]]
train['sensor_7_mean'] = col_7.median(axis=1)
col_24 = train.loc[:, train.columns[34]:train.columns[43]]
train['sensor_24_mean'] = col_24.median(axis=1)
col_25 = train.loc[:, train.columns[44]:train.columns[53]]
train['sensor_25_mean'] = col_25.median(axis=1)
col_26 = train.loc[:, train.columns[54]:train.columns[63]]
train['sensor_26_mean'] = col_26.median(axis=1)
col_64 = train.loc[:, train.columns[101]:train.columns[110]]
train['sensor_64_mean'] = col_64.median(axis=1)
col_69 = train.loc[:, train.columns[115]:train.columns[124]]
train['sensor_69_mean'] = col_69.median(axis=1)
col_105 = train.loc[:, train.columns[160]:train.columns[169]]
train['sensor_105_mean'] = col_105.median(axis=1)

train.drop(col_7, axis=1, inplace=True)
train.drop(col_24, axis=1, inplace=True)
train.drop(col_25, axis=1, inplace=True)
train.drop(col_26, axis=1, inplace=True)
train.drop(col_64, axis=1, inplace=True)
train.drop(col_69, axis=1, inplace=True)
train.drop(col_105, axis=1, inplace=True)

X_train = train[train.columns[2:]]
y_train = train[train.columns[1]]

In [24]:
col_7_t = test.loc[:, test.columns[7]:test.columns[16]]
test['sensor_7_mean'] = col_7_t.median(axis=1)
col_24_t = test.loc[:, test.columns[33]:test.columns[42]]
test['sensor_24_mean'] = col_24_t.median(axis=1)
col_25_t = test.loc[:, test.columns[43]:test.columns[52]]
test['sensor_25_mean'] = col_25_t.median(axis=1)
col_26_t = test.loc[:, test.columns[53]:test.columns[62]]
test['sensor_26_mean'] = col_26_t.median(axis=1)
col_64_t = test.loc[:, test.columns[100]:test.columns[109]]
test['sensor_64_mean'] = col_64_t.median(axis=1)
col_69_t = test.loc[:, test.columns[114]:test.columns[123]]
test['sensor_69_mean'] = col_69_t.median(axis=1)
col_105_t = test.loc[:, test.columns[159]:test.columns[168]]
test['sensor_105_mean'] = col_105_t.median(axis=1)

test.drop(col_7_t, axis=1, inplace=True)
test.drop(col_24_t, axis=1, inplace=True)
test.drop(col_25_t, axis=1, inplace=True)
test.drop(col_26_t, axis=1, inplace=True)
test.drop(col_64_t, axis=1, inplace=True)
test.drop(col_69_t, axis=1, inplace=True)
test.drop(col_105_t, axis=1, inplace=True)

X_test = test[test.columns[1:]]

In [25]:
#logistic regression
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train, y_train)

lreg = LogisticRegression()
lreg.fit(X_train, y_train)
y_pred = lreg.predict(X_test)

In [28]:
#tested accuracy = 0.97160

In [26]:
#decision tree
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

In [29]:
#tested accuracy = 0.78223

In [27]:
#random forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [30]:
#tested accuracy = 0.97946

As seen in the accuracies above, the median values did not improve the accuracy from the mean or the original data frame. The model combination that I found to be most accurate was using the Random Forest Algorithm on the original data set including all 10 of the sensor measurments for each of the 7 histogramic sensors. This model is a very effective way to predict whether the Oil Well could have a surface or below ground failure.