# Car breakdown prediction

We have a fleet of automatic cars of same make & model. Since the drivers don't own the cars, they driver it abusively, which causes more wear and tear. All cars are equipped with sensors to provide the sate of the car on a daily basis (1 reading / day).


1. **Driving Mode sensors:**
The car has 3 different driving modes(Auto, City, Sports), which can be selected only once by the driver on a daily basis. The sensors though are not discrete and is captured as 3 different real numbers, captured as **ecoMode, cityMode, sportsMode**

2. **Engine Sensors**: 
Every is equipped with 21 different kind of sensors. (E.g. engine-rpm, engine-oil level, ac temperature, battery voltage, ...), captured as **s1, s2, s3, ..., s21**

## Dataset

[Car breakdown dataset](car_breakdown_data.ipynb)

# What to predict ?

Given a timeseries data for every **vehicleId** and the day of failure. Can we predict that the break down is going to happen withing 30 days.

## Expectations from the project

* Implementation with a decent accuracy, e.g. 70%
* Checkin the code to github, and email the link
* README.md should capture your approach of implementation
* Document various other techniques which can be used to address the problem, given only the data which is provided in this project.

In [341]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [342]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100)

In [343]:
train_df = pd.read_csv('data/car_breakdown_train.tsv', sep='\t', header=0)
train_df.head()

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,-0.0007,-0.0004,100,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100,39.06,23.419
1,1,2,0.0019,-0.0003,100,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100,39.0,23.4236
2,1,3,-0.0043,0.0003,100,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100,38.95,23.3442
3,1,4,0.0007,0.0,100,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100,38.88,23.3739
4,1,5,-0.0019,-0.0002,100,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100,38.9,23.4044


In [344]:
train_df.shape

(20631, 26)

In [345]:
non_imp_features = []
for i in range(len(train_df.columns)):
    print(train_df.columns[i],train_df.iloc[:,i].nunique())
    if train_df.iloc[:,i].nunique() == 1:
        non_imp_features.append(train_df.columns[i])

vehicleId 100
days 362
ecoMode 158
cityMode 13
sportMode 1
s1 1
s2 310
s3 3012
s4 4051
s5 1
s6 2
s7 513
s8 53
s9 6403
s10 1
s11 159
s12 427
s13 56
s14 6078
s15 1918
s16 1
s17 13
s18 1
s19 1
s20 120
s21 4745


In [346]:
non_imp_features

['sportMode', 's1', 's5', 's10', 's16', 's18', 's19']

## Solve this problem as classification, to predict if car will breakdown in next 30 days
### we will create Y label, as car_breakdown to denote the same

In [347]:
vehicle_brokedown_day_dict = train_df.groupby('vehicleId')["days"].max().to_dict()

In [348]:
train_df['car_brokedown_day'] = train_df.vehicleId.apply(lambda x:vehicle_brokedown_day_dict[x])

In [349]:
train_df.sample(5)

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,car_brokedown_day
2632,13,87,0.0007,0.0003,100,518.67,642.93,1590.9,1412.87,14.62,21.61,552.76,2388.14,9066.14,1.3,47.63,520.66,2388.11,8139.76,8.4561,0.03,393,2388,100,38.71,23.3149,163
17515,86,176,0.0022,0.0005,100,518.67,643.1,1593.97,1411.73,14.62,21.61,553.1,2388.1,9067.16,1.3,47.61,521.13,2388.09,8145.75,8.4363,0.03,394,2388,100,38.83,23.3606,278
4140,20,207,0.0003,0.0,100,518.67,643.37,1596.99,1424.85,14.62,21.61,551.63,2388.19,9071.75,1.3,47.99,520.53,2388.18,8140.09,8.4903,0.03,395,2388,100,38.53,23.067,234
6781,34,170,-0.002,0.0003,100,518.67,643.22,1599.85,1421.7,14.62,21.61,552.28,2388.14,9113.24,1.3,47.73,520.31,2388.17,8179.94,8.4975,0.03,394,2388,100,38.73,23.2821,195
265,2,74,-0.0013,0.0001,100,518.67,642.39,1584.04,1396.13,14.62,21.6,555.36,2388.01,9060.56,1.3,47.25,521.97,2388.03,8138.8,8.3932,0.03,391,2388,100,39.07,23.4015,287


In [350]:
train_df['RUL'] = train_df.apply(lambda x: int(x['car_brokedown_day'] - x['days']),axis=1)

In [351]:
train_df.sample(5)

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,car_brokedown_day,RUL
3321,17,17,-0.0016,0.0,100,518.67,642.2,1583.35,1402.41,14.62,21.61,554.5,2388.06,9060.22,1.3,47.28,522.07,2388.02,8144.86,8.3929,0.03,392,2388,100,39.06,23.4013,276,259
12245,62,119,0.0002,0.0,100,518.67,642.8,1597.84,1423.39,14.62,21.61,553.25,2388.13,9057.04,1.3,47.62,520.5,2388.12,8139.12,8.4728,0.03,393,2388,100,38.64,23.2475,180,61
9046,46,252,0.0011,0.0002,100,518.67,643.74,1604.29,1426.03,14.62,21.61,551.22,2388.29,9053.35,1.3,48.23,519.83,2388.26,8126.53,8.4992,0.03,397,2388,100,38.68,22.9664,256,4
19811,96,260,0.0019,-0.0003,100,518.67,642.74,1592.98,1414.04,14.62,21.61,552.96,2388.07,9061.14,1.3,47.51,520.97,2388.07,8139.49,8.4968,0.03,395,2388,100,38.62,23.1429,336,76
5751,29,122,0.0014,0.0003,100,518.67,642.58,1587.41,1414.2,14.62,21.61,553.02,2388.14,9059.42,1.3,47.68,521.09,2388.08,8137.02,8.4484,0.03,394,2388,100,38.68,23.1895,163,41


In [352]:
train_df[train_df.vehicleId==1]

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,car_brokedown_day,RUL
0,1,1,-0.0007,-0.0004,100,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100,39.06,23.419,192,191
1,1,2,0.0019,-0.0003,100,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100,39.0,23.4236,192,190
2,1,3,-0.0043,0.0003,100,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100,38.95,23.3442,192,189
3,1,4,0.0007,0.0,100,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100,38.88,23.3739,192,188
4,1,5,-0.0019,-0.0002,100,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100,38.9,23.4044,192,187
5,1,6,-0.0043,-0.0001,100,518.67,642.1,1584.47,1398.37,14.62,21.61,554.67,2388.02,9049.68,1.3,47.16,521.68,2388.03,8132.85,8.4108,0.03,391,2388,100,38.98,23.3669,192,186
6,1,7,0.001,0.0001,100,518.67,642.48,1592.32,1397.77,14.62,21.61,554.34,2388.02,9059.13,1.3,47.36,522.32,2388.03,8132.32,8.3974,0.03,392,2388,100,39.1,23.3774,192,185
7,1,8,-0.0034,0.0003,100,518.67,642.56,1582.96,1400.97,14.62,21.61,553.85,2388.0,9040.8,1.3,47.24,522.47,2388.03,8131.07,8.4076,0.03,391,2388,100,38.97,23.3106,192,184
8,1,9,0.0008,0.0001,100,518.67,642.12,1590.98,1394.8,14.62,21.61,553.69,2388.05,9046.46,1.3,47.29,521.79,2388.05,8125.69,8.3728,0.03,392,2388,100,39.05,23.4066,192,183
9,1,10,-0.0033,0.0001,100,518.67,641.71,1591.24,1400.46,14.62,21.61,553.59,2388.05,9051.7,1.3,47.03,521.79,2388.06,8129.38,8.4286,0.03,393,2388,100,38.95,23.4694,192,182


In [353]:
train_df['is_carbrokedown'] = train_df.RUL.apply(lambda x: 1 if x <=30 else 0)

In [354]:
train_df.is_carbrokedown.value_counts()

0    17531
1     3100
Name: is_carbrokedown, dtype: int64

## create test data

In [355]:
valid_df = pd.read_csv('data/car_breakdown_test.tsv',sep='\t',header=0)

In [356]:
valid_df.head()

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,0.0023,0.0003,100,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100,38.86,23.3735
1,1,2,-0.0027,-0.0003,100,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100,39.02,23.3916
2,1,3,0.0003,0.0001,100,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100,39.08,23.4166
3,1,4,0.0042,0.0,100,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100,39.0,23.3737
4,1,5,0.0014,0.0,100,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100,38.99,23.413


In [357]:
valid_data_gt = pd.read_csv('data/car_breakdown_test_truth.tsv',sep='\t',header=0)

In [358]:
valid_data_gt.head(),valid_data_gt.vehicleId.nunique(),valid_df.vehicleId.nunique()

(   vehicleId  RUL
 0          1  112
 1          2   98
 2          3   69
 3          4   82
 4          5   91, 100, 100)

In [359]:
valid_vechicle_id_car_rul_dict = valid_data_gt.set_index('vehicleId')['RUL'].to_dict()

In [360]:
valid_df['RUL'] = valid_df.apply(lambda x: int(valid_vechicle_id_car_rul_dict[x['vehicleId']] - x['days']+1),axis=1)

In [361]:
valid_df[valid_df.vehicleId==10]

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,RUL
896,10,1,-0.0017,0.0,100,518.67,642.07,1584.19,1403.69,14.62,21.61,554.53,2388.01,9057.35,1.3,47.32,522.13,2388.01,8145.46,8.4039,0.03,391,2388,100,38.75,23.353,96
897,10,2,0.0061,-0.0001,100,518.67,642.32,1584.48,1388.37,14.62,21.61,554.43,2387.98,9061.92,1.3,47.17,522.59,2388.03,8146.38,8.3981,0.03,392,2388,100,39.08,23.4908,95
898,10,3,0.0027,-0.0003,100,518.67,641.77,1574.22,1400.07,14.62,21.6,554.38,2388.04,9058.28,1.3,46.95,522.22,2388.0,8141.22,8.3763,0.03,391,2388,100,39.31,23.4285,94
899,10,4,-0.0028,-0.0004,100,518.67,642.83,1583.9,1404.2,14.62,21.61,554.5,2388.04,9061.39,1.3,47.45,521.93,2387.97,8144.96,8.425,0.03,392,2388,100,39.04,23.3622,93
900,10,5,0.0013,-0.0002,100,518.67,642.04,1585.22,1403.5,14.62,21.6,554.52,2388.01,9071.21,1.3,47.34,522.28,2388.0,8142.12,8.4127,0.03,391,2388,100,38.94,23.4059,92
901,10,6,-0.0007,-0.0001,100,518.67,641.87,1583.49,1395.45,14.62,21.61,554.44,2388.02,9057.8,1.3,47.35,522.24,2387.98,8141.46,8.3845,0.03,392,2388,100,38.78,23.3883,91
902,10,7,0.0006,-0.0002,100,518.67,642.02,1580.84,1398.72,14.62,21.61,553.42,2387.96,9060.98,1.3,47.33,521.97,2388.0,8148.3,8.3737,0.03,392,2388,100,39.15,23.4004,90
903,10,8,0.0024,0.0002,100,518.67,641.62,1583.96,1399.01,14.62,21.61,554.99,2387.99,9060.89,1.3,47.1,521.89,2388.05,8142.62,8.3911,0.03,391,2388,100,38.98,23.523,89
904,10,9,-0.0006,-0.0004,100,518.67,642.21,1586.17,1398.14,14.62,21.61,554.78,2388.01,9063.92,1.3,47.24,522.91,2388.0,8142.14,8.414,0.03,393,2388,100,38.99,23.4151,88
905,10,10,-0.0008,-0.0005,100,518.67,641.77,1584.88,1391.26,14.62,21.61,555.02,2388.02,9061.13,1.3,47.14,522.42,2388.04,8142.58,8.3729,0.03,389,2388,100,39.12,23.3973,87


In [362]:
valid_df['is_carbrokedown'] = valid_df.RUL.apply(lambda x: 1 if x <=30 else 0)

In [363]:
valid_df[valid_df['is_carbrokedown']==True].head(10)

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,RUL,is_carbrokedown
119,3,40,-0.0004,-0.0005,100,518.67,642.52,1588.58,1407.3,14.62,21.61,553.38,2388.16,9053.33,1.3,47.57,521.66,2388.18,8132.02,8.4341,0.03,393,2388,100,38.73,23.303,30,1
120,3,41,-0.0047,-0.0005,100,518.67,642.94,1596.7,1406.09,14.62,21.61,553.78,2388.15,9048.09,1.3,47.49,521.34,2388.12,8137.15,8.4601,0.03,393,2388,100,38.85,23.4,29,1
121,3,42,-0.0011,-0.0004,100,518.67,642.74,1590.47,1415.23,14.62,21.61,552.79,2388.13,9051.32,1.3,47.36,521.37,2388.1,8134.08,8.4292,0.03,393,2388,100,38.75,23.312,28,1
122,3,43,0.0017,0.0001,100,518.67,643.02,1589.26,1413.2,14.62,21.61,553.34,2388.11,9052.23,1.3,47.5,521.59,2388.1,8134.12,8.4194,0.03,394,2388,100,38.84,23.4232,27,1
123,3,44,-0.0019,0.0,100,518.67,642.16,1586.46,1410.72,14.62,21.61,552.95,2388.13,9055.43,1.3,47.59,521.77,2388.14,8133.77,8.4204,0.03,394,2388,100,38.74,23.3035,26,1
124,3,45,0.0003,-0.0004,100,518.67,642.55,1588.36,1405.75,14.62,21.61,552.3,2388.13,9058.3,1.3,47.57,521.0,2388.14,8125.84,8.4227,0.03,394,2388,100,38.86,23.3064,25,1
125,3,46,-0.0006,-0.0002,100,518.67,642.45,1587.99,1401.94,14.62,21.61,553.75,2388.09,9055.86,1.3,47.55,521.21,2388.09,8127.02,8.421,0.03,393,2388,100,38.83,23.1494,24,1
126,3,47,0.0015,-0.0005,100,518.67,642.63,1587.05,1407.79,14.62,21.61,553.6,2388.14,9051.69,1.3,47.56,520.81,2388.11,8137.44,8.4516,0.03,394,2388,100,38.88,23.3113,23,1
127,3,48,0.0008,0.0,100,518.67,642.26,1586.63,1411.59,14.62,21.61,553.31,2388.08,9056.76,1.3,47.58,521.85,2388.09,8135.46,8.4567,0.03,393,2388,100,38.9,23.2519,22,1
128,3,49,0.0016,-0.0003,100,518.67,642.29,1592.85,1406.74,14.62,21.61,553.64,2388.12,9047.48,1.3,47.54,521.54,2388.08,8131.23,8.4352,0.03,394,2388,100,38.85,23.1975,21,1


In [364]:
valid_df.is_carbrokedown.value_counts()

1    8887
0    4209
Name: is_carbrokedown, dtype: int64

In [365]:
valid_data_max_days = valid_df.groupby('vehicleId')['days'].max().to_dict()

In [366]:
valid_df['is_final'] = valid_df.apply(lambda x: x['days']==valid_data_max_days[x['vehicleId']],axis=1)

In [367]:
features = list(set(train_df.columns) -set(['is_carbrokedown','RUL','car_brokedown_day','vehicleId']))

In [368]:
scaler = preprocessing.StandardScaler()
X_subset = pd.DataFrame(scaler.fit_transform(train_df[features]))
X_subset.columns = [col for col in features]
X_remain = pd.DataFrame(train_df[[col for col in train_df.columns if col not in features]])
X_remain.columns = [col for col in train_df.columns if col not in features]
train_data = pd.concat([X_subset,X_remain],axis=1)

In [369]:
X_subset.shape,train_df.shape,X_remain.shape,train_data.shape

((20631, 25), (20631, 29), (20631, 4), (20631, 29))

In [370]:
train_data.head()

Unnamed: 0,s14,s16,s2,s10,cityMode,s7,s1,s20,s6,ecoMode,s17,sportMode,s18,s21,s4,s13,s19,days,s8,s15,s9,s5,s3,s12,s11,vehicleId,car_brokedown_day,RUL,is_carbrokedown
0,-0.269071,-1.0,-1.721725,0.0,-1.372953,1.121141,0.0,1.348493,0.141683,-0.31598,-0.78171,0.0,0.0,1.194427,-0.925936,-1.05889,0.0,-1.56517,-0.516338,-0.603816,-0.862813,-1.0,-0.134255,0.334262,-0.266467,1,192,191,0
1,-0.642845,-1.0,-1.06178,0.0,-1.03172,0.43193,0.0,1.016528,0.141683,0.872722,-0.78171,0.0,0.0,1.236922,-0.643726,-0.363646,0.0,-1.550652,-0.798093,-0.275852,-0.958818,-1.0,0.211528,1.174899,-0.191583,1,192,190,0
2,-0.551629,-1.0,-0.661813,0.0,1.015677,1.008155,0.0,0.739891,0.141683,-1.961874,-2.073094,0.0,0.0,0.503423,-0.525953,-0.919841,0.0,-1.536134,-0.234584,-0.649144,-0.557139,-1.0,-0.413166,1.364721,-1.015303,1,192,189,0
3,-0.520176,-1.0,-0.661813,0.0,-0.008022,1.222827,0.0,0.352598,0.141683,0.32409,-0.78171,0.0,0.0,0.777792,-0.784831,-0.224597,0.0,-1.521616,0.188048,-1.971665,-0.713826,-1.0,-1.261314,1.961302,-1.539489,1,192,188,0
4,-0.521748,-1.0,-0.621816,0.0,-0.690488,0.714393,0.0,0.463253,0.141683,-0.864611,-0.136018,0.0,0.0,1.059552,-0.301518,-0.780793,0.0,-1.507098,-0.516338,-0.339845,-0.457059,-1.0,-1.251528,1.052871,-0.977861,1,192,187,0


In [371]:
train_df.head()

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,car_brokedown_day,RUL,is_carbrokedown
0,1,1,-0.0007,-0.0004,100,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100,39.06,23.419,192,191,0
1,1,2,0.0019,-0.0003,100,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100,39.0,23.4236,192,190,0
2,1,3,-0.0043,0.0003,100,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100,38.95,23.3442,192,189,0
3,1,4,0.0007,0.0,100,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100,38.88,23.3739,192,188,0
4,1,5,-0.0019,-0.0002,100,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100,38.9,23.4044,192,187,0


In [372]:
X_subset = pd.DataFrame(scaler.transform(valid_df[features]))
X_subset.columns = [col for col in features]
X_remain = pd.DataFrame(valid_df[[col for col in valid_df.columns if col not in features]])
X_remain.columns = [col for col in valid_df.columns if col not in features]
valid_data = pd.concat([X_subset,X_remain],axis=1)

In [373]:
valid_data.head()

Unnamed: 0,s14,s16,s2,s10,cityMode,s7,s1,s20,s6,ecoMode,s17,sportMode,s18,s21,s4,s13,s19,days,s8,s15,s9,s5,s3,s12,s11,vehicleId,RUL,is_carbrokedown,is_final
0,-0.954235,-1.0,0.678077,0.0,1.015677,0.601408,0.0,0.241943,0.141683,1.055599,-0.78171,0.0,0.0,0.774097,-1.19148,-0.919841,0.0,-1.56517,-0.798093,-0.985107,-0.682579,-1.0,-0.85355,0.415614,-1.277396,1,112,0,False
1,-0.216648,-1.0,-1.941707,0.0,-1.03172,1.674769,0.0,1.127183,0.141683,-1.230366,-0.136018,0.0,0.0,0.941305,-1.501467,-0.502695,0.0,-1.550652,-1.220725,-1.649034,-0.490117,-1.0,-0.338137,1.012195,-0.154141,1,111,0,False
2,-0.715712,-1.0,-0.441831,0.0,0.333211,0.838677,0.0,1.459148,0.141683,0.141213,-0.136018,0.0,0.0,1.172256,-0.843717,-0.919841,0.0,-1.536134,-0.657216,0.052112,-0.375093,-1.0,-0.584426,0.754581,-0.154141,1,110,0,False
3,-0.568929,-1.0,-0.481827,0.0,-0.008022,0.793483,0.0,1.016528,0.141683,1.924266,-1.427402,0.0,0.0,0.775945,-0.279297,-0.641744,0.0,-1.521616,-0.93897,-1.345067,-0.90357,-1.0,-1.044384,-0.045381,-0.977861,1,109,0,False
4,-0.745069,-1.0,-0.341839,0.0,-0.008022,0.89517,0.0,0.9612,0.141683,0.644125,-2.073094,0.0,0.0,1.138999,-0.779276,-0.919841,0.0,-1.507098,-1.220725,-1.041101,-0.937081,-1.0,-0.54365,0.998637,-0.865536,1,108,0,False


In [374]:
valid_df.head()

Unnamed: 0,vehicleId,days,ecoMode,cityMode,sportMode,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,RUL,is_carbrokedown,is_final
0,1,1,0.0023,0.0003,100,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100,38.86,23.3735,112,0,False
1,1,2,-0.0027,-0.0003,100,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100,39.02,23.3916,111,0,False
2,1,3,0.0003,0.0001,100,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100,39.08,23.4166,110,0,False
3,1,4,0.0042,0.0,100,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100,39.0,23.3737,109,0,False
4,1,5,0.0014,0.0,100,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100,38.99,23.413,108,0,False


## create train and test X and Y

In [375]:
def prepare_train_test_set(train_data,valid_data,features,normalize=True):
    train_data_x = train_data[features]
    train_data_y = train_data[['is_carbrokedown']]
    
    y_train = np.array(train_data_y).squeeze()
    X_train = np.array(train_data_x)
    valid_data_x = valid_data[features]
    valid_data_y = valid_data[['is_carbrokedown']]
    valid_data_y = np.array(valid_data_y).squeeze()
    
    valid_data_final = valid_data[valid_data.is_final==True]
    valid_data_final_x = valid_data_final[features]
    valid_data_final_y = valid_data_final[['is_carbrokedown']]

    valid_data_final_y = np.array(valid_data_final_y).squeeze()
    
    return X_train, y_train,  valid_data_x, valid_data_y, valid_data_final_x, valid_data_final_y

## Build a sample logistic regression model with as is features

In [376]:
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score

In [377]:
def build_logistic_model(X_train, y_train,iterations=2000):
    logistic_regressor = LogisticRegression(solver='liblinear',max_iter=iterations,C=1,class_weight='balanced',penalty='l2')
    logistic_regressor.fit(X_train, y_train)
    return logistic_regressor

In [378]:
def get_predictions(logistic_regressor,X_test,y_test):    
    predictions = logistic_regressor.predict(X_test)
    score = logistic_regressor.score(X_test,y_test)
    cm = metrics.confusion_matrix(y_test, predictions)
    return predictions, score, cm

In [379]:
sel_feat1 = list(set(train_data.columns) -set(['is_carbrokedown','RUL','car_brokedown_day','vehicleId']))
sel_feat1 = [col for col in sel_feat1 if col not in non_imp_features]

In [380]:
X_train, y_train, valid_data_x, valid_data_y,valid_data_final_x, valid_data_final_y = prepare_train_test_set(train_data, valid_data,sel_feat1)
print(sel_feat1)
clf1 = build_logistic_model(X_train,y_train)
valid_pred, valid_score, cm = get_predictions(clf1, valid_data_x, valid_data_y)
valid_final_pred,valid_final_score, fcm = get_predictions(clf1, valid_data_final_x, valid_data_final_y)
print("normalized input:", valid_score, valid_final_score,X_train.shape,X_test.shape)

['s11', 's14', 's12', 'cityMode', 's7', 's20', 's6', 'ecoMode', 's17', 's21', 's4', 's13', 'days', 's8', 's15', 's9', 's3', 's2']
normalized input: 0.35949908368967626 0.47 (20631, 18) (6809, 66)


In [381]:
THRESHOLD = 0.97
preds1 = np.where(clf1.predict_proba(valid_data_final_x)[:,1] > THRESHOLD, 0, 1)
print(accuracy_score(valid_data_final_y, preds1), recall_score(valid_data_final_y, preds1),precision_score(valid_data_final_y, preds1))

0.66 0.8354430379746836 0.7586206896551724


In [382]:
valid_preds1 = np.where(clf1.predict_proba(valid_data_x)[:,1] > THRESHOLD, 0, 1)
print(accuracy_score(valid_data_y, valid_preds1), recall_score(valid_data_y, valid_preds1),precision_score(valid_data_y, valid_preds1))

0.66905925473427 0.9859345110836053 0.6755069000077095


## Add rolling window average of features to capture change in data values

### here used window_size of 5

In [383]:
#train_data1 = pd.DataFrame()
for vehicleId in train_data.vehicleId.unique():
    #print(vehicleId)
    for i in range(5,26):
        #print(train_data.columns[i])
        train_data.loc[train_data.vehicleId==vehicleId,"a"+str(i-4)] = train_data.loc[train_data.vehicleId==vehicleId].iloc[:,i].rolling(window=2,min_periods=1).mean()
        #train_data["std"+str(i-4)] = train_data.iloc[:,i].rolling(window=5,min_periods=1).std(ddof=0)

In [384]:
train_data.head()

Unnamed: 0,s14,s16,s2,s10,cityMode,s7,s1,s20,s6,ecoMode,s17,sportMode,s18,s21,s4,s13,s19,days,s8,s15,s9,s5,s3,s12,s11,vehicleId,car_brokedown_day,RUL,is_carbrokedown,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21
0,-0.269071,-1.0,-1.721725,0.0,-1.372953,1.121141,0.0,1.348493,0.141683,-0.31598,-0.78171,0.0,0.0,1.194427,-0.925936,-1.05889,0.0,-1.56517,-0.516338,-0.603816,-0.862813,-1.0,-0.134255,0.334262,-0.266467,1,192,191,0,1.121141,0.0,1.348493,0.141683,-0.31598,-0.78171,0.0,0.0,1.194427,-0.925936,-1.05889,0.0,-1.56517,-0.516338,-0.603816,-0.862813,-1.0,-0.134255,0.334262,-0.266467,1.0
1,-0.642845,-1.0,-1.06178,0.0,-1.03172,0.43193,0.0,1.016528,0.141683,0.872722,-0.78171,0.0,0.0,1.236922,-0.643726,-0.363646,0.0,-1.550652,-0.798093,-0.275852,-0.958818,-1.0,0.211528,1.174899,-0.191583,1,192,190,0,0.776535,0.0,1.18251,0.141683,0.278371,-0.78171,0.0,0.0,1.215675,-0.784831,-0.711268,0.0,-1.557911,-0.657216,-0.439834,-0.910815,-1.0,0.038637,0.754581,-0.229025,1.0
2,-0.551629,-1.0,-0.661813,0.0,1.015677,1.008155,0.0,0.739891,0.141683,-1.961874,-2.073094,0.0,0.0,0.503423,-0.525953,-0.919841,0.0,-1.536134,-0.234584,-0.649144,-0.557139,-1.0,-0.413166,1.364721,-1.015303,1,192,189,0,0.720043,0.0,0.878209,0.141683,-0.544576,-1.427402,0.0,0.0,0.870172,-0.58484,-0.641744,0.0,-1.543393,-0.516338,-0.462498,-0.757978,-1.0,-0.100819,1.26981,-0.603443,1.0
3,-0.520176,-1.0,-0.661813,0.0,-0.008022,1.222827,0.0,0.352598,0.141683,0.32409,-0.78171,0.0,0.0,0.777792,-0.784831,-0.224597,0.0,-1.521616,0.188048,-1.971665,-0.713826,-1.0,-1.261314,1.961302,-1.539489,1,192,188,0,1.115491,0.0,0.546244,0.141683,-0.818892,-1.427402,0.0,0.0,0.640607,-0.655392,-0.572219,0.0,-1.528875,-0.023268,-1.310405,-0.635482,-1.0,-0.83724,1.663011,-1.277396,1.0
4,-0.521748,-1.0,-0.621816,0.0,-0.690488,0.714393,0.0,0.463253,0.141683,-0.864611,-0.136018,0.0,0.0,1.059552,-0.301518,-0.780793,0.0,-1.507098,-0.516338,-0.339845,-0.457059,-1.0,-1.251528,1.052871,-0.977861,1,192,187,0,0.96861,0.0,0.407926,0.141683,-0.27026,-0.458864,0.0,0.0,0.918672,-0.543175,-0.502695,0.0,-1.514357,-0.164145,-1.155755,-0.585442,-1.0,-1.256421,1.507087,-1.258675,1.0


In [385]:
for vehicleId in valid_data.vehicleId.unique():
    #print(vehicleId)
    for i in range(5,26):
        #print(valid_data.columns[i])
        valid_data.loc[valid_data.vehicleId==vehicleId,"a"+str(i-4)] = valid_data.loc[valid_data.vehicleId==vehicleId].iloc[:,i].rolling(window=2,min_periods=1).mean()
        #train_data["std"+str(i-4)] = train_data.iloc[:,i].rolling(window=5,min_periods=1).std(ddof=0)

In [386]:
valid_data.head()

Unnamed: 0,s14,s16,s2,s10,cityMode,s7,s1,s20,s6,ecoMode,s17,sportMode,s18,s21,s4,s13,s19,days,s8,s15,s9,s5,s3,s12,s11,vehicleId,RUL,is_carbrokedown,is_final,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21
0,-0.954235,-1.0,0.678077,0.0,1.015677,0.601408,0.0,0.241943,0.141683,1.055599,-0.78171,0.0,0.0,0.774097,-1.19148,-0.919841,0.0,-1.56517,-0.798093,-0.985107,-0.682579,-1.0,-0.85355,0.415614,-1.277396,1,112,0,False,0.601408,0.0,0.241943,0.141683,1.055599,-0.78171,0.0,0.0,0.774097,-1.19148,-0.919841,0.0,-1.56517,-0.798093,-0.985107,-0.682579,-1.0,-0.85355,0.415614,-1.277396,1.0
1,-0.216648,-1.0,-1.941707,0.0,-1.03172,1.674769,0.0,1.127183,0.141683,-1.230366,-0.136018,0.0,0.0,0.941305,-1.501467,-0.502695,0.0,-1.550652,-1.220725,-1.649034,-0.490117,-1.0,-0.338137,1.012195,-0.154141,1,111,0,False,1.138088,0.0,0.684563,0.141683,-0.087383,-0.458864,0.0,0.0,0.857701,-1.346473,-0.711268,0.0,-1.557911,-1.009409,-1.31707,-0.586348,-1.0,-0.595844,0.713905,-0.715769,1.0
2,-0.715712,-1.0,-0.441831,0.0,0.333211,0.838677,0.0,1.459148,0.141683,0.141213,-0.136018,0.0,0.0,1.172256,-0.843717,-0.919841,0.0,-1.536134,-0.657216,0.052112,-0.375093,-1.0,-0.584426,0.754581,-0.154141,1,110,0,False,1.256723,0.0,1.293165,0.141683,-0.544576,-0.136018,0.0,0.0,1.05678,-1.172592,-0.711268,0.0,-1.543393,-0.93897,-0.798461,-0.432605,-1.0,-0.461282,0.883388,-0.154141,1.0
3,-0.568929,-1.0,-0.481827,0.0,-0.008022,0.793483,0.0,1.016528,0.141683,1.924266,-1.427402,0.0,0.0,0.775945,-0.279297,-0.641744,0.0,-1.521616,-0.93897,-1.345067,-0.90357,-1.0,-1.044384,-0.045381,-0.977861,1,109,0,False,0.81608,0.0,1.237838,0.141683,1.032739,-0.78171,0.0,0.0,0.9741,-0.561507,-0.780793,0.0,-1.528875,-0.798093,-0.646478,-0.639332,-1.0,-0.814405,0.3546,-0.566001,1.0
4,-0.745069,-1.0,-0.341839,0.0,-0.008022,0.89517,0.0,0.9612,0.141683,0.644125,-2.073094,0.0,0.0,1.138999,-0.779276,-0.919841,0.0,-1.507098,-1.220725,-1.041101,-0.937081,-1.0,-0.54365,0.998637,-0.865536,1,108,0,False,0.844327,0.0,0.988864,0.141683,1.284196,-1.750248,0.0,0.0,0.957472,-0.529286,-0.780793,0.0,-1.514357,-1.079848,-1.193084,-0.920325,-1.0,-0.794017,0.476628,-0.921699,1.0


In [387]:
sel_feat2 = list(set(train_data.columns) -set(['is_carbrokedown','RUL','car_brokedown_day','vehicleId']))
sel_feat2 = [col for col in sel_feat2 if col not in non_imp_features and not col.startswith('s')]

In [388]:
X_train,  y_train, valid_data_x, valid_data_y, valid_data_final_x,valid_data_final_y = prepare_train_test_set(train_data, valid_data,sel_feat2)
print(sel_feat2)
clf2 = build_logistic_model(X_train,y_train)
valid_pred, valid_score, valid_cm = get_predictions(clf2,valid_data_x,valid_data_y)
valid_final_pred, valid_final_score, valid_final_cm = get_predictions(clf2,valid_data_final_x,valid_data_final_y)

['a4', 'a6', 'a3', 'a10', 'a16', 'a7', 'a8', 'a17', 'a18', 'a15', 'days', 'a20', 'a9', 'a21', 'a2', 'a11', 'a13', 'cityMode', 'a14', 'a5', 'a19', 'a12', 'a1', 'ecoMode']


In [389]:
valid_score, valid_final_score,X_train.shape

(0.35797189981673794, 0.46, (20631, 24))

In [390]:
THRESHOLD = 0.995
preds2 = np.where(clf2.predict_proba(valid_data_final_x)[:,1] > THRESHOLD, 0, 1)
print(accuracy_score(valid_data_final_y, preds2), recall_score(valid_data_final_y, preds2),precision_score(valid_data_final_y, preds2))

0.71 0.8987341772151899 0.7717391304347826


In [391]:
valid_preds2 = np.where(clf2.predict_proba(valid_data_x)[:,1] > THRESHOLD, 0, 1)
print(accuracy_score(valid_data_y, valid_preds2), recall_score(valid_data_y, valid_preds2),precision_score(valid_data_y, valid_preds2))

0.6748625534514355 0.9944863283447732 0.6773971027822487


## Add rolling window std of features to capture change in data values
## Also added change in feature values 

### Here used window_size of 5

In [392]:
#train_data1 = pd.DataFrame()
for vehicleId in train_data.vehicleId.unique():
    #print(vehicleId)
    for i in range(5,26):
        #print(train_data.columns[i])
        train_data.loc[train_data.vehicleId==vehicleId,"a"+str(i-4)] = train_data.loc[train_data.vehicleId==vehicleId].iloc[:,i].rolling(window=5,min_periods=1).mean()
        train_data.loc[train_data.vehicleId==vehicleId,"std"+str(i-4)] = train_data.loc[train_data.vehicleId==vehicleId].iloc[:,i].rolling(window=5,min_periods=1).std(ddof=0)
        train_data.loc[train_data.vehicleId==vehicleId,"diff"+str(i-4)] = train_data.loc[train_data.vehicleId==vehicleId].iloc[:,i].rolling(window=5,min_periods=1).apply(lambda x:x[-1]-x[0])

  


In [393]:
train_data.head()

Unnamed: 0,s14,s16,s2,s10,cityMode,s7,s1,s20,s6,ecoMode,s17,sportMode,s18,s21,s4,s13,s19,days,s8,s15,s9,s5,s3,s12,s11,vehicleId,car_brokedown_day,RUL,is_carbrokedown,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21,std1,diff1,std2,diff2,std3,diff3,std4,diff4,std5,diff5,std6,diff6,std7,diff7,std8,diff8,std9,diff9,std10,diff10,std11,diff11,std12,diff12,std13,diff13,std14,diff14,std15,diff15,std16,diff16,std17,diff17,std18,diff18,std19,diff19,std20,diff20,std21,diff21
0,-0.269071,-1.0,-1.721725,0.0,-1.372953,1.121141,0.0,1.348493,0.141683,-0.31598,-0.78171,0.0,0.0,1.194427,-0.925936,-1.05889,0.0,-1.56517,-0.516338,-0.603816,-0.862813,-1.0,-0.134255,0.334262,-0.266467,1,192,191,0,1.121141,0.0,1.348493,0.141683,-0.31598,-0.78171,0.0,0.0,1.194427,-0.925936,-1.05889,0.0,-1.56517,-0.516338,-0.603816,-0.862813,-1.0,-0.134255,0.334262,-0.266467,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.642845,-1.0,-1.06178,0.0,-1.03172,0.43193,0.0,1.016528,0.141683,0.872722,-0.78171,0.0,0.0,1.236922,-0.643726,-0.363646,0.0,-1.550652,-0.798093,-0.275852,-0.958818,-1.0,0.211528,1.174899,-0.191583,1,192,190,0,0.776535,0.0,1.18251,0.141683,0.278371,-0.78171,0.0,0.0,1.215675,-0.784831,-0.711268,0.0,-1.557911,-0.657216,-0.439834,-0.910815,-1.0,0.038637,0.754581,-0.229025,1.0,0.344605,-0.68921,0.0,0.0,0.165982,-0.331965,0.0,0.0,0.594351,1.188702,0.0,0.0,0.0,0.0,0.0,0.0,0.021247,0.042495,0.141105,0.28221,0.347622,0.695244,0.0,0.0,0.007259,0.014518,0.140877,-0.281755,0.163982,0.327964,0.048002,-0.096004,0.0,0.0,0.172892,0.345784,0.420319,0.840637,0.037442,0.074884,0.0,0.0
2,-0.551629,-1.0,-0.661813,0.0,1.015677,1.008155,0.0,0.739891,0.141683,-1.961874,-2.073094,0.0,0.0,0.503423,-0.525953,-0.919841,0.0,-1.536134,-0.234584,-0.649144,-0.557139,-1.0,-0.413166,1.364721,-1.015303,1,192,189,0,0.853742,0.0,1.03497,0.141683,-0.468377,-1.212171,0.0,0.0,0.978257,-0.698538,-0.780793,0.0,-1.550652,-0.516338,-0.509604,-0.792923,-1.0,-0.111964,0.957961,-0.491118,1.0,0.301812,-0.112985,0.0,0.0,0.248803,-0.608602,0.0,0.0,1.162226,-1.645895,0.608764,-1.291384,0.0,0.0,0.0,0.0,0.336207,-0.691004,0.167829,0.399983,0.30038,0.139049,0.0,0.0,0.011854,0.029036,0.230052,0.281755,0.16632,-0.045328,0.171269,0.305674,0.0,0.0,0.255517,-0.27891,0.447778,1.030459,0.371914,-0.748837,0.0,0.0
3,-0.520176,-1.0,-0.661813,0.0,-0.008022,1.222827,0.0,0.352598,0.141683,0.32409,-0.78171,0.0,0.0,0.777792,-0.784831,-0.224597,0.0,-1.521616,0.188048,-1.971665,-0.713826,-1.0,-1.261314,1.961302,-1.539489,1,192,188,0,0.946013,0.0,0.864377,0.141683,-0.27026,-1.104556,0.0,0.0,0.928141,-0.720111,-0.641744,0.0,-1.543393,-0.340242,-0.875119,-0.773149,-1.0,-0.399302,1.208796,-0.75321,1.0,0.306365,0.101687,0.0,0.0,0.365695,-0.995894,0.0,0.0,1.063404,0.64007,0.559186,0.0,0.0,0.0,0.0,0.0,0.303827,-0.416635,0.15007,0.141105,0.354506,0.834293,0.0,0.0,0.016232,0.043554,0.364312,0.704386,0.64927,-1.367849,0.152227,0.148988,0.0,0.0,0.544661,-1.127059,0.582352,1.62704,0.556613,-1.273022,0.0,0.0
4,-0.521748,-1.0,-0.621816,0.0,-0.690488,0.714393,0.0,0.463253,0.141683,-0.864611,-0.136018,0.0,0.0,1.059552,-0.301518,-0.780793,0.0,-1.507098,-0.516338,-0.339845,-0.457059,-1.0,-1.251528,1.052871,-0.977861,1,192,187,0,0.899689,0.0,0.784153,0.141683,-0.389131,-0.910848,0.0,0.0,0.954423,-0.636393,-0.669553,0.0,-1.536134,-0.375461,-0.768064,-0.709931,-1.0,-0.569747,1.177611,-0.798141,1.0,0.28926,-0.406747,0.0,0.0,0.364322,-0.88524,0.0,0.0,0.980399,-0.548632,0.632647,0.645692,0.0,0.0,0.0,0.0,0.276788,-0.134875,0.214598,0.624418,0.321921,0.278098,0.0,0.0,0.020532,0.058073,0.333377,0.0,0.618938,0.263971,0.185808,0.405754,0.0,0.0,0.594584,-1.117273,0.524593,0.718609,0.505894,-0.711395,0.0,0.0


In [394]:
for vehicleId in valid_data.vehicleId.unique():
    #print(vehicleId)
    for i in range(5,26):
        #print(valid_data.columns[i])
        valid_data.loc[valid_data.vehicleId==vehicleId,"a"+str(i-4)] = valid_data.loc[valid_data.vehicleId==vehicleId].iloc[:,i].rolling(window=5,min_periods=1).mean()
        valid_data.loc[valid_data.vehicleId==vehicleId,"std"+str(i-4)] = valid_data.loc[valid_data.vehicleId==vehicleId].iloc[:,i].rolling(window=5,min_periods=1).std(ddof=0)
        valid_data.loc[valid_data.vehicleId==vehicleId,"diff"+str(i-4)] = valid_data.loc[valid_data.vehicleId==vehicleId].iloc[:,i].rolling(window=5,min_periods=1).apply(lambda x:x[-1]-x[0])

  import sys


In [400]:
sel_feat3 = list(set(train_data.columns) -set(['is_carbrokedown','RUL','car_brokedown_day','vehicleId']))
sel_feat3 = [col for col in sel_feat3 if col not in non_imp_features and (col.startswith('a') or col.startswith('std') or col.startswith('diff'))]
sel_feat3.extend(['ecoMode','days','cityMode'])

In [401]:
X_train, y_train, valid_data_x, valid_data_y, valid_data_final_x,valid_data_final_y = prepare_train_test_set(train_data, valid_data,sel_feat3)
print(sel_feat3)
clf3 = build_logistic_model(X_train,y_train)
valid_pred, valid_score, valid_cm = get_predictions(clf3,valid_data_x,valid_data_y)
valid_final_pred, valid_final_score, valid_final_cm = get_predictions(clf3,valid_data_final_x,valid_data_final_y)

['std7', 'std11', 'a10', 'std6', 'a16', 'std17', 'std13', 'diff10', 'a8', 'diff9', 'std9', 'std12', 'a18', 'diff1', 'std10', 'diff5', 'diff6', 'std18', 'std20', 'std19', 'std14', 'diff7', 'std8', 'a2', 'a13', 'std21', 'diff20', 'a5', 'diff17', 'std4', 'diff12', 'diff8', 'diff15', 'a4', 'std1', 'a6', 'std15', 'diff13', 'a3', 'a7', 'diff4', 'std16', 'diff14', 'diff2', 'a17', 'std2', 'a15', 'a20', 'a9', 'std5', 'a21', 'a11', 'diff16', 'std3', 'diff19', 'a14', 'a19', 'diff11', 'a12', 'diff18', 'a1', 'diff3', 'diff21', 'ecoMode', 'days', 'cityMode']


In [402]:
print("normalized input:", valid_score, valid_final_score,X_train.shape,valid_data_final_x.shape)

normalized input: 0.3583536957849725 0.46 (20631, 66) (100, 66)


In [403]:
THRESHOLD = 0.995
preds3 = np.where(clf3.predict_proba(valid_data_final_x)[:,1] > THRESHOLD, 0, 1)
print(accuracy_score(valid_data_final_y, preds3), recall_score(valid_data_final_y, preds3),precision_score(valid_data_final_y, preds3))

0.71 0.8987341772151899 0.7717391304347826


In [404]:
valid_preds3 = np.where(clf3.predict_proba(valid_data_x)[:,1] > THRESHOLD, 0, 1)
print(accuracy_score(valid_data_y, valid_preds3), recall_score(valid_data_y, valid_preds3),precision_score(valid_data_y, valid_preds3))

0.6751679902260233 0.9949364239900979 0.6774959773197456
