# PREDICT

- [1. Overview](#1)
- [2. Function](#2)
- [3. Training](#3)
    - [3.1 Create Data Train (Optional)](#31)
    - [3.2 Load Data Train](#32)
    - [3.3 Train Model](#33)
- [4. Predict Data Test](#4)
    - [4.1 Create Data Test](#41)
    - [4.2 Predict Probability](#42)
    - [4.3 Predict Label with Probability Cut Off](#43)
- [5. Save Results](#5)
    - [5.1 Into Dataframe](#51)
    - [5.2 Into csv file](#52)






<a id="1"></a>

# 1. Overview
Feature used in this model are inspired by [Grab](https://help.grab.com/driver/en-my/360001944868-Weekly-Safety-Report) and [Insurance Telematics paper by Peter Handel et. al](https://ieeexplore.ieee.org/document/6936433/authors#authors) as described below.

| Peter Handel et. al Feature | Description                                                         | Weekly Safety Report Grab | Available Variable   |
|---------------------|---------------------------------------------------------------------|--------------|----------------------|
| Acceleration        | Number of rapid acceleration events and their harshness             | ✔            | Speed + Time, Gyro         |
| Braking             | Number of harsh braking events and their harshness                  | ✔            | Speed + Time, Gyro         |
| Speeding            | Amount of absolute speeding                                         | _              | Speed                |
| Smoothness          | Long-term speed variations around a nominal speed                   | _              | Speed                |
| Swerving            | Number of abrupt steering maneuvers and their harshness             |_             | Gyro + Acceleration  |
| Cornering           | Number of events when turning at too high speed and their harshness | ✔            | Bearing + Gyro+ Time |
| Elapsed time        | Time duration of the trip                                           |_              | Time                 |
  
Grab Telematics data consist of bookingID, Accuracy, Bearing, acceleration, gyro, second, and Speed. 
16135561 data point of telematics data transformed into 1 data point for each bookingID (transformed into total 20000 data point).


| Original Feature              | Feature Aggregation per bookingID                                | Description                          |
|----------------------|------------------------------------------------------------------|--------------------------------------|
| second               | ```max_second```                                                       | `Elapsed Time`                                |
| Speed                | ```mean_Speed```<br/>```median_Speed```<br/>```max_Speed```<br/>```std_Speed```<br/>```speed_diff``` | ```speed_diff``` is average of speed difference over time to estimate ```Smoothness``` along with `std_Speed`.<br/>```max_Speed```, `mean_Speed`, `median_Speed` estimate ```Speeding```.                      |
| ```acceleration_(x,y,z)``` | ```mean_acceleration_(x,y,z)```<br/>```median_acceleration_(x,y,z)```<br/>```max_acceleration_(x,y,z)```<br/>```min_acceleration_(x,y,z)```<br/>```std_acceleration_(x,y,z)```<br/>```count(1,2,3)_acceleration_(x,y,z)```                                                  | ```count(1,2,3)_acceleration_(x,y,z)``` is how many times ```acceleration_(x,y,z)``` data goes to far from its median as described in advanced.pdf. Used to estimate harsh `Acceleration` and `Braking`.<br/><br/>Harsh Acceleration and Braking will result in medium(count2) to high(count3) acceleration_z value alteration                           |
| gyro_(x,y,z)         | ```mean_gyro_(x,y,z)```<br/>```median_gyro_(x,y,z)```<br/>```max_gyro_(x,y,z)```<br/>```min_gyro_(x,y,z)```<br/>```std_gyro_(x,y,z)```<br/>```count(1,2,3)_gyro_(x,y,z)```                              | ```count(1,2,3)_gyro_(x,y,z)``` is how many times ```gyro_(x,y,z)``` data goes to far from its median as described in advanced.pdf. Used to estimate harsh `Acceleration`, `Braking`, and `Swerving`. <br/><br/>Harsh Acceleration and Braking create medium(count2) to high(count3) gyro value alteration, Swerving create high(count3) gyro value alteration      |


The aggregated data become input to Stacking algorithm consist of LogisticRegression, RandomForestClassifier, and GradientBoostingClassifier that maximize AUC. Class prediction probability cut off is set to 0.5 by default, this nummber could be set to create a balance between TP and FP. 

</br>
  
_* Cornering detection isn't implemented due to insufficient time to integrate bearing + gyro + time_  
_** This notebook contain only algorithm and final procedure of my solution to detect dangerous driving. More detailed explanation of exploratory data analysis, feature engineering, and thought process are in baseline(notebook|pdf) and advanced(notebook|pdf). **_


<a id="2"></a>

## 2. Function

In [9]:




%matplotlib inline
import pandas as pd
import numpy as np
import glob
import gc
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
import copy
import pickle
import swifter
import sys
pd.set_option('display.max_columns', 500)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve, auc, classification_report 
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier


In [5]:
"""
This cell contain function to save object & load object

Used to save file to disk after aggregation to save time.
Will not be used in submission.
"""

def save_obj(obj, name ):
    with open('/jet/prs/aiforsea/'+ name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name ):
    with open('/jet/prs/aiforsea/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [6]:
def statform(booking_id,var):
    """
    Generate statistical measurement of a variable with the same bookingID.
    Use this in pandas apply.
    
    Input:
    booking_id      bookingID
    df_dict         list containing dataframe per bookingID
    data df         label data
    """
    booking_id = booking_id.bookingID
    std = df_dict[booking_id][var].std()
    maxx = df_dict[booking_id][var].max()
    minn = df_dict[booking_id][var].min()
    mean = df_dict[booking_id][var].mean()
    med = df_dict[booking_id][var].median()
    
    return std, maxx, minn, mean, med

In [7]:
def adv_statform(df_dict,data,booking_id,var):
    """
    Generate statistical measurement of a variable with the same bookingID.
    Use this in pandas apply.
    
    Input:
    booking_id      bookingID
    df_dict         list containing dataframe per bookingID
    data df         label data
    """
#     booking_id = booking_id.bookingID
    var_series = df_dict[booking_id][var]
    med = data.loc[data.bookingID == booking_id]['median_'+var].values[0]
    mad = abs(var_series-med).mean()
    
    cnt = var_series.count()
    
    cut1 = 0
    cut2 = 0
    cut3 = 0
## ver 4 gyro
    if var=='gyro_x' or var=='gyro_y' or var=='gyro_z':
        cut1 = len(np.where((abs(med-var_series)>(0.1)) & (abs(med-var_series)<=(0.2)))[0])
        cut2 = len(np.where((abs(med-var_series)>(0.2)) & (abs(med-var_series)<=(0.4)))[0])
        cut3 = len(np.where(abs(med-var_series)>(0.4))[0])
    
    if var=='acceleration_x' or var=='acceleration_y' or var=='acceleration_z':
        cut1 = len(np.where((abs(med-var_series)>(1.0)) & (abs(med-var_series)<=(2.0)))[0])
        cut2 = len(np.where((abs(med-var_series)>(2.0)) & (abs(med-var_series)<=(4.0)))[0])
        cut3 = len(np.where(abs(med-var_series)>(4.0))[0])
    
#     return cut1,cut2,cut3,cut4,cut5
    return mad,cut1,cut2,cut3

In [8]:
def ediff(df_dict,booking_id,var):
    #booking_id = booking_id.bookingID
    var_series = df_dict[booking_id][var]
    ediff = np.ediff1d(var_series)
    return ediff.sum()/var_series.count()

<a id="3"></a>


## 3. Training

<a id="31"></a>


### 3.1 Create Data Train Feature (Optional)

<a id="32"></a>


### 3.2 Load Data Train

In [10]:
train_data = load_obj('train_data')
train_data_adv = load_obj('train_data_adv')

NameError: name 'pickle' is not defined

<a id="3"></a>


### 3.3 Train Model

In [None]:
X = pd.merge(train_data, train_data_adv, on=['bookingID','label'])
X = X.drop('label', axis=1)

# drop column that didn't make any sense
X.drop('bookingID', axis=1, inplace=True)
X.drop(['median_second','mean_second','std_second','min_second'], axis=1, inplace=True) # no meaning, represented by max second
X.drop(['min_Speed'], axis=1, inplace=True) # all 0 
X.drop(['trip_duration'], axis=1, inplace=True) # max_second better

Y = train_data_adv['label']



In [None]:
from pystacknet.pystacknet import StackNetClassifier

modelx=StackNetClassifier(models, metric="auc", folds=4,
    restacking=False,use_retraining=True, use_proba=True, 
    random_state=12345,n_jobs=1, verbose=1)

modelx.fit(X_train,y_train)
# model = RandomForestClassifier() # Backup just in case your pystacknet couldn't be installed properly




<a id="4"></a>


## 4. Predict Data Test

<a id="41"></a>


### 4.1 Create Data Test Feature

<a id="42"></a>


### 4.2 Predict Probability


In [None]:
preds=modelx.predict_proba(X_test)
preds_train=modelx.predict_proba(X_train)

<a id="43"></a>


### 4.3 Predict Label with Probability Cut Off 

<a id="5"></a>


## 5. Save Results

In [None]:
from sklearn.metrics import confusion_matrix

t = 0.5 # default 0.5
t1 = 1 - t
pred = pd.DataFrame({'pred0':preds[:,0],'pred1':preds[:,1]})
pred['pred'] = np.where(pred['pred0']<=t1, '1', '0').astype(int)

pred_train = pd.DataFrame({'pred0':preds_train[:,0],'pred1':preds_train[:,1]})
pred_train['pred'] = np.where(pred_train['pred0']<=t1, '1', '0').astype(int)

print ("TRAIN")
print ("Accuracy  : %.8f" % accuracy_score(y_train, pred_train.pred.values))
print ("Recall    : %.8f" % recall_score(y_train, pred_train.pred.values))
print ("Precision : %.8f" % precision_score(y_train, pred_train.pred.values))
print ("AUC score : %.8f" % roc_auc_score(y_train, preds_train[:,1]))
print ("Confusion Matrix : \n", confusion_matrix(y_train, pred_train.pred.values))
print(classification_report(y_train, pred_train.pred.values))
print("Train predict distribution\n",pd.Series(pred.pred.values).value_counts(normalize=True))



print("Test predic distribution",pd.Series(pred.pred.values).value_counts(normalize=True)) # same distribution)


<a id="51"></a>

### 5.1 Into DataFrame

<a id="52"></a>


### 5.2 Into CSV file