# Safety Category: Models

In this notebook, several models are tested on the preprocessed training data.  

**Models:**   
1) Random Forest  
2) Logistic Regression  
3) Support Vector Machine  
4) Neural Network

**Performance**  

| Model                  | Accuracy   | AUC   |
|------------------------|------------|-------|
| Random Forest          | 0.777      | 0.588 |
| Logistic Regression    | 0.753      | 0.500 |
| Support Vector Machine | 0.752      | 0.499 |
| Neural Network         | 0.745      | 0.532 |

From the experiments below, **Random Forest** performs the best


## Reading the data

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

In [3]:
# safety_dataset_new = pd.read_csv('safety_dataset_new.csv')
safety_dataset_new = pd.read_csv('safety_dataset_filtered.csv')
print(safety_dataset_new.shape)
safety_dataset_new.head()

(20000, 12)


Unnamed: 0,bookingID,Speed_perc70,acceleration_x_min,acceleration_z_std,Bearing_std,acceleration_x_std,Speed_std,acceleration_y_std,acceleration_z_max,Speed_max,time,label
0,0,14.473692,-4.692294,1.141266,129.231351,0.928022,7.199919,0.639934,2.318857,22.946083,1589.0,0
1,1,12.118372,-5.352994,0.854271,89.861236,0.744157,7.059362,0.533915,1.481293,21.882141,1034.0,1
2,2,5.038032,-2.971295,1.020021,119.31652,0.756589,2.897762,0.505693,2.31287,9.360483,825.0,1
3,4,8.217,-2.866458,0.779529,71.273774,0.52722,5.595901,0.598023,0.296381,19.780001,1094.0,1
4,6,7.770113,-4.352792,0.942163,111.868249,0.826271,5.314844,0.61721,7.977724,16.394695,1094.0,0


In [4]:
X = safety_dataset_new.drop(['label', 'bookingID'], axis=1)
y = safety_dataset_new.label

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [6]:
features = list(X_train.columns)
features

['Speed_perc70',
 'acceleration_x_min',
 'acceleration_z_std',
 'Bearing_std',
 'acceleration_x_std',
 'Speed_std',
 'acceleration_y_std',
 'acceleration_z_max',
 'Speed_max',
 'time']

The above features are the selected features obtained from the feature selection notebook

### 1) Random Forest

In [7]:
RSEED = 50

# Create the RF model with 100 trees
rf_model = RandomForestClassifier(n_estimators=100,  random_state=RSEED,  max_features = 'sqrt', n_jobs=-1, verbose = 1)

# Fit on training data
rf_model.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.0s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=50, verbose=1, warm_start=False)

In [15]:
accuracy_rf = rf_model.score(X_test, y_test)
print("Accuracy:", accuracy_rf)

Accuracy: 0.777


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


In [16]:
auc_score_rf = roc_auc_score(y_test, rf_model.predict(X_test))
print("AUC score:", auc_score_rf)

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


AUC score: 0.588262849982


### 2) Logistic Regression

In [13]:
logistic_regression = LogisticRegression(solver = 'lbfgs', max_iter=4000)
logistic_regression.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=4000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [17]:
accuracy_logreg = logistic_regression.score(X_test, y_test)
print("Accuracy:", accuracy_logreg)

Accuracy: 0.7532


In [19]:
auc_score_logreg = roc_auc_score(y_test, logistic_regression.predict(X_test))
print("AUC score:", auc_score_logreg)

AUC score: 0.5


### 3) Support Vector Machine

In [22]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [29]:
accuracy_svm = svm.score(X_test, y_test)
print("Accuracy:", accuracy_svm)

Accuracy: 0.7516


In [24]:
auc_score_svm = roc_auc_score(y_test, svm.predict(X_test))
print("AUC score:", auc_score_svm)

AUC score: 0.499210284633


### 4) Neural Network

In [26]:
mlpc = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(72, 24, 72), random_state=1, max_iter=500)
mlpc.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(72, 24, 72), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [30]:
accuracy_nn = mlpc.score(X_test, y_test)
print("Accuracy:", accuracy_nn)

Accuracy: 0.7452


In [31]:
auc_score_nn = roc_auc_score(y_test, mlpc.predict(X_test))
print("AUC score:", auc_score_nn)

AUC score: 0.53201080038
