# Model Definition 1

We'll be defining a few models here to compare: a simple random forest classifier, a logistic regressor, a 1D CNN, a multi-layer perceptron, and a combined hybrid neural network with converging CNN and MLP networks. The logistic regressor will serve as our baseline model.

Let's read in our data and imports and get to work.

In [1]:
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib

import plaidml.keras as pk
pk.install_backend()

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping

from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Conv1D, Dropout, LeakyReLU, MaxPooling1D, Embedding, Flatten, Input, Concatenate

### Reading in training and validation data

In [2]:
with open('./sequence_data.pickle', 'rb') as f:
    sequence_data = pickle.load(f)
    
with open('./numerical_data.pickle', 'rb') as f:
    numeric_data = pickle.load(f)

In [3]:
X_seq, y = sequence_data

In [4]:
X_num, y = numeric_data

In [5]:
print(X_seq.shape)
print(X_num.shape)
print(y.shape)

(68486, 5)
(68486, 8)
(68486,)


In [6]:
X = np.concatenate([X_num, X_seq], axis=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.33)

In [7]:
xs_train, xs_test, ys_train, ys_test = train_test_split(X_seq, y, test_size=.33)
xn_train, xn_test, yn_train, yn_test = train_test_split(X_num, y, test_size=.33)

### Non-DeepLearning Methods 

We'll start with our non-deeplearning models -- the random forest and logistic regressor. We'll start with the random forest.

#### Random Forest

We're doing a randomized hyperparameter search over a wide parameter space. This will give a good chance of finding the best model since we get to search a wide range of hyperparameter values.

In [None]:
rf = RandomForestClassifier()

n_estimators = [x for x in range(1,201,10)]
max_features = ['auto', 'sqrt']
max_depth = [x for x in range(1, 101, 10)] + [None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

params = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'bootstrap': bootstrap   
}

rf_grid = RandomizedSearchCV(estimator = rf, param_distributions = params,
                             n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

In [9]:
rf_grid.fit(x_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done 150 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  2.6min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

In [10]:
best_rf = rf_grid.best_estimator_

preds = best_rf.predict(x_test)
acc = accuracy_score(preds, y_test)
print(f'Accuracy: {round(acc*100,2)}%')

Accuracy: 92.84%


Right out of the box, we get a very stong model! 92.8% accuracy is great, and given that our dataset is balanced by design, our other metrics should be around 93% as well. Let's take a look at the classification report to varify.

In [11]:
print(classification_report(preds, y_test))

              precision    recall  f1-score   support

           0       0.94      0.92      0.93     11705
           1       0.91      0.94      0.93     10896

    accuracy                           0.93     22601
   macro avg       0.93      0.93      0.93     22601
weighted avg       0.93      0.93      0.93     22601



In [12]:
roc_auc_score(preds, y_test)

0.9287260557036948

As expected, precision, recall, F1, and AUC-ROC scores are all very good -- around 92%.

We'll persist this model through pickling for use in further evaluation.

In [13]:
joblib.dump(rf_grid, './random_forest_model.joblib')

['./random_forest_model.joblib']

In [14]:
vocab_size = 300
max_length = 5
batch_size = 32

#### Baseline Logistic Regressor

Next we'll make a simple logistic regressor to serve as a performance baseline for our deep learning networks.

We'll first make an embedding for our routes, and then pass the data through a single sigmoid layer.

In [15]:
model = Sequential()

model.add(Embedding(vocab_size, 3, input_length=13))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy',
             metrics=['acc'])

INFO:plaidml:Opening device "metal_amd_radeon_pro_570x.0"


In [16]:
model.fit(x_train, y_train,
         batch_size=batch_size,
         epochs=25,
         validation_data=(x_test, y_test),
         verbose=1)

Train on 45885 samples, validate on 22601 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7fa979161690>

The Logistic Regressor is scoring around 86% accuracy. This will make a fine baseline by which to judge our other models.

We'll save the model weights as an h5 file.

In [17]:
model.save('./logistic_regressor.h5')

### DeepLearning Methods:

For our deep learning model we'll be implementing two separate models, a CNN and MLP model, and then we'll combine the two into a hybrid model. The reason for this approach is that we have two separate types of data to model: sequence data and numerical/categorical data. Our CNN will be modeling the sequence data, and the MLP will model our numerical and categorical data. We'll then combine outputs of those two models and feed them into a third model and see if the performance is boosted.

Let's start by developing our 1D CNN.

#### Convolutional NN

We'll start small, with only one convolving layer going into a single hidden dense layer. The inputs will be our "route sentences".

In [18]:
cnn = Sequential()

cnn.add(Embedding(vocab_size, 3, input_length=5))

cnn.add(Conv1D(64, kernel_size=3, strides=1))
cnn.add(MaxPooling1D(pool_size=2))
cnn.add(Flatten())

cnn.add(Dense(1, activation='sigmoid'))

cnn.compile(optimizer='adam', loss='binary_crossentropy',
             metrics=['acc'])

In [19]:
cnn.fit(xs_train, ys_train,
         batch_size=batch_size,
         epochs=25,
         validation_data=(xs_test, ys_test),
         verbose=1)

Train on 45885 samples, validate on 22601 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7fa949f21990>

This CNN isn't performing very well -- hitting around 1% below the simple logistic regressor. Let's see if we can improve it through adding layers to the NN.

In [20]:
cnn.save('./baseline_cnn.85acc.h5')

#### Improved CNN

Let's take the model above, double the convolution filters, and add a few more dense layers.

In [21]:
cnn = Sequential()

cnn.add(Embedding(vocab_size, 3, input_length=5))

cnn.add(Conv1D(128, kernel_size=3, strides=1))
cnn.add(LeakyReLU())
cnn.add(MaxPooling1D(pool_size=2))

cnn.add(Flatten())

cnn.add(Dense(1024))
cnn.add(LeakyReLU())
cnn.add(Dropout(.5))

cnn.add(Dense(512))
cnn.add(LeakyReLU())
cnn.add(Dropout(.5))

cnn.add(Dense(256))
cnn.add(LeakyReLU())
cnn.add(Dropout(.5))

cnn.add(Dense(1, activation='sigmoid'))

cnn.compile(optimizer='adam', loss='binary_crossentropy',
             metrics=['acc'])

In [22]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', patience=3, 
                                            verbose=2, factor=0.5, min_lr=0.00001)

best_model = ModelCheckpoint('./cnn.2.h5', monitor='val_acc', verbose=2, 
                             save_best_only=True, mode='max')

early_stopping = EarlyStopping(monitor='val_loss', min_delta=1e-10, 
                               patience=5, restore_best_weights=True)

In [23]:
hist = cnn.fit(xs_train, ys_train,
         batch_size=batch_size,
         epochs=50,
         validation_data=(xs_test, ys_test),
         callbacks=[learning_rate_reduction,best_model,early_stopping],
         verbose=1)

Train on 45885 samples, validate on 22601 samples
Epoch 1/50

Epoch 00001: val_acc improved from -inf to 0.83669, saving model to ./cnn.2.h5
Epoch 2/50

Epoch 00002: val_acc improved from 0.83669 to 0.84815, saving model to ./cnn.2.h5
Epoch 3/50

Epoch 00003: val_acc improved from 0.84815 to 0.85151, saving model to ./cnn.2.h5
Epoch 4/50

Epoch 00004: val_acc did not improve from 0.85151
Epoch 5/50

Epoch 00005: val_acc improved from 0.85151 to 0.86151, saving model to ./cnn.2.h5
Epoch 6/50

Epoch 00006: val_acc did not improve from 0.86151
Epoch 7/50

Epoch 00007: val_acc did not improve from 0.86151
Epoch 8/50

Epoch 00008: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

Epoch 00008: val_acc did not improve from 0.86151
Epoch 9/50

Epoch 00009: val_acc improved from 0.86151 to 0.86240, saving model to ./cnn.2.h5
Epoch 10/50

Epoch 00010: val_acc improved from 0.86240 to 0.86341, saving model to ./cnn.2.h5
Epoch 11/50

Epoch 00011: val_acc improved from 0.86341 to 


Epoch 00039: val_acc improved from 0.87363 to 0.87390, saving model to ./cnn.2.h5
Epoch 40/50

Epoch 00040: val_acc improved from 0.87390 to 0.87443, saving model to ./cnn.2.h5
Epoch 41/50

Epoch 00041: val_acc did not improve from 0.87443
Epoch 42/50

Epoch 00042: val_acc improved from 0.87443 to 0.87465, saving model to ./cnn.2.h5
Epoch 43/50

Epoch 00043: val_acc did not improve from 0.87465
Epoch 44/50

Epoch 00044: val_acc did not improve from 0.87465
Epoch 45/50

Epoch 00045: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.

Epoch 00045: val_acc did not improve from 0.87465
Epoch 46/50

Epoch 00046: val_acc improved from 0.87465 to 0.87487, saving model to ./cnn.2.h5
Epoch 47/50

Epoch 00047: val_acc did not improve from 0.87487
Epoch 48/50

Epoch 00048: val_acc did not improve from 0.87487
Epoch 49/50

Epoch 00049: val_acc improved from 0.87487 to 0.87509, saving model to ./cnn.2.h5
Epoch 50/50

Epoch 00050: val_acc did not improve from 0.87509


It's performing a bit better now, at 87.5%. This is satisfactory as a version 1 model, so let's move on to our MLP.

### MLP

Our MLP will be straight-forward: an input layer, three hidden layers, and a sigmoid output layer.

In [24]:
mlp = Sequential()

mlp.add(Dense(512, input_shape=(8,)))
mlp.add(LeakyReLU())
mlp.add(Dropout(.5))

mlp.add(Dense(1024))
mlp.add(LeakyReLU())
mlp.add(Dropout(.5))

mlp.add(Dense(1024))
mlp.add(LeakyReLU())
mlp.add(Dropout(.5))

mlp.add(Dense(512))
mlp.add(LeakyReLU())
mlp.add(Dropout(.5))

mlp.add(Dense(1, activation='sigmoid'))

mlp.compile(optimizer='adam', loss='binary_crossentropy',
             metrics=['acc'])

learning_rate_reduction_mlp = ReduceLROnPlateau(monitor='val_acc', patience=3, 
                                            verbose=2, factor=0.5, min_lr=0.00001)

best_model_mlp = ModelCheckpoint('./mlp.1.h5', monitor='val_acc', verbose=2, 
                             save_best_only=True, mode='max')

early_stopping_mlp = EarlyStopping(monitor='val_loss', min_delta=1e-10, 
                               patience=5, restore_best_weights=True)

In [25]:
mlp.fit(xn_train, yn_train,
         batch_size=batch_size,
         epochs=5,
         validation_data=(xn_test, yn_test),
         callbacks=[learning_rate_reduction_mlp,best_model_mlp,early_stopping_mlp],
         verbose=1
)

Train on 45885 samples, validate on 22601 samples
Epoch 1/5

Epoch 00001: val_acc improved from -inf to 0.86607, saving model to ./mlp.1.h5
Epoch 2/5

Epoch 00002: val_acc improved from 0.86607 to 0.86624, saving model to ./mlp.1.h5
Epoch 3/5

Epoch 00003: val_acc did not improve from 0.86624
Epoch 4/5

Epoch 00004: val_acc did not improve from 0.86624
Epoch 5/5

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

Epoch 00005: val_acc did not improve from 0.86624


<keras.callbacks.History at 0x7fa939ac4bd0>

This model is performing about as well as the simple logistic regressor.

Let's combine our two models and see if the two parts form a stronger whole.

### Hybrid Model

The architecture of our hybrid model will be as follows: we will have our sequence data go into training a 1D CNN, and our numerical and categorical data will go into training an MLP; each will have a final dense layer of 64 outputs, these outputs will flow into a final hybrid MLP with four hidden dense layers and a sigmoid output layer.

Let's build the model and see how it does.

In [26]:
def build_model():

    # cnn
    seq_input = Input(shape=(max_length,))

    x = Embedding(vocab_size, 3, input_length=max_length)(seq_input)

    x = Conv1D(256, kernel_size=3, strides=1)(x)
    x = LeakyReLU()(x)
    x = MaxPooling1D(pool_size=2)(x)

    x = Flatten()(x)

    x = Dense(1024)(x)
    x = LeakyReLU()(x)
    x = Dropout(.5)(x)

    x = Dense(1024)(x)
    x = LeakyReLU()(x)
    x = Dropout(.5)(x)

    x = Dense(512)(x)
    x = LeakyReLU()(x)
    x = Dropout(.5)(x)
    
    x = Dense(512)(x)
    x = LeakyReLU()(x)
    x = Dropout(.5)(x)

    seq_output = Dense(64, activation='relu')(x)

    cnn = Model(inputs=seq_input, outputs=seq_output)

    # mlp
    num_input = Input(shape=(8,))
    y = Dense(512)(num_input)
    y = LeakyReLU()(y)
    y = Dropout(.5)(y)

    y = Dense(1024)(y)
    y = LeakyReLU()(y)
    y = Dropout(.5)(y)

    y = Dense(1024)(y)
    y = LeakyReLU()(y)
    y = Dropout(.5)(y)

    y = Dense(512)(y)
    y = LeakyReLU()(y)
    y = Dropout(.5)(y)

    mlp_output = Dense(64, activation='relu')(y)

    mlp = Model(inputs=num_input, outputs=mlp_output)

    # combine
    combined = Concatenate()([cnn.output, mlp.output])
    
    z = Dense(512)(combined)
    z = LeakyReLU()(z)
    z = Dropout(.5)(z)
    
    z = Dense(512)(z)
    z = LeakyReLU()(z)
    z = Dropout(.2)(z)
    
    z = Dense(256)(z)
    z = LeakyReLU()(z)
    z = Dropout(.2)(z)
    
    z = Dense(64)(z)
    z = LeakyReLU()(z)
    z = Dropout(.2)(z)
    
    output = Dense(1, activation='sigmoid')(z)

    final_model = Model(inputs=mlp.inputs + cnn.inputs, outputs=[output])

    final_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    return final_model

In [27]:
model = build_model()

In [28]:
num_train = x_train[:, :8]
seq_train = x_train[:, 8:]

num_test = x_test[:, :8]
seq_test = x_test[:, 8:]

xc_train = [num_train, seq_train] 
xc_test = [num_test, seq_test] 

In [29]:
learning_rate_reduction_combined = ReduceLROnPlateau(monitor='val_acc', patience=3, 
                                            verbose=2, factor=0.5, min_lr=0.00001)

best_model_combined = ModelCheckpoint('./combined_cnn_mlp_model.1.4.h5', monitor='val_acc', verbose=2, 
                             save_best_only=True, mode='max')

early_stopping_combined = EarlyStopping(monitor='val_loss', min_delta=1e-10, 
                               patience=10, restore_best_weights=True)

In [30]:
hist = model.fit(xc_train, y_train,
         batch_size=batch_size,
         epochs=50,
         validation_data=(xc_test, y_test),
         callbacks = [learning_rate_reduction_combined, best_model_combined, early_stopping_combined],
         verbose=1
)

Train on 45885 samples, validate on 22601 samples
Epoch 1/50

Epoch 00001: val_acc improved from -inf to 0.88430, saving model to ./combined_cnn_mlp_model.2.h5
Epoch 2/50

Epoch 00002: val_acc improved from 0.88430 to 0.89372, saving model to ./combined_cnn_mlp_model.2.h5
Epoch 3/50

Epoch 00003: val_acc did not improve from 0.89372
Epoch 4/50

Epoch 00004: val_acc improved from 0.89372 to 0.89899, saving model to ./combined_cnn_mlp_model.2.h5
Epoch 5/50

Epoch 00005: val_acc did not improve from 0.89899
Epoch 6/50

Epoch 00006: val_acc did not improve from 0.89899
Epoch 7/50

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

Epoch 00007: val_acc did not improve from 0.89899
Epoch 8/50

Epoch 00008: val_acc improved from 0.89899 to 0.90385, saving model to ./combined_cnn_mlp_model.2.h5
Epoch 9/50

Epoch 00009: val_acc did not improve from 0.90385
Epoch 10/50

Epoch 00010: val_acc did not improve from 0.90385
Epoch 11/50

Epoch 00011: val_acc improved from


Epoch 00037: ReduceLROnPlateau reducing learning rate to 1e-05.

Epoch 00037: val_acc did not improve from 0.90983
Epoch 38/50

Epoch 00038: val_acc did not improve from 0.90983
Epoch 39/50

Epoch 00039: val_acc did not improve from 0.90983
Epoch 40/50

Epoch 00040: val_acc did not improve from 0.90983
Epoch 41/50

Epoch 00041: val_acc did not improve from 0.90983
Epoch 42/50

Epoch 00042: val_acc improved from 0.90983 to 0.91009, saving model to ./combined_cnn_mlp_model.2.h5
Epoch 43/50

Epoch 00043: val_acc did not improve from 0.91009
Epoch 44/50

Epoch 00044: val_acc did not improve from 0.91009
Epoch 45/50

Epoch 00045: val_acc did not improve from 0.91009
Epoch 46/50

Epoch 00046: val_acc did not improve from 0.91009
Epoch 47/50

Epoch 00047: val_acc did not improve from 0.91009
Epoch 48/50

Epoch 00048: val_acc did not improve from 0.91009
Epoch 49/50

Epoch 00049: val_acc did not improve from 0.91009
Epoch 50/50

Epoch 00050: val_acc did not improve from 0.91009


The performance was improved by a lot! The individual CNN and MLP models were scoring around 86% accuracy, and our hybrid model scored 91% accuracy.

Next, we will try tweaking the inputs for model_def version 2.