In this final solution, we will be incorporating neural networks to see if we can boost the modeling performance. As the previous solution, we will first start by importing the necessary modules and will then load the dataset. 

In [1]:
# Filter the uneccesary warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import numpy
import numpy as np

# Fix the random seed
np.random.seed(7)

In [3]:
# Load the numpy arrays which will be our datasets from now
X_train, y_train = np.load("X_train.npy", allow_pickle=True), np.load("y_train.npy", allow_pickle=True)
X_test, y_test = np.load("X_test.npy", allow_pickle=True), np.load("y_test.npy", allow_pickle=True)

In [4]:
# Other imports
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import *
from wandb.keras import WandbCallback
import wandb

In [5]:
# Fix TensorFlow's random see for better reproducibility
import tensorflow as tf
tf.random.set_seed(666)

I find it convenient to wrap my model definition inside a function and return a compiled version of the model from the function. 

In [6]:
# Model building using the Sequential API
def get_training_model(data):
    model = Sequential()

    model.add(Dense(40, activation="relu",
              kernel_initializer="uniform",input_dim=data.shape[1]))
    model.add(Dense(30, activation="relu",
              kernel_initializer="uniform"))
    model.add(Dense(1, activation="sigmoid"))

    model.compile(loss="binary_crossentropy", optimizer=Adam(), metrics=["accuracy"])
    return model

When using the `ReLU` activation function it's good to use the `He_Uniform` scheme to initialize the weights of your neural network and that can be set using the `kernel_initializer` argument. 

In [7]:
# Instantiate the model and print its summary
model = get_training_model(X_train)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 40)                1240      
_________________________________________________________________
dense_1 (Dense)              (None, 30)                1230      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 31        
Total params: 2,501
Trainable params: 2,501
Non-trainable params: 0
_________________________________________________________________


Early Stopping is a good way to prevent overfitting and in many cases you won't need to train the model for all the epochs. If you see that you model's performance is not as expected the training will get stopped after fixed interval (which will be set by you) and the best model within those many epochs will be returned. Let's define an `EarlyStopping` callback. 

In [8]:
from tensorflow.keras.callbacks import EarlyStopping

In [9]:
es_cb = EarlyStopping(monitor="loss", 
    patience=5, # number of epochs to consider
    restore_best_weights=True, # restore the best weights when loss stops enhancing
    verbose=1)

Let's start training our shallow network. 

In [10]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import time 

In [11]:
def train_network(model, name):
    # Initialize Weights and Biases
    wandb.init(project="phishing-websites-detection", name=name)
    
    start = time.time()
    history = model.fit(X_train, y_train, batch_size=64, epochs=128, verbose=1, \
        callbacks=[es_cb, WandbCallback()])
    end = time.time()-start
    prediction = model.predict_classes(X_test)
    wandb.log({"accuracy":accuracy_score(y_test, prediction)*100.0,\
               "precision": precision_recall_fscore_support(y_test, prediction, average="macro")[0],
               "recall": precision_recall_fscore_support(y_test, prediction, average="macro")[1],
               "training_time":end})
    print("Accuracy score of the Logistic Regression classifier with default hyperparameter values {0:.2f}%"\
                  .format(accuracy_score(y_test, prediction)*100.))
    print("\n")
    print("----Classification report of the Logistic Regression classifier with default hyperparameter value----")
    print("\n")
    print(classification_report(y_test, prediction, target_names=["Phishing Websites", "Normal Websites"]))

I would encourage you to specify other things like the data, number of epochs, batch size as arguments to the function to have more flexibility and control. 

In [12]:
train_network(model, "neural-network")

wandb: Wandb version 0.8.28 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 8844 samples
Epoch 1/128
Epoch 2/128
Epoch 3/128
Epoch 4/128
Epoch 5/128
Epoch 6/128
Epoch 7/128
Epoch 8/128
Epoch 9/128
Epoch 10/128
Epoch 11/128
Epoch 12/128
Epoch 13/128
Epoch 14/128
Epoch 15/128
Epoch 16/128
Epoch 17/128
Epoch 18/128
Epoch 19/128
Epoch 20/128
Epoch 21/128
Epoch 22/128
Epoch 23/128
Epoch 24/128
Epoch 25/128
Epoch 26/128
Epoch 27/128
Epoch 28/128
Epoch 29/128
Epoch 30/128
Epoch 31/128
Epoch 32/128
Epoch 33/128
Epoch 34/128
Epoch 35/128
Epoch 36/128
Epoch 37/128
Epoch 38/128
Epoch 39/128
Epoch 40/128
Epoch 41/128
Epoch 42/128
Epoch 43/128
Epoch 44/128
Epoch 45/128
Epoch 46/128
Epoch 47/128
Epoch 48/128
Epoch 49/128
Epoch 50/128
Epoch 51/128
Epoch 52/128
Epoch 53/128
Epoch 54/128
Epoch 55/128
Epoch 56/128
Epoch 57/128
Epoch 58/128
Epoch 59/128
Epoch 60/128
Epoch 61/128
Epoch 62/128
Epoch 63/128
Epoch 64/128
Epoch 65/128
Epoch 66/128
Epoch 67/128
Epoch 68/128
Epoch 69/128
Epoch 70/128
Epoch 71/128
Epoch 72/128
Epoch 73/128
Epoch 74/128
Epoch 75/128
Epoch 76/128

We already have a training accuracy of 97%. Can we push this even further? 

**TDLHBA** is technique introduced [in this paper](https://dl.acm.org/citation.cfm?id=3227655). We will use hyperparameter values as presented in the paper to see the performance enhancement of the model.

In [13]:
# Building the model with the same topology as specified in the above-mentioned paper

model_TDLHBA = Sequential()

model_TDLHBA.add(Dense(40, activation="relu",
          kernel_initializer="uniform", input_dim=X_train.shape[1]))
model_TDLHBA.add(Dense(30, activation="relu",
          kernel_initializer="uniform"))
model_TDLHBA.add(Dense(1, activation="sigmoid"))

adam = Adam(lr=0.0017470)
model_TDLHBA.compile(loss="binary_crossentropy", optimizer=adam, metrics=["accuracy"])

In [14]:
wandb.init(project="phishing-websites-detection", name="neural-network-tdlhba")

wandb: Wandb version 0.8.28 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


W&B Run: https://app.wandb.ai/datagodzilla/phishing-websites-detection/runs/rvoco64x

In [15]:
start = time.time()
history_TDLHBA = model_TDLHBA.fit(X_train, y_train, batch_size=10, epochs=100, 
                                  verbose=1, callbacks=[es_cb, WandbCallback()])

end = time.time() - start
prediction = model_TDLHBA.predict_classes(X_test)
wandb.log({"accuracy":accuracy_score(y_test, prediction)*100.0,\
           "precision": precision_recall_fscore_support(y_test, prediction, average="macro")[0],
           "recall": precision_recall_fscore_support(y_test, prediction, average="macro")[1],
           "training_time":end})
print("Accuracy score of the Logistic Regression classifier with default hyperparameter values {0:.2f}%"\
                  .format(accuracy_score(y_test, prediction)*100.))
print("\n")
print("----Classification report of the Logistic Regression classifier with default hyperparameter value----")
print("\n")
print(classification_report(y_test, prediction, target_names=["Phishing Websites", "Normal Websites"]))

Train on 8844 samples
Epoch 1/100


wandb: Wandb version 0.8.28 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 00054: early stopping
Accuracy score of the Logistic Regression classifier with default hyperparameter values 98.26%


----Classification report of the Logistic Regression classifier with default hyperparameter value----


                   precision    recall  f1-score   support

Phishing Websites       0.98  

The performance is by far the best one. 