# ML Refresher | Neural Network Exposure

### Practicum Overview

We'll go through the machine learning loop to (i) preprocess a real datast and (ii) train it to predict GBH incidents using a random forest. Upon building the random forest, we'll then construct a neural network to give you an understanding of how training a deep learning model works. In doing so, we'll be executing each major process of the machine learning loop.

<div style="text-align: center;"> <img src = "res/model_building/ml_wheel.jpg" width="25%"/> </div>

## Hyperparameters

The hyperparameters for the random forest appear below. Though in this practicum, our primary focus will be the neural network.

#### Random Forest.

<ul>
  <li> <strong>n_estimators.</strong> The number of trees in the forest. </li>
  <li> <strong>criterion:</strong> The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. </li>
  <li> <strong>max_depth.</strong> The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. </li>
  <li> <strong>min_samples_split.</strong> The minimum number of samples required to split an internal node. </li>
  <li> <strong>min_samples_leaf.</strong> The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. </li>
  <li> <strong>min_weight_fraction_leaf.</strong> The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. </li>
  <li> <strong>max_features.</strong> The number of features to consider when looking for the best split. </li>
  <li> <strong>max_leaf_nodes.</strong> Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. </li>
  <li> <strong>min_impurity_decrease.</strong> A node will be split if this split induces a decrease of the impurity greater than or equal to this value. </li>
  <li> <strong>min_impurity_split.</strong> Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf. </li>
  <li> <strong>bootstrap.</strong> Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. </li>
  <li> <strong>oob_score.</strong> Whether to use out-of-bag samples to estimate the generalization accuracy. </li>
</ul>

# 0 | Google Colab Setup

In [25]:
import os
import shutil
import stat

In [26]:
def copy_safe(src, dst, max_len=200):
    """Copy files, skip long paths"""
    skipped = 0
    for root, dirs, files in os.walk(src):
        rel_path = os.path.relpath(root, src)
        dst_root = os.path.join(dst, rel_path) if rel_path != '.' else dst
        if len(dst_root) < max_len:
            os.makedirs(dst_root, exist_ok=True)
            for file in files:
                dst_file = os.path.join(dst_root, file)
                if len(dst_file) < max_len:
                    try: shutil.copy2(os.path.join(root, file), dst_file)
                    except: skipped += 1
                else: skipped += 1
        else: skipped += len(files)
    return skipped

In [32]:
# Setup resources if needed
setup_ran = False
if not os.path.exists('res'):
    print("Setting up resources...")
    setup_ran = True
    
    # Cleanup, clone, copy
    repo = 'deep_learning_resources'
    if os.path.exists(repo):
        shutil.rmtree(repo, onerror=lambda f,p,e: os.chmod(p, stat.S_IWRITE) or f(p))
    
    !git clone --depth=1 https://github.com/jjv31/deep_learning_resources
    
    if os.path.exists(f'{repo}/res'):
        skipped = copy_safe(f'{repo}/res', 'res')
        print(f"Setup complete! {'(' + str(skipped) + ' long filenames skipped)' if skipped else ''}")
    
    shutil.rmtree(repo, onerror=lambda f,p,e: os.chmod(p, stat.S_IWRITE) or f(p))

Setting up resources...


Cloning into 'deep_learning_resources'...
error: unable to create file res/data_mining/www.college.police.uk/text_cleaned/www.college.police.uk_app_mental-health_mental-vulnerability-and-illness.txt: Filename too long
error: unable to create file res/data_mining/www.college.police.uk/text_raw/www.college.police.uk_app_mental-health_mental-vulnerability-and-illness.txt: Filename too long
fatal: unable to checkout working tree
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'



Setup complete! 


In [None]:
# Only refresh if we just downloaded resources
if setup_ran:
    from IPython.display import Javascript, display
    import time
    
    print("Refreshing images...")
    
    # Try browser refresh + aggressive image reload
    display(Javascript(f'''
    try {{ setTimeout(() => window.location.reload(true), 2000); }} catch(e) {{}}
    
    const t = {int(time.time())};
    document.querySelectorAll('img').forEach((img, i) => {{
        if (img.src.includes('res/')) {{
            const src = img.src.split('?')[0];
            setTimeout(() => img.src = src + '?v=' + t + '_' + i, i * 50);
        }}
    }});
    '''))
    
    print("If images don't appear, press Ctrl+Shift+R to hard refresh!")
else:
    print("Resources already exist, skipping setup.")

# 1 | Imports & Preprocessing

### 1.0 | Imports & Auxilary Functions

Just run these. No need to modify them.

In [None]:
# Install artificial data generator (tabular data)
%pip install imblearn

In [None]:
#load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Scikit-learn libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

#Set plot styles
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

# Get Pandas to display all rows/columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None) 

In [None]:
# Neural Network
from keras.models import Sequential
from keras.layers import Dense, Input
from keras.optimizers import Adam
import keras as keras

In [None]:
# Mutes Pandas' annoying future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
def print_univariates_categorical(data):
    data = data.astype(str) # Prevents error when sorting between strings and bools
    numbers = data.value_counts().sort_index()
    percentage = numbers / numbers.sum() * 100
    percentage = percentage.round(2)
    
    for c, i in enumerate(numbers.index):
        print(f"{i}\t\t{numbers.iloc[c]}\t{percentage.iloc[c]}%")

In [None]:
#Function to facilitate evaluating our models
def print_score(clf, X, y_true):

    # Gets predicted labels
    if isinstance(clf, keras.models.Sequential): # If the model is a Keras neural network
        y_pred = (clf.predict(X) >= 0.5).astype(int) 
    else: # Normal scikit-learn model
        y_pred = clf.predict(X)

    # Gets key performance indicators
    accuracy = round(accuracy_score(y_true, y_pred), 4)
    recall = round(recall_score(y_true, y_pred), 4)
    precision = round(precision_score(y_true, y_pred), 4)
    f1 = round(f1_score(y_true, y_pred), 4)

    # Displays them
    print(f"F1 = {f1:.4f} | Recall = {recall* 100:.2f}% | Precision = {precision*100:.2f}%")

In [None]:
# Let's visualize the confusion matrix via seaborn
def display_confusion_matrix(clf, X_test, y_true):

    # Gets predicted labels
    if isinstance(clf, keras.models.Sequential): # If the model is a Keras neural network
        y_hat = (model.predict(X_test) >= 0.5).astype(int) 
    else: # Normal scikit-learn model
        y_hat = clf.predict(X_test)
        
    mat = confusion_matrix(y_test, y_hat)
    labels = ['No GBH', 'GBH']
 
    sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False, cmap='Blues',
            xticklabels=labels, yticklabels=labels)
 
    plt.xlabel('Predicted label')
    plt.ylabel('True label')

In [None]:
# Plots neural network training
def plot_performance(training_values, validation_values, metric_name = "Recall"):

    epochs = range(1, len(training_values) + 1)
    
    sns.set() 
    plt.plot(epochs, training_values, '-', label=f'Training {metric_name}')
    plt.plot(epochs, validation_values, ':', label=f'Validation {metric_name}')

    plt.title(f'Training and Validation {metric_name}')
    plt.xlabel('Epoch')
    plt.ylabel(metric_name)
    plt.legend(loc='lower right')
    plt.plot()

### 1.1 | Import & Explore Data

In [None]:
# Import the data
df = pd.read_csv("res/model_building/domestic_gbh_v2.csv")
df.head(3)

In [None]:
# Display frequency
sns.countplot(x='GBH_12m', data=df)

The data appear wildly imbalanced. Let's inspect more closely.

In [None]:
#Display frequency table
print("GBH in 12 months?")
print_univariates_categorical(df["GBH_12m"])

Over 90% of individuals will not commit an act of GBH in 12 months. We'll explore this later.

### 1.2 | Preprocessing

In [None]:
# Checks to see if there's any NAs
assert(df.isnull().sum().all() == 0)
print("Congratulations. There are no NAs in your dataset.")

In [None]:
df.dtypes

In [None]:
# Recode features so they can be processed by a neural network.
# Don't worry about this. The preprocessing was handled for you.

df['GBH_12m'] = df['GBH_12m'].replace(('No', 'Yes'), (0, 1))
df['Sex'] = df['Sex'].replace(('M', 'F'), (0, 1))
df['EthnicAppearance_cleaned'] = df['EthnicAppearance_cleaned'].replace(('Afro-Caribbean', 'Arab', 'Asian', 'Black',
       'Chinese, Japanese or SE Asian', 'Middle Eastern',
       'North European - White', 'South European - White', 'Unknown',
       'White European'), (0,1,2,3,4,5,6,7,8,9))
df['InitialRisk'] = df['InitialRisk'].replace(('H', 'S', 'M', 'Unknown'), (0, 1,2,3) )
df['AccHowKnown'] = df['AccHowKnown'].replace(('Ex Boyfriend of victim', 'Boyfriend of victim',
       'Husband of victim', 'Ex Girlfriend of victim',
       'Girlfriend of victim', 'Ex Wife of victim', 'Wife of victim',
       'Ex Husband of victim', 'Same Sex Ex intimate Partner',
       'Ex Common Law Husband of victim', 'Common Law Husband of victim',
       'Common Law Wife of victim', 'Civil Partner Same Sex',
       'Ex Common Law Wife of victim', 'Same Sex Intimate Partner',
       'Ex Civil Partner Same Sex '), (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) )
df['Known 1'] = df['Known 1'].replace(('Boy/Girlfriend', 'Partner/Spouse'), (0, 1))
df['Known 2'] = df['Known 2'].replace(('Ex', 'Current'), (0, 1))
df['ProceedingsType'] = df['ProceedingsType'].replace(('Charge/further charge', 'Second time charged', 'Adult caution',
       'First time charged', 'Youth Conditional Caution',
       'Third and subsequent time charged', 'Summons',
       'Postal Charge Requisition', 'Conditional caution', 'TIC',
       'Youth Caution'), (0,1,2,3,4,5,6,7,8,9,10))
df['Crimes in pre 1'] = df['Crimes in pre 1'].replace(('No', 'Yes'), (0, 1))
df['DV in Pre 1'] = df['DV in Pre 1'].replace(('No', 'Yes'), (0, 1))

In [None]:
# recode numberic columns
df['Age'] = pd.to_numeric(df['Age'],downcast="float")
df['Crimes in pre'] = pd.to_numeric(df['Age'],downcast="float")
df['DV in Pre'] = pd.to_numeric(df['Age'],downcast="float")

df.info()

In [None]:
# Makes the columns more intuitive
df.rename(columns={"Sex": 'Offender_isMale',
                  "EthnicAppearance_cleaned" : "Offender_ethnicAppearance",
                  "Age" : "Offender_Age",
                  "Crimes in pre" : "Previous_Crimes",
                  "DV in Pre" : "Previous_DV",
                  "InitialRisk" : "DASH_Assessment",}, inplace=True)

In [None]:
# Drops each person's ID
df.drop(["ID"], axis="columns", inplace=True)

In [None]:
# Displays preprocessed dataframe
df.head(3)

# 2 | Split Data into Training and Testing Sets

### 2.0 | Section Logic

We'll need to split our dataset into a (i) training set and (ii) a testing set. For both the training and testing set, we'll need to split the features (X) from the label (y). We'll do that in this section.

<div style="text-align: center;"> <img src = "res/model_building/data_split_illustration.jpg" width="50%"/> </div>

### 2.1 | Split Dataset into Features (X) and Label (y)

In [None]:
# Seperate the output (y) from the inputs (X). The output is what we're hoping to predict.
# In machine learning lingo, the input variable should be named X (capital x) and the output variable should be named y (lowercase y)
X = df.drop(['GBH_12m'], axis=1)
y = df['GBH_12m']

In [None]:
#Displays first rows of features. Confirms (a) they do not contain the label and (b) they look OK
X.head(3)

In [None]:
#Displays first 3 outputs. Confirms we only have the label (i.e. whether an individual will commit GH)
y.head(3)

In [None]:
# Splits into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

In [None]:
#Let's examine the sizes of the training and testing sets
print(f"Training set size = {y_train.size}\nTesting set size = {y_test.size}")

# 3 | Artificially Generate Future GBH via SMOTE

### 3.1 | Dataset Inspection

In [None]:
# First, Let's have a look at the labels again. 
# We'll only look at the training set because that's the dataset that the model is training on --- the same dataset that is causing the model to behave poorly.

y_train.value_counts()

Our training dataset is wildly unbalanced. This lack of balance is very dangerous for machine learning because models learn that, if they ignore the minority case, they can achieve a very high accuracy score. 

Let's fix that via SMOTE

### 3.2 | Artificial Data Generation via SMOTE

Note that we only apply SMOTE to the testing set. We don't want to affect the training set.

In [None]:
#Apply SMOTE to the training set.

try:
    from imblearn.over_sampling import SMOTE
    
    smote_resampler = SMOTE()
    X_train, y_train = smote_resampler.fit_resample(X_train, y_train)
    print("SMOTE ran successfully.")
    
except Exception as e:
    print("Caution - you do not have SMOTE installed. Perhaps you're running an incompatible Python version?")
    print(f"Error = {e}")

    print("Not to worry. We will use dataset backups that have undergone SMOTE")
    X_train = pd.read_csv("res/model_building/X_train_SMOTE_backup.csv")
    y_train = pd.read_csv("res/model_building/y_train_SMOTE_backup.csv")


y_train.value_counts()

### 3.3 | Synthetic vs. Real Data

In [None]:
# Let's compare the real data to the synthetic ones

print("REAL DATA")
print("GBH Label (1 = will commit GBH in 12 months):")
print(y_train[:2000].head(5) )

print("Features:")
X_train[:2000].head(5) 

In [None]:
print("SYNTHETIC DATA")
print("GBH Label (1 = will commit GBH in 12 months):")
print(y_train[2000:].head(5) )

print("Features:")
X_train[2000:].head(5) 

# 4 | Fit & Optimize Random Forest

### 4.1 | Creates basic Random Forest

In [None]:
#Fits a decision tree via three lines of code
original_random_forest = RandomForestClassifier(random_state=42)
original_random_forest.fit(X_train, y_train) 

In [None]:
print("Random Forest Results")
print_score(original_random_forest, X_test, y_test)

### 4.2 | Optimize Random Forest 

In [None]:
rf_hyperparams = {
    'class_weight' : ["balanced", "balanced_subsample"],
    'criterion' : ["gini", "entropy", "log_loss"],
    'n_estimators': [10, 50, 100,],
    'max_depth': [2, 4, 6, None],
    'max_features': ['sqrt', 'log2', None],
    'min_samples_leaf': [1, 2, 5, 10, 50, 200],
}


# Init the random search
rf_cv = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42), param_distributions=rf_hyperparams,
                           n_iter=100, # NOTE - We're only doing 100 searches for the sake of time. We won't find the optimal combination
                           cv=2, verbose=2, random_state=42, n_jobs=-1)


# Runs the search
rf_cv.fit(X_train, y_train)
rf_best_params = rf_cv.best_params_
print(f"Best paramters: {rf_best_params})")

In [None]:
# Refits model with optimal hyperparameters
optimized_random_forest = RandomForestClassifier(**rf_best_params)
optimized_random_forest.fit(X_train, y_train)

In [None]:
print("\nRank #2: Original Random Forest")
print_score(original_random_forest, X_test, y_test)

print("\nOptimized Results")
print_score(optimized_random_forest, X_test, y_test)

# 5 | Create Neural Network

### 5.0 | Section Overview

Our initial neural network can be visualized via the below diagram. First we'll compile it (§5.1), then we'll train it (§5.2), evaluate it's training (§5.3) before getting its ultimate performance (§5.4)

<div style="text-align: center;"> <img src = "res/model_building/gbh_nn_diagram.jpg" width="50%"/> </div>

### 5.1 | Constructs Neural Network

In [None]:
number_of_features = X.shape[1]
print(f"There are {number_of_features} to be inputted into the neural network. Thus, there should be {number_of_features} input nodes")

In [None]:
# Create neural network
model = Sequential() 
model.add( Input( shape= (number_of_features,) ) ) 
model.add(Dense(16, activation='relu', ))
model.add(Dense(1, activation='sigmoid')) 

# Compiles & Summarizes model
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=.0001), 
              metrics=[keras.metrics.Precision(name="precision"), keras.metrics.Recall(name="recall"), ]) 
model.summary()

### 5.2 | Trains Neural Network

In [None]:
#Trains the model.
hist = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=128)

### 5.3 | Evaluates Training

In [None]:
loss, val_loss = hist.history["loss"], hist.history["val_loss"]
plot_performance(loss, val_loss, "Loss (Error)")

In [None]:
prec, val_prec = hist.history["precision"], hist.history["val_precision"]
plot_performance(prec, val_prec, "Precision")

In [None]:
recall, val_recall = hist.history["recall"], hist.history["val_recall"]
plot_performance(recall, val_recall)

### 5.4 | Evaluates Model

In [None]:
print_score(model, X_test, y_test)

In [None]:
display_confusion_matrix(model, X_test, y_test)

# 6 | Improve the Neural Network

### 6.0 | Section Overview

This section follows the same logic as the previous, except your task is to improve the ultimate performance of the neural network. This can involve any of the following:

<ul>
  <li> Add or remove more neurons to the hidden layer. </li>
  <li> Add more hidden layer(s). </li>
  <li> Adjust the activation function. </li>
  <li> Adjust the learning rate. </li>
</ul>

### 6.1 | Compile Neural Network

In [None]:
# Create neural network. ADJUST THIS CODE so your neural network improves upon the existing one.
your_neural_network = Sequential() 
your_neural_network.add( Input( shape= (number_of_features,) ) ) 
your_neural_network.add(Dense(16, activation='relu', ))
your_neural_network.add(Dense(1, activation='sigmoid')) 

# Compiles & Summarizes model
your_neural_network.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=.0001), 
              metrics=[keras.metrics.Precision(name="precision"), keras.metrics.Recall(name="recall"), ]) 
your_neural_network.summary()

### 6.2 | Train Neural Network

In [None]:
#Trains the model.
your_hist = your_neural_network.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=128)

### 6.3 | Evaluate Your Neural Network's Training

In [None]:
# Plots loss
loss, val_loss = your_hist.history["loss"], your_hist.history["val_loss"]
plot_performance(loss, val_loss, "Loss (Error)")

In [None]:
# Precision
prec, val_prec = your_hist.history["precision"], your_hist.history["val_precision"]
plot_performance(prec, val_prec, "Precision")

In [None]:
recall, val_recall = your_hist.history["recall"], your_hist.history["val_recall"]
plot_performance(recall, val_recall)

### 6.4 | Evaluates Your Neural Network on Test Set

In [None]:
print_score(your_neural_network, X_test, y_test)

Does your f1 score exceed the previous neural network's f1 score?

In [None]:
display_confusion_matrix(your_neural_network, X_test, y_test)