# Practical 5: Neural Network

### In this practical
1. [Resuming from practical 4](#resume)
2. [Building your first neural network model](#build)
3. [Understanding your neural network model](#viz)
4. [Finding optimal hyperparameters with GridSearchCV](#gridsearch)
5. [Feature selection](#fselect)
6. [Comparing models](#comparison)

---

### Important Changelog:
* (25/07/2017) Made tutorial notes public.

This practical introduces neural network mining in Python. Similar with previous practicals, our objective is to build a neural network to classify the lapsing donors based on their responses to the greeting card mailing campaign conducted by the national veterans' organisation. We will continue using **PVA97NK** dataset to predict **TARGETB**.

With its exotic sounding name, a neural network model is often regarded as mysterious yet powerful predictive tool. Perhaps surprisingly, the most typical form of neural network, in fact, is a natural extension of regression model. This form of neural network is called **multilayer perceptron**, which is the subject of our practical today.

Whereas the strength of regression models is making decision in data with linear relationships, the strength of multi-layer perceptrons is their ability to go beyond linear relationships and model non-linear relationships in data.

Multilayer perceptron models were originally inspired by structure and interconnections between neurons in brain. They are often represented using network diagram instead of an equation. The basic model form arranges neurons in layers. The first layer, called the **input layer**, connects to one or more **hidden layers**, which in turn, connect to the final layer called **target/output layer**. Connections between each layer correspond to certain set of weights, which like regression model, are optimised during training process.

At the end of this practical, we would have built a number of predictive models. In practice, given a new dataset, data science professionals will build and experiment with many different models. Thus, it is important to understand how to compare these models and choose the best model. The second part of this practical guides you to assessing all of the models we have built so far - decision trees, logistic regressions and neural networks.

In cases such as the financial and health domains, performance of a predictive model is crucial. To achieve even better performance, multiple models can be combined in together to achieve better performance than individual models. This approach is called **ensemble modeling** and it will be covered in the last part of this practical.

**This tutorial notes is in experimental version. Please give us feedbacks and suggestions on how to make it better. Ask your tutor for any question and clarification.**

## 1. Resuming from practical 4<a name="resume"></a>
Similar with practical 3 and 4, we will again reuse the code for data preprocessing. Just as regression models, neural networks are sensitive to data on different scales, thus we will also perform standarization on the dataset:

In [1]:
# libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
from dm_tools import data_prep
from sklearn.preprocessing import StandardScaler

# preprocessing step
df = data_prep()

# train test split
y = df['TargetB']
X = df.drop(['TargetB'], axis=1)
X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.33, random_state=42, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train, y_train)
X_test = scaler.transform(X_test)

## 2. Building your first neural network model

Start by importing your neural network from the library. In `sklearn`, neural network classifier is implemented in `MLPClassifier`, short for multilayer perceptron classifier.

In [2]:
from sklearn.neural_network import MLPClassifier

Let's train our first MLPClassifier. Initiate the model without any additional parameter, fit it to the training data and test its performance on test data.

In [3]:
model = MLPClassifier()
model.fit(X_train, y_train)

print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy:", model.score(X_test, y_test))

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

print(model)

Train accuracy: 0.880104792726
Test accuracy: 0.534563653425
             precision    recall  f1-score   support

          0       0.53      0.53      0.53      1599
          1       0.53      0.54      0.54      1598

avg / total       0.53      0.53      0.53      3197

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)


This default neural network performed alright, with 0.531 accuracy score on the test data. Similar with the first decision tree that we trained in practical 3, you should notice that the training accuracy is much higher than the test accuracy. This is an indication the model overfits to the training data, which we will fix through GridSearch tuning.

In addition to the output messages, the neural network raises a "convergence is not reached" warning message. Just as regression, neural networks compute prediction values using activation function with input weights in each neuron. These weights are learned in an iterative training process called "backpropagation". In `MLPClassifier`, backpropagation is run until it converges or reached the maximum number of iteration set by the `max_iter` parameter (default 200). If the second situation occurs, the warning message is raised, which means the `max_iter` value has to be increased.

Try to increase the `max_iter` to 500 from the original 200.

In [4]:
model = MLPClassifier(max_iter=500)
model.fit(X_train, y_train)

print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy:", model.score(X_test, y_test))

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

print(model)

Train accuracy: 0.842810910772
Test accuracy: 0.537065999374
             precision    recall  f1-score   support

          0       0.54      0.56      0.55      1599
          1       0.54      0.51      0.53      1598

avg / total       0.54      0.54      0.54      3197

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)


With 500 max iterations, the backpropagation we managed to reach convergence. You can set up the max iteration value higher to guarantee convergence everytime, at cost of slower training process.

## 4. Finding optimal hyperparameter with GridSearchCV

Once we trained our first neural network, we will find the optimal hyperparameters using GridSearchCV. Neural network is harder to tune than decision trees or regression models due to relatively many type parameters and slow training process. In this practical, we will focus on tuning two parameters:
1. `hidden_layer_sizes`: It has values of tuples, and within each tuple, element i-th represent the number of neurons contained in each hidden layer.
2. `alpha`: L2 regularization parameter used in each neuron's activation function.

Start by tuning the hidden layer sizes. There is no official guideline on how many neurons we should have in each layer, but for most data mining tasks a single hidden layer with neurons no more than the number of input variables and no less than output neurons (in this case, it is 1).

> #### Deep Learning
> You might have heard of deep learning, which is process of building very complex neural networks (up to hundreds of layers and thousands of neurons, hence **deep**). Deep neural networks are typically used for complex tasks, like image recognition, Siri-like voice assistant, machine translation and self-driving tasks.

Let's see how many input features we have.

In [5]:
print(X_train.shape)

(6489, 85)


With 85 features, we will start tuning with one hidden layer of 5 to 85 neurons, increment of 25. (This is going to be slow).

In [6]:
params = {'hidden_layer_sizes': [(x,) for x in range(5, 86, 25)]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=500), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

Train accuracy: 0.629989212513
Test accuracy: 0.551767281827
             precision    recall  f1-score   support

          0       0.55      0.56      0.55      1599
          1       0.55      0.55      0.55      1598

avg / total       0.55      0.55      0.55      3197

{'hidden_layer_sizes': (5,)}


The output of this GridSearchCV returns 5 neurons as the optimal number of neurons in the hidden layer. For this dataset, it seems like more complex models (more neurons in the hidden layer) tend to overfit. From this information, we should tune around the lower number of neurons.

In [7]:
params = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=500), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

Train accuracy: 0.605486207428
Test accuracy: 0.559587112918
             precision    recall  f1-score   support

          0       0.55      0.61      0.58      1599
          1       0.57      0.51      0.54      1598

avg / total       0.56      0.56      0.56      3197

{'hidden_layer_sizes': (3,)}


We now have the optimal value for the neuron count. Next, we will tune the second parameter, which is `alpha`. The default value for `alpha` is `0.0001`, thus we will try `alpha` values around this default value.

In [8]:
params = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)], 'alpha': [0.01,0.001, 0.0001, 0.00001]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=500), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

Train accuracy: 0.597472646016
Test accuracy: 0.564279011573
             precision    recall  f1-score   support

          0       0.55      0.68      0.61      1599
          1       0.58      0.44      0.50      1598

avg / total       0.57      0.56      0.56      3197

{'alpha': 1e-05, 'hidden_layer_sizes': (3,)}


## 5. Dimensionality reduction

Now, let's try to reduce the size of our feature set and see whether it improves the performance of the model. We will use the same techniques as covered last week.

### 5.1. Recursive Feature Elimination

Firstly, reduce the feature set size using RFE. We will need a base elimination model and RFE requires type of model that assigns weight/feature importance to each feature (like regression/decision tree). Unfortunately, neural networks provide neither, thus we will try to use LogisticRegression as the base elimination model.

In [9]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression

rfe = RFECV(estimator = LogisticRegression(), cv=10)
rfe.fit(X_train, y_train)

print(rfe.n_features_)

9


RFE has selected 19 features as the best set of features. Next, tune an `MLPClassifier` with the transformed data set as training data.

In [10]:
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)

# step = int((X_train_rfe.shape[1] + 5)/5);
params = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)], 'alpha': [0.01,0.001, 0.0001, 0.00001]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=1000), cv=10, n_jobs=-1)
cv.fit(X_train_rfe, y_train)

print("Train accuracy:", cv.score(X_train_rfe, y_train))
print("Test accuracy:", cv.score(X_test_rfe, y_test))

y_pred = cv.predict(X_test_rfe)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

Train accuracy: 0.584835876098
Test accuracy: 0.578041914295
             precision    recall  f1-score   support

          0       0.57      0.66      0.61      1599
          1       0.59      0.49      0.54      1598

avg / total       0.58      0.58      0.58      3197

{'alpha': 0.001, 'hidden_layer_sizes': (7,)}


The RFE selected feature set showed major improvements over the original data set. We managed to bring train/test accuracy closer and produce a model that generalise better.

As mentioned before, we could also use decision trees in RFE. Let's try to do it with `DecisionTreeClassifier`.

In [11]:
from sklearn.tree import DecisionTreeClassifier

rfe = RFECV(estimator = DecisionTreeClassifier(), cv=10)
rfe.fit(X_train, y_train)

print(rfe.n_features_)

56


In [12]:
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)

# step = int((X_train_rfe.shape[1] + 5)/5);
params = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)], 'alpha': [0.01,0.001, 0.0001, 0.00001]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=1000), cv=10, n_jobs=-1)
cv.fit(X_train_rfe, y_train)

print("Train accuracy:", cv.score(X_train_rfe, y_train))
print("Test accuracy:", cv.score(X_test_rfe, y_test))

y_pred = cv.predict(X_test_rfe)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

Train accuracy: 0.619201725998
Test accuracy: 0.558335939944
             precision    recall  f1-score   support

          0       0.55      0.60      0.57      1599
          1       0.56      0.52      0.54      1598

avg / total       0.56      0.56      0.56      3197

{'alpha': 1e-05, 'hidden_layer_sizes': (5,)}


This decision tree classifier RFE selects 29 features, which the tuned neural network produces slightly different performance than the LogisticRegression RFE. In real life practice, both approaches are valid and you should try both when faced with a new data set.

### 5.2. Principle Component Analysis

The second feature reduction technique that we will try is principle component analysis, or PCA. As mentioned last week, PCA is a technique that finds underlying variables (known as principal components) that best differentiate your data points. The idea of PCA is to reduce the number of features while still retaining the variance/pattern in the feature set. As last week, we will set our variance threshold at 95%.

In [13]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)

sum_var = 0
for idx, val in enumerate(pca.explained_variance_ratio_):
    sum_var += val
    if (sum_var >= 0.95):
        print("N components with > 95% variance =", idx+1)
        break

N components with > 95% variance = 66


66 components with cumulative > 95% variance are selected. Train and test on this transformed dataset.

In [14]:
pca = PCA(n_components=66)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

params = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)], 'alpha': [0.01,0.001, 0.0001, 0.00001]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=1000), cv=10, n_jobs=-1)
cv.fit(X_train_pca, y_train)

print("Train accuracy:", cv.score(X_train_pca, y_train))
print("Test accuracy:", cv.score(X_test_pca, y_test))

y_pred = cv.predict(X_test_pca)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

KeyboardInterrupt: 

The result shows an improved performance over the original feature set. We also managed to reduce the feature set size to only 66, which shorten the training process. However, compared to the RFE selected feature set, PCA still produces slightly worse performance.

### 5.3. Selecting using decision tree

Lastly, we will use decision tree and feature importance produces from the model to perform feature selection. To start, we need to tune a decision tree with GridSearchCV.

In [None]:
from sklearn.tree import DecisionTreeClassifier

params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(3, 10),
          'min_samples_leaf': range(20, 200, 20)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(), cv=10)
cv.fit(X_train, y_train)

In [None]:
from dm_tools import analyse_feature_importance

analyse_feature_importance(cv.best_estimator_, X.columns)

In [None]:
from sklearn.feature_selection import SelectFromModel

selectmodel = SelectFromModel(cv.best_estimator_, prefit=True)
X_train_sel_model = selectmodel.transform(X_train)
X_test_sel_model = selectmodel.transform(X_test)

print(X_train_sel_model.shape)

Our `analyse_feature_importance` function shows around 7 important features according to this decision tree model. With this result, `SelectFromModel` transforms the original dataset into only 7 columns. Proceed to tune a MLPClassifier with this dataset.

In [None]:
params = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)], 'alpha': [0.01,0.001, 0.0001, 0.00001]}

cv = GridSearchCV(param_grid=params, estimator=MLPClassifier(max_iter=1000), cv=10, n_jobs=-1)
cv.fit(X_train_sel_model, y_train)

print("Train accuracy:", cv.score(X_train_sel_model, y_train))
print("Test accuracy:", cv.score(X_test_sel_model, y_test))

y_pred = cv.predict(X_test_sel_model)
print(classification_report(y_test, y_pred))

print(cv.best_params_)

The test accuracy result shows improvement over the original feature set and RFE selected feature set as well. This method yields the smallest feature set yet (only 7 rather than 85 features), and again the performance is the best. We will keep this model as the best model.

## 6. Comparing Models

After 5 weeks, we have learned how to perform data preprocessing, model tuning and various dimensionality reduction techniques. While we have been pretty straight forward with our decisions and process, in real life projects, you need to test many different techniques/approaches before getting to the solution. You need to constantly expand your tool box and knowledge to be a great data miner/data scientist/machine learning engineer.

Now, let's imagine that you have built multiple models, where each uses different data preprocessing/dimensionality reduction techniques. How do you choose the best performing model? One way to do this is by comparing statistics produced by each model. I will show you how.

Firstly, let's train and tune three models of `DecisionTreeClassifier`, `LogisticRegression` and `MLPClassifier` with GridSearchCV. We will use the original feature set with no dimensionality reduction for this demonstration, but the process should not be too much different.

In [None]:
# grid search CV for decision tree
params_dt = {'criterion': ['gini'],
          'max_depth': range(2, 5),
          'min_samples_leaf': range(40, 61, 5)}

cv = GridSearchCV(param_grid=params_dt, estimator=DecisionTreeClassifier(), cv=10)
cv.fit(X_train, y_train)

dt_model = cv.best_estimator_
print(dt_model)

# grid search CV for logistic regression
params_log_reg = {'C': [pow(10, x) for x in range(-6, 4)]}

cv = GridSearchCV(param_grid=params_log_reg, estimator=LogisticRegression(), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

log_reg_model = cv.best_estimator_
print(log_reg_model)

# grid search CV for NN
params_nn = {'hidden_layer_sizes': [(3,), (5,), (7,), (9,)], 'alpha': [0.01,0.001, 0.0001, 0.00001]}

cv = GridSearchCV(param_grid=params_nn, estimator=MLPClassifier(max_iter=500), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)

nn_model = cv.best_estimator_
print(nn_model)

### 6.1. Test Accuracy

Once you have them trained, there are a number of statistics that we could use for comparing models. First, accuracy of the models on test data, just like what we have used so far.

**Note**: Accuracy is a great statistics when the ratio of target classes are relatively equal, like what we have in this dataset (50% donors vs 50% non-donors). In cases where the targets are not equal (e.g. in cancer detection task where most people in the dataset will not have cancer), metrics like precision/recall/F1 from `classification_report` or Cohen's kappa are preferred.

In [None]:
y_pred_dt = dt_model.predict(X_test)
y_pred_log_reg = log_reg_model.predict(X_test)
y_pred_nn = nn_model.predict(X_test)

print("Accuracy score on test for DT:", accuracy_score(y_test, y_pred_dt))
print("Accuracy score on test for logistic regression:", accuracy_score(y_test, y_pred_log_reg))
print("Accuracy score on test for NN:", accuracy_score(y_test, y_pred_nn))

On test accuracy score, logistic regression performs the best, followed by decision tree and neural network.

### 6.2. ROC AUC
Another metric commonly used to compare models is receiver operating characteristic (ROC) and area under curve (AUC). ROC refers to the ability of binary classifier (like what we have here) to classify with varied discrimination threshold.

Most predictive classification models produce probability of target values on a set of inputs. `LogisticRegression` and `MLPClassifier` produces real value probabilities, while `DecisionTree` has the ratio of majority classes in each leaf node. Most of the time, discrimination threshold is cap at 0.5, which means any probability prediction above 0.5 is considered as positive (and the rest negative). For more clarity, see this code below:

In [None]:
# typical prediction
y_pred = dt_model.predict(X_test)

# probability prediction from decision tree
y_pred_proba_dt = dt_model.predict_proba(X_test)

print("Probability produced by decision tree for each class vs actual prediction on TargetB (0 = non-donor, 1 = donor). You should be able to see the default threshold of 0.5.")
print("(Probs on zero)\t(probs on one)\t(prediction made)")
# print top 10
for i in range(10):
    print(y_pred_proba_dt[i][0], '\t', y_pred_proba_dt[i][1], '\t', y_pred[i])

With this concept in mind, ROC AUC score aims to find the best model under varied threshold. To compute our ROC AUC score, use the code below.

In [None]:
from sklearn.metrics import roc_auc_score

y_pred_proba_dt = dt_model.predict_proba(X_test)
y_pred_proba_log_reg = log_reg_model.predict_proba(X_test)
y_pred_proba_nn = nn_model.predict_proba(X_test)

roc_index_dt = roc_auc_score(y_test, y_pred_proba_dt[:, 1])
roc_index_log_reg = roc_auc_score(y_test, y_pred_proba_log_reg[:, 1])
roc_index_nn = roc_auc_score(y_test, y_pred_proba_nn[:, 1])

print("ROC index on test for DT:", roc_index_dt)
print("ROC index on test for logistic regression:", roc_index_log_reg)
print("ROC index on test for NN:", roc_index_nn)

`LogisticRegression` produces the best ROC score. This means on varied discrimination threshold, this LogReg model performs better compared to the other two models.

ROC score only tells a side of the story, however. Typically, instead of ROC score, we plot a curve to show the performance of the model on different threshold values. The curve should look something like this, and the closer the curve is to top left corner, the better the model is.

![ROC Curve](http://gim.unmc.edu/dxtests/roccomp.jpg)

Let's plot ROC curve for our models. Firstly, we need to find the false positive rate, true positive rate and thresholds used for each models. We can get it from the `roc_curve` function.

In [None]:
from sklearn.metrics import roc_curve

fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test, y_pred_proba_dt[:,1])
fpr_log_reg, tpr_log_reg, thresholds_log_reg = roc_curve(y_test, y_pred_proba_log_reg[:,1])
fpr_nn, tpr_nn, thresholds_nn = roc_curve(y_test, y_pred_proba_nn[:,1])

Once we have these scores, plot them using `matplotlib`'s pyplot.

In [None]:
import matplotlib.pyplot as plt

plt.plot(fpr_dt, tpr_dt, label='ROC Curve for DT {:.3f}'.format(roc_index_dt), color='red', lw=0.5)
plt.plot(fpr_log_reg, tpr_log_reg, label='ROC Curve for Log reg {:.3f}'.format(roc_index_log_reg), color='green', lw=0.5)
plt.plot(fpr_nn, tpr_nn, label='ROC Curve for NN {:.3f}'.format(roc_index_nn), color='darkorange', lw=0.5)

# plt.plot(fpr[2], tpr[2], color='darkorange',
#          lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=0.5, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

Here, you can see the curve for different models. `LogisticRegression` again has the largest curve area compared to the other two models. Thus, all three statistics that we used collectively agreed on `LogisticRegression` being the best performing model overall.

While statistics are vital, in a real project, performance is not always the priority. Some of the other aspects used to consider a best model are:
1. Interpretability: how well can humans use the model to make decisions. Decision trees (and regressions to some extent) excel at this, while neural network not so much.
2. Speed: how well can the model train and predict on large amount of data. Again decision trees and regressions are relatively fast, while neural networks take a while to train.
3. Adaptability: in some cases, you want your model to slowly adapt to the data trend. Neural networks are great for this as they can be trained using "online training", while decision trees are not so great.

## End Notes and Next Week

This week, we learned how to build, tune and explore the structure of neural network models. We also explored dimensionality reduction techniques to reduce the size of the feature set and improve performance of our neural network model. In addition, we tried numerous statistics to compare end-to-end performance of all models we have built so far.

Next week, we will have a drop-in help session where you focus on the assignment and ask your tutor questions regarding it.