# Ramia_Assignment6

**Introduction:** This assignment involves fitting a neural network to the MNIST data, testing alternative network structures. Tested neural network structures will be explored within a benchmark experiment, a factorial design with two levels on each of two experimental factors (a 2x2 completely crossed design). We will assess classification performance of accuracy and processing time of these models to determine which one to recommend for optical character recognition.

We find that, while adding more layers does not seem to improve the accuracy of these models tremendously, more nodes per layer does. Further, processing time tends to increase with added complexity, though the model with fewer layers and 20 nodes per layer runs slower than the model with a greater number of layers, oddly enough. Thus, I conclude that the model with five layers of 20 nodes each is the model that should be used.

Load the MNIST dataset and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).

In [1]:
import numpy as np
from six.moves import urllib
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1)
    mnist.target = mnist.target.astype(np.int64)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

In [2]:
X_train = mnist['data'][:60000]
y_train = mnist['target'][:60000]

X_test = mnist['data'][60000:]
y_test = mnist['target'][60000:]

Check the type and shape of the training and test sets:

In [3]:
print('\nX_train object:', type(X_train), X_train.shape)    
print('\ny_train object:', type(y_train),  y_train.shape)  
print('\nX_test object:', type(X_test),  X_test.shape)  
print('\ny_test object:', type(y_test),  y_test.shape)  


X_train object: <class 'numpy.ndarray'> (60000, 784)

y_train object: <class 'numpy.ndarray'> (60000,)

X_test object: <class 'numpy.ndarray'> (10000, 784)

y_test object: <class 'numpy.ndarray'> (10000,)


In [4]:
import pandas as pd
import time
from sklearn.neural_network import MLPClassifier
RANDOM_SEED = 9999

Define four methods to be benchmarked in our experiment. Here we will be comparing networks of varying structures. However, there are plenty of other hyperparameters that could be compared. 

In [7]:
names = ['ANN-2-Layers-10-Nodes-per-Layer',
         'ANN-2-Layers-20-Nodes-per-Layer',
         'ANN-5-Layers-10-Nodes-per-Layer',
         'ANN-5-Layers-20-Nodes-per-Layer']

layers = [2, 2, 5, 5]
nodes_per_layer = [10, 20, 10, 20]
treatment_condition = [(10, 10), 
                       (20, 20), 
                       (10, 10, 10, 10, 10), 
                       (20, 20, 20, 20, 20)]

In [8]:
methods = [
    MLPClassifier(hidden_layer_sizes=treatment_condition[0], activation='relu', 
              solver='adam', alpha=0.0001, batch_size='auto', 
              learning_rate='constant', learning_rate_init=0.001, 
              power_t=0.5, max_iter=200, shuffle=True, 
              random_state=RANDOM_SEED, 
              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
              nesterovs_momentum=True, early_stopping=False, 
              validation_fraction=0.083333, beta_1=0.9, beta_2=0.999, epsilon=1e-08),
    MLPClassifier(hidden_layer_sizes=treatment_condition[1], activation='relu', 
              solver='adam', alpha=0.0001, batch_size='auto', 
              learning_rate='constant', learning_rate_init=0.001, 
              power_t=0.5, max_iter=200, shuffle=True, 
              random_state=RANDOM_SEED, 
              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
              nesterovs_momentum=True, early_stopping=False, 
              validation_fraction=0.083333, beta_1=0.9, beta_2=0.999, epsilon=1e-08),
    MLPClassifier(hidden_layer_sizes=treatment_condition[2], activation='relu', 
              solver='adam', alpha=0.0001, batch_size='auto', 
              learning_rate='constant', learning_rate_init=0.001, 
              power_t=0.5, max_iter=200, shuffle=True, 
              random_state=RANDOM_SEED, 
              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
              nesterovs_momentum=True, early_stopping=False, 
              validation_fraction=0.083333, beta_1=0.9, beta_2=0.999, epsilon=1e-08),
    MLPClassifier(hidden_layer_sizes=treatment_condition[3], activation='relu', 
              solver='adam', alpha=0.0001, batch_size='auto', 
              learning_rate='constant', learning_rate_init=0.001, 
              power_t=0.5, max_iter=200, shuffle=True, 
              random_state=RANDOM_SEED, 
              tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
              nesterovs_momentum=True, early_stopping=False, 
              validation_fraction=0.083333, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
]

Train and score each model iteratively:

*Note:* Using deprecated time functions will produce warnings. These are not errors, however, and will not prevent your code from running. To prevent these warnings from occuring, use a recommended time function instead.

In [9]:
index_for_method = 0 
training_performance_results = []
test_performance_results = []
processing_time = []
   
for name, method in zip(names, methods):
    print('\n------------------------------------')
    print('\nMethod:', name)
    print('\n  Specification of method:', method)
    start_time = time.clock()
    method.fit(X_train, y_train)
    end_time = time.clock()
    runtime = end_time - start_time  # seconds of wall-clock time 
    print("\nProcessing time (seconds): %f" % runtime)        
    processing_time.append(runtime)

    # mean accuracy of prediction in training set
    training_performance = method.score(X_train, y_train)
    print("\nTraining set accuracy: %f" % training_performance)
    training_performance_results.append(training_performance)

    # mean accuracy of prediction in test set
    test_performance = method.score(X_test, y_test)
    print("\nTest set accuracy: %f" % test_performance)
    test_performance_results.append(test_performance)
                
    index_for_method += 1


------------------------------------

Method: ANN-2-Layers-10-Nodes-per-Layer

  Specification of method: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=9999, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.083333, verbose=False, warm_start=False)


  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':



Processing time (seconds): 490.651651

Training set accuracy: 0.941150

Test set accuracy: 0.925000

------------------------------------

Method: ANN-2-Layers-20-Nodes-per-Layer

  Specification of method: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(20, 20), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=9999, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.083333, verbose=False, warm_start=False)


  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':



Processing time (seconds): 670.844476

Training set accuracy: 0.970850

Test set accuracy: 0.942500

------------------------------------

Method: ANN-5-Layers-10-Nodes-per-Layer

  Specification of method: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 10, 10, 10, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=9999, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.083333, verbose=False, warm_start=False)


  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':



Processing time (seconds): 555.676125

Training set accuracy: 0.962017

Test set accuracy: 0.934700

------------------------------------

Method: ANN-5-Layers-20-Nodes-per-Layer

  Specification of method: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(20, 20, 20, 20, 20), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=9999, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.083333, verbose=False, warm_start=False)


  # Remove the CWD from sys.path while we load stuff.
  if sys.path[0] == '':



Processing time (seconds): 624.406146

Training set accuracy: 0.984217

Test set accuracy: 0.947400


Aggregate the results for final report using OrderedDict to preserve the order of variables in DataFrame.

In [10]:
from collections import OrderedDict

In [12]:
results = pd.DataFrame(OrderedDict([('Method Name', names),
                        ('Layers', layers),
                        ('Nodes per Layer', nodes_per_layer),
                        ('Processing Time', processing_time),
                        ('Training Set Accuracy', training_performance_results),
                        ('Test Set Accuracy', test_performance_results)]))

print('\nBenchmark Experiment: Scikit Learn Artificial Neural Networks\n')
results



Benchmark Experiment: Scikit Learn Artificial Neural Networks



Unnamed: 0,Method Name,Layers,Nodes per Layer,Processing Time,Training Set Accuracy,Test Set Accuracy
0,ANN-2-Layers-10-Nodes-per-Layer,2,10,490.651651,0.94115,0.925
1,ANN-2-Layers-20-Nodes-per-Layer,2,20,670.844476,0.97085,0.9425
2,ANN-5-Layers-10-Nodes-per-Layer,5,10,555.676125,0.962017,0.9347
3,ANN-5-Layers-20-Nodes-per-Layer,5,20,624.406146,0.984217,0.9474


**Conclusion:** These models performed fairly well, though these results are somewhat disappointing for a neural network. We were able to achieve greater results using a random forest in the previous assignment. Further, it appears all four models are slightly overfitting the training data.

To improve accuracy and prevent overfitting, neural networks provide what seems like an infinite amount of flexibility in their hyperparameters. There are many other more complex models that deserve to be benchmarked alongside these simpler models. For instance, one might increase the complexity of the model structure, explore different activation functions, or add regularization methods. Further, one warning we received while running these models stated that the models failed to converge before the max iterations were met. Allowing the models to run for longer could also improve performance.

Of course, adjusting hyperparameters could come at the expense of time, which was not terrible for these simple models (the longest a model took to run was just over ten minutes). For now we are left to recommend the model with five layers of 20 nodes each for the financial institution to use for their optical character recognition problem.