In [None]:
'''
    Assignment tasks:
    1. Notebook configuration and naming convention -X
    2. Establishes Accuracy Scores with Two Hidden Layers -X (need snap shot)
    3. Establishes Accuracy Scroes by varying parameters
    4. Explain changes in accuracy rats 
    5. Articuation of Response 

'''

# make sure the print statement in the code works the same way in both older and newer versions of Python.
from __future__ import print_function
# NumPy is used for working with arrays and matrices of numerical data. Provides tools to create,
# manipulate, and perform calculations on large collections of numbers efficiently. 
import numpy as np

# MNIST stands for Modified National Institute of Standards and Technology,
# and it is a widely-used dataset in the field of machine learning and computer vision.
# The MNIST dataset consists of 70,000 small images of handwritten digits (from 0 to 9),
#along with their corresponding labels (the actual digit that the image represents).
from keras.datasets import mnist

'''
Those lines of code are importing different components from the Keras library
that allow you to build and set up a neural network model for machine learning tasks like image classification.
- Sequential helps define the neural network architecture as a sequence of layers.
- Dense and Activation are two types of layers you can add to the network
    -- Dense layers are for computing weighted sums of inputs,
- Activation layers apply an activation function like ReLU.
- SGD is an optimization algorithm that adjusts the network's weights during training.
- np_utils provides some utilities for working with NumPy arrays which are often used to represent data inputs and labels.
'''

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.optimizers import Adam
from keras.utils import np_utils


'''
This code adds an additional dense layer to a neural network model and applies L2
regularization to the weights of that layer. Specifically, it imports the
regularizers module from the Keras library, which provides different techniques to prevent overfitting.
It then adds a dense layer with 64 nodes, where the input dimension is also 64. The
kernel_regularizer=regularizers.l2(0.01) part tells the layer to apply L2 regularization to
the weights connecting the inputs to the nodes in this layer. L2 regularization adds a penalty
term to the loss function that discourages the weights from becoming too large during training. This
encourages the model to learn smaller weights, making it less likely to overfit to the training data.
The 0.01 value determines the strength of the regularization penalty - larger values increase the 
regularization effect. Adding this regularized layer helps the model generalize better to unseen data.

** 5 year old explanation

Imagine you're building a really cool tower with your building blocks. This tower is special because it can look at pictures of
numbers and tell you what number it is. Pretty neat, right?

Now, when you're building your tower, you want to make sure it doesn't get too big and wobbly. If you make it too
tall and unstable, it might fall over and not work properly anymore.

That's kind of like what can happen when we're training our special number-guessing tower (the neural network model). If we
let the tower get too big and complicated, it might start learning things that aren't really
important, and it won't be good at guessing new numbers it hasn't seen before.

So, these lines of code are like adding a special rule when we're building our tower.
The rule says that whenever we add a new layer of blocks (the dense layer with 64 nodes),
we have to be careful not to make it too big or heavy.

The kernel_regularizer=regularizers.l2(0.01) part is like telling the tower,
"Hey, when you're adding these new blocks, make sure you don't use too many of them, or 
use you'll get a little penalty." The 0.01 part is like saying how strict we want to be with
this rule – a bigger number would mean we're super strict, and a smaller number means we're a little more relaxed.

By adding this rule, we're making sure our tower doesn't get too big and wobbly. It helps our tower stay nice and stable, so it
can keep guessing numbers correctly, even when it sees new pictures it hasn't seen before.


'''
from keras import regularizers
model.add(Dense(64, input_dim=64, kernel_regularizer=regularizers.l2(0.01)))


# sets the starting point for generating random numbers in NumPy to a specific value (1671),
np.random.seed(1671) 

'''
Sets various configuration options and *hyperparameters for a machine learning model that will be trained.
- NB_EPOCH=20 means the model will be trained across 20 iterations of the entire dataset.
- BATCH_SIZE=128 specifies that the data will be divided into batches of 128 examples when training the model.
- VERBOSE=1 controls how much information is printed to the screen during training.
- NB_CLASSES=10 indicates there are 10 possible *output classes (likely the digits 0-9).
- OPTIMIZER=SGD() sets the optimization algorithm used to *train the model's weights to Stochastic Gradient Descent.
- N_HIDDEN=128 defines the number of nodes in a *hidden layer of the model to be 128. 
- VALIDATION_SPLIT=0.2 reserves *20% of the training data to be used for validation and monitoring the model's performance during training.
'''
NB_EPOCH = 20
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 

'''
The difference between SGD & ADAM
'''
OPTIMIZER = Adam()
N_HIDDEN = 128
VALIDATION_SPLIT=0.2
DROPOUT=0.3
'''
Those lines are loading the MNIST dataset of handwritten digit images and
splitting it into training and testing sets. The mnist.load_data() function returns two tuples
- one for the training data
- one for the test data.
Each tuple contains two elements:
- the images themselves (X_train and X_test)
- their corresponding labels (y_train and y_test).
The images are being unpacked into X_train and X_test,while the labels are unpacked
into y_train and y_test. This allows the code to have separate image and label data for
training a machine learning model and evaluating its performance on unseen test data.
The RESHAPED = 784 line is likely indicating that each 28x28 pixel image will be reshaped
or flattened into a 1-dimensional vector of length 784 (28 * 28) for use as input to the model.
'''

(X_train, y_train), (X_test, y_test) = mnist.load_data()
RESHAPED = 784

'''
Theses lines of code are pre-processing the image data from the MNIST dataset to prepare it for use in training
and testing a machine learning model.
- the first two lines reshape the training and test image data from their original 2D shape (28x28 pixels) into 1D vectors of length 784 (28*28).
This flattening step is necessary because many machine learning models require input data to be 1-dimensional.
- The next two lines convert the data type of the training and test images from integers to 32-bit floating-point numbers.
This is often required because many machine learning algorithms expect input data to be floating-point values, not integers.
By reshaping the 2D image arrays into 1D vectors and converting data types, this code is transforming the raw image data into a
format that can be efficiently processed by the machine learning model in the subsequent steps.

'''
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')


'''
Those lines are performing additional preprocessing steps on the training and test image
data to standardize the pixel values. The first two lines divide every pixel value
in the training and test images by 255. This scales all the pixel values, which originally
ranged from 0 to 255, to be between 0 and 1.

Scaling input data is a common practice in machine learning to ensure differentfeatures are
on a similar scale. The next two lines simply print out the number of training and test
samples in the dataset. This provides a sanity check to verify the dataset has been loaded
and preprocessed correctly before proceeding to train the model. Specifically, it will
print something like "60000 train samples" and "10000 test samples" since the MNIST dataset
contains 60,000 training and 10,000 test images originally.
'''
X_train /= 255 
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')


'''
Those lines are converting the label data (y_train and y_test) from their original integer format into a
categorical format suitable for training a *multi-class classification model. Initially,
the labels are represented as single digits (0 to 9) indicating which handwritten digit the image
represents. However, many machine learning models expect the labels to be *"one-hot" encoded vectors instead
of integers. The np_utils.to_categorical function from Keras does this encoding automatically. It takes the
integer labels and the total number of classes (NB_CLASSES=10 for digits 0-9), and converts each label
into a vector with 0s in all positions except a 1 in the index corresponding to that digit class. This
allows the model to interpret the output as a probability distribution over the 10 possible classes
during training and prediction. So y_train and y_test now contain the categorically encoded label data
matching the format expected by the model.

** 5 year old version 

You know how when you're learning your numbers, your teacher might give you a worksheet with pictures
of numbers and ask you to circle or color in the correct number?

Well, imagine that instead of just circling the number, you had to color in a whole row of circles, and
the circle that matches the number in the picture is the only one you can color in.

For example, if the picture shows the number 3, you'd have a row of 10 circles (one for each number from 0 to 9), and you'd color in
the 3rd circle, leaving the rest blank.

That's kind of what these lines of code are doing with the labels (the actual numbers that match each picture).

Originally, the labels are just single numbers (like 3 or 7). But the computer program we're building
needs the labels in a special format where each number is represented by a row of circles, with only one circle colored in.

So these lines of code take the original labels (like 3 or 7) and convert them into rows of
10 circles, with only the circle matching the number colored in.

This special format with rows of circles (called "one-hot encoding") is what the computer program expects,
so it can learn to match the pictures of numbers with the correct row of circles (the correct label).
'''
Y_train = np_utils.to_categorical(y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(y_test, NB_CLASSES)


'''
That code is defining the architecture and compiling a neural network model
for the task of classifying handwritten digit images. It first creates a Sequential model,
which is a linear stack of layers. It adds a Dense layer with N_HIDDEN (128) nodes as the input
layer, followed by a ReLU activation function. It then adds another Dense layer with N_HIDDEN nodes,
another ReLU activation, a Dense output layer with NB_CLASSES (10) nodes for the 10 digit classes, and finally
a softmax activation to output class probabilities. The model.summary() prints a summary of the model architecture. 
inally, it compiles the model by specifying the loss function as categorical cross-entropy (for multi-class problems),
the previously set optimizer SGD, and accuracy as the evaluation metric. This compiled model is now ready
to be trained on the preprocessed image data.

** 5 year old version
Imagine you have a big box of building blocks, and you want to build a really cool tower with them.
But not just any tower - you want to build one that can look at pictures of numbers and tell you what number it is!

First, we need to get all the blocks ready. The line model = Sequential() is like saying, "I'm going to build my tower one block at a time,
stacking them up in a line."

Then, we start adding different kinds of blocks to the tower. The lines like model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,))) and model.add(Activation('relu'))
are adding special blocks that are really good at looking at the pictures of numbers and finding patterns.

Some of the blocks are called "Dense" blocks, and they're like the bricks that make up the main part of the tower. The "Activation" blocks
are like little helpers that make sure the tower is working properly.

We keep adding more and more of these special blocks, until we have a really tall tower
with lots of different kinds of blocks. The last few blocks, like model.add(Dense(NB_CLASSES)) and model.add(Activation('softmax')), are the ones that
actually look at the picture of the number and tell us what number it is.

Finally, we have the line model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy']). This is like giving the tower some
instructions on how to learn and get better at recognizing the numbers. It's like telling the tower, "When you see a picture of a
number, try your best to guess what it is. If you get it wrong, don't worry, just learn from your mistakes and keep practicing!"

So, with all these special blocks and instructions, our tower is now ready to start learning how to recognize
numbers in pictures. It's like a really cool, number-guessing tower that we built ourselves!
'''

model = Sequential()

# hidden layer 1
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))

# hidden layer 2 
model.add(Dense(N_HIDDEN))
model.add(Activation('relu')) 
model.add(Dropout(DROPOUT))

# output layer
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))

model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=OPTIMIZER,
metrics=['accuracy'])


'''
The purpose of adding dropout 

Dropout is a technique used to prevent overfitting in neural networks during training.
Overfitting happens when the model learns the training data too well, including the noise and
irregularities, causing it to perform poorly on new unseen data. The line model.add(Dropout(DROPOUT))
is adding a dropout layer to the model architecture. This layer randomly drops out (sets to zero) a fraction (0.3 or 30% based on DROPOUT=0.3)
of the nodes in that layer during each training iteration. This helps prevent nodes from becoming too reliant on the
presence of other nodes, forcing them to learn more robust features. By randomly dropping nodes, different nodes get exposed
to different data, making the overall network more generalizable. After training, dropout is disabled, allowing the full network
to make predictions on new data. Incorporating dropout makes the model more robust, reducing overfitting on the training data, and
improving its ability to generalize well on unseen data.
'''

# NEW STUFF
# M_HIDDEN hidden layers 10 outputs
model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer=OPTIMIZER,
metrics=['accuracy'])


'''
This code is training the previously defined neural network model on the handwritten digit
image data, evaluating its performance on the test set, and printing the test metrics.
- model.fit() is the function that trains the model by showing it the training images (X_train) and labels
(Y_train) -- VISUAL 
- for a set number of iterations (epochs=20) in small batches (batch_size=128). - KINDA HAS TO DO WITH ROUNDS...
The validation_split reserves 20% of the training data for monitoring the model's performance during training. - I get that, but why?
After training, model.evaluate() computes the loss and accuracy of the trained model on the test images
(X_test) and labels (Y_test). Finally, it prints out the test loss (score[0]) and test accuracy (score[1]) scores,
allowing you to assess how well the model performed on data it has not seen before.

'''
history = model.fit(X_train, Y_train,
batch_size=BATCH_SIZE, epochs=NB_EPOCH,
verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

