#### Optimization

This refers to the algorithms we will use to vary our model's parameters.

Steps to take to train our model:
1. Gradient descent(GD) - It iterates over the whole training set before updating the weights. It is slow and not descending.
2. Stochastic Gradient descent(SGD) - It is similar to GD and works in the same way but updates the weights many times inside a single epoch. It is much faster. The SGD comes at a cost though - it approximates things a bit.

##### Gradient Descent Pitfalls(Local Minima Pitfalls)

A single GD wold be slow, but will eventually reach the minimum.

SGD would be much faster, but will give us an approximate answer.

Each local minimum is a suboptimal solution to the optimization problem. GD is prone to this issue.It often falls to the closest minimum to the starting point rather than a global minimum.

#### Momemtum

Both GD and SGD are good ways to train our models. We need not change them but extend them. The simplest extention to apply is momentum.

Momentum: We create algorithms that will likely fall into a depth instead of descending to the optimal solution. Including momentum, we ill consider the speed with which we are descending.

#### Learning Rate Schedules

Hyperparemeters are set by us - They are the width, depth and learning rate.

Parameters are found by optimizing - They are the weights and biases.

The Learning rate iter - it must be small enough so we gently descend, instead of oscillating widely around the minimum or diverging to infinity. It also has to be big enough so the optimiztion takes place in a reasonable amount of time.

A smart way is deal with choosing the proper learning rate is to adopt a learning rate schedule.

Learning rate schedule deals both the small enough and big enough.

Small enough 
1. Start from a high initial learning rate - leads to faster training
2. At some point, we lower the rate to avoid oscillation.
3. Around the end, we pick a very small rate to get a precise answer.

How learning schedules are implemented in practise:
- The simplest way is to set a pre-determined piecewise learning rate.
- Exponential schedule - still simple but much better as it smoothly decays the learning rate.

#### Advanced Learning Rate Schedules

Types:
1. AdaGrad (Adaptive Gradient Algorithm): It dynamically varies the learning rate at each update and for each weight individually. Just ask TesorFlow to use AdaGrad please.
    - it is smart
    - it makes use of Adaptive learning rate schedule
    - it is based on the training itself
    - the adaption is Per weight

2. RMSProp (The Root Mean Square Propagation): Similar to AdaGrad. It is longer monotonous, so it can adapt upwards and downwards. Both methods are logical and smart.

3. Adam (Adaptive moment estimation): It is the most advance optizer(very fast and efficient). It includes momentum

#### Preprocessing 

This is the first activity before creating a ML Algorithm. It refers to any manipulation applied to a dataset before running it through the model.

The motivation for preprocessing:
1. Compatibility with the library used.
2. Orders of magnitude - adjusting the input of different magnitude.
3. Generalization - same model, differen issue.

#### Basic Preprocessing 

Relative metrics are especially useful when we have a time-series data.

The realtives can further be transformed into Logarithms.

Advantages:
- Faster computation
- Lower order of magnitude.
- Clearer relationships
- Homogeneous variance

#### ML Preprocessing - Standardization

Also known as Feature Scaling and Normalization.

Standardization/Feature Scaling is the process of transforming data into a standard scale. This is done by subtracting the mean from the original variable and dividing by the standard deviation.

Other methods include:
- Normalization - It consists of converting each sample into a unit length vector using the L1 or L2 Norm.
- PCA(Principal Components analysis) - Adimension reduction technique used to combine several variables into a bigger variable. 
- Whittening: Often performed after PCA. It removes most underlying correlations between data points. 

#### Dealing with Categorical Data

Categorical Data refers to groups or categories (non-numerical data). The ML algorithm takes only numbers as values. Transforming categorical data will mean to assign numerical values to each category/group. 

The 2 main ways to encode categories in a way useful for ML are:
1. One-hot encoding
2. Binary encoding

#### One-Hot vs Binary Encoding

- Binary encoding implies turning ordinal numbers into binary i.e 0 and 1. 

- One-Hot encoding consists of creating as many columns as there are possible values. 

The trade-off between binary and one-hot encoding:
- We use one-hot encoding when there are few categories.
- We use binary when there are many categories.


#### MNIST Classification

The MNIST dataset consists of 70,000 images of handwritten digits.It is the 'Hello World' of ML. There are 10 classes from 0 - 9. The objective is to build an algorithm that takes aa input an image and then correctly determines which number is shown in that image.

Reasons for this Algorithm
1. It is a visual problem - you can see the data and know what to expect.
2. It is extremely common.
3. It is easy to build up the CNN(Convolusional Neural Networks) from the MNIST example.
4. Very big and processed dataset - i.e. the dataset is large and clean; no missing values, wrong labels etc

Read more on yann.lecun.com.

#### How to tackle Image Recognition problem

We can think about the problem as a 28x28 matrix, where input values are from 0 to 255. 0 -> black, 255 -> white. A 28x28 photo will have 784 pixels.

The approach for deep neural networks is to 'flatten' each image into a vector 784 x 1.
 - Each photo consists of 784 pixels.
 - Each pixel is an input for our neural network.
 - Each pixel corresponds to the intensity of the color (255 to white, 0 is black)

##### The MNIST deep net

Each pixel is an input in the input layer. We will linearly combine them and add a non-linearity to get the first hidden layer. For our example, we will build a model with 2 hidden layers. We then produce the output layer. There are 10 digits => 10 classes => Therefore, 10 output units in the output layer. The output will then be compared to the target. Using a softmax activation function for the output layer.

##### The MNIST action plan
1. Prepare our data and preprocess it. Create a  training, validation and test dataset.
2. Outline the model and choose the activation functions we want to employ.
3. Set the appropriate advanced optimizer and the loss function.
4. Make the data learn.
5. Test the accuracy of the model regarding the test dataset.

#### Deep Neural Network for MNIST Classification

We'll aply all the knowledge from the lectures in this section to write a deep neural network. The problem we've chosen is referred to as the 'Hello World' of deep learning because for most students, it is the first deep learning algorithm they see.

The dataset is called MNIST and refers to handwritten digit recognition. You can find more about it on Yann LeCun's website (Director of AI Research, Facebook). He is one of the pioneers of what we've been talking about and of more complex approaches that are widely used today, such as Convolutional Neural Networks (CNNs).

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).

The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

Our goal would be to build a neural network with 2 hidden layers.

#### Import the relevant packages

In [1]:
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
conda install -c anaconda tensorflow-datasets


Note: you may need to restart the kernel to use updated packages.


#### Data

mnist_dataset = tfds.load(name = 'mnist')
#tfds has a large number of datasets ready for modelling
#the first time you execute tfds.load(), a dataset willbe downloaded on your computer.
#Each consecutive time you run the code, it will automatically load this local copy. 

mnist_datset, mnist_info = tfds.load(name = 'mnist', with_info = True, as_supervised=True)

#tfds.load(name, with_info, as_supervised) loads a dataset from TensorFlow datasets.
#as_supervised = True, loads the data in a 2-tuple structure [input, target].
#with_info = true, provides a tuple containing info about version, features, samples of the dataset
#the first time it will take a bit longer(since you're actually downloading the dataset)

In [3]:
#Acquiring our data and storing in the mnist dataset.
mnist_dataset,mnist_info = tfds.load('mnist',with_info = True, as_supervised = True)

#tfds has a large number of datasets ready for modelling
#the first time you execute tfds.load(), a dataset willbe downloaded on your computer.
#Each consecutive time you run the code, it will automatically load this local copy



In [4]:
#Extract the train and test data
mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']

#Let's take 10% of the training dataset to serve as validation. 
#We can either count the num of train samples, or use the mnist_info

In [5]:
#The validation dataset
num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
num_validation_samples = tf.cast(num_validation_samples, tf.int64)

#tf.cast(x, dtype) casts (converts) a variable into a given data type

#store the number of test samples in a dedicated variable
num_test_samples = mnist_info.splits['train'].num_examples
num_test_samples = tf.cast(num_validation_samples, tf.int64)

#normally, we'd like to scale our data in some way to make the result more numerically stable(e.g. inputs bet 0 and 1)

In [None]:
#define a function that will scale the inputs called scale
def scale(image, label):
# we make sure the value is a float
    image = tf.cast(image, tf.float32)
# since the possible values for the inputs are 0 to 255 (256 different shades of grey)
# if we divide each element by 255, we would get the desired result -> all elements will be between 0 and 1 
    image /= 255
    return image, label

#dataset.map(*function*) applies a custom transformation to a given dataset.
#It takes as input a function which determines the transformation to a given dataset

scaled_train_and_validation_data = mnist_train.map(scale)
#this will scale the whole train dataset and store it in our new variable

test_dataset = mnist_test.map(scale)

BUFFER_SIZE = 10000 #this is used in case we are dealing with enormous dataset

#when we are dealing with enormous datasets, we can't shuffle all data at once
#if buffer_size = 1, no shuffling will happen
#if buffer_size .= num_samples, shuffling will happen at once(uniformly)
#if 1 < buffer_size < num_samples, we will be optimizing the computational power

#we apply shuffling in the preprocessing stage.
#Shuffling = Keeping the same information but in a different order.
shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

#extracting the train and validation dataset
validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

# the train_data is everything else, so we skip as many samples as there are in the validation dataset
train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

#set a batch size and prepare our data for batching
#batch size of 1 = Stochastic gradient descent(SGD)
#batch size = num of samples = (single batch) GD
#1 < batch size < num of samples = mini-batch GD
BATCH_SIZE = 100

#dataset.batch(batch_size) a method that combines the consecutive elements of a dataset into batches
train_data = train_data.batch(BATCH_SIZE) #this adds a column to our tensor indicating num of sample in each batch

validation_data = validation_data.batch(num_validation_samples)

#test data = test_data(num_test_samples)???
                      
#when batching, we find the average loss and average accuracy
#the model expects the validation dataset in batch form too.

#extract and convert the validation inputs and target appropriately
validation_inputs, validation_targets = next(iter(validation_data))

#### Model

##### Outline the model

In [7]:
#declare 3 variables for width of the input, output and hidden layer
input_size = 784
output_size = 10
hidden_layer_size = 50
#the underlying assumption is that all hidden layers are of the same size

#define the actual model and store in a variable called model
#our data(from tfds) is such that each input is 28x28x1
#tf.keras.layers.Flatten(original shape) transforms (flattens) a tensor into a vector
model = tf.keras.Sequential([
                            tf.keras.layers.Flatten(input_shape = (28, 28, 1)),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(output_size, activation = 'softmax')
                            ])

tf.keras.layers.Dense(output size) takes the inputs, provided to the model and calculates the dot product of the inputs and the weights and adds the bias. This is also where we can apply an activation function

#### Choose the optimizer and the loss function

In [8]:
#specify the optimizer and the loss through the compile method
model.compile(optimizer ='adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
# Adam - Adaptive Moment Estimation
#configures the model for training

Types of Crossentropy:
1. Binary Crossentropy - used when we've got binary encoding
2. Categorical Crossentropy - expects that you've one-hot encoded the targets
3. Sparse Categorical Crossentropy - applies one-hot encoding

#### Training

In [9]:
#fit the model we have built
NUM_EPOCHS = 5

model.fit(train_data, epochs = NUM_EPOCHS, validation_data = (validation_inputs, validation_targets), verbose = 2)

Epoch 1/5
540/540 - 10s - loss: 0.3980 - accuracy: 0.8877 - val_loss: 0.2292 - val_accuracy: 0.9403 - 10s/epoch - 19ms/step
Epoch 2/5
540/540 - 7s - loss: 0.1810 - accuracy: 0.9475 - val_loss: 0.1708 - val_accuracy: 0.9525 - 7s/epoch - 12ms/step
Epoch 3/5
540/540 - 8s - loss: 0.1395 - accuracy: 0.9589 - val_loss: 0.1557 - val_accuracy: 0.9570 - 8s/epoch - 14ms/step
Epoch 4/5
540/540 - 7s - loss: 0.1136 - accuracy: 0.9663 - val_loss: 0.1284 - val_accuracy: 0.9662 - 7s/epoch - 13ms/step
Epoch 5/5
540/540 - 7s - loss: 0.0974 - accuracy: 0.9703 - val_loss: 0.1101 - val_accuracy: 0.9680 - 7s/epoch - 13ms/step


<keras.callbacks.History at 0x182e4bee860>

WHAT HAPPENS INSIDE AN EPOCH
1. At the beginning of each epoch, the training loss will be set to 0
2. The algorithm will iterate over a preset number of batches, all from tran_data.
3. The weights and biases will be updated as many times as there are batches.
4. We will get a value for the loss function, indicating how the training is going.
5. We will also see a training accuracy.
6. At the end of the epoch, the algorithm will forward propagate the whole validation set.
*When we reach the maximum number of epochs, the training will be over.

Explaining the result:
- Info on the number of Epochs
- Number of batches = 540/540
- The time it took for the epoch to conclude
- The training loss
- The accuracy - it shows in what % of the cases our outputs were equal to the targets.
- The loss and the accuracy for the validation dataset. val_accuracy = true accuracy of the model.
- To access the overall accuracy of the model, check the val_accuracy for the last epoch (97%) - this is the validation accuracy.
- val_accuracy = True Accuracy of the model.

In [11]:
#Let's increase the hidden layer to 100 and re-run
input_size = 784
output_size = 10
hidden_layer_size = 100

model = tf.keras.Sequential([
                            tf.keras.layers.Flatten(input_shape = (28, 28, 1)),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                            tf.keras.layers.Dense(output_size, activation = 'softmax')
                            ])

In [13]:
#specify the optimizer and the loss through the compile method
model.compile(optimizer ='adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

In [14]:
NUM_EPOCHS = 5

model.fit(train_data, epochs = NUM_EPOCHS, validation_data = (validation_inputs, validation_targets), verbose = 2)

Epoch 1/5
540/540 - 9s - loss: 0.3359 - accuracy: 0.9050 - val_loss: 0.1946 - val_accuracy: 0.9497 - 9s/epoch - 17ms/step
Epoch 2/5
540/540 - 7s - loss: 0.1386 - accuracy: 0.9591 - val_loss: 0.1303 - val_accuracy: 0.9665 - 7s/epoch - 13ms/step
Epoch 3/5
540/540 - 7s - loss: 0.0982 - accuracy: 0.9705 - val_loss: 0.1036 - val_accuracy: 0.9717 - 7s/epoch - 13ms/step
Epoch 4/5
540/540 - 6s - loss: 0.0753 - accuracy: 0.9774 - val_loss: 0.0919 - val_accuracy: 0.9750 - 6s/epoch - 12ms/step
Epoch 5/5
540/540 - 7s - loss: 0.0604 - accuracy: 0.9813 - val_loss: 0.0769 - val_accuracy: 0.9790 - 7s/epoch - 13ms/step


<keras.callbacks.History at 0x182e5f68b50>

With an increased hidden layer, the val_accuracy increased to 97.90%!

#### Test the model

We train on the training data and validate on the validation data. We must make sure our parameters - the weights and the biases - dont overfit.

Create a model -> fiddle with the hyperparameters -> check validation accuracy. 

What we find are the hypeparmeters that fit our validation data best. By fine tuning them, we are overfitting the validation dataset.

The validation dataset is our reality check that prevents us from overfitting the parameters. The validation accuracy is a benchmark for how good the model is.

The test dataset is our reality check that prevents us from overfitting the hyperparameters(make sure our hyperparameters - width, depth, batch size, epochs etc. - don't overfit). It is the dataset the model has truly never seen.

In [15]:
#we can access the test accuracy by using the method evaluate.
test_loss, test_accuracy = model.evaluate(test_data)

NameError: name 'test_data' is not defined

In [16]:
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'. format(test_loss, test_accuracy * 100.))

NameError: name 'test_loss' is not defined

Our model has a test accuracy of 97.28%. This is the final state of the ML process. After testing the model, we are no longer allowed to change it.

The main point of the test dataset is to simulate model deployment. If we get 50 to 60% testing accuracy, we will know that our model has overfit and it will fail miserably in real life. However, getting a test accuracy very close to the validation accuracy shows that we have not overfit.

The test accuracy is the accuracy we expect to observe if we deploy the model in the real world.