# Introduction to Deep Learning with Keras + TensorFlow

This notebook provides a very brief introduction to the theory behind deep neural networks. There are tons great resources online for digging deaper into the math behind deep learning, this introduction will be very high level and I'll gloss over some details. My goal with the conceptual intro is to give you just enough intuition to make sense of the code we'll look at later.

Following the theorical foundation, we'll dive straight into some examples in code to demonstrate how to build deep neural networks in TensorFlow using Keras. We'll look at a classic image classification task using the MNIST handwritten digit dataset. From there we'll move on to time series anomoly detection with LTSM.

## Deep Learning Basics

Deep learning is a field of machine learning that uses artificial neural networks to perform learning tasks. These neural networks are called "deep" because they have many layers of neurons. 

<img src="images/AI-ML-DL.svg" alt="deep learning in context" width="400">

<a href="https://commons.wikimedia.org/wiki/File:AI-ML-DL.svg">Original file: Avimanyu786SVG version: Tukijaaliwa</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons

While the theoritical beginnings of deep learning can be traced to the work of [Frank Rosenblatt](https://en.wikipedia.org/wiki/Frank_Rosenblatt) in the 1960s, advances in GPU technology and the availability of big data have helped the field take off in recent years. GPUs are particularly important as they enable massive parallelism for the kinds of linear algebra and matrix and calculus operations used when training models.

### The Artificial Neuron

Essentially, a neural network is a function approximator. We'd like the network to learn a function `f(x)` that maps some input X to some prediction Y. For example, given an image of a cat, we'd like the network to learn a function that takes as inputs the pixels of an image and correctly "predicts" that the image as a cat.


Deep neural networks are composed of layers of inter-connected neurons. Each neuron perfoms a simple linear transformation on its inputs, and passes result of that transformation through a non-linear "activation function." The non-linearity is important because it allows the network to learn complex non-linear functions. Without the non-linear activation function, the network would only be able to learn linear functions.

![single neuron](images/neuron.jpg)

`f(x) = wx + b` 

(this is our old friend `y = mx + b` from back in the day except that *w and x are vectors*)

`Neuron output = activation function(f(x))`

AI researchers have explored different activation fuctions, and this is still an area of active research. One commmon activation function is the [Recified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) or ReLU. This simple function returns zero when its input is negative, and returns the same value for positve inputs. In Python this is just `max(0, x)`. We'll use ReLU in our experiments today.

### Training Deep Neural Networks

![neural network](images/network.jpg)

At a high level, the process to train a neural network is as follows:
1. Initialize the weights and biases to small, normally distributed positive numbers.
1. Make a prediction on a small (mini) batch of the training data. Typically NNs are trained on huge datasets so it's necessary to break the training data into "mini-batches." TensorFlow will do this for us automatically.
1. Use a loss function to calculate the "badness" of the prediction. There are different loss functions for different learning tasks. Since the network starts with random weights and biases, the initial predictions will be pretty bad.
1. Calculate the gradient (partial derivative) of the loss function at each of the outputs. This is a measure of how much the loss will change for a change in each output of the network.
1. Apply "back propagation" to calculate the gradient of the loss with respect to each weight in the network. Again, this is a partial derivative that tells us how a small chage in each weight will change the loss.
1. Adjust all of the weights and biases by a small amount ("learning rate") in the direction of the gradient. In pseudo code this can be expressed by `weights -= learning_rate * gradients` and `biases -= learning_rate * gradients`.
1. Repeat the process on each batch of training data, adjusting the weights each time, until the model has seen all of the batches. One pass through all of the batches of the training data is called an epoch, and typically we'll need to train for several epochs.

This process of taking small steps in the direction that minimizes the loss by adjusting the model weights and biases by the learning rate times the gradient is called [gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent). 

While it's fun to understand the theory because you can impress your friends at cocktail parties, in practice, modern deep learning frameworks like TensorFlow and PyTorch take care of nearly all of this for us.

Let's jump in.

## The "Hello World" of Deep Learning
The [MNIST](https://en.wikipedia.org/wiki/MNIST_database) handwritten digit classification task is sort of the hello world of deep learning. The [original paper](https://en.wikipedia.org/wiki/MNIST_database) by LeCun et al is a classic worth reading.

LeCun's work on MNIST was a breakthrough in 1998. Today, we can replicate the results in in TensorFlow with just a few lines of Python. First, we'll train a fully connected deep neural network (DNN) on MNIST. Then we'll train a convolutional neural network (CNN) and compare the difference in performance.

We can visualized our model using:

`
tf.keras.utils.plot_model(model, show_shapes=True, rankdir="LR")
`