# Introduction to Data Science 
# Lecture 25: Neural Networks II
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

In this lecture, we'll continue discussing Neural Networks. 

Recommended Reading:
* A. Géron, [Hands-On Machine Learning with Scikit-Learn & TensorFlow](http://proquest.safaribooksonline.com/book/programming/9781491962282) (2017) 
* I. Goodfellow, Y. Bengio, and A. Courville, [Deep Learning](http://www.deeplearningbook.org/) (2016)
*  Y. LeCun, Y. Bengio, and G. Hinton, [Deep learning](https://www.nature.com/articles/nature14539), Nature (2015) 


## Recap: Neural Networks

Last time, we introduced *Neural Networks* and discussed how they can be used for classification and regression.

There are many different *network architectures* for Neural Networks, but our focus is on **Multi-layer Perceptrons**. Here, there is an *input layer*, typically drawn on the left hand side and an *output layer*, typically drawn on the right hand side. The middle layers are called *hidden layers*. 


<img src="Colored_neural_network.svg" title="https://en.wikipedia.org/wiki/Artificial_neural_network#/media/File:Colored_neural_network.svg" 
width="300">

Given a set of features $X = x^0 = \{x_1, x_2, ..., x_n\}$ and a target $y$, a neural network works as follows. 


Each layer applies an affine transformation and an [activation function](https://en.wikipedia.org/wiki/Activation_function) (e.g., ReLU, hyperbolic tangent, or logistic) to the output of the previous layer: 
$$
x^{j} = f ( A^{j} x^{j-1} + b^j ). 
$$
At the $j$-th hidden layer, the input is represented as the composition of $j$ such mappings. An additional function, *e.g.* [softmax](https://en.wikipedia.org/wiki/Softmax_function), is applied to the output layer to give the prediction, $\hat y$, for classification or regression. 

<img src="activationFct.png" 
title="see Géron, Ch. 10" 
width="700">


## Softmax function for classificaton 

The *softmax function*, $\sigma:\mathbb{R}^K \to (0,1)^K$ is defined by
$$
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}}
\qquad \qquad \textrm{for } j=1, \ldots, K.
$$
Note that each component is in the range $(0,1)$ and the values sum to 1. We interpret $\sigma(\mathbf{z})_j$ as the probability that $\mathbf{z}$ is a member of class $j$. 

## Training a neural network

Neural networks uses a loss function of the form 
$$
Loss(\hat{y},y,W) =  \frac{1}{2} \sum_{i=1}^n g(\hat{y}_i(W),y_i) + \frac{\alpha}{2} \|W\|_2^2
$$
Here, 
+ $y_i$ is the label for the $i$-th example, 
+ $\hat{y}_i(W)$ is the predicted label for the $i$-th example, 
+ $g$ is a function that measures the error, typically $L^2$ difference for regression or cross-entropy for classification, and 
+ $\alpha$ is a regularization parameter. 

Starting from initial random weights, the loss function is minimized by repeatedly updating these weights. Various **optimization methods** can be used, *e.g.*, 
+ gradient descent method 
+ quasi-Newton method,
+ stochastic gradient descent, or 
+ ADAM. 

There are various parameters associated with each method that must be tuned. 

**Back propagation** is a way of using the chain rule from calculus to compute the gradient of the $Loss$ function for optimization. 

## Neural Networks in scikit-learn

In the previous lecture, we used Neural Network implementations in scikit-learn to do both classification and regression:
+ [multi-layer perceptron (MLP) classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)
+ [multi-layer perceptron (MLP) regressor](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)


However, there are several limitations to the scikit-learn implementation: 
- no GPU support
- limited network architectures 

## Neural networks with TensorFlow

Today, we'll use [TensorFlow](https://github.com/tensorflow/tensorflow) to train a Neural Network. 

TensorFlow is an open-source library designed for large-scale machine learning. 

### Installing TensorFlow

Instructions for installing TensorFlow are available at [the tensorflow install page](https://www.tensorflow.org/install).

It is recommended that you use the command: 
```
pip install tensorflow
```


In [2]:
import tensorflow as tf
print(tf.__version__)

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

1.13.2


TensorFlow represents computations by connecting op (operation) nodes into a computation graph.

<img src="graph.png" 
title="An example of computational graph" 
width="400">

A TensorFlow program usually has two components:
+ In the *construction phase*, a computational graph is built. During this phase, no computations are performed and the variables are not yet initialized. 
+ In the *execution phase*, the graph is evaluated, typically many times. In this phase, the each operation is given to a CPU or GPU, variables are initialized, and functions can be evaluted. 

In [4]:
# construction phase
x = tf.Variable(3)
y = tf.Variable(4)
f = x*x*y + y + 2

# execution phase
with tf.Session() as sess: # initializes a "session" 
    x.initializer.run()
    y.initializer.run()
    print(f.eval())


# alternatively all variables cab be initialized as follows
init = tf.global_variables_initializer()
with tf.Session() as sess: # initializes a "session" 
    init.run() # initializes all the variables
    print(f.eval())


42
42


### Autodiff

TensorFlow can automatically compute the derivative of functions using [```gradients```](https://www.tensorflow.org/api_docs/python/tf/gradients). 

In [6]:
# construction phase
x = tf.Variable(3.0)
y = tf.Variable(4.0)
f = x + 2*y*y + 2
grads = tf.gradients(f,[x,y])

# execution phase
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer()) # initializes all variables
    print([g.eval() for g in grads])


[1.0, 16.0]


This is enormously helpful since training a NN requires the derivate of the loss function with respect to the parameters (and there are a lot of parameters). This is computed using backpropagation (chain rule) and TensorFlow does this work for you. 

**Exercise:** Use TensorFlow to compute the derivative of $f(x) = e^x$ at $x=2$.

In [4]:
# your code here


### Optimization methods
Tensorflow also has several built-in optimization methods.

Other optimization methods in TensorFlow:
+ [```tf.train.Optimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/Optimizer)
+ [```tf.train.GradientDescentOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/GradientDescentOptimizer)
+ [```tf.train.AdadeltaOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/AdadeltaOptimizer)
+ [```tf.train.AdagradOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/AdagradOptimizer)
+ [```tf.train.AdagradDAOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/AdagradDAOptimizer)
+ [```tf.train.MomentumOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/MomentumOptimizer)
+ [```tf.train.AdamOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/AdamOptimizer)
+ [```tf.train.FtrlOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/FtrlOptimizer)
+ [```tf.train.ProximalGradientDescentOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalGradientDescentOptimizer)
+ [```tf.train.ProximalAdagradOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalAdagradOptimizer)
+ [```tf.train.RMSPropOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/RMSPropOptimizer)

For more information, see the [TensorFlow training webpage](https://www.tensorflow.org/api_guides/python/train). 


Let's see how to use the [```GradientDescentOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/GradientDescentOptimizer). 

In [7]:
x = tf.Variable(3.0, trainable=True)
y = tf.Variable(2.0, trainable=True)
f = x*x + 100*y*y
opt = tf.train.GradientDescentOptimizer(learning_rate=5e-3).minimize(f)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(1000):
        if i%100 == 0: print(sess.run([x,y,f]))
        sess.run(opt)
        

[3.0, 2.0, 409.0]
[1.0980968, 0.0, 1.2058167]
[0.40193906, 0.0, 0.161555]
[0.14712274, 0.0, 0.0216451]
[0.053851694, 0.0, 0.0029000049]
[0.019711465, 0.0, 0.00038854184]
[0.0072150305, 0.0, 5.2056665e-05]
[0.0026409342, 0.0, 6.9745333e-06]
[0.00096666755, 0.0, 9.344461e-07]
[0.00035383157, 0.0, 1.2519678e-07]


Using another optimizer, such as the [```MomentumOptimizer```](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/MomentumOptimizer), 
has similiar syntax. 

In [7]:
x = tf.Variable(3.0, trainable=True)
y = tf.Variable(2.0, trainable=True)
f = x*x + 100*y*y
opt = tf.train.MomentumOptimizer(learning_rate=1e-2,momentum=.5).minimize(f)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(1000):
        if i%100 == 0: print(sess.run([x,y,f]))
        sess.run(opt)
        

[3.0, 2.0, 409.0]
[0.043930665, 2.0290405e-15, 0.0019299033]
[0.0006126566, -1.547466e-30, 3.753481e-07]
[8.544106e-06, 0.0, 7.300175e-11]
[1.1915596e-07, 0.0, 1.4198143e-14]
[1.6617479e-09, 0.0, 2.761406e-18]
[2.3174716e-11, 0.0, 5.3706746e-22]
[3.2319424e-13, 0.0, 1.0445451e-25]
[4.5072626e-15, 0.0, 2.0315416e-29]
[6.285822e-17, 0.0, 3.951156e-33]


**Exercise:** Use TensorFlow to find the minimum of the [Rosenbrock function](https://en.wikipedia.org/wiki/Rosenbrock_function): 
$$
f(x,y) = (x-1)^2 + 100*(y-x^2)^2.
$$


In [8]:
# your code here


## Classifying the MNIST handwritten digit dataset

We now use TensorFlow to classify the handwritten digits in the MNIST dataset. 

### Using plain TensorFlow
We'll first follow [Géron, Ch. 10](https://github.com/ageron/handson-ml/blob/master/10_introduction_to_artificial_neural_networks.ipynb) to build a NN using plain TensorFlow. 



#### Construction phase

+ We specify the number of inputs and outputs and the size of each layer. Here the images are 28x28 and there are 10 classes (each corresponding to a digit). We'll choose 2 hidden layers, with 300 and 100 neurons respectively. 

+ Placeholder nodes are used to represent the training data and targets. We use the ```None``` keyword to leave the shape (of the training batch) unspecified. 

+ We add layers to the NN using the ```layers.dense()``` function. In each case, we specify the input, and the size of the layer. We also specify the activation function used in each layer. Here, we choose the ReLU function. 

+ We specify that the output of the NN will be a softmax function. The loss function is cross entropy. 

+ We then specify that we'll use the [GradientDescentOptimizer](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer) 
with a learning rate of 0.01. 

+ Finally, we specify how the model will be evaluated. The [```in_top_k```](https://www.tensorflow.org/api_docs/python/tf/nn/in_top_k) function checks to see if the  targets are in the top k predictions. 

We then initialize all of the variables and create an object to save the model using the [```saver()```](https://www.tensorflow.org/programmers_guide/saved_model) function. 

#### Execution phase

At each *epoch*, the code breaks the training batch into mini-batches of size 50. Cycling through the mini-batches, it uses gradient descent to train the NN. The accuracy for both the training and test datasets are evaluated.  


In [10]:
import tensorflow as tf
import numpy as np    
from sklearn.metrics import confusion_matrix

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [11]:
# load the data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

In [12]:
# helper code
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


In [13]:
# construction phase

n_inputs = 28*28  # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1",activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2",activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")
    #y_proba = tf.nn.softmax(logits)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")


learning_rate = 0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
            

Instructions for updating:
Use keras.layers.dense instead.


In [14]:
# execution phase

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 10
#n_batches = 50
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_valid = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(epoch, "Batch accuracy:", acc_batch, "Validation accuracy:", acc_valid)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Batch accuracy: 0.9 Validation accuracy: 0.9055
1 Batch accuracy: 0.9 Validation accuracy: 0.9208
2 Batch accuracy: 0.94 Validation accuracy: 0.9331
3 Batch accuracy: 0.94 Validation accuracy: 0.9406
4 Batch accuracy: 1.0 Validation accuracy: 0.9444
5 Batch accuracy: 0.98 Validation accuracy: 0.9481
6 Batch accuracy: 0.98 Validation accuracy: 0.9537
7 Batch accuracy: 0.98 Validation accuracy: 0.9566
8 Batch accuracy: 0.98 Validation accuracy: 0.9591
9 Batch accuracy: 1.0 Validation accuracy: 0.9597


Since the NN has been saved, we can use it for classification using the [```saver.restore```](https://www.tensorflow.org/programmers_guide/saved_model) function. 

We can also print the confusion matrix using [```confusion_matrix```](https://www.tensorflow.org/api_docs/python/tf/confusion_matrix). 

In [15]:
with tf.Session() as sess:
    saver.restore(sess, save_path)
    Z = logits.eval(feed_dict={X: X_test})
    y_pred = np.argmax(Z, axis=1)
    
print(confusion_matrix(y_test,y_pred))

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt
[[ 969    0    1    1    0    3    1    2    2    1]
 [   0 1115    2    2    0    1    4    2    9    0]
 [   7    1  980   12    5    0    6    9   11    1]
 [   1    0    2  984    0    1    0   10    7    5]
 [   1    0    3    1  930    0    7    1    5   34]
 [  10    2    1   25    3  820   10    1   14    6]
 [   8    3    0    2    9    6  925    0    5    0]
 [   0   10   12    4    2    0    0  984    2   14]
 [   3    2    3   13    4    1    8    8  929    3]
 [   5    7    0   13   11    2    1    5    4  961]]


### Using TensorFlow's Keras API 

Next, we'll use TensorFlow's Keras API to build a NN for the MNIST dataset. 

[Keras](https://keras.io/) is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. We'll use it with TensorFlow. 

In [16]:
import tensorflow as tf
import numpy as np
from sklearn.metrics import confusion_matrix

In [17]:
(X_train, y_train),(X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0

In [18]:
# set the model
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(rate=0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

# specifiy optimizer
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# train the model
model.fit(X_train, y_train, epochs=5)

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1a4bf07470>

In [19]:
score = model.evaluate(X_test, y_test)
names = model.metrics_names
for ii in np.arange(len(names)):
    print(names[ii],score[ii])
    

loss 0.07185092351303901
acc 0.9791


In [20]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               401920    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________


In [21]:
y_pred = np.argmax(model.predict(X_test), axis=1)
print(confusion_matrix(y_test,y_pred))

[[ 974    1    0    0    0    0    3    1    0    1]
 [   0 1124    3    2    0    1    2    0    3    0]
 [   5    0 1018    2    1    0    1    1    3    1]
 [   0    1    3  997    0    0    0    2    2    5]
 [   1    0    1    1  971    0    3    0    0    5]
 [   2    0    0   20    3  852    5    1    6    3]
 [   2    3    0    1    7    3  940    0    2    0]
 [   3    6   10    6    1    0    0  984    4   14]
 [   4    0    7    6    1    3    3    3  945    2]
 [   2    2    0    4   12    1    0    1    1  986]]


## Using a pre-trained network

There are many examples of pre-trained NN that can be accessed [here](https://www.tensorflow.org/api_docs/python/tf/keras/applications). 
These NN are very large, having been trained on giant computers using massive datasets. 

It can be very useful to initialize a NN using one of these. This is called [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning). 


We'll use a NN that was pretrained for image recognition. This NN was trained on the  [ImageNet](http://www.image-net.org/) project, which contains > 14 million images belonging to > 20,000 classes (synsets). 

In [22]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import vgg16

In [23]:
vgg_model = tf.keras.applications.VGG16(weights='imagenet',include_top=True)
vgg_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [24]:
img_path = 'images/scout1.jpeg'
img = image.load_img(img_path, target_size=(224, 224))

x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = vgg16.preprocess_input(x)

preds = vgg_model.predict(x)
print('Predicted:', vgg16.decode_predictions(preds, top=5)[0])

Predicted: [('n02098105', 'soft-coated_wheaten_terrier', 0.3554158), ('n02105641', 'Old_English_sheepdog', 0.23714595), ('n02095314', 'wire-haired_fox_terrier', 0.13490717), ('n02091635', 'otterhound', 0.0611032), ('n02093991', 'Irish_terrier', 0.052789364)]


**Exercise:** Repeat the above steps for an image of your own.

**Exercise:** There are several [other pre-trained networks in Keras](https://github.com/keras-team/keras-applications). Try these! 

In [25]:
# your code here


## Some NN topics that we didn't discuss
+ Recurrent neural networks (RNN) for time series
+ How NN can be used for unsupervised learning problems and [Reinforcement learning problems](https://en.wikipedia.org/wiki/Reinforcement_learning)
+ Special layers in NN for image processing 
+ Using Tensorflow on a GPU 
+ ... 

## CPU vs. GPU

[CPUs (Central processing units)](https://en.wikipedia.org/wiki/Central_processing_unit) have just a few cores. The number of processes that a CPU can do in parallel is limited. However, each cores is very fast and is good for sequential tasks. 

[GPUs (Graphics processing units)](https://en.wikipedia.org/wiki/Graphics_processing_unit) have thousands of cores, so can do many processes in parallel. GPU cores are typically slower and are more limited than CPU cores. However, for the right kind of computations (think matrix multiplication), GPUs are very fast. GPUs also have their own memory and caching systems, which further improves the speed of some computations, but also makes GPUs more difficult to program. (You have to use something like [CUDA](https://en.wikipedia.org/wiki/CUDA)).  

TensorFlow can use GPUs to significantly speed up the training NN. See the programmer's guide [here](https://www.tensorflow.org/programmers_guide/using_gpu). 