
# Chapter 11: NNs

* problems include black box, many hyperparmeters, run time, convergence

* inspired by biology of human thought

  * eg. outputs fire/activate (ouputs 1), or not (0), hidden layers (output from one to next), and activation functions allow for outputs of more then just 1 or 0

* overview of how it works:

  * layers: inputs fed to inner layers connected from one to another all the way to outputs

  * outputs of layers are fed through an activation function (analagous to SVM kernel)

  * final output is usually a single number in regression, or c class probabilities in classification

## 11.1. Example: Vertebrae Data

* we will use the default parameters shown in the output below

* MLP: ***multi-layer pereptron*** from `sklearn` has regression and classification methods (eg. MLPregressor)

* trains using backpropogation (see week 0 slides and lecture in intro to NN)

  * cross-entropy loss

* weights are updated with some randomness so random_state input allows reproducibility

  * *non-convex loss function* (eg. there exists more than one local minimum)
  
  * different ***random weight initializations*** can produce different validation accuracies

* `relu` function is default activation function

  * nothing to do with GLM logistic model application

  * can also use: '`identity`', '`logistic`', '`tanh`'

      * activation function use has nothing to do with GLM use

* possible solvers (default is `adam`):

  * `lbfgs` is an optimizer in the family of quasi-Newton methods

  * `sgd` :stochastic gradient descent (converges faster for smaller datasets)

  * `adam` : stochastic gradient-based optimizer (proposed by Kingma, Diederik, and Jimmy Ba) (better for larger data)


* **ADAM**: an adaptive learning rate method

* uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network

  * combination of RMSprop and SGD with momentum
  
  * uses the squared gradients to scale the learning rate like RMSprop 
  
  * momentum is moving average of the gradient (instead of gradient itself like SGD with momentum)


* trains using some form of **gradient descent** and the gradients are calculated using **Backpropagation**

  * make sure you know these from week 0 NN intro


*  For each class, the raw output passes through the logistic function

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
my_path = '/content/drive/My Drive/ecs171_yancey/Lecture_Notes/Chapter_2/column_3C.dat'
vert = pd.read_csv(my_path, sep=' ',header=None)


from sklearn.neural_network import MLPClassifier
X_train, X_test, y_train, y_test = train_test_split(vert.iloc[:,0:5], vert.iloc[:,6], test_size=20)

clf = MLPClassifier(solver='sgd',hidden_layer_sizes=(3, ), random_state=1)
clf.fit(X_train, y_train)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(3,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=1, shuffle=True, solver='sgd',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

* `hidden_layer_sizes=(3,)` means 1 hidden layer and 3 neurons are specified

* we will have 6 neurons in the input layer since we have 6 fetaures,  3 neurons in the output layer since we have 3 classes, hidden layer in between


* weighted sum is fed into output layer (again, please, see week 0 lecture notes/video)

  * minimize sum of squared errors to compute weights using sgd 



In [0]:
clf.intercepts_

[array([ 0.29525839, -0.15850658,  0.10165376]),
 array([-0.81947072, -0.33695172,  0.33047785])]

### 11.1.1. Prediction

* p features fed into input and then through weights into output (eg. only once without reupdating weights) 

In [0]:
clf.predict_proba(X_test.iloc[0:1,:])

array([[0.17306485, 0.28039109, 0.54654406]])

In [0]:
from sklearn.metrics import accuracy_score

y_pred= clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.25

### 10.1.2. Hyperparameters

There ar many hyperparameters, but some of the useful ones include the setting of an adaptive learning rate or maximum number of iterations (m

* `tol`: tolerance for the optimization

  * When loss/score is not improving by at least `tol` for `n_iter_no_change` (default = 10) consecutive iterations (unless learning_rate is set to `adaptive`) convergence is considered to be reached (training stops)

* `learning_rate` =`adaptive`:  keeps learning rate at `learning_rate_init` as long as training loss is decreasing (or score increasing) by at least `tol` for 2 consecutive epochs or else divided by 5 (used when solver='sgd')

* `learning_rate_init`: step-size in updating the weights


* `max_iter`: max number of iterations

  * iterates until convergence (determined by ‘tol’) or this number of iterations

     * note that this determines the number of epochs


## 11.2. Pitfall: Issues with Convergence, Platforms, ect.

* complexity of NN may require dealing with other issues



### 11.2.1. Convergence

* in contrast to SVM and linear models, NNs are non-convex meaning the solution is not unique

  * for example if we swap the top circle/neuron with the botom one (along with the input lines) in the first layer we would have a different set of weights and the same minimum sum of squares

* linear models are also non-iterative which means we have an explicit closed form solution.

## 11.3. Activation Functions



* if our curve looks like that in figure 11-3 we may have a *vanishing gradient* issue meaning that even with a large LR we could take a long time to reach the min ( or with a very sharp curve we may have an *exploding gradient*)

* 

## 11.4. Regularization

remember that our weights are pretty much our new value of p so we would like to find some way of reducing the size

* we could apply l1 or l1 regularization (similar to ridge, LASSO, or SVM

  * note this wont reduce quantity due to loss function shape

### 11.4.2.

* dropout will randomly remove a certain percentage of the weights for dimension reduction

## 11.5. Convergence Tricks

* ***scaling*** (or converting to between 0 and 1) is commonly used to ensure convergence is attained

  * some packages do scaling by default

* also ***learning rate adjustment*** can be used

* in ***early stopping*** we stop when the performance on the validation set begins to deteriorate

* we use can also use ***momentum*** which is a modified update rule (as shown below) in gradient descent

* we update the weight using not just the learning rate times the gradient, but add a momentum factor (gamma) times the weight delta from the previous iteration

    * eg. ADAM

![alt text](https://visualstudiomagazine.com/articles/2017/08/01/~/media/ECG/visualstudiomagazine/Images/2017/08/0817vsm_McCaffreyFig1.ashx)

11.6. Fitting Neural Network 

### 11.6.1. Breast Cancer Data example

In [0]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [0]:
bc = pd.DataFrame(cancer['data'])
bc_y = pd.DataFrame(cancer['target'])
bc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [0]:
bc.shape

(569, 30)

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(bc, bc_y)

* lets try scaling to see if it helps convergence

* it seems to help accuracy

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

* lets try as many layers as features

In [0]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report,confusion_matrix

# 3 layers 30 neurons/layer 
clf = MLPClassifier(hidden_layer_sizes=(30,30,30))


clf.fit(X_train,y_train)


print()
y_pred = clf.predict(X_test)
print()
accuracy_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


[[43  6]
 [ 1 93]]
              precision    recall  f1-score   support

           0       0.98      0.88      0.92        49
           1       0.94      0.99      0.96        94

    accuracy                           0.95       143
   macro avg       0.96      0.93      0.94       143
weighted avg       0.95      0.95      0.95       143



* `coefs_[i] ` weight matrices between layer i and layer i+1

* `intercepts_[i]` bias vectors added to layer i+1

In [0]:
# weights per layer
len(clf.coefs_[0])

30

In [0]:
# layers including output
len(clf.coefs_)

4

In [0]:
clf.intercepts_

[array([-0.03681759,  0.1726946 , -0.01077704, -0.21764019,  0.18954937,
         0.23760169, -0.02629925,  0.33132044,  0.23864519, -0.1809671 ,
         0.22604855,  0.21129482,  0.29982368,  0.42792994, -0.03545507,
        -0.24234831, -0.2517509 ,  0.28436397,  0.0207698 ,  0.07590619,
         0.14254272, -0.15152401,  0.06133646,  0.24956177, -0.11689575,
         0.04504144, -0.26928886,  0.34279583,  0.09353027, -0.16461493]),
 array([ 0.08292715, -0.04508507, -0.24826857,  0.06094625,  0.01442469,
        -0.10336484,  0.03531113, -0.15584774, -0.07768522,  0.09098652,
         0.26873999,  0.05291085,  0.05009212, -0.07490764,  0.19832722,
        -0.00206966, -0.02598817,  0.335719  , -0.25091081, -0.25189414,
         0.12527096, -0.25643678, -0.06456207, -0.25735071,  0.04322971,
         0.06396341,  0.29060605,  0.1657436 ,  0.10089951,  0.12085279]),
 array([-0.1233255 ,  0.02190778,  0.22945016,  0.09904069,  0.09214511,
         0.37189609,  0.28433826,  0.00058068, 

### 11.6.2. Hyperparameters

* many are used for improving the convergence behaviour such as early stopping, LR, and adaptive methods

### 11.6.3. Grid Search: BC Data

In [0]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()

* lets try different solvers, activation functions, numbers of layers/neurons, and epochs...

* we could also try changing alpha (the L2 penatly regularization term)

* 

In [0]:
parameter_space = {
    'hidden_layer_sizes': [(30,30,30), (10,10,), (30,)],
    #'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    #'learning_rate': ['constant','adaptive'],
    'max_iter': [10, 50, 100]#, 300]
}

* `n_jobs` is to sets number of CPU cores from your computer to use (-1 for all cores available)

* `cv ` sets number of splits for cross-validation

* sklearn has hyper-parameter optimization tools.

  * GridSearchCV
  * RandomizedSearchCV

In [0]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=7)

In [0]:
clf.fit(X_train, y_train)

print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()


print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_pred = clf.predict(X_test)

print()
accuracy_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


Best parameters set found on development set:

{'hidden_layer_sizes': (30, 30, 30), 'max_iter': 100, 'solver': 'adam'}

Grid scores on development set:

0.512 (+/-0.268) for {'hidden_layer_sizes': (30, 30, 30), 'max_iter': 10, 'solver': 'sgd'}
0.927 (+/-0.058) for {'hidden_layer_sizes': (30, 30, 30), 'max_iter': 10, 'solver': 'adam'}
0.883 (+/-0.130) for {'hidden_layer_sizes': (30, 30, 30), 'max_iter': 50, 'solver': 'sgd'}
0.962 (+/-0.045) for {'hidden_layer_sizes': (30, 30, 30), 'max_iter': 50, 'solver': 'adam'}
0.941 (+/-0.055) for {'hidden_layer_sizes': (30, 30, 30), 'max_iter': 100, 'solver': 'sgd'}
0.986 (+/-0.021) for {'hidden_layer_sizes': (30, 30, 30), 'max_iter': 100, 'solver': 'adam'}
0.523 (+/-0.360) for {'hidden_layer_sizes': (10, 10), 'max_iter': 10, 'solver': 'sgd'}
0.625 (+/-0.365) for {'hidden_layer_sizes': (10, 10), 'max_iter': 10, 'solver': 'adam'}
0.831 (+/-0.193) for {'hidden_layer_sizes': (10, 10), 'max_iter': 50, 'solver': 'sgd'}
0.953 (+/-0.054) for {'hidden_laye



0.958041958041958

* it looks like more than 10 iterations are required

* Adam also usually produces higher accuracy

* also combinations of 30 neurons/layer 

* note that the number of epochs is determined by max iterations since iterations stop when the max is reached

* remember that in each epoch we go through all of the training examples

## 11.9. Relation to Polynomial Regression

* we are summing the features and running them through an activation function which acts similar to the polynomial


* eg. if we had the activation function t^2  we would also have the squared terms

* and another layer would be the 4th power of each term


* the activation functions can also usually be approximated by polynomials



* minimizes sum of squared errors

# CNN

* Let’s design a CNN for a MNIST demo in Keras

* **Keras** is a popular deep learning API

  * built on top of Tensorflow

* the output/input shapes are 3D tensors

  * input tensor of size (28, 28, 1) (the image size)

  * if we had a color image it would be 3 channels

  * the number of kernels is the first parameter to conv2D

* say we use a conv filter size of 5 x 5 with RelU activation and 2 X 2 max pooling


* summary can give us a table of layer shapes and the number of parameters



In [0]:
from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32,(5,5),activation='relu',input_shape=(28, 28,1)))
model.add(layers.MaxPooling2D((2, 2)))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         
Total params: 832
Trainable params: 832
Non-trainable params: 0
_________________________________________________________________


* **param #** stands for the number of parameters of the conv2D layer 

* eg. weight matrix of window size 5×5 and a bias for each of the filters is 832 parameters (32 × (25 + 1))
  * +1 for bias term 

* **Max-pooling** then takes the max of each window reducing the number of inputs to the next layer by 1/2 to 12

    * for every 2 elements in each direction we have the max

  * does not add params since it is an operation


## Adding more layers 

* as we add more layers height and width reduce, while features increase

* the number of input channels will be 32 since that is the number of features from the previous layer
  
  * Keras will calculate this for us though

* adding 64 5 × 5 window filters and a 2 × 2 pooling layer gives us a total of ((5 × 5 × 32) + 1) ×64 = 51264
  

In [0]:
model = models.Sequential()
model.add(layers.Conv2D(32,(5,5),activation='relu', 
                                 input_shape=(28,28,1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (5, 5), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 8, 64)          51264     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 4, 4, 64)          0         
Total params: 52,096
Trainable params: 52,096
Non-trainable params: 0
_________________________________________________________________


* after the second max-pooling layer we have 4 by 4 h by w dimensions

* next we will add a ***densely connected layer*** with **softmax** for final classification


* to do this the previous output vector is first flattened to 1024 

  * since the number of possible outputs is 10 we now have 10 + 1024 parameters

    * softmax squashes a vector of output scores (each of the elements in the flattened output) to the range (0, 1) so all the resulting elements probabilities add up to 1. It is applied to the output scores:


> ![alt text](https://latex.codecogs.com/gif.latex?f(s)_{i}&space;=&space;\frac{e^{s_{i}}}{\sum_{j}^{C}&space;e^{s_{j}}})


In [0]:
model = models.Sequential()
model.add(layers.Conv2D(32,(5,5),activation='relu', 
                                 input_shape=(28,28,1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (5, 5), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax'))
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 24, 24, 32)        832       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 8, 8, 64)          51264     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 4, 4, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                10250     
Total params: 62,346
Trainable params: 62,346
Non-trainable params: 0
__________________________________________________

![alt text](https://miro.medium.com/max/1400/1*CKgnKTEITrPKaDiG_536Mg.png)

## Training and evaluation

* all the paramaters of the convolutional layers are adjusted during training

In [0]:
from keras.datasets import mnist
from keras.utils import to_categorical
(X_train, y_train), (X_test, y_test) = mnist.load_data()
print('X_train.shape')
print(X_train.shape)
print('X_test.shape')
print(X_test.shape)

X_train.shape
(60000, 28, 28)
X_test.shape
(10000, 28, 28)


* reshape into 4D tensors

* make sue everything is of proper type

In [0]:
X_train = X_train.reshape((60000, 28, 28, 1))
X_train = X_train.astype('float32') / 255
X_test = X_test.reshape((10000, 28, 28, 1))
X_test = X_test.astype('float32') / 255
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

* similar to NN we choose the optimizer and loss function


* we use a Softmax activation plus a **Cross-Entropy loss** to output class probability for each image

  * used for multi-class classification



![alt text](https://gombru.github.io/assets/cross_entropy_loss/softmax_CE_pipeline.png)


* In Multi-Class classification the labels are one-hot, so only the positive class keeps its term in the loss

In [0]:
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          batch_size=100,
          epochs=5,
          verbose=1)
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: 0.9664999842643738


* takes much longer on CPU
