## Fine-Tuning Neural Networks Hyperparameters
- one of the main drawbacks of NNs is that there are many hyperparameters to tweak

#### option 1: try many combinations of hyperparams and see which one works best on the validation set

In [8]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train_full, X_test, y_test_full, y_test = train_test_split(housing.data, housing.target)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_test_full)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [9]:
from tensorflow import keras

def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[8]): # provide reasonable defaults to the model
    model = keras.models.Sequential()
    model.add(keras.layers.InputLayer(input_shape=input_shape))
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu"))
    model.add(keras.layers.Dense(1))
    optimizer = keras.optimizers.SGD(lr=learning_rate)
    model.compile(loss="mse", optimizer=optimizer)
    return model

# sequential model for multivariate regression (one neuron)

wrap the model with KerasRegressor so that it will have the same methods as Scikit-Learn regressors

In [10]:
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)

Now keras_reg will have the *fit(), score(), and predict()* methods like in Scikit

In [14]:
keras_reg.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val), callbacks=[keras.callbacks.EarlyStopping(patience=10)], verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fdcd90ee490>

In [15]:
mse_test = keras_reg.score(X_test, y_test)
y_pred = keras_reg.predict(X_test)



Use RandomizedSearchCV because there are many hyperparameters to check.

In [24]:
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_distribs = {
    "n_hidden": [0,1,2,3],
    "n_neurons": np.arange(1, 100).tolist(),
    "learning_rate": reciprocal(3e-4, 3e-2).rvs(1000).tolist() # makes a list from an inverse continuous random variable
}

rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3, verbose=2)
rnd_search_cv.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val), callbacks=[keras.callbacks.EarlyStopping(patience=10)], verbose=0)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END learning_rate=0.002058508271902106, n_hidden=3, n_neurons=14; total time=  15.2s
[CV] END learning_rate=0.002058508271902106, n_hidden=3, n_neurons=14; total time=  15.0s
[CV] END learning_rate=0.002058508271902106, n_hidden=3, n_neurons=14; total time=  15.1s
[CV] END learning_rate=0.004818367294934084, n_hidden=1, n_neurons=38; total time=  13.8s
[CV] END learning_rate=0.004818367294934084, n_hidden=1, n_neurons=38; total time=  13.9s
[CV] END learning_rate=0.004818367294934084, n_hidden=1, n_neurons=38; total time=  14.4s
[CV] END learning_rate=0.010812911851930444, n_hidden=1, n_neurons=33; total time=  14.3s
[CV] END learning_rate=0.010812911851930444, n_hidden=1, n_neurons=33; total time=  14.3s
[CV] END learning_rate=0.010812911851930444, n_hidden=1, n_neurons=33; total time=  14.4s
[CV] END learning_rate=0.0003145538090468832, n_hidden=2, n_neurons=62; total time=  15.0s
[CV] END learning_rate=0.0003145538090

RandomizedSearchCV(cv=3,
                   estimator=<tensorflow.python.keras.wrappers.scikit_learn.KerasRegressor object at 0x7fdcc888fd90>,
                   param_distributions={'learning_rate': [0.008072582276422957,
                                                          0.022999308832844866,
                                                          0.00037796389193578384,
                                                          0.0005764359101138321,
                                                          0.004475815288389972,
                                                          0.010119228808704825,
                                                          0.006221008037106827,
                                                          0.026292307701445267,
                                                          0.0008953239050796231,...
                                                          0.0003918753218625975,
                                                 

RandomizedSearchCV does not use *validation_data* because it uses k-fold cross validation for validation data

In [25]:
rnd_search_cv.best_params_

{'n_neurons': 72, 'n_hidden': 2, 'learning_rate': 0.02408823900687922}

In [26]:
rnd_search_cv.best_score_

-0.3038170039653778

In [27]:
model = rnd_search_cv.best_estimator_.model

**This method of searching for the best hyperparameters will be very inefficient if the hyparameter space is too large**

#### option 2: Use these python libraries to optimize hyperparameters:
- *the core idea is that good regions should be explored more
- Hyperopt
- Hyperas, kopt, Talos
- Keras Tuner
- Scikit-Optimize
- Spearmint
- Hyperband
- Sklearn-Deap

#### option 3: Use hyperparamter tuning services from:
- Google Cloud AI
- Arimo
- SigOpt
- CallDesk Oscar

Sidenotes:
- Hyperparameter turning is still an active area of research
- Evolutionary algorithms are making a comeback
- AutoML suite uses evolutionary algorithms to find the best hyperparameters

#### Tuning number of hidden layers
- *Train the network with increasing layers until it starts overfitting*
- MLPs with one hidden layer can still perform good on the most complex problems *if it has enough neurons*
- **Although deep networks have a much higher <u>parameter efficiency</u> than shallow ones**
    - can model complex function using exponentially fewer neurons while performing better with the same amount of training data
    
    
- Analogy:
    - In a task where I draw a forest with two kinds of software...
    - The first software does not allow copy and paste
        - This is terrible because every detail must be drawn repeatedly, even the leaves and branches
    - The second software will allow me to draw a leaf once, copy that leaf, draw a branch, paste the leaf on that branch, draw a tree, and paste the branc+leaf on the tree, and then copy and paste the tree repeatedly to make a forest.
- With regards to the analogy, real-world data has this **hierarchical structure**
    - Deep NNs automatically take advantage of that
        - lower hidden layers draw the leaves, branches, and other things
        - higher hidden layers put them all together
    - Deep NNs converge faster 
    - Deep NNs generalize better
    - **Transfer Learning** example:
        - There is a model trained from faces...
        - Task is to make a Deep NN model that recognizes hairstyles
        - Rather, than randomly initializing the weights and biases of a Deep NN, use the weights and biases computed by the other model
        - This way, the lower-level structures will already be learnt and the model will only have to learn the higher-level structures (hairstyle)

- Very complex tasks in Deep learning will require dozens of layers (or even hundreds but not fully connected ones) and need a huge amount of training data.
    - Although!! I should use a pretrained state-of-the-art network that performs a similar task
        - This way training will be faster w/ less data

#### The number of neurons per Hidden layer
- *Train network with increasing #. of neurons until it starts overfitting*
- \# of neurons in input and output layers is determined by the task
    - e.g. MNIST requires 28*28=784 input neurons for each pixel and 10 output neurons for each number
- Old practice:
    - gradually decrease # of neurons per layer (300 in first layer, 200 in second, 100 in third) b/c low-level features can coalesce into high-level features
    - **status: abandoned** b/c same # of neurons/layer works just as well, if not better w/ fewer hyperparameters to tune
- **"Strecth Pants approach"** method to get the correct number of neurons:
    - pick a model with more layers and models than needed and use early stopping and other regularization techniques to prevent overfitting
- if a layer is too small, it will not have enough representational power to preserve all the useful information from the inputs
    - reason: a layer with 2 neurons can only output 2D data. If it's processing 3D data, information will be lost and the rest of the network will never see it. **Ever.**

#### Tip: increasing # of layers instead of # of neurons per layer gives more bang for buck

#### Tuning Learning Rate
- The most important hyperparameter
- Train model with hundreds of iterations w/ very low starting learning rate and gradually increase to very large learning rate
    - multiply the learning rate by constant factor at each iteration
    
#### Picking the right optimizer
- Will be visited in Chapter 11

#### Tuning batch size
- larger batch sizes will enable GPUs to process them efficiently (more instances per second). 
- Many researcher recommended to use the largest batch size that can fit in GPU RAM
- The legend **Yann LeCun** tweeted "friends don't let friends use mini-batches larger than 32"
    - small batches lead to better models in less training time
    
#### Picking the activation function
- hidden layers: The ReLU activation function will be a good default.
- output layers: depends on the task duhhhh

#### Tuning the right number of operations
- does not need to be tweaked in most cases
- just use early stopping