<b><font size="6">Neural Networks - The parameters and GridSearch</font></b><br><br>

In this notebook we are going to check some of the most important parameters that can influence the performance of a neural network.

# <font color='#BFD72F'>Contents</font> <a class="anchor" id="toc"></a>

* [1. The needed steps](#import)
* [2. Neural Networks](#nn)
    * [2.1. The hidden layer size - Number of hidden layers and units](#hidden)
    * [2.2. The activation function](#activation)
    * [2.3. The solver](#solver)
    * [2.4. The learning rate initialization](#lr_init)
    * [2.5. The learning rate](#lr)
    * [2.6. The batch size](#batch)
    * [2.7. The maximum iterations](#max_iter)
    * [2.8. Other parameters](#other)
* [3. The GridSearch](#gridsearch)


# 1. The needed steps <a class="anchor" id="1st-bullet"></a>
[Back to Contents](#toc)

<a class="anchor" id="2nd-bullet">

### 1.1. Import the needed libraries
    
</a>

__`Step 1`__ Import the needed libraries.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import time
from sklearn.neural_network import MLPClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
import numpy as np

<a class="anchor" id="3rd-bullet">

### 1.2. Import the dataset
    
</a>

__`Step 1`__ - Import the dataset __diabetes.csv__.

In [2]:
diabetes = pd.read_csv(r'data\diabetes.csv')

__`Step 2`__ - Define the independent variables as __X__ and the dependent variable as __y__. 

In [3]:
X = diabetes.iloc[:,:-1]
y = diabetes.iloc[:,-1]

__`Step 3`__ - Create a function named __avg_score__ that will return the average score value for the train and the validation set by applying a model, besides that it should also count the amount of time the model takes to fit, and the number of iterations needed. The model should be received by the function as a parameter.

In [4]:
def avg_score(model):
    # apply kfold
    kf = KFold(n_splits=10)
    # create lists to store the results from the different models 
    score_train = []
    score_val = []
    timer = []
    n_iter = []
    for train_index, val_index in kf.split(X):
        # get the indexes of the observations assigned for each partition
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        # start counting time
        begin = time.perf_counter()
        # fit the model to the data
        model.fit(X_train, y_train)
        # finish counting time
        end = time.perf_counter()
        # check the mean accuracy for the train
        value_train = model.score(X_train, y_train)
        # check the mean accuracy for the validation
        value_val = model.score(X_val, y_val)
        # append the accuracies, the time and the number of iterations in the corresponding list
        score_train.append(value_train)
        score_val.append(value_val)
        timer.append(end-begin)
        n_iter.append(model.n_iter_)
    # calculate the average and the std for each measure (accuracy, time and number of iterations)
    avg_time = round(np.mean(timer),3)
    avg_train = round(np.mean(score_train),3)
    avg_val = round(np.mean(score_val),3)
    std_time = round(np.std(timer),2)
    std_train = round(np.std(score_train),2)
    std_val = round(np.std(score_val),2)
    avg_iter = round(np.mean(n_iter),1)
    std_iter = round(np.std(n_iter),1)
    
    return str(avg_time) + '+/-' + str(std_time), str(avg_train) + '+/-' + str(std_train),\
str(avg_val) + '+/-' + str(std_val), str(avg_iter) + '+/-' + str(std_iter)

__`Step 4`__ - Create a function named __show_results__ that will return the average score for the train and validation
 dataset (returned from the function __avg_score__) for several given models.

In [5]:
def show_results(df, *args):
    """
    Receive an empty dataframe and the different models and call the function avg_score
    """
    count = 0
    # for each model passed as argument
    for arg in args:
        # obtain the results provided by avg_score
        time, avg_train, avg_val, avg_iter = avg_score(arg)
        # store the results in the right row
        df.iloc[count] = time, avg_train, avg_val, avg_iter
        count+=1
    return df

# 2. Neural Networks <a class="anchor" id="nn"></a>
[Back to Contents](#toc)

__`Step 5`__ - Create an instance of MLPClassifier with the default parameters and name it as __model__. Check the results using the above created functions.

In [6]:
model = MLPClassifier()
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Raw'])
show_results(df, model)

Unnamed: 0,Time,Train,Validation,Iterations
Raw,1.187+/-0.49,0.752+/-0.02,0.712+/-0.06,123.1+/-38.0


<hr>
<a class="anchor" id="hidden">

### 2.1. The hidden layer size - Number of hidden layers and neurons (default = (100,))
    
</a>

__The number of hidden layers__<br>
-	Increase the number of hidden layers might improve the accuracy or might not, it depend on the complexity of the problem
-	Increase the number of hidden layers more than the sufficient ones will cause overfit on training set and the decrease of the accuracy in the test set

__The number of hidden units__ <br>
-	Using too few neurons in hidden layers will result in underfitting
-	Using too many neurons in the hidden layer may result in overfitting and increase the time it takes to train the neural network

The aim is to keep a good trade-off between the simplicity of the model and the performance accuracy! <br>

__Some rule of thumbs:__
-	The number of hidden neurons should be between the size of the input layer and the size of the output layer
-	The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer
-	The number of hidden neurons should be less than twice the size of the input layer


<img src="image/neuralnetwork.gif" width="350px">

__`Step 6`__ - Create an MLPClassifier with one hidden layer and one neuron and name it __model_simple__

In [7]:
model_simple = MLPClassifier(hidden_layer_sizes=(1))

__`Step 7`__ - Create an MLPClassifier with one hidden layer and 10 neurons and name it __model_medium__

In [8]:
model_medium = MLPClassifier(hidden_layer_sizes=(10))

__`Step 8`__ - Create an MLPClassifier with four hidden layers and 100 neurons each and name it __model_complex__

In [9]:
model_complex = MLPClassifier(hidden_layer_sizes = (100,100,100,100))

__`Step 9`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [10]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['Simple','Medium','Complex'])
show_results(df, model_simple, model_medium, model_complex)

Unnamed: 0,Time,Train,Validation,Iterations
Simple,0.6+/-0.15,0.622+/-0.09,0.604+/-0.13,181.3+/-46.7
Medium,0.709+/-0.27,0.687+/-0.02,0.682+/-0.06,181.3+/-56.1
Complex,4.096+/-1.44,0.81+/-0.04,0.708+/-0.06,108.9+/-41.7


While the results may differ in different runs, we probably will get the following conclusions:
- The more complex the model, the higher the running time;
- We can have a boost on the performance on our model when we adjust rightly the complexity of it - too simple leads to underfitting, and too complex can lead to overfitting.

<hr>
<a class="anchor" id="activation">
    
### 2.2. The activation function (default = 'relu')
    
</a>

Check this link for more information regarding the advantages and disadvantages of different activation functions: <br>https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/

<img src="image/activation.png" width="350px">

__`Step 10`__ - Create an instance of MLPClassifier, define the activation as __relu__ and name it as __model_relu__

In [11]:
model_relu = MLPClassifier(activation = 'relu')

 - __Advantages:__
     - Computationally efficient - allows the network to converge very quickly.
     - Nonlinear - Although it looks like a linear function, ReLU has a derivative function and allows for backpropagation
 - __Disadvantages:__
     - The **dying ReLU problem** - When inputs approach zero, or are negative, the gradient of the function becomes zero and the network cannot perform backpropagation and cannot learn.

__`Step 11`__ - Create an instance of MLPClassifier, define the activation as __logistic__ and name it as __model_logistic__

In [12]:
model_logistic = MLPClassifier(activation = 'logistic')

 - __Advantages:__
     - Smooth gradient, preventing “jumps” in output values.
     - Output values bound between 0 and 1, normalizing the output of each neuron.
 - __Disadvantages:__
     - Vanishing gradient—for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or have slow convergence.
     - Computationally expensive.

__`Step 12`__ - Create an instance of MLPClassifier, define the activation as __tanh__ and name it as __model_tanh__

In [13]:
model_tanh = MLPClassifier(activation = 'tanh')

 - __Advantages:__
     - Zero centered - making it easier to model inputs that have strongly negative, neutral and strongly positive values. Otherwise like sigmoid function. <br>
 - __Disadvantages:__
     - Like the logistic function

__`Step 13`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [14]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['relu','logistic','tanh'])
show_results(df, model_relu, model_logistic, model_tanh)

Unnamed: 0,Time,Train,Validation,Iterations
relu,1.306+/-0.37,0.735+/-0.02,0.681+/-0.04,106.8+/-35.6
logistic,1.856+/-0.24,0.783+/-0.01,0.721+/-0.04,200.0+/-0.0
tanh,2.616+/-0.49,0.815+/-0.01,0.711+/-0.06,200.0+/-0.0


Checking the results, we can identify some evidences:
- Relu tends to be faster than logistic or tanh.
- Sigmoid functions and their combinations (such as tanh) generally work better in the case of classification problems.

<hr>
<a class="anchor" id="solver">
    
### 2.3. The solver (default = 'adam')

    
</a>
For more information check this paper: <br>   

http://www.robotics.stanford.edu/~ang/papers/icml11-OptimizationForDeepLearning.pdf <br>



__`Step 14`__ - Create an instance of MLPClassifier, define the solver as __sgd__ and name ir as __model_sgd__

In [15]:
model_sgd = MLPClassifier(solver = 'sgd')

__When to use__
- If generalization is more important than time processing - Some recent papers observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance.
(https://papers.nips.cc/paper/7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning.pdf)

__Notes__
- While Gradient Descent use the whole training data to do a single update, in SGD a random data point of the training data to update the parameters - SGD is faster than GD.
- It uses a **common learning rate** for all parameters, contrarialy to what happen in Adam.

__`Step 15`__ - Create an instance of MLPClassifier, define the solver as __adam__ and name it as __model_adam__

In [16]:
model_adam = MLPClassifier(solver = 'adam')

__When to use__ <br>
- It achieves good results fast - good for complex models, if processing time is an issue.

__Notes__ <br>
- It computes individual adaptive learning rates for different parameters
- Adam combines the advantages of RMSProp and AdaGrad <br>
(For more about Adam, check this: https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c)
- Recent research papers have noted that it can fail to converge to an optimal solution under specific settings.
(The paper https://arxiv.org/pdf/1712.07628.pdf demonstrates that adaptive optimization techniques such as Adam generalize poorly compared to SGD)



__`Step 16`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [17]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['sgd','adam'])
show_results(df, model_sgd, model_adam)

Unnamed: 0,Time,Train,Validation,Iterations
sgd,0.482+/-0.19,0.713+/-0.02,0.676+/-0.05,51.6+/-16.4
adam,1.288+/-0.55,0.752+/-0.02,0.693+/-0.05,117.4+/-33.0


In sklearn, the number of iterations for __sgd__ and __adam__ correspond to the number of epochs (an epoch consists of one full cycle through the training data.)

<img src="image/optimizers.gif" width="350px">

<hr>
<a class="anchor" id="lr_init">

### 2.4. The learning rate initialization - only for sgd and adam (default = 0.001)
    
</a>

The learning rate is one of the most important hyper-parameters to tune for training deep neural networks:

__Small LR__
- If the learning rate is small, then training is more reliable, but optimization will take a lot of time because steps towards the minimum of the loss function are tiny - a smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train.
- A learning rate that is too small may never converge or may get stuck on a suboptimal solution.

__Big LR__
- If the learning rate is high, then training may not converge or even diverge. Weight changes can be so big that the optimizer overshoots the minimum and makes the loss worse - a large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights.


The training should start from a relatively large learning rate because, in the beginning, random weights are far from optimal, and then the learning rate can decrease during training to allow more fine-grained weight updates.

__`Step 18`__ - Create an instance of MLPClassifier, define the solver as __sgd__, the learning_rate_init as __0.5__ and name it as __model_lr_big__

In [18]:
model_lr_big = MLPClassifier(solver = 'sgd', learning_rate_init = 0.5)

__`Step 19`__ - Create an instance of MLPClassifier, define the solver as __sgd__, the learning_rate_init as __0.001__ and name it as __model_lr_medium__

In [19]:
model_lr_medium = MLPClassifier(solver = 'sgd', learning_rate_init = 0.001)

__`Step 20`__ - Create an instance of MLPClassifier, define the solver as __sgd__, the learning_rate_init as __0.000001__ and name it as __model_lr_small__

In [20]:
model_lr_small = MLPClassifier(solver = 'sgd', learning_rate_init = 0.000001)

__`Step 21`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [21]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['big','medium','small'])
show_results(df, model_lr_big, model_lr_medium, model_lr_small)

Unnamed: 0,Time,Train,Validation,Iterations
big,0.122+/-0.02,0.651+/-0.01,0.651+/-0.07,12.0+/-0.0
medium,0.951+/-0.61,0.731+/-0.01,0.695+/-0.06,74.8+/-33.0
small,2.304+/-0.57,0.599+/-0.06,0.579+/-0.08,200.0+/-0.0


<hr>
<a class="anchor" id="lr">

### 2.5. The learning rate - Only for sgd (default = 'constant')
</a>

__`Step 22`__ - Create an instance of MLPClassifier, define the solver as __sgd__, the learning_rate as __constant__ and name it as __model_constant__

In [22]:
model_constant = MLPClassifier(solver = 'sgd', learning_rate = 'constant')

__Definition__<br>
If the learning rate is constant, as the name says, the learning rate will always remain equal to the initial learning rate.

__`Step 23`__ - Create an instance of MLPClassifier, define the solver as __sgd__, the learning_rate as __invscaling__ and name it as __model_invscaling__

In [23]:
model_invscaling = MLPClassifier(solver = 'sgd', learning_rate = 'invscaling')

__Definition__<br>
If the learning rate is invscaling, it gradually decreases the learning rate at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. <br><br>
$$effective\; learning\; rate = \frac{learning\_rate\_init}{t\;^{power\_t}}$$ <br>
The __power_t__ (default = 0.5) is another parameter that you can change.

__`Step 24`__ - Create an instance of MLPClassifier, define the solver as __sgd__, the learning_rate as __adaptive__ and name it as __model_adaptive__

In [24]:
model_adaptive = MLPClassifier(solver = 'sgd', learning_rate = 'adaptive')

__Definition__ <br>
If the learning rate is adaptive, then it keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. <br><br>
Each time two consecutive epochs fail to decrease training loss by at least __tol__ (another parameter that you can change), or fail to increase validation score by at least __tol__ if __early_stopping__ (another parameter that you can change) is on, the current learning rate is divided by 5.

__`Step 25`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [25]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['constant','invscaling','adaptive'])
show_results(df, model_constant, model_invscaling, model_adaptive)

Unnamed: 0,Time,Train,Validation,Iterations
constant,0.801+/-0.28,0.723+/-0.02,0.671+/-0.06,61.8+/-13.9
invscaling,0.822+/-1.07,0.633+/-0.07,0.637+/-0.07,68.6+/-86.0
adaptive,1.746+/-0.72,0.746+/-0.01,0.71+/-0.06,171.9+/-23.5


<hr>
<a class="anchor" id="batch">

### 2.6. The batch size (default = min(200, n_samples))
</a>

The batch size can affect significantly the performance and the speed of your training. What happens when you put a batch through your network is that you average the gradients. <br>

__Small batch size__
- The lower the batch size, the higher the probability of your estimate being less accurate, since the networks weights can "jump" around if your data is noisy, and it might be unable to learn, or it converges very slow. Besides that, the computation time is going to increase.
- It can be useful in some cases to escape local minima.
- Sometimes, and depending on your computational resources, this is the only option.

__Big batch size__
- If your batch size is big enough, this will provide a stable enough estimate of what the gradient of the full dataset would be, since you will have fewer gradient updates per epoch.
- In the same logic, it is desired to speed up computation, due to a lower quantity of updates.

__`Step 26`__ - Create an instance of MLPClassifier, define the batch_size as __5__ and name it as __model_batch5__

In [26]:
model_batch5 = MLPClassifier(solver = 'sgd', batch_size = 5)

__`Step 27`__ - Create an instance of MLPClassifier, define the batch_size as __50__ and name it as __model_batch50__

In [27]:
model_batch50 = MLPClassifier(solver = 'sgd', batch_size = 50)

__`Step 28`__ - Create an instance of MLPClassifier, define the batch_size as __500__ and name it as __model_batch500__

In [28]:
model_batch500 = MLPClassifier(solver = 'sgd', batch_size = 500)

__`Step 29`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [29]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['batch 5','batch 50','batch 500'])
show_results(df, model_batch5, model_batch50, model_batch500)

Unnamed: 0,Time,Train,Validation,Iterations
batch 5,3.21+/-0.84,0.681+/-0.01,0.672+/-0.08,33.7+/-8.6
batch 50,0.812+/-0.33,0.739+/-0.01,0.714+/-0.06,52.7+/-18.6
batch 500,0.556+/-0.15,0.718+/-0.02,0.68+/-0.03,85.6+/-24.5


<hr>
<a class="anchor" id="max_iter">

### 2.7. The maximum iterations (default = 200)
</a>

By default, sklearn defines the maximum number of iterations as 200. While this could be enough for simple datasets, in complex problems you should try values higher that allow the model to converge.

__`Step 30`__ - Create an instance of MLPClassifier, define the max_iter as __20__ and name it as __model_maxiter_20__

In [30]:
model_maxiter_20 = MLPClassifier(max_iter = 20)

__`Step 31`__ - Create an instance of MLPClassifier, define the max_iter as __100__ and name it as __model_maxiter_100__

In [31]:
model_maxiter_100 = MLPClassifier(max_iter = 100)

__`Step 32`__ - Create an instance of MLPClassifier, define the max_iter as __500__ and name it as __model_maxiter_500__

In [32]:
model_maxiter_500 = MLPClassifier(max_iter = 500)

__`Step 33`__ - Check the mean accuracy of each model by calling the function _show_results_ and pass as arguments the dataset and the three models.

In [33]:
df = pd.DataFrame(columns = ['Time','Train','Validation', 'Iterations'], index = ['max iter 20','max iter 100','max iter 500'])
show_results(df, model_maxiter_20, model_maxiter_100, model_maxiter_500)

Unnamed: 0,Time,Train,Validation,Iterations
max iter 20,0.164+/-0.01,0.676+/-0.03,0.654+/-0.05,20.0+/-0.0
max iter 100,0.74+/-0.2,0.726+/-0.02,0.669+/-0.06,83.5+/-18.4
max iter 500,0.951+/-0.35,0.753+/-0.02,0.704+/-0.05,106.9+/-36.5


<hr>
<a class="anchor" id="other">

### 2.8. Other parameters
</a>

|Parameter| Definition | LBFGS | SGD | ADAM |
|---|---|---|---|---|
|alpha| L2 penalty (regularization term) parameter | yes | yes | yes |
| power_t | The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. | no | yes | no |
| shuffle | Whether to shuffle samples in each iteration. | no | yes | yes |
| tol | Tolerance for the optimization. When the loss or score is not improving by at least tol for n_iter_no_change consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops. | yes | yes | yes |
| warm_start | When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. | yes | yes | yes |
| momentum | Momentum for gradient descent update. Should be between 0 and 1. | no | yes | no |
| nesterovs_momentum | Whether to use Nesterov’s momentum.| no | yes | no |
| early stopping | Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs. The split is stratified, except in a multilabel setting.  | no | yes | yes |
| validation_fraction | The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True | no | yes | yes|
| beta1 | Exponential decay rate for estimates of first moment vector in adam, should be in [0, 1). | no | no | yes |
| beta2 | Exponential decay rate for estimates of second moment vector in adam, should be in [0, 1).  | no | no | yes |
| epsilon | Value for numerical stability in adam. | no | no | yes |
| n_iter_no_change | Maximum number of epochs to not meet tol improvement. |  no | yes | yes |
| max_fun | Only used when solver=’lbfgs’. Maximum number of loss function calls. The solver iterates until convergence (determined by ‘tol’), number of iterations reaches max_iter, or this number of loss function calls. | yes | no | no |

# 3. Grid Search <a class="anchor" id="gridsearch"></a>
[Back to Contents](#toc)

__`Step 34`__ - From sklearn.model_selection import __GridSearchCV__

In [34]:
from sklearn.model_selection import GridSearchCV

__`Step 35`__ - Define a dictionary named as __parameter_space__ and define the following options to be considered during modelling:
- 'hidden_layer_sizes': [(50,50,50), (100,)
- 'activation': ['tanh', 'relu']
- 'solver': ['sgd', 'adam']
- 'learning_rate_init' : [0.0001, 0.001, 0.01, 0.1]

In [35]:
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'learning_rate_init': [0.0001, 0.001, 0.01, 0.1]
}

__`Step 36`__ - Create an instance of GridSearchCV named as __clf__ and pass as parameters the __model__ and the __parameter_space__

In [36]:
clf = GridSearchCV(model, parameter_space)

__`Step 37`__ - Fit your Grid Search to __X_train__ and __y_train__

__Step 37.1__ - By using the method `train_test_split` from `sklearn.model_selection`, split your dataset into train(70%) and validation(30%).

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 150, shuffle = True, stratify = y)

__Step 37.2__ - Fit your instance to __X_train__ and __y_train__

In [38]:
clf.fit(X_train, y_train)

GridSearchCV(estimator=MLPClassifier(),
             param_grid={'activation': ['tanh', 'relu'],
                         'hidden_layer_sizes': [(50, 50, 50), (100,)],
                         'learning_rate_init': [0.0001, 0.001, 0.01, 0.1],
                         'solver': ['sgd', 'adam']})

__`Step 38`__ - Call the attribute __best_params___ to check which is the best combination of parameters. Create a final model with those parameters by calling the attribute __best_estimator___

In [39]:
clf.best_params_

{'activation': 'relu',
 'hidden_layer_sizes': (50, 50, 50),
 'learning_rate_init': 0.01,
 'solver': 'adam'}

In [40]:
final_model = clf.best_estimator_.fit(X_train, y_train)
print('Train:', final_model.score(X_train, y_train))
print('Test:', final_model.score(X_test, y_test))

Train: 0.7746741154562383
Test: 0.70995670995671


__`Step 39`__ - Create a loop to check the mean and the standard deviation of the different models created using the different combinations using GridSearchCV

In [41]:
# Best parameter set
print('------------------------------------------------------------------------------------------------------------------------')
print('Best parameters found:\n', clf.best_params_)
print('------------------------------------------------------------------------------------------------------------------------')

# All results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std , params))

------------------------------------------------------------------------------------------------------------------------
Best parameters found:
 {'activation': 'relu', 'hidden_layer_sizes': (50, 50, 50), 'learning_rate_init': 0.01, 'solver': 'adam'}
------------------------------------------------------------------------------------------------------------------------
0.656 (+/-0.010) for {'activation': 'tanh', 'hidden_layer_sizes': (50, 50, 50), 'learning_rate_init': 0.0001, 'solver': 'sgd'}
0.665 (+/-0.030) for {'activation': 'tanh', 'hidden_layer_sizes': (50, 50, 50), 'learning_rate_init': 0.0001, 'solver': 'adam'}
0.672 (+/-0.048) for {'activation': 'tanh', 'hidden_layer_sizes': (50, 50, 50), 'learning_rate_init': 0.001, 'solver': 'sgd'}
0.672 (+/-0.012) for {'activation': 'tanh', 'hidden_layer_sizes': (50, 50, 50), 'learning_rate_init': 0.001, 'solver': 'adam'}
0.676 (+/-0.011) for {'activation': 'tanh', 'hidden_layer_sizes': (50, 50, 50), 'learning_rate_init': 0.01, 'solver': 'sg