# Neural networks
A family of algorithms known as neural networks has become increasingly popular during the past few years under
the name “deep learning.” While deep learning shows great promise in many machine
learning applications, deep learning algorithms are often tailored very carefully to a
specific use case. Here, we will only discuss some relatively simple methods, namely
multilayer perceptrons for classification and regression, that can serve as a starting
point for more involved deep learning methods. Multilayer perceptrons (MLPs) are
also known as (vanilla) feed-forward neural networks, or sometimes just neural
networks.

MLPs can be viewed as generalizations of linear models that perform multiple stages
of processing to come to a decision. Remember that the prediction by a linear regressor is given as:

ŷ = w[0] * x[0] + w[1] * x[1] + ... + w[p] * x[p] + b

In plain English, ŷ is a weighted sum of the input features x[0] to x[p], weighted by
the learned coefficients w[0] to w[p]. We could visualize this graphically as shown in the following figure:

<img src="perceptron.png">

In an MLP this process of computing weighted sums is repeated multiple times, first
computing hidden units that represent an intermediate processing step, which are
again combined using weighted sums to yield the final result. Note that there can be several layers of hidden units, each of which will create a more complex representation of the input.

<img src="MLP.png">

Computing a series of weighted sums is mathematically the same as computing just
one weighted sum, so to make this model truly more powerful than a linear model,
we need one extra trick. After computing a weighted sum for each hidden unit, a
nonlinear function is applied to the result.

Let’s look into the workings of the MLP by applying the `MLPClassifier` to the
two_moons dataset. Import the `make_moons` method and generate a dataset with 100 samples `noise=0.25` and `random_state=3`. Then make a scatter plot of the data.

Now import the `Perceptron` classifier from sklearn and apply it to the data.

Build a function to visualize the input data and the decision function of the model, you may go to the Random Forest exercise to look for inspiration...

Visualize the classifier. Is this shape expected? Why is the perceptron a linear classifier? 

Now import the `MLPClassifier`, set the random state to 0 and the solver to lbfgs. Leave the rest of the parameters as default. Fit the MLP model and then plot the results.

How many hidden units does the model have by default? Is this a good idea for such a small data set? Try different numbers and see the effect on the decision boundary.

With only 10 hidden units, the decision boundary looks somewhat more ragged. The
default nonlinearity is relu, shown in Figure 2-46. With a single hidden layer, this
means the decision function will be made up of 10 straight line segments. If we want
a smoother decision boundary, we could add more hidden units,
add a second hidden layer, or use the tanh nonlinearity. Try a model with 2 hidden layers of 10 units each, then try a model with one hidden layer of 10 units and tanh activation.

Finally, we can also control the complexity of a neural network by using an l2 penalty
to shrink the weights towards zero, as we did in ridge regression. The parameter for this in the MLPClassifier is alpha, and it is set to a very low value (little regularization) by default. Generate a plot with 4 different parameters for alpha [0.0001, 0.01, 1, 2] and for MLP with 2 layers of size 10 or 100.

An important property of neural networks is that their weights are set randomly
before learning is started, and this random initialization affects the model that is
learned. That means that even when using exactly the same parameters, we can
obtain very different models when using different random seeds. If the networks are
large, and their complexity is chosen properly, this should not affect accuracy too
much, but it is worth keeping in mind (particularly for smaller networks). 

## Neural networks and the Breast cancer dataset

We will use the Breast cancer dataset to check an application of neural networks with real-world data. Load the breast cancer dataset from sklearn and print the description. 

Print the dataset target names, the feature names and the input shape

Now train perform a train test split with `random_state=0` and train a MLP with the default parameters and `random state=42`. After training print the accuracy on the training and the test set.

You may compare the performance of the MLP with the SVC and the RandomForestClassifier. Remember to set `gamma='scale'` for the SVC and you can use `n_estimators=1000` for the Random Forest. Print the accuracies.

Which model has obtained the best performance? Why do you think that is? Compute the standard deviation for each feature in the dataset.

Now normalize the data, subtract the mean and divide by the standard deviation. You have to compute the mean and std on the training set, and use the same one for the test set.

After doing that, you can check that the mean and std has been actually set to 0 and 1.

Run again the MLP on the normalized data and print the accuracy on the training and the test. Did the results improve?

You should have a warning saying that the optimization has not converged. This usually means we should add more iterations, set `max_iter` to 1000. What are the accuracies now? I there a way you can think of to further improve the results?

Finally compare the performance of the SVC and the random forest with the normalized data. Comment on the results.