# A Simple Neural network
## Part 4: Vectorization

This tutorial is part 3 of the previous tutorials on neural networks (TODO: url). While the previous tutorials described very simple single layer regression and classification models, this tutorial will describe a 2-class classification neural network with 1 input dimension, and a non-linear hidden layer with 2 dimensions. While we didn't add the bias parameters to the previous 2 models, we will add them to this model. The network of this model is shown in the following figure:

![Image of the logistic model](https://dl.dropboxusercontent.com/u/8938051/Blog_images/SimpleANN03.png)

In [None]:
# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib.pyplot as plt  # Plotting library
from matplotlib.colors import colorConverter, ListedColormap # some plotting functions
from mpl_toolkits.mplot3d import Axes3D  # 3D plots
from matplotlib import cm # Colormaps
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)

## Define the dataset 

In this example the target classes $t$ will be generated from 2 class distributions: blue ($t=1$) and red ($t=0$). Where the red class is a circular distribution that surrounds the distribution of the blue class. This results in a 2D dataset that is not linearly seperable. The model from part 2 won't be able to classify both classes correctly since it can learn only linear seperators. By adding a hidden layer the model will be able to train a non-linear seperator.

In [None]:
# Define and generate the samples
nb_of_samples_per_class = 50  # The number of sample in each class

# Generate blue samples
blue_mean = [0,0]  # The mean of the blue class
blue_std_dev = 0.3  # standard deviation of blue class
x_blue = np.random.randn(nb_of_samples_per_class, 2) * blue_std_dev + blue_mean

# Generate red samples as circle around blue samples
red_radius_mean = 1.3  # mean of the radius
red_radius_std_dev = 0.2  # standard deviation of the radius
red_rand_radius = np.random.randn(nb_of_samples_per_class) * red_radius_std_dev + red_radius_mean
red_rand_angle = 2 * np.pi * np.random.rand(nb_of_samples_per_class);
x_red = np.asmatrix([red_rand_radius * np.cos(red_rand_angle), 
                     red_rand_radius * np.sin(red_rand_angle)]).T

# Define target vectors for blue and red
t_blue_vector = np.asarray([1, 0])
t_red_vector = np.asarray([0, 1])
# Define the full target matrix for each class
t_blue = np.tile(t_blue_vector, (nb_of_samples_per_class, 1))
t_red = np.tile(t_red_vector, (nb_of_samples_per_class, 1))

# Merge samples in set of input variables x, and corresponding set of
# output variables t
X = np.vstack((x_blue, x_red))
T = np.vstack((t_blue, t_red))

In [None]:
# Plot both classes on the x1, x2 plane
plt.plot(x_red[:,0], x_red[:,1], 'ro', label='class red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='class blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('$x_1$', fontsize=15)
plt.ylabel('$x_2$', fontsize=15)
plt.axis([-2, 2, -2, 2])
plt.title('red vs blue classes in the input space')
plt.show()

## Vectorization of backpropagation

### 1. Vectorization of the forward step

#### Compute activations of hidden layer
The $n$ input samples with $2$ variables each are given as a $n \times 2$ matrix $X$:

$$X =
\begin{bmatrix} 
x_{11} & x_{12} \\
\vdots & \vdots \\
x_{n1} & x_{n2}
\end{bmatrix}$$

Where $x_{ij}$ is the value of the $j$-th variable of the $i$-th input sample. These inputs are projected onto the 3 dimensions of the hidden layer $H$ by the weight matrix $W$ ($w_{hij}$ is the weight of the connection between input variable $i$ and hidden neuron activation $j$) and bias vector $b$ :

$$\begin{align}
W_h =
\begin{bmatrix} 
w_{h11} & w_{h12} & w_{h13} \\
w_{h21} & w_{h22} & w_{h23}
\end{bmatrix}
&& b_h = 
\begin{bmatrix} 
b_{h1} & b_{h2} & b_{h3}
\end{bmatrix}
\end{align}$$

following computation: 

$$H = \sigma(X \cdot W_h + b_h) = \frac{1}{1+e^{-(X \cdot W_h + b_h)}} = \begin{bmatrix} 
h_{11} & h_{12} & h_{13} \\
\vdots & \vdots & \vdots \\
h_{n1} & h_{n2} & h_{n3}
\end{bmatrix}$$

With $\sigma$ the logistic function, and with $H$ resulting in a $n \times 3$ matrix.

#### Compute activations of output 

To compute the output activations the hidden layer activations can be projected onto the 2 dimensional output layer. This is done by the $3 \times 2$ weight matrix $W_o$ ($w_{oij}$ is the weight of the connection between hidden layer neuron $i$ and output activation $j$) and $2 \times 1$ bias vector $b_o$ :

$$\begin{align}
W_o =
\begin{bmatrix} 
w_{o11} & w_{o12} \\
w_{o21} & w_{o22} \\
w_{o31} & w_{o32}
\end{bmatrix}
&& b_o = 
\begin{bmatrix} 
b_{o1} & b_{o2}
\end{bmatrix}
\end{align}$$

following computation: 

$$Y = \sigma(H \cdot W_o + b_o) = \frac{1}{1+e^{-(H \cdot W_o + b_o)}} = \begin{bmatrix} 
y_{11} & y_{12}\\
\vdots & \vdots \\
y_{n1} & y_{n2} 
\end{bmatrix}$$

With $\sigma$ the logistic function, and with $Y$ resulting in a $n \times 2$ matrix.

In [None]:
# Define the logistic function
def logistic(z): return 1 / (1 + np.exp(-z))

# Function to compute the hidden activations
def hidden_activations(X, Wh, bh):
    return logistic(X * Wh + bh)

# Define output layer feedforward
def output_activations(H, Wo, bo):
    return logistic(H * Wo + bo)

# Define the neural network function
def nn(X, Wh, bh, Wo, bo): 
    return output_activations(hidden_activations(X, Wh, bh), Wo, bo)

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(X, Wh, bh, Wo, bo): 
    return np.around(nn(X, Wh, bh, Wo, bo))

In [None]:
# Weights and biases
bh = np.asmatrix([[0, 0, 0]])
Wh = np.asmatrix([[1, 1, 1],[1, 1, 1]])

bo = np.asmatrix(([0, 0]))
Wo = np.asmatrix([[1, 1], [1, 1], [1, 1]])


print 'X.shape: ', X.shape
print 'Wh.shape: ', Wh.shape
print 'bh.shape: ', bh.shape

H = hidden_activations(X, Wh, bh)
print 'H.shape: ', H.shape
print 'Wo.shape: ', Wo.shape
print 'bo.shape: ', bo.shape

O = output_activations(H, Wo, bo)
print 'O.shape: ', O.shape

Y = nn(X, Wh, bh, Wo, bo)
print 'Y.shape: ', Y.shape

Y_pred = nn_predict(X, Wh, bh, Wo, bo)
print 'Y_pred.shape: ', Y_pred.shape

### 2. Vectorization of the backward step

#### Vectorization of the output layer backward step

##### Compute the error at the output

The cost function $\xi$ for $n$ samples and $c$ classes used in this model is the cross-entropy cost function:

$$\xi(T,Y) = \sum_{i=1}^n \xi(\mathbf{t}_i,\mathbf{y}_i) = - \sum_{i=1}^n \sum_{i=c}^{C} t_{ic} \cdot log( y_{ic}) $$


The error gradient $\delta_{o}$ of this cost function at the softmax output layer is simply:

$$\delta_{o} = \frac{\partial \xi}{\partial Z_o} = Y - T$$

With $Z_o$ a $n \times 2$ matrix of inputs to the softmax layer ($Z_o = H \cdot W_o + b_o$), and $T$ a $n \times 2$ target matrix that corresponds to $Y$. Note that $\delta_{o}$ also results in a $n \times 2$ matrix.

##### Update the output layer weights
At the output the gradient $\delta_{w_{oij}}$ is computed by ${\partial \xi}/{\partial w_{oij}}$. We worked out this formula for batch processing over all $n$ samples in part 2:

$$\frac{\partial \xi}{\partial w_{oij}} = 
\frac{\partial Z_{o}}{\partial w_{oij}} \frac{\partial Y}{\partial Z_{o}} \frac{\partial \xi}{\partial Y} = 
\frac{\partial Z_{o}}{\partial w_{oij}} \frac{\partial \xi}{\partial Z_o} =
\sum_{j=1}^n h_{ji} (y_j - t_j) = \sum_{j=1}^n h_{ji} \delta_{oj}$$

We can write this formula out as matrix operations with the parameters of the output layer:

$$\frac{\partial \xi}{\partial W_o} = H^T \cdot (Y-T) = H^T \cdot \delta_{o}$$

The resulting gradient is a $3 \times 2$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_{W_o} = \nabla_{W_o} \xi =
\begin{bmatrix} 
\frac{\partial \xi}{\partial w_{o11}} & \frac{\partial \xi}{\partial w_{o12}} \\
\frac{\partial \xi}{\partial w_{o21}} & \frac{\partial \xi}{\partial w_{o22}} \\
\frac{\partial \xi}{\partial w_{o31}} & \frac{\partial \xi}{\partial w_{o32}}
\end{bmatrix}$$

The output weights $W_o$ can then be updated with learning rate $\mu$ according to:

$$ W_o(k+1) = W_o(k) - \mu * J_{W_o} $$

##### Update the output layer bias
The bias $b_o$ can be updated in the same manner. The gradient $\delta_{b_{oi}}$ is computed by ${\partial \xi}/{\partial b_{oi}}$. We worked out this formula for batch processing over all $n$ samples in part 2:

$$\frac{\partial \xi}{\partial b_{oi}} = \frac{\partial Z_{o}}{\partial b_{oi}} \frac{\partial Y}{\partial Z_{o}} \frac{\partial \xi}{\partial Y} = \sum_{j=1}^n 1 * (y_j - t_j) = \sum_{j=1}^n \delta_{oj}$$

The resulting gradient is a $2 \times 1$ Jacobian matrix:

$$J_{b_o} = \nabla_{b_o} \xi =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_{o1}} & \frac{\partial \xi}{\partial b_{o2}}
\end{bmatrix}$$

The output bias $b_o$ can then be updated with learning rate $\mu$ according to:

$$ b_o(k+1) = b_o(k) - \mu * J_{b_o} $$

In [None]:
# Define the cost function
def cost(Y, T):
    return - np.sum(np.multiply(T, np.log(Y)) + np.multiply((1-T), np.log(1-Y)))

# Define the error function at the output
def error_output(Y, T):
    return Y - T

# Define the gradient function for the weight parameters at the output layer
def gradient_weight_out(H, Eo): 
    return  H.T * Eo

# Define the gradient function for the bias parameters at the output layer
def gradient_bias_out(Eo): 
    return  np.sum(Eo, axis=0)

In [None]:
C = cost(Y, T)
print 'C.shape: ', C.shape

Eo = error_output(Y, T)
print 'Eo.shape: ', Eo.shape

JWo = gradient_weight_out(H, Eo)
print 'JWo.shape: ', JWo.shape

Jbo = gradient_bias_out(Eo)
print 'Jbo.shape: ', Jbo.shape

#### Vectorization of the hidden layer backward step

##### Compute the error at the hidden layer

The error gradient $\delta_{h}$ of the cost function at the hidden layer is defined as:

$$\delta_{h} = \frac{\partial \xi}{\partial Z_h} = \frac{\partial H}{\partial Z_h} \frac{\partial \xi}{\partial H} = \frac{\partial H}{\partial Z_h} \frac{\partial Z_o}{\partial H} \frac{\partial \xi}{\partial Z_o}$$

With $Z_h$ a $n \times 3$ matrix of inputs to the logistic functions in the hidden neurons ($Z_h = X \cdot W_h + b_h$). Note that $\delta_{h}$ will also result in a $n \times 3$ matrix.

Lets first derive the error gradient $\delta_{hij}$ for one sample $i$ at hidden neuron $j$. The gradients that backpropagate from the previous layer via the weighted connections are summed for each origin $h_{ij}$. 

$$\delta_{hij} = \frac{\partial \xi}{\partial z_{hij}} 
= \frac{\partial h_{ij}}{\partial z_{hij}} \frac{\partial z_{oi}}{\partial h_{ij}} \frac{\partial \xi}{\partial z_{oi}} 
= h_{ij} (1-h_{ij}) \sum_{k=1}^2 w_{ojk} (y_{ik}-t_{ik})
= h_{ij} (1-h_{ij}) [\delta_{oi} \cdot w_{oj}^T]$$

Where $w_{oj}$ is the $j$-th row of $W_o$, and thus a $1 \times 2$ vector and $\delta_{oi}$ is a $1 \times 2$ vector. The full $n \times 3$ error matrix $\delta_{h}$ can thus be calculated as:

$$\delta_{h} = \frac{\partial \xi}{\partial Z_h} = H \circ (1 - H) \circ [\delta_{o} \cdot W_o^T]$$

With $\circ$ the [elementwise product](http://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29). 


##### Update the hidden layer weights
At the hidden layer the gradient ${\partial \xi}/{\partial w_{hij}}$ of each $w_{hij}$ over all $n$ samples can be computed by:

$$\frac{\partial \xi}{\partial w_{hij}} = 
\frac{\partial Z_{h}}{\partial w_{hij}} \frac{\partial H}{\partial Z_{h}} \frac{\partial \xi}{\partial H} = 
\frac{\partial Z_{h}}{\partial w_{hij}} \frac{\partial \xi}{\partial Z_h} =
\sum_{j=1}^n x_{ji} \delta_{hj}$$

We can write this formula out as matrix operations with the parameters of the hidden layer:

$$\frac{\partial \xi}{\partial W_h} = X^T \cdot \delta_{h}$$

The resulting gradient is a $2 \times 3$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_{W_h} = \nabla_{W_h} \xi =
\begin{bmatrix} 
\frac{\partial \xi}{\partial w_{h11}} & \frac{\partial \xi}{\partial w_{h12}} & \frac{\partial \xi}{\partial w_{h13}} \\
\frac{\partial \xi}{\partial w_{h21}} & \frac{\partial \xi}{\partial w_{h22}} & \frac{\partial \xi}{\partial w_{h23}} \\
\end{bmatrix}$$

The output weights $W_h$ can then be updated with learning rate $\mu$ according to:

$$ W_h(k+1) = W_h(k) - \mu * J_{W_h} $$

##### Update the hidden layer bias
The bias $b_h$ can be updated in the same manner. The gradient $\delta_{b_{hi}}$ is computed by ${\partial \xi}/{\partial b_{hi}}$. We worked out this formula for batch processing over all $n$ samples in part 2:

$$\frac{\partial \xi}{\partial b_{hi}} = \frac{\partial Z_{h}}{\partial b_{hi}} \frac{\partial H}{\partial Z_{h}} \frac{\partial \xi}{\partial H} 
= \sum_{j=1}^n \delta_{hj}$$

The resulting gradient is a $1 \times 3$ Jacobian matrix:

$$J_{b_h} = \nabla_{b_h} \xi =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_{h1}} & \frac{\partial \xi}{\partial b_{h2}} & \frac{\partial \xi}{\partial b_{h3}}
\end{bmatrix}$$

The output bias $b_h$ can then be updated with learning rate $\mu$ according to:

$$ b_h(k+1) = b_h(k) - \mu * J_{b_h} $$

In [None]:
# Define the error function at the hidden layer
def error_hidden(H, Wo, Eo):
    # H * (1-H) * (E . Wo^T)
    return np.multiply(np.multiply(H,(1 - H)), Eo.dot(Wo.T))

# Define the gradient function for the weight parameters at the hidden layer
def gradient_weight_hidden(X, Eh):
    return X.T * Eh

# Define the gradient function for the bias parameters at the output layer
def gradient_bias_hidden(Eh): 
    return  np.sum(Eh, axis=0)

In [None]:
Eh = error_hidden(H, Wo, Eo)
print 'Eh.shape: ', Eh.shape

JWh = gradient_weight_hidden(X, Eh)
print 'JWh.shape: ', JWh.shape

Jbh = gradient_bias_hidden(Eh)
print 'Jbh.shape: ', Jbh.shape

## Gradient checking

Programming the computation of the backpropagation gradient is prone to bugs. This is why it is recommended to [check the gradients](http://ufldl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization) of your models. Gradient checking is done by computing the [numerical gradient](http://en.wikipedia.org/wiki/Numerical_differentiation) of each parameter, and compare this value to the gradient found by backpropagation.

The numerical gradient ${\partial \xi}/{\partial \theta_i}$ for a parameter $\theta_i$ can be computed by:

$$\frac{\partial \xi}{\partial \theta_i}
= \frac{f(\theta_1, \cdots, \theta_i+\epsilon, \cdots, \theta_m) - f(\theta_1, \cdots, \theta_i-\epsilon, \cdots, \theta_m)}{2\epsilon}$$

Where $f$ if the neural network output function that takes all parameters $\theta$, and $\epsilon$ is the small change that is used to peturbate the parameter $\theta_i$. 

The numerical gradient for each parameter should be close to the backpropagation gradient for that parameter.

In [None]:
# Initialize weights and biases
init_var = 0.1
# Initialize hidden layer parameters
bh = np.random.randn(1, 3) * init_var
Wh = np.random.randn(2, 3) * init_var
# Initialize output layer parameters
bo = np.random.randn(1, 2) * init_var
Wo = np.random.randn(3, 2) * init_var

# Compute the gradients by backpropagation
# Compute the activations of the layers
H = hidden_activations(X, Wh, bh)
Y = output_activations(H, Wo, bo)
# Compute the gradients of the output layer
Eo = error_output(Y, T)
JWo = gradient_weight_out(H, Eo)
Jbo = gradient_bias_out(Eo)
# Compute the gradients of the hidden layer
Eh = error_hidden(H, Wo, Eo)
JWh = gradient_weight_hidden(X, Eh)
Jbh = gradient_bias_hidden(Eh)

In [None]:
# Combine all parameter matrices in a list
params = [Wh, bh, Wo, bo]
# Combine all parameter gradients in a list
grad_params = [JWh, Jbh, JWo, Jbo]

# Set the small change to compute the numerical gradient
eps = 0.0001

# Check each parameter matrix
for p_idx in xrange(len(params)):
    # Check each parameter in each parameter matrix
    for row in xrange(params[p_idx].shape[0]):
        for col in xrange(params[p_idx].shape[1]):
            # Copy the parameter matrix and change the current parameter slightly
            p_matrix_min = params[p_idx].copy()
            p_matrix_min[row,col] -= eps
            p_matrix_plus = params[p_idx].copy()
            p_matrix_plus[row,col] += eps
            # Copy the parameter list, and change the updated parameter matrix
            params_min = params[:]
            params_min[p_idx] = p_matrix_min
            params_plus = params[:]
            params_plus[p_idx] =  p_matrix_plus
            # Compute the numerical gradient
            grad_num = (cost(nn(X, *params_plus), T)-cost(nn(X, *params_min), T))/(2*eps)
            print 'backprop gradient: {:.6f} ; numerical gradient {:.6f}'.format(grad_params[p_idx][row,col], grad_num)
            if not np.isclose(grad_num, grad_params[p_idx][row,col]):
                raise ValueError('Numerical gradient is not close to the backpropagation gradient!')

### Backpropagation updates

In [None]:
# Define the update function to update the network parameters over 1 iteration
def backprop_gradients(X, T, Wh, bh, Wo, bo):
    # Compute the output of the network
    # Compute the activations of the layers
    H = hidden_activations(X, Wh, bh)
    Y = output_activations(H, Wo, bo)
    # Compute the gradients of the output layer
    Eo = error_output(Y, T)
    JWo = gradient_weight_out(H, Eo)
    Jbo = gradient_bias_out(Eo)
    # Compute the gradients of the hidden layer
    Eh = error_hidden(H, Wo, Eo)
    JWh = gradient_weight_hidden(X, Eh)
    Jbh = gradient_bias_hidden(Eh)
    return [JWh, Jbh, JWo, Jbo]

def update_velocity(X, T, ls_of_params, Vs, momentum_term, learning_rate):
    # ls_of_params = [Wh, bh, Wo, bo]
    # Js = [JWh, Jbh, JWo, Jbo]
    Js = backprop_gradients(X, T, *ls_of_params)
    return [momentum_term * V + learning_rate * J for V,J in zip(Vs, Js)]

def update_params(ls_of_params, Vs):
    # ls_of_params = [Wh, bh, Wo, bo]
    # Vs = [VWh, Vbh, VWo, Vbo]
    return [P - V for P,V in zip(ls_of_params, Vs)]

In [None]:
# Run backpropagation
# Initialize weights and biases
init_var = 0.1
# Initialize hidden layer parameters
bh = np.random.randn(1, 3) * init_var
Wh = np.random.randn(2, 3) * init_var
# Initialize output layer parameters
bo = np.random.randn(1, 2) * init_var
Wo = np.random.randn(3, 2) * init_var
# Parameters are already initilized randomly with the gradient checking
# Set the learning rate
learning_rate = 0.05
momentum_term = 0.9

# define the velocities Vs = [VWh, Vbh, VWo, Vbo]
Vs = [np.zeros_like(M) for M in [Wh, bh, Wo, bo]]

# Start the gradient descent updates and plot the iterations
nb_of_iterations = 200  # number of gradient descent updates
lr_update = learning_rate / nb_of_iterations # learning rate update rule
ls_costs = []  # list of cost over the iterations

for i in xrange(nb_of_iterations):
    # Add the current cost to the cost list
    current_cost = cost(nn(X, Wh, bh, Wo, bo), T)
    ls_costs.append(current_cost)
    Vs = update_velocity(X, T, [Wh, bh, Wo, bo], Vs, momentum_term, learning_rate)s
    Wh, bh, Wo, bo = update_params([Wh, bh, Wo, bo], Vs)

# Add the final cost to the cost list
ls_costs.append(cost(nn(X, Wh, bh, Wo, bo), T))
# Plot the cost over the iterations
plt.plot(ls_costs, 'b-')
plt.xlabel('iteration')
plt.ylabel('cost')
plt.title('Decrease of cost over backprop iteration')
plt.grid()
plt.show()

In [None]:
# Plot the resulting decision boundary
# Generate a grid over the input space to plot the color of the
#  classification at that grid point
nb_of_xs = 200
xs1 = np.linspace(-2, 2, num=nb_of_xs)
xs2 = np.linspace(-2, 2, num=nb_of_xs)
xx, yy = np.meshgrid(xs1, xs2) # create the grid
# Initialize and fill the classification plane
classification_plane = np.zeros((nb_of_xs, nb_of_xs))
for i in xrange(nb_of_xs):
    for j in xrange(nb_of_xs):
        pred = nn_predict(np.asmatrix([xx[i,j], yy[i,j]]), Wh, bh, Wo, bo)
        classification_plane[i,j] = pred[0,0]
# Create a color map to show the classification colors of each grid point
cmap = ListedColormap([
        colorConverter.to_rgba('r', alpha=0.30),
        colorConverter.to_rgba('b', alpha=0.30)])

# Plot the classification plane with decision boundary and input samples
plt.contourf(xx, yy, classification_plane, cmap=cmap)
# Plot both classes on the x1, x2 plane
plt.plot(x_red[:,0], x_red[:,1], 'ro', label='class red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='class blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('$x_1$', fontsize=15)
plt.ylabel('$x_2$', fontsize=15)
plt.axis([-2, 2, -2, 2])
plt.title('red vs blue classification boundary')
plt.show()

In [None]:
H_blue = hidden_activations(x_blue, Wh, bh)
H_red = hidden_activations(x_red, Wh, bh)

# Plot the error surface
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(np.ravel(H_blue[:,0]), np.ravel(H_blue[:,1]), np.ravel(H_blue[:,2]), 'bo')
ax.plot(np.ravel(H_red[:,0]), np.ravel(H_red[:,1]), np.ravel(H_red[:,2]), 'ro')
ax.set_xlabel('$h_1$', fontsize=15)
ax.set_ylabel('$h_2$', fontsize=15)
ax.set_zlabel('$h_3$', fontsize=15)
ax.view_init(elev=10, azim=-30)
plt.title('Projection of the input X onto the hidden layer H')
plt.grid()
plt.show()