## Why a single layer is limited
In the previous chapter we saw that the single layer perceptron was unable to classify points as being inside or outside a circle centered around the origin. This was because the output of the model (the logits) only had connections directly from the input features.

Why exactly is this a limitation? Think about how the connections work in a single layer neural network. The neurons in the output layer have a connection coming in from each of the neurons in the input layer, but the connection weights are all just real numbers.

![title](img/single_layer_perceptron.png)

The single layer neural network architecture. The weight values are denoted w1, w2, and w3.

Based on the diagram, the logits can be calculated as a linear combination of the input layer and weights:

logits=w
​1
​​ ⋅x+w
​2
​​ ⋅y+w
​3
​​ 

The above linear combination shows the single layer perceptron can model any linear boundary. However, the equation of the circle boundary is:

x^{2} +y^{2} =r^{2}

This is not an equation that can be modeled by a single linear combination.

##  Hidden layers

If a single linear combination doesn't work, what can we do? The answer is to add more linear combinations, as well as non-linearities. We add more linear combinations to our model by adding more layers. The single layer perceptron has only an input and output layer. We will now add an additional hidden layer between the input and output layers, officially making our model a multilayer perceptron. The hidden layer will have 5 neurons, which means that it will add an additional 5 linear combinations to the model's computation.

![title](img/multilayer_perception.png)

The multilayer perceptron architecture.

The 5 neuron hidden layer is more than enough for our circle example. However, for very complex datasets a model could have multiple hidden layers with hundreds of neurons per layer. In the next chapter, we discuss some tips on choosing the number of neurons and hidden layers in a model.

## Non-linearity

We add non-linearities to our model through activation functions. These are non-linear functions that are applied within the neurons of a hidden layer. You've already seen an example of a non-linear activation function, the sigmoid function. We used this after the output layer of our model to convert the logits to probabilities.

The 3 most common activation functions used in deep learning are tanh(https://en.wikipedia.org/wiki/Hyperbolic_function#Tanh), ReLU(https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), and the aforementioned sigmoid)https://en.wikipedia.org/wiki/Sigmoid_function(. Each has its uses in deep learning, so it's normally best to choose activation functions based on the problem. However, the ReLU activation function has been shown to work well in most general-purpose situations, so we'll apply ReLU activation for our hidden layer.

## ReLU

The equation for ReLU is very simple:

ReLU(x) = max(0, x)ReLU(x)=max(0,x)

You might wonder why ReLU even works. While tanh and sigmoid are both inherently non-linear, the ReLU function seems pretty linear (it's just f(x) = 0 for x < 0 and f(x) = x for x ≥ 0). However, let's take a look at the following function:

f(x) = ReLU(x) + ReLU(-x) + ReLU(2x - 2) + ReLU(-2x - 2)f(x)=ReLU(x)+ReLU(−x)+ReLU(2x−2)+ReLU(−2x−2)

This is just a linear combination of ReLU. However, the graph it produces looks like this:

![title](img/relu_graph.png)

Though a bit rough on the edges, it looks somewhat like the quadratic function, f(x) = x2. In fact, with enough linear combinations and ReLU activations, a model can easily learn the quadratic transformation. This is exactly how our multilayer perceptron can learn the circle decision boundary.

We've shown that ReLU is capable of being a non-linear activation function. So what makes it work well in general purpose situations? Its aforementioned simplicity. The simplicity of ReLU, specifically with respect to its gradient, allows it to avoid the vanishing gradient problem(https://en.wikipedia.org/wiki/Vanishing_gradient_problem). Furthermore, the fact that it maps all negative values to 0 actually helps the model train faster and avoid overfitting (discussed in the next chapter).

In [None]:
def model_layers(inputs, output_size):
  hidden1_inputs = inputs 
  hidden1 = tf.keras.layers.Dense(units=5,
                            activation=tf.nn.relu,
                            name='hidden1')(hidden1_inputs) 
                                                              
  logits_inputs = hidden1                       
  logits = tf.keras.layers.Dense(units=output_size,
                           name='logits')(logits_inputs)
  return hidden1_inputs, hidden1, logits_inputs, logits

After adding in the hidden layer to the model, the multilayer perceptron will be able to classify the 2-D circle dataset. Specifically, the classification plot will now look like:

![title](img/circle_classification.png)

Blue represents points the model thinks is outside the circle and red represents points the model thinks is inside. As you can see, the model is a lot more accurate now.