### What We'll Learn
Now that we know the basic algorithms underpinning Neural Networks, we need to learn how to design their architectures In this lesson, we will learn how to:

![image.png](attachment:d4180a52-0901-4715-b2b7-992c4afb9e7f.png)

- Explain essential concepts in Neural Networks, including their origins
- Implement appropriate Neural Networks architectures
- Distinguish between problems based on model objectives
- Design Neural Networks based on the decision boundaries in the data
  
Taken together, these skills, combined with knowledge of backpropagation and gradient descent, give us the ability to design our Neural Networks to solve our problems.


### Origins of the Term Neural Network

Neural Networks get their name from the fact that they are—loosely—modeled after biological neurons. Perceptrons take inputs, perform calculations on the inputs, and decide whether to return one result or another (e.g., a one or a zero).

![image.png](attachment:f5acd343-7091-4a51-944e-7898b4cfaf0e.png)

In a similar way, neurons in the brain receive inputs (such as signals from other neurons) through their branching dendrites, and then decide whether to, in turn, send out a signal of their own.

Similar to how real neurons can be connected one to another to form layers, we will be concatenating our perceptrons—layering multiple perceptrons such that we can take the output from one and use it as the input for another.

### Multilayer Perceptrons are Neural Networks
The perceptron and neural networks are inspired by biological neurons. Though modern "perceptrons" use the Logistic Sigmoid Function or other activation functions, classical perceptrons use a step function.

![image.png](attachment:b30c32e9-df12-48c6-9eee-06a42030f7e4.png)

Neural Networks are a more general class of models that encapsulates multi-layer perceptrons. Neural Networks are defined by having one or more hidden layers and an output layer that emits a decision -- either a predicted value, a probability, or a vector of probabilities, depending on the task.

### Neural Network Architecture

![image.png](attachment:804c1ae0-5214-42d3-be18-fdfad031fbd9.png) ![image.png](attachment:13d5dd7c-4a9c-4db7-be7a-7303e998a08f.png)

![image.png](attachment:0b80d3ab-9508-43aa-97ad-e78da2f8c181.png)

__Combining Models :__ We will combine two linear models to get our non-linear model. Essentially the steps to do this are:

* Calculate the probability for each model
* Apply weights to the probabilities
* Add the weighted probabilities
* Apply the sigmoid function to the result


### Multiple layers
Now, not all neural networks look like the one above. They can be way more complicated! In particular, we can do the following things:

- Add more nodes to the input, hidden, and output layers.
- Add more layers.


![image.png](attachment:729b7ada-1ee6-4f90-ac3c-584a41d91b8a.png)

Neural networks have a certain special architecture with layers:

- The first layer is called the input layer, which contains the inputs.
- The next layer is called the hidden layer, which is the set of linear models created with the input layer.
- The final layer is called the output layer, which is where the linear models get combined to obtain a nonlinear model.
- 
Neural networks can have different architectures, with varying numbers of nodes and layers:

__Input nodes.__ In general, if we have `n` nodes in the input layer, then we are modeling data in n-dimensional space (e.g., 3 nodes in the input layer means we are modeling data in 3-dimensional space).

__Output nodes.__ If there are more nodes in the output layer, this simply means we have more outputs—for example, we may have a multiclass classification model.

__Layers.__ If there are more layers then we have a deep neural network. Our linear models combine to create nonlinear models, which then combine to create even more nonlinear models!

### Multi-Class Classification
And here we elaborate a bit more into what can be done if our neural network needs to model data with more than one output.
Note: The softmax mentioned in the video is the activation function used by multiclass classification, which we will cover shortly in this lesson.

When we have three or more classes, we could construct three separate neural networks—one for predicting each class. However, this is not necessary. Instrad, we can add more nodes in the output layer. Each of these nodes will give us the probability that the item belongs to the given class.

![image.png](attachment:b13ea096-0091-4387-a6d0-92ab9991496c.png)


**Feedforward Process in Neural Networks:**

Feedforward is the process neural networks use to turn the input into an output. In general terms, the process looks like this:

1. Take the input vector.
2. Apply a sequence of linear models and sigmoid functions.
3. Combine maps to create a highly non-linear map.

The general feedforward formula is:

$$
\hat{y} = \sigma \circ W^{(2)} \circ \sigma \circ W^{(1)}(x)
$$

Where:
- \( \hat{y} \) is the predicted output.
- \( \sigma \) is the activation function (e.g., sigmoid).
- \( W^{(1)} \) and \( W^{(2)} \) are the weight matrices for the layers.
- \( x \) is the input vector.


![image.png](attachment:ac6b48d9-50f7-4f86-b770-670787615092.png)

### Activation Functions

#### Activation Function Properties
There are a wide variety of activation functions that we can use. Activation functions should be:

- Nonlinear
- Differentiable -- preferably everywhere
- Monotonic
- Close to the identity function at the origin
  
  ![image.png](attachment:df1e4f50-0967-44dd-a3d3-f17c4504b3e1.png)
  
We can loosen these restrictions slightly. For example, ReLU is not differentiable at the origin. Others, like monotonicity, are very important and cannot be reasonably relaxed.

# Different Activation Functions in Neural Networks

Activation functions determine the output of a neural network node, which, in turn, affects the learning process. Below are some common activation functions, including their properties and use cases.

## 1. Sigmoid Activation Function

The **Sigmoid** function is shaped like an "S" and squashes the input values between 0 and 1. The formula for the sigmoid function is:

\[
\sigma(x) = \frac{1}{1 + e^{-x}}
\]

- **Range**: (0, 1)
- **Advantages**: It outputs probabilities, which makes it useful for binary classification tasks.
- **Disadvantages**: It suffers from the **vanishing gradient problem** — when inputs are very large or very small, the gradient becomes almost zero, which slows down learning.

## 2. Tanh (Hyperbolic Tangent) Activation Function

The **Tanh** function is similar to the sigmoid function but outputs values between -1 and 1. The formula is:

\[
tanh(x) = \frac{2}{1 + e^{-2x}} - 1
\]

- **Range**: (-1, 1)
- **Advantages**: It has a better range than sigmoid, allowing the model to learn more strongly. Negative inputs will map strongly negative, and zero inputs will be near zero.
- **Disadvantages**: Like sigmoid, it suffers from the vanishing gradient problem for very large and very small inputs.


## 3. ReLU (Rectified Linear Unit) Activation Function

The **ReLU** function outputs the input if it's positive, and zero otherwise. Its formula is simple:

\[
ReLU(x) = max(0, x)
\]

- **Range**: [0, ∞)
- **Advantages**: ReLU is computationally efficient and helps mitigate the vanishing gradient problem. It is the most commonly used activation function in deep learning today.
- **Disadvantages**: It can suffer from the **dying ReLU problem** where neurons can get stuck at 0 for large negative inputs and stop learning.


## 4. Leaky ReLU Activation Function

The **Leaky ReLU** function is a variation of ReLU that allows a small, non-zero gradient when the input is negative. The formula is:

\[
Leaky ReLU(x) = max(\alpha x, x)
\]

Where \( \alpha \) is a small constant (like 0.01).

- **Range**: (-∞, ∞)
- **Advantages**: It prevents the dying ReLU problem by allowing a small negative slope.
- **Disadvantages**: It's still not commonly used in practice compared to ReLU.

## 5. Softmax Activation Function

The **Softmax** function is typically used in the output layer for classification problems with multiple classes. It converts raw output scores into probabilities that sum to 1. The formula is:

\[
Softmax(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
\]

- **Range**: (0, 1) for each class, and the sum of all outputs is 1.
- **Advantages**: Useful for multi-class classification tasks where each output neuron represents a class.
- **Disadvantages**: Like sigmoid, it can saturate and cause gradients to vanish, slowing learning.


### Summary Table

| Activation Function | Range       | Key Properties                                     | Common Use Cases                             |
|---------------------|-------------|----------------------------------------------------|---------------------------------------------|
| Sigmoid             | (0, 1)      | Smooth, differentiable, but has vanishing gradient | Binary classification tasks                 |
| Tanh                | (-1, 1)     | Strong negative and positive outputs               | RNNs, improves learning over sigmoid        |
| ReLU                | [0, ∞)      | Efficient, mitigates vanishing gradient            | Most deep learning models (CNNs, DNNs)      |
| Leaky ReLU          | (-∞, ∞)     | Fixes the dying ReLU issue                         | Variant of ReLU, used when ReLU isn't ideal |
| Softmax             | (0, 1)      | Outputs probabilities summing to 1                 | Multi-class classification                  |

These activation functions are essential for neural networks to learn patterns in data, and choosing the right one depends on the task and the network architecture.


## How to Choose an Output Function

The choice of an output function depends on two primary factors about what you're trying to predict:
- **Shape**: What form your output takes (single value, multiple values).
- **Range**: What values your output can take (bounded or unbounded).

#### Types of Problems:
Your output function is largely determined by whether you're doing **classification** or **regression**:

1. **Classification**: Predicting categories or labels (e.g., cat vs. dog).
2. **Regression**: Predicting continuous values (e.g., house prices).

#### Common Output Functions:

1. **Sigmoid** for Binary Classification:
   - Used when there are only two possible classes (e.g., yes/no, 0/1).
   - It outputs a value between 0 and 1, which can be interpreted as a probability.
   - Formula: 
     \[
     \sigma(x) = \frac{1}{1 + e^{-x}}
     \]

2. **Softmax** for Multi-Class Classification:
   - Used when there are more than two classes.
   - It converts the output scores into probabilities that sum to 1 across the classes.
   - Formula: 
     \[
     Softmax(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
     \]

3. **Identity** or **ReLU** for Regression:
   - For regression tasks, where you're predicting continuous values, the identity function or ReLU is often used.
   - The **Identity function** just returns the input as the output (no transformation):
     \[
     f(x) = x
     \]
   - **ReLU** can also be used to ensure that your output is non-negative:
     \[
     ReLU(x) = max(0, x)
     \]

#### Summary:
- **Binary Classification**: Use **Sigmoid**.
- **Multi-Class Classification**: Use **Softmax**.
- **Regression**: Use **Identity** or **ReLU**.

Choosing the right output function ensures your neural network behaves appropriately based on the task you're solving.


![image.png](attachment:b9dc3fe4-19dd-4299-971a-1a25f033f060.png)

### Nonlinear and High Dimensional Decision Boundaries
Our data will dictate what the shape of our decision boundary should be. Neural networks will be able to find decision boundaries that are high-dimensional and nonlinear by combining the decision boundaries of the hidden neurons. Even in cases where we cannot visualize our decision boundary easily, knowing the approximate complexity of our decision boundary will inform how big our model needs to be.

![image.png](attachment:2daf000e-5864-43fd-987f-2e68f450cbff.png) ![image.png](attachment:ab869463-e2c2-4d3b-a37a-c3396e0bc126.png)
 

### Glossary

![image.png](attachment:a6ac8bc5-8f86-4d01-bf77-4d1c832d02b0.png)