<a href="https://colab.research.google.com/github/lior5egal/Deep-Learning-0512-436201/blob/main/HW1/EX_1_DL_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Theory**
---


## **Question 1**
---
### a. What is the shape of the input $ X $?
The input $ X $ is a batch of size $ m $ with 10 features per sample. Therefore, its shape is:
$ X \in \mathbb{R}^{m \times 10} $

### b. What about the shape of the hidden layer's weight vector $ W_h $, and the shape of its bias vector $ b_h $?
- $ W_h $: This weight matrix maps the 10 input features to 50 hidden layer neurons. Its shape is:
$ W_h \in \mathbb{R}^{10 \times 50} $
- $ b_h $: This bias vector is added to the 50 neurons in the hidden layer. Its shape is:
$ b_h \in \mathbb{R}^{50} $

### c. What is the shape of the output layer's weight vector $ W_o $, and its bias vector $ b_o $?
- $ W_o $: This weight matrix maps the 50 hidden neurons to 3 output neurons. Its shape is:
$ W_o \in \mathbb{R}^{50 \times 3} $
- $ b_o $: This bias vector is added to the 3 output neurons. Its shape is:
$ b_o \in \mathbb{R}^{3} $

### d. What is the shape of the network's output matrix $ Y $?
The output matrix $ Y $ contains the predictions for $ m $ samples, each with 3 output values. Its shape is:
$ Y \in \mathbb{R}^{m \times 3} $

### e. Write the equation that computes the network's output matrix $ Y $ as a function of $ X, W_h, b_h, W_o, $ and $ b_o $.
The computation of the output $ Y $ involves the following steps:
1. Compute the pre-activation for the hidden layer:
   $ Z_h = X W_h + b_h $
2. Apply the ReLU activation function:
   $ H = \text{ReLU}(Z_h) = \max(0, Z_h) $
3. Compute the pre-activation for the output layer:
   $ Z_o = H W_o + b_o $
4. (Optional) Apply a non-linear activation to $ Z_o $, if specified, such as softmax for classification.

In an MLP, the right thing to do for the output layer depends on the specific task we are solving.

**1. Regression Tasks**

If the goal is to predict real-valued outputs (e.g., house prices, temperatures):
- **Output layer activation**: **No activation function** (linear activation).  
  This allows the network to output any value in the real number range, positive or negative.
- **Loss function**: Mean Squared Error (MSE) or Mean Absolute Error (MAE). $Y = X W_h + b_h$
**2. Binary Classification (Two Classes)**
If the goal is to classify inputs into one of two classes (e.g., spam vs. not spam):
- **Output layer activation**: **Sigmoid**.  
  This compresses the output to the range $[0, 1]$, making it interpretable as a probability for one class.
- **Loss function**: Binary Cross-Entropy.

$$
Y = \text{Sigmoid}(X W_h + b_h) = \frac{1}{1 + e^{-(X W_h + b_h)}}
$$

**3. Multi-Class Classification (More than Two Classes)**
If the goal is to classify inputs into one of multiple categories (e.g., dog, cat, bird):
- **Output layer activation**: **Softmax**.  
  This ensures that the output values are probabilities for each class, summing to 1.
- **Loss function**: Categorical Cross-Entropy.

$$
Y = \text{Softmax}(Z_o) = \frac{e^{Z_o^{(i)}}}{\sum_{j} e^{Z_o^{(j)}}}
$$

Where $ Z_o $ is the output before activation, and $ i, j $ are class indices.

**4. Multi-Label Classification**
If the goal is to predict multiple independent binary labels for each input (e.g., predicting attributes like "male" and "smiling" for a face image):
- **Output layer activation**: **Sigmoid** for each output neuron.
- **Loss function**: Binary Cross-Entropy for each label.

$$
Y = \text{Sigmoid}(X W_h + b_h)
$$

**5. Specialized Tasks**
For other tasks (e.g., energy functions, constrained outputs), you might use custom activation functions or modifications, such as:
- **ReLU** for ensuring non-negative outputs (e.g., counting objects).
- **Tanh** if the outputs must be in the range $[-1, 1]$.

---

### **General Rule**
- **For the Output Layer**: Choose the activation function based on the task requirements:
  - **None** for regression.
  - **Sigmoid** for binary classification.
  - **Softmax** for multi-class classification.

By tailoring the output layer's activation to the task, the MLP will produce outputs appropriate for the problem we're solving

Therefore, the equation for $ Y $ is:


$ Y = Φ(\text{ReLU}(X W_h + b_h) W_o + b_o) $


Where $ Φ $ is the activation function of the output layer