# Types of Layers in Neural Networks

---

## **1. Input Layer**
### **Purpose**:
   - Acts as the entry point for the raw input data into the network. It doesn’t perform any computation but forwards the data to the subsequent layers.

### **Structure**:
   - The number of neurons (nodes) corresponds to the number of features in the input data.
   - For example:
     - In image data, if the input is a 28×28 grayscale image, the input layer will have $28 \times 28 = 784$ neurons.
     - For tabular data, if there are 10 features, the input layer will have 10 neurons.

---

## **2. Dense (Fully Connected) Layer**
### **Purpose**:
   - Connects every neuron in one layer to every neuron in the next layer.
   - Learns weights for every connection and applies biases and activation functions.

### **Mechanics**:
   - The output of a neuron is calculated as:
     $y = \sigma\left(\sum_{i=1}^n w_i x_i + b\right)$
     where:
     - $w_i$: Weight for the $i$-th input.
     - $x_i$: Value of the $i$-th input.
     - $b$: Bias.
     - $\sigma$: Activation function.

### **Applications**:
   - Used in feedforward neural networks for tasks like classification and regression.

---

## **3. Activation Layer**
### **Purpose**:
   - Introduces non-linearity into the network, enabling it to learn complex patterns and relationships in the data.

### **Common Activation Functions**:
1. **ReLU (Rectified Linear Unit)**:
   - Formula: $f(x) = \max(0, x)$.
   - Benefits:
     - Avoids vanishing gradient problems.
     - Computationally efficient.
   - Drawback:
     - Can cause dead neurons if weights update to negative values.

2. **Sigmoid**:
   - Formula: $f(x) = \frac{1}{1 + e^{-x}}$.
   - Benefits:
     - Squashes output to the range (0, 1), suitable for probabilities.
   - Drawback:
     - Prone to vanishing gradient issues for large positive/negative inputs.

3. **Tanh (Hyperbolic Tangent)**:
   - Formula: $f(x) = \tanh(x)$.
   - Benefits:
     - Outputs in range (-1, 1), making it zero-centered.
   - Drawback:
     - Similar vanishing gradient issues as sigmoid.

4. **Softmax**:
   - Formula: $f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}$.
   - Benefits:
     - Converts logits into probabilities.
     - Commonly used in multi-class classification.

---

## **4. Convolutional Layer**
### **Purpose**:
   - Extracts spatial and hierarchical features from data, particularly effective for image and video processing.

### **Mechanics**:
   - Applies convolution operations using filters (kernels).
   - A filter slides over the input data, performing element-wise multiplication and summation:
     $(I * K)[i, j] = \sum_{m=0}^{M} \sum_{n=0}^{N} I[i+m, j+n] \cdot K[m, n]$
     where:
     - $I$: Input matrix.
     - $K$: Kernel matrix.
     - $M, N$: Dimensions of the kernel.

### **Parameters**:
   - **Kernel Size**: Dimensions of the filter (e.g., 3×3, 5×5).
   - **Stride**: Number of steps the filter moves at each step.
   - **Padding**:
     - **Valid**: No padding; output shrinks.
     - **Same**: Padding added to maintain dimensions.

### **Applications**:
   - Image classification, object detection, and style transfer.

---

## **5. Pooling Layer**
### **Purpose**:
   - Reduces spatial dimensions of feature maps, lowering computational cost and mitigating overfitting.

### **Types**:
1. **Max Pooling**:
   - Extracts the maximum value in a pooling window.
2. **Average Pooling**:
   - Computes the average value in a pooling window.

### **Parameters**:
   - Pooling size (e.g., 2×2, 3×3).
   - Stride.

---

## **6. Recurrent Layer**
### **Purpose**:
   - Processes sequential data by retaining information about previous inputs through hidden states.

### **Types**:
1. **Simple RNN**:
   - Computes the hidden state at time $t$ as:
     $h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$
   - Struggles with long-term dependencies.

2. **LSTM (Long Short-Term Memory)**:
   - Handles long-term dependencies using gates:
     - Forget gate: Decides which information to discard.
     - Input gate: Decides which information to store.
     - Output gate: Decides what to output.

3. **GRU (Gated Recurrent Unit)**:
   - Similar to LSTM but with fewer gates.

---

## **7. Dropout Layer**
### **Purpose**:
   - Reduces overfitting by randomly deactivating a fraction of neurons during training.

### **Mechanics**:
   - During training:
     - A neuron is kept active with a probability $p$.
     - Otherwise, its output is set to 0.
   - At inference:
     - Outputs are scaled by $p$ to maintain consistency.

---

## **8. Batch Normalisation Layer**
### **Purpose**:
   - Normalises the inputs of each layer to stabilise training and improve convergence speed.

### **Mechanics**:
   - Normalises inputs to have zero mean and unit variance:
     $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
   - Then scales and shifts using learnable parameters $\gamma$ and $\beta$:
     $y = \gamma \hat{x} + \beta$

---

## **9. Attention Layer**
### **Purpose**:
   - Focuses on specific parts of the input data, dynamically assigning importance to different features.

### **Applications**:
   - Transformer models for NLP and computer vision.

---

## **10. Embedding Layer**
### **Purpose**:
   - Converts discrete categorical variables into dense continuous representations.

### **Mechanics**:
   - Represents words or categories in a lower-dimensional vector space.
   - Example:
     - Word embedding: "king" → [0.12, 0.89, 0.33].

---

## **11. Residual Layer**
### **Purpose**:
   - Introduces skip connections in deep networks to alleviate vanishing gradient problems.
   - Output of a residual block:
     $y = F(x) + x$

---

## **12. Output Layer**
### **Purpose**:
   - Produces the final output of the model.
   - For regression: Single neuron (e.g., predicting continuous values).
   - For classification: One neuron per class with a Softmax activation.

---

## **Custom Layers**
- Created to handle specific tasks using frameworks like TensorFlow or PyTorch.

---

These layers form the foundation for designing sophisticated neural network architectures tailored for diverse applications like image recognition, natural language processing, and time series prediction.
