# **Neural Network (NN)**

**Neural Network (NN)**
- **ANN** (Artificial Neural Network) → Feedforward / MLP
- **CNN** (Convolutional Neural Network)
- **RNN** (Recurrent Neural Network) → LSTM, GRU
- **Transformer** → BERT, GPT
- **Autoencoder** → Variational Autoencoder (VAE)
- **GAN** (Generative Adversarial Network)

# **ANN →  Artificial Neural Network**

**FNN →  Feedforward Neural Network**

- Each neuron take input- input will be multiplied by weight- calculate weighted sum of each layer -  add bias

    **`∑ ( 𝑤𝑖 ⋅ 𝑥𝑖 ) + 𝑏`**

- Apply activation function - get the output

    **`𝑦 = Activation ( ∑ ( 𝑤 𝑖 ⋅ 𝑥 𝑖 ) + 𝑏 )`**

- Result (y) will pass to the next layer until the output layer gives a final prediction.

- Calculate loss: Compare prediction with actual value

- Improve Accuracy using backpropagation: Errors are sent backward through the layers - Adjust weight to reduce error

    **`𝑤 = 𝑤 − 𝛼 ⋅ ∂ Loss / ∂ 𝑤`**

- Repeat Until Convergence - Train multiple **epochs** until the model **makes accurate predictions**.


# **CNN → Convolutional Neural Networks**

- Start with **input image (or data)** – break it into small patches.

- Apply **filters (kernels)** – slide over input – perform element-wise multiplication – calculate features (edges, textures, etc.).

     **`Feature Map = ∑ (Filter × Image Patch) + Bias`**

- Apply activation function (usually ReLU) – keep only important features.

    **`𝑦 = Activation ( ∑ (Filter × Patch) + 𝑏 )`**

- Apply **Pooling (e.g., Max Pooling)** – reduces size – keeps strongest features.

    **`Pooling = Pick maximum or average value from patch`**

- Repeat: Convolution → Activation → Pooling to learn deeper patterns.

  -Flatten the final feature map – turn it into a vector.

- Pass to **Fully Connected Layers (like ANN)** to make predictions.

- Calculate Loss: Compare predicted output with actual value.

- Improve Accuracy using **Backpropagation**: Send error backward – update filter and dense layer weights.

    **`𝑤 = 𝑤 − 𝛼 ⋅ ∂ Loss / ∂ 𝑤`**

- Repeat Until Convergence – train for multiple epochs until model predicts correctly.

# **RNN → Recurrent Neural Networks**

- Start with **input data (like sequence/text/time series)** – pass one value at a time (step-by-step).

- Each step: Multiply input by weight, add previous hidden state (memory), apply activation.

    **`ℎₜ = Activation ( 𝑤ₓ ⋅ 𝑥ₜ + 𝑤ₕ ⋅ ℎₜ₋₁ + 𝑏 )`**

- Pass the new hidden state ℎₜ to the next step – this helps the network "remember" past info.

- Final output is compared with actual value – calculate loss.

- Backpropagation Through Time (BPTT): Send error backward through all time steps – update weights.

    **`𝑤 = 𝑤 − 𝛼 ⋅ ∂ Loss / ∂ 𝑤`**

- Repeat until convergence – train over sequences multiple times.


**LSTM (Long Short-Term Memory)**

Same as RNN but with **extra gates** to **better handle long-term memory**.

- **Forget Gate**: Decides what to throw away from memory.
- **Input Gate**: Decides what new info to add.
- **Output Gate**: Decides what to show as output.

All gates work together to update the cell state (memory) and hidden state.

This solves the problem of **vanishing gradient** in long sequences.

|

**GRU (Gated Recurrent Unit)**

Similar to LSTM but **simpler** – uses only **2 gates**:

- **Update Gate**: Controls both memory and input.
- **Reset Gate**: Controls how much past info to forget.

Faster and lighter than LSTM, often with similar performance.

# **Transformer**

**Input Preparation**
- Take **input sequence** (e.g., sentence), break it into tokens (words) and convert to **embedding vectors**. Add **Positional Encoding** to add word order
- These ordered vector will passed into the first **encoder layer**

**Encoder Layer**
- Each **encoder layer** has -
  - **Multi-Head Self-Attention (with Add & Normalize)**
    - **self-attention** → each word looks at all other words including itself to determine their importance.
    - Output of self-attention is added to the original input (Add), and then cleaned up (Normalize).
  - **Feedforward Layer (with Add & Normalize)**
    - improves the word’s vector even more. Again, it adds the original input and normalizes the result.
- The **Output** is passed to the **next encoder layer**
- After all encoder layers are done, we get a **new vector** for each word. These vectors now understand each word’s **meaning** in context.

**Decoder Layer**
- The **decoder** takes the final output from the encoder (which understands the input sentence).
- Each **Decoder Layer** has -
  - **Masked Self-Attention**
    - each word can only see the previous words, not future ones (prevents cheating)
  - **Encoder-Decoder Attention (with Add & Normalize)**
    - decoder looks at the encoder output to understand the input sentence.
  - **Feedforward Layer (with Add & Normalize)**
- After all decoder layers, we send the **output** to a **linear layer + softmax**
- Use **Loss Function** to compare prediction with actual
- Use **Backpropagation** to update weights and learn
- Repeat over and over (many epochs) until it learns well.

**input preparation**

- Tokenization → ['The', 'dog', 'barked']
- Embedding → Each word becomes a vector (e.g., n/512-dimensional)
- Positional Encoding → 0, 1, 2 and so on

**encoder**

- a. Self-Attention → Each word checks the importance of all other words, including itself.
  - Each word is turned into three vectors: Query (Q), Key (K), Value (V)
  - Attention(Q, K, V) = softmax(QKᵀ / √dk) V
- Now each word gets a new, context-aware vector.
-  b. Feedforward Layer → Each word vector passes through a small neural network, This helps the model learn abstract features like tense, roles, etc.
  - FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
- Add & Normalize after both a and b steps.

**decoder**

Assume we want to translate:
"The dog barked" → "Le chien a aboyé" (French)

-  a. Masked Self-Attention → The decoder only sees past part of the output. Masking prevents the model from peeking
  - If generating "Le", it sees nothing (start token only)
  - "chien", it sees ["Le"]
  - "a", it sees ["Le", "chien"]
- b. Encoder-Decoder Attention → While generating each word, the decoder looks at relevant parts of the input.
  - When generating “aboyé”, it focuses on "barked"
- c. Feedforward → refines the output vectors.

**output**

- Each decoder step outputs a word, one by one:
  - "Le" → "chien" → "a" → "aboyé"

- These come from a Linear layer + Softmax, choosing the most likely word.


**training**
- Compare generated output ("Le chien a aboyé") with the actual correct output.

- Use loss function (e.g., Cross-Entropy Loss).

- Apply Backpropagation to adjust model weights.

- Repeat across many sentences and epochs until the model gets better.

**BERT (Bidirectional Encoder Representations from Transformers)**

- Uses only the **Encoder** part of Transformer
- Reads sentence **in both directions (left + right)** for full context
- Pretrained using **Masked Language Modeling** (hide random words and guess them)
- Good for **understanding tasks**: classification, question answering, etc.

|

**GPT (Generative Pretrained Transformer)**

- Uses only the **Decoder** part of Transformer
- Reads text **left to right (one direction only)**
- Pretrained to **predict the next word** in sequence
- Great for **generating text**: writing, answering, summarizing, etc.

# **Autoencoder**

- Input data (e.g., image or vector) is passed to the model.
- **Encoder** compresses the input into a smaller representation (**latent vector**).
- **Decoder** tries to **reconstruct the original input** from the latent vector.
- Output is compared with original input → calculate **Reconstruction Loss**.
- Backpropagation is used to adjust weights and **minimize reconstruction error**.
- Trained to **learn only the essential features** of input data.

**Variational Autoencoder (VAE)**

- Works like autoencoder but **adds randomness** to latent space.
- Encoder doesn't give just one latent vector → it gives **mean (μ)** and **variance (σ)**.
- From μ and σ → sample a **random point (z)** in latent space.
- This makes the model **generate new, similar data** (good for generative tasks).
- Loss = Reconstruction Loss + KL Divergence (regularizes latent space)
- Trained to **generate new data** and learn **smooth, continuous features**.

# **Generative Adversarial Network (GAN)**

- GAN has **two models**:  
  - **Generator**: creates fake data  
  - **Discriminator**: checks if data is real or fake  
- Start with random noise as input → **Generator** tries to create realistic data (e.g., fake images).
- **Discriminator** sees both real and fake data → predicts if it's real (1) or fake (0).
- Both models compete:
  - Generator improves to **fool the discriminator**
  - Discriminator improves to **catch fakes**
- Loss:
  - Generator tries to **maximize** the chance of fooling the discriminator
  - Discriminator tries to **minimize** its error in detecting fake vs real
- Use **backpropagation** to update both models
- Trained until **Generator produces realistic data** that Discriminator can’t easily detect

# **Deep Reinforcement Learning (DRL)**

- An **agent** interacts with an **environment** (like a game or simulation).
- At each step:
  - Agent observes the **state** of the environment.
  - Chooses an **action** based on current state.
  - Environment returns a **reward** and new **state**.
- Goal: **Learn actions** that maximize total reward over time.
- Uses **Deep Neural Networks** to approximate the best actions (called a policy or Q-function).
- Learns by **trial and error**: good actions get **positive rewards**, bad ones get **penalties**.
- Uses **backpropagation** to update the neural network and improve decisions.
- Trained over many episodes until the agent learns the best strategy.