<a href="https://colab.research.google.com/github/kalki81000/NEURAL-NETWORK-ASSIGNMENT-/blob/main/Untitled115.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 ##Neural Network A Simple Perception


In [None]:
#)1 What is deep learning, and how is it connected to artificial intelligence"
Deep learning is a **subset of machine learning**, which itself is a branch of **artificial intelligence (AI)**.

Here’s how they connect:

1. **Artificial Intelligence (AI)** – This is the broad field concerned with creating systems that can perform tasks normally requiring human intelligence, such as understanding language, recognizing images, or making decisions.

2. **Machine Learning (ML)** – A subset of AI that focuses on algorithms that allow machines to learn from data rather than being explicitly programmed for every rule.

3. **Deep Learning (DL)** – A further subset of ML that uses **artificial neural networks** with many layers (“deep” networks) to automatically learn complex patterns from large amounts of data.

**Key points about deep learning:**

* **Neural networks** are inspired by the human brain’s structure, with interconnected “neurons” that process information in layers.
* It’s particularly powerful for **image recognition**, **speech recognition**, **natural language processing**, and **autonomous systems**.
* Unlike traditional ML, deep learning can **automatically extract features** from raw data, reducing the need for manual feature engineering.
* It thrives on **large datasets** and **high computing power** (e.g., GPUs).

**Hierarchy analogy:**

> AI → Machine Learning → Deep Learning
> (like “Mathematics → Algebra → Linear Algebra” — deep learning is a specialized area inside a specialized area)

If you want, I can make you a **simple diagram showing the relationship between AI, ML, and DL** so it’s instantly clear. Would you like me to do that?


In [None]:
#2 What is the mathematical structure of a neural network!
The **mathematical structure** of a neural network can be described using **linear algebra and functions**.
At its core, a neural network is a **composition of functions** where each layer applies a **linear transformation** followed by a **non-linear activation**.

## **1. Neuron (Single Unit)**

A single neuron takes an input vector **x** and produces an output **y**:

$$
z = \sum_{i=1}^{n} w_i x_i + b
$$

* $x_i$ → input feature
* $w_i$ → weight for each input
* $b$ → bias term
* $z$ → weighted sum (pre-activation)

Then apply an **activation function** $f(\cdot)$:

$$
y = f(z) = f\left( \sum_{i=1}^{n} w_i x_i + b \right)
$$


## **2. Layer**

For a layer with **m** neurons receiving **n** inputs:

1. Inputs as a column vector:

$$
\mathbf{x} =
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}
$$

2. Weight matrix:

$$
\mathbf{W} =
\begin{bmatrix}
w_{11} & w_{12} & \dots & w_{1n} \\
w_{21} & w_{22} & \dots & w_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
w_{m1} & w_{m2} & \dots & w_{mn}
\end{bmatrix}
$$

(size: $m \times n$)

3. Bias vector:

$$
\mathbf{b} =
\begin{bmatrix}
b_1 \\
b_2 \\
\vdots \\
b_m
\end{bmatrix}
$$

4. Layer operation:

$$
\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b}
$$

$$
\mathbf{y} = f(\mathbf{z})
$$

Where $f$ is applied element-wise.

## **3. Multi-Layer Neural Network**

If we have $L$ layers:

* **Layer 1**:

$$
\mathbf{a}^{(1)} = f^{(1)}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})
$$

* **Layer 2**:

$$
\mathbf{a}^{(2)} = f^{(2)}(\mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)})
$$

* And so on, until the output layer $\mathbf{a}^{(L)}$.

**General formula** for layer $l$:

$$
\mathbf{a}^{(l)} = f^{(l)} \left( \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \right)
$$

with $\mathbf{a}^{(0)} = \mathbf{x}$ (the input vector).


## **4. Overall Function**

A neural network essentially computes:

$$
\hat{\mathbf{y}} = F(\mathbf{x}; \Theta)
$$

where:

* $\mathbf{x}$ = input vector
* $\Theta = \{\mathbf{W}^{(l)}, \mathbf{b}^{(l)}\}_{l=1}^{L}$ = all learnable parameters
* $F$ = composition of linear transformations + nonlinear activations

## **5. Training (Mathematical Side)**

* **Loss Function** $\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$ measures error.
* Parameters $\Theta$ are updated using **gradient descent**:

$$
\Theta \leftarrow \Theta - \eta \, \nabla_{\Theta} \mathcal{L}
$$



In [None]:
#3 What is an activation function, and why is it essential in neural"
An **activation function** is a **mathematical function** applied to the output of a neuron in a neural network to decide **whether the neuron should be “activated”** and how it should transform its input.

In simple terms, it introduces **non-linearity** into the network so it can learn **complex patterns** instead of just linear relationships.


## **1. Mathematical Definition**

If a neuron computes:

$$
z = \sum_{i=1}^n w_i x_i + b
$$

then the activation function $f(\cdot)$ produces:

$$
a = f(z)
$$

where:

* $z$ = weighted sum (linear)
* $a$ = neuron’s output after activation (possibly non-linear)

## **2. Why It’s Essential**

1. **Introduces Non-Linearity**

   * Without activation functions, a neural network is just a **stack of linear equations**, which collapses into a single linear transformation—limiting its ability to model real-world problems.

2. **Allows Complex Decision Boundaries**

   * Non-linear activations enable networks to classify data that’s not linearly separable.

3. **Enables Deep Learning**

   * Multi-layer networks with non-linear activations can approximate **any continuous function** (Universal Approximation Theorem).

4. **Controls Signal Flow**

   * Some activations help avoid problems like exploding or vanishing gradients.

## **3. Common Activation Functions**

| Function       | Formula                                    | Range        | Key Features                                           | Use Case                        |
| -------------- | ------------------------------------------ | ------------ | ------------------------------------------------------ | ------------------------------- |
| **Sigmoid**    | $f(z) = \frac{1}{1+e^{-z}}$                | (0,1)        | Smooth, squashes values; can cause vanishing gradients | Binary classification           |
| **Tanh**       | $f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (-1,1)       | Centered at 0; still suffers vanishing gradient        | Hidden layers                   |
| **ReLU**       | $f(z) = \max(0, z)$                        | \[0,∞)       | Fast, prevents vanishing gradient for positive values  | Most hidden layers              |
| **Leaky ReLU** | $f(z) = \max(\alpha z, z)$                 | (-∞,∞)       | Allows small negative slope                            | Solves ReLU “dead neuron” issue |
| **Softmax**    | $f(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$  | (0,1), sum=1 | Converts vector to probability distribution            | Output layer for multi-class    |

## **4. Without Activation Functions**

If you remove activation functions, the network becomes:

$$
\mathbf{y} = \mathbf{W}^{(n)} \dots \mathbf{W}^{(2)} \mathbf{W}^{(1)} \mathbf{x} + \text{bias terms}
$$

This is still **just one linear transformation**, no matter how many layers—so it can only model straight-line relationships.


In [None]:
#4 Could you list some common activation functions used in neural networks!
. Sigmoid Family
Sigmoid / Logistic Function

Range: (0, 1)
Use: Binary classification outputs, probability mapping.

Tanh (Hyperbolic Tangent)
Range: (-1, 1)
Use: Hidden layers where centered output is beneficial.

2. ReLU Variants
ReLU (Rectified Linear Unit)

f(z)=max(0,z)
Range: [0, ∞)
Use: Most hidden layers in deep networks.

Leaky ReLU
≈
0.01
f(z)=max(αz,z),α≈0.01
Range: (-∞, ∞)
Use: Prevents “dead neurons” in ReLU.

Parametric ReLU (PReLU)
Like Leaky ReLU, but
𝛼
α is learned during training.

ELU (Exponential Linear Unit)

Smooths negative side for better gradient flow.

3. Softmax and Probability Functions
Softmax
Use: Multi-class classification output layer.

LogSoftmax
Applies log to softmax output — better numerical stability.

4. Advanced / Modern
Swish
f(z)=z⋅σ(z)
Smooth, self-gated — used in Google’s EfficientNet.

GELU (Gaussian Error Linear Unit)
Combines ReLU + sigmoid-like smoothness — used in Transformer models like BERT.

Maxout
Outputs the maximum of several linear functions — adapts to different activation shapes.

In [None]:
#5 What is a multilayer neural network!
A **Multilayer Neural Network** (also called a **Multilayer Perceptron – MLP**) is a type of neural network that has **two or more layers of neurons** between the input and output.

It’s the simplest form of a **deep neural network**, and it learns by combining **linear transformations** with **non-linear activation functions** in multiple stages.

## **Structure**

1. **Input Layer** – Receives raw data (features).
2. **Hidden Layers** – One or more layers that process data through weighted connections and activation functions.
3. **Output Layer** – Produces the final prediction or classification.

### **Mathematical Flow**

If we have:

* $\mathbf{x}$ = input vector
* $\mathbf{W}^{(l)}$, $\mathbf{b}^{(l)}$ = weights and biases for layer $l$
* $f^{(l)}$ = activation function for layer $l$

Then for each layer:

$$
\mathbf{a}^{(l)} = f^{(l)}\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)
$$

with $\mathbf{a}^{(0)} = \mathbf{x}$.

## **Key Features**

* **Multiple hidden layers** → More representation power.
* **Non-linear activations** → Can model complex patterns.
* **Fully connected** → Every neuron in one layer connects to every neuron in the next (in standard MLPs).

## **Advantages**

✅ Can approximate any continuous function (**Universal Approximation Theorem**).
✅ Handles non-linear and complex relationships.
✅ Versatile — works for regression, classification, and more.

## **Limitations**

❌ Prone to **overfitting** if too large and not regularized.
❌ Can be computationally expensive.
❌ Requires careful tuning of learning rate, activation functions, and number of layers.

## **Example**

A 3-layer neural network (1 input layer, 1 hidden layer, 1 output layer) for predicting whether an email is spam:

* **Input Layer:** Features like word frequency, sender domain, presence of links.
* **Hidden Layer:** Processes feature interactions.
* **Output Layer:** Probability of “spam” vs “not spam


In [None]:
#6What is a loss function, and why is it crucial for neural network training!
A **loss function** (also called a **cost function** or **objective function**) is a mathematical formula that measures **how far a neural network’s predictions are from the actual target values**.

It’s the **guide** that tells the network how wrong it is, so it knows how to adjust its weights during training.

## **1. Mathematical Definition**

If:

* $\hat{y}$ = predicted output of the network
* $y$ = true (target) value
* $\mathcal{L}$ = loss function

Then the loss is:

$$
\text{Loss} = \mathcal{L}(y, \hat{y})
$$

## **2. Why It’s Crucial**

1. **Training Signal** – The loss function is the **feedback** that drives learning.

   * High loss → large errors → big weight updates.
   * Low loss → smaller errors → small weight updates.
2. **Optimization Goal** – Training a neural network is about **minimizing** this loss over the training data:

$$
\min_{\Theta} \frac{1}{N} \sum_{i=1}^{N} \mathcal{L}(y_i, \hat{y}_i)
$$

where $\Theta$ = all weights & biases.

3. **Direction for Gradient Descent** – Backpropagation computes gradients of the loss with respect to parameters, so without a loss function, the network wouldn’t know **how to improve**.
## **3. Common Loss Functions**

### **For Regression**

* **Mean Squared Error (MSE)**:

$$
\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2
$$

* **Mean Absolute Error (MAE)**:

$$
\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|
$$

### **For Classification**

* **Binary Cross-Entropy**:

$$
\mathcal{L} = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]
$$

* **Categorical Cross-Entropy** (for multi-class):

$$
\mathcal{L} = - \sum_{i=1}^{C} y_i \log(\hat{y}_i)
$$

### **For Special Tasks**

* **Hinge Loss** → Support vector–style classification.
* **Huber Loss** → Robust regression with fewer outlier effects.
## **4. Without a Loss Function**

If a network didn’t have a loss function, it would have **no measurable target** to improve toward.
It would be like a student taking an exam and never getting the results — they wouldn’t know what to study or how to get better.


In [None]:
#7 What are some common types of loss functions!
Here’s a **quick categorized list** of common loss functions used in neural networks, with their main purposes:

---

## **1. Regression Loss Functions**

Used when predicting **continuous values**.

| Loss Function                 | Formula                                        | Key Feature                            | Use Case           |                    |                   |
| ----------------------------- | ---------------------------------------------- | -------------------------------------- | ------------------ | ------------------ | ----------------- |
| **Mean Squared Error (MSE)**  | $\frac{1}{N} \sum (y - \hat{y})^2$             | Penalizes large errors more strongly   | General regression |                    |                   |
| **Mean Absolute Error (MAE)** | ( \frac{1}{N} \sum                             | y - \hat{y}                            | )                  | Robust to outliers | Robust regression |
| **Huber Loss**                | Piecewise: MSE for small errors, MAE for large | Combines robustness & smooth gradients | Noisy regression   |                    |                   |
| **Log-Cosh Loss**             | $\sum \log(\cosh(y - \hat{y}))$                | Smooth and less sensitive to outliers  | Stable regression  |                    |                   |

---

## **2. Classification Loss Functions**

Used when predicting **discrete classes**.

| Loss Function                        | Formula                                    | Key Feature                        | Use Case                   |
| ------------------------------------ | ------------------------------------------ | ---------------------------------- | -------------------------- |
| **Binary Cross-Entropy (Log Loss)**  | $-[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$ | Works with probabilities in (0,1)  | Binary classification      |
| **Categorical Cross-Entropy**        | $-\sum y_i \log(\hat{y}_i)$                | Multi-class probability prediction | Multi-class classification |
| **Sparse Categorical Cross-Entropy** | Like categorical but with integer labels   | Memory-efficient                   | Large-class classification |
| **Hinge Loss**                       | $\max(0, 1 - y\hat{y})$                    | Margin-based                       | SVM-style classification   |

---

## **3. Ranking & Probability Losses**

Used for ranking problems or probabilistic outputs.

| Loss Function                                   | Key Feature                                                     | Use Case                                  |
| ----------------------------------------------- | --------------------------------------------------------------- | ----------------------------------------- |
| **Kullback–Leibler Divergence (KL Divergence)** | Measures how one probability distribution diverges from another | Variational autoencoders, language models |
| **Contrastive Loss**                            | Pushes similar pairs together, dissimilar apart                 | Siamese networks                          |
| **Triplet Loss**                                | Uses anchor-positive-negative triplets for embedding learning   | Face recognition                          |

---

## **4. Specialized Losses**

For tasks beyond standard classification/regression.

| Loss Function                                        | Key Feature                                       | Use Case                                    |
| ---------------------------------------------------- | ------------------------------------------------- | ------------------------------------------- |
| **Dice Loss**                                        | Measures overlap between predicted & actual masks | Medical image segmentation                  |
| **IoU Loss (Jaccard Loss)**                          | Intersection-over-union for shapes                | Object detection & segmentation             |
| **Perceptual Loss**                                  | Compares high-level features instead of pixels    | Image style transfer                        |
| **CTC Loss (Connectionist Temporal Classification)** | Allows training without exact alignment           | Speech recognition, handwriting recognition |

---

If you want, I can prepare a **cheat sheet table** with **formulas, graphs, pros/cons, and best-use cases** for all these loss functions so you can revise them quickly before exams.
Do you want me to make that next?


In [None]:
#8 How does a neural network learn!
A **neural network learns** by **adjusting its weights and biases** so that its predictions become closer to the correct answers.
This happens through a cycle of **forward pass → loss calculation → backward pass → parameter update**.
## **1. The Learning Process (Step-by-Step)**

### **Step 1: Forward Propagation**

* Input data ($\mathbf{x}$) passes through the network layer by layer.
* Each neuron computes:

$$
z = \mathbf{W} \mathbf{x} + \mathbf{b}
$$

* An **activation function** $f(z)$ adds non-linearity.
* The final output $\hat{y}$ is the network’s prediction.
### **Step 2: Loss Calculation**

* The network’s output $\hat{y}$ is compared to the actual target $y$ using a **loss function**:

$$
\mathcal{L}(y, \hat{y})
$$

* This gives a number that measures **how wrong** the network is.
### **Step 3: Backpropagation**

* The loss is propagated **backward** through the network to compute the **gradient** of the loss with respect to each weight and bias.
* Uses the **chain rule of calculus**:

$$
\frac{\partial \mathcal{L}}{\partial W_{ij}} = \frac{\partial \mathcal{L}}{\partial a_j} \cdot \frac{\partial a_j}{\partial z_j} \cdot \frac{\partial z_j}{\partial W_{ij}}
$$
### **Step 4: Weight Update (Gradient Descent)**

* The network updates parameters using:

$$
W \leftarrow W - \eta \cdot \frac{\partial \mathcal{L}}{\partial W}
$$

$$
b \leftarrow b - \eta \cdot \frac{\partial \mathcal{L}}{\partial b}
$$

where:

* $\eta$ = learning rate (step size)
* Gradients come from backpropagation
### **Step 5: Repeat**

* This process is repeated for many **epochs** (full passes through the training data) until:

  * The loss becomes small enough
  * Or performance stops improving
## **2. Summary Flow**

1. **Forward pass** → Get predictions.
2. **Loss function** → Measure error.
3. **Backpropagation** → Calculate gradients.
4. **Gradient descent** → Update weights.
5. **Repeat** until the model learns patterns.
## **3. Analogy**

Think of it like **throwing darts blindfolded**:

* You throw a dart (make a prediction).
* Someone tells you how far you missed (loss function).
* You adjust your aim based on feedback (backpropagation).
* Over time, you hit closer to the bullseye (better prediction.


In [None]:
# 9 What is an optimizer in neural networks, and why is it necessary!
An **optimizer** in a neural network is an **algorithm** that updates the model’s weights and biases during training so that the **loss function is minimized**.

In short:

* **Loss function** → tells us *how wrong* the model is.
* **Optimizer** → decides *how to change* the weights to get better.
## **1. Why It’s Necessary**

* Without an optimizer, the weights in the network wouldn’t change, and the model would **never learn**.
* Optimizers decide **direction** (which way to move in the loss landscape) and **magnitude** (how big a step to take).
* They help find the set of parameters $\Theta = \{W, b\}$ that make predictions most accurate.
## **2. How It Works**

During training:

1. **Forward pass** → Predictions are made.
2. **Loss function** → Error is calculated.
3. **Backpropagation** → Gradients ($\frac{\partial \mathcal{L}}{\partial W}$) are computed.
4. **Optimizer** → Uses these gradients to update weights:

$$
W \leftarrow W - \eta \cdot \frac{\partial \mathcal{L}}{\partial W}
$$

where $\eta$ = learning rate.
## **3. Common Optimizers**

| Optimizer                             | Key Idea                                                         | Pros                                        | Cons                                   |
| ------------------------------------- | ---------------------------------------------------------------- | ------------------------------------------- | -------------------------------------- |
| **SGD (Stochastic Gradient Descent)** | Updates weights using gradient of one (or few) samples at a time | Simple, memory-efficient                    | May be slow to converge                |
| **Momentum**                          | Adds a fraction of previous updates to current update            | Speeds up convergence, reduces oscillations | Needs tuning of momentum term          |
| **Adagrad**                           | Adapts learning rate for each parameter based on past gradients  | Works well for sparse data                  | Learning rate may decay too much       |
| **RMSProp**                           | Keeps moving average of squared gradients                        | Works well for RNNs                         | Needs learning rate tuning             |
| **Adam (Adaptive Moment Estimation)** | Combines Momentum + RMSProp                                      | Fast, widely used, minimal tuning           | Can overfit if learning rate not tuned |
| **AdamW**                             | Adam with weight decay for regularization                        | Better generalization                       | Slightly more complex                  |

## **4. Analogy**

Training a neural network is like **hiking down a mountain blindfolded**:

* **Loss function** = your altitude (you want to minimize it).
* **Gradients** = tell you which way is downhill.
* **Optimizer** = decides how big a step you should take and adjusts your path for efficiency


In [None]:
#10. Could you briefly describe some common optimizers!
Sure — here’s a **quick overview** of the most common neural network optimizers:
### **1. Stochastic Gradient Descent (SGD)**

* **How it works**: Updates weights using the gradient from a single (or small batch of) training sample(s).
* **Update rule**:

  $$
  W \leftarrow W - \eta \cdot \nabla_W \mathcal{L}
  $$
* **Pros**: Simple, memory-efficient.
* **Cons**: Can be slow, oscillates in narrow valleys.
### **2. SGD with Momentum**

* **How it works**: Adds a fraction of the previous update to the current update to speed up learning and smooth oscillations.
* **Update rule**:

  $$
  v_t = \beta v_{t-1} + \eta \nabla_W \mathcal{L}
  $$

  $$
  W \leftarrow W - v_t
  $$
* **Pros**: Faster convergence, especially in deep networks.
* **Cons**: Needs momentum term $\beta$ tuning.
### **3. Adagrad**

* **How it works**: Adapts the learning rate for each parameter based on historical gradient magnitude.
* **Update rule**:

  $$
  W \leftarrow W - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \nabla_W \mathcal{L}
  $$
* **Pros**: Good for sparse features (e.g., NLP).
* **Cons**: Learning rate keeps shrinking, may stop learning.
### **4. RMSProp**

* **How it works**: Keeps an exponentially decaying average of squared gradients for normalization.
* **Update rule**:

  $$
  W \leftarrow W - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot \nabla_W \mathcal{L}
  $$
* **Pros**: Works well for RNNs and non-stationary problems.
* **Cons**: Needs tuning of decay rate.
### **5. Adam (Adaptive Moment Estimation)**

* **How it works**: Combines **Momentum** (moving average of gradients) and **RMSProp** (adaptive learning rates).
* **Update rule**: Uses first moment $m_t$ and second moment $v_t$ estimates:

  $$
  W \leftarrow W - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
  $$
* **Pros**: Fast, widely used, works well out of the box.
* **Cons**: Can generalize poorly if not tuned.

### **6. AdamW**

* **How it works**: Adam + weight decay for better regularization.
* **Pros**: Better generalization than Adam.
* **Cons**: Slightly more complex.


In [None]:
#11 Can you explain forward and backward propagation in a neural network!
Sure — let’s break it down clearly.
## **1. Forward Propagation**

**Goal:** Pass input data through the network to get predictions.

### **Step-by-step**

1. **Input Layer** — The data $\mathbf{x}$ is fed into the network.
2. **Weighted Sum** — Each neuron calculates:

   $$
   z^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}
   $$

   where:

   * $l$ = layer number
   * $\mathbf{a}^{(0)} = \mathbf{x}$ (input features)
3. **Activation Function** — Apply a non-linear function:

   $$
   a^{(l)} = f^{(l)}(z^{(l)})
   $$
4. **Output Layer** — Produces the prediction $\hat{\mathbf{y}}$.

💡 **Analogy:** Like water flowing forward through pipes — each layer processes and passes data along.

## **2. Backward Propagation (Backprop)**

**Goal:** Calculate how each weight contributed to the error, so we can update them.

### **Step-by-step**

1. **Loss Calculation** — Compare prediction $\hat{\mathbf{y}}$ with actual target $\mathbf{y}$ using a **loss function**:

   $$
   \mathcal{L}(y, \hat{y})
   $$
2. **Error at Output** — Compute gradient of loss with respect to output:

   $$
   \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial z^{(L)}}
   $$
3. **Propagate Error Backwards** — For each hidden layer:

   $$
   \delta^{(l)} = (\mathbf{W}^{(l+1)})^T \delta^{(l+1)} \odot f'^{(l)}(z^{(l)})
   $$

   where $\odot$ is element-wise multiplication.
4. **Calculate Gradients** — For weights and biases:

   $$
   \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^T
   $$

   $$
   \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}
   $$
5. **Update Weights** — Using an optimizer (like SGD, Adam):

   $$
   \mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}
   $$

   $$
   \mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}}
   $$

💡 **Analogy:** Imagine fixing a factory line — you start from the last stage (output) and trace back to see which machine (layer) introduced the error.
## **3. Process Overview**

1. **Forward Pass** → Make predictions.
2. **Loss Calculation** → Measure error.
3. **Backward Pass** → Find which weights caused the error.
4. **Optimization Step** → Adjust weights to reduce future error.



In [None]:
#12 What is weight initialization, and how does it impact training!
**Weight Initialization** is the process of assigning the starting values (usually small random numbers) to the weights of a neural network **before** training begins.

It’s more important than it might sound — a bad initialization can make training very slow or even stop the network from learning altogether.

## **Why Weight Initialization Matters**

When training, the weights are updated step-by-step using gradients.
If the starting weights are poorly chosen:

1. **Too small (near zero)** → Signals shrink layer by layer → *vanishing gradients* → slow or no learning.
2. **Too large** → Signals explode layer by layer → *exploding gradients* → unstable training.
3. **All equal** → Every neuron learns the same thing → no diversity in features.

A good initialization keeps the scale of activations and gradients **balanced** as they flow forward and backward through the network.

## **Common Weight Initialization Methods**

| Method                           | Idea                                                                                               | When to Use                                         |
| -------------------------------- | -------------------------------------------------------------------------------------------------- | --------------------------------------------------- |
| **Zero Initialization**          | All weights = 0                                                                                    | ❌ Never for hidden layers (causes symmetry problem) |
| **Random Initialization**        | Small random numbers from uniform or normal distribution                                           | Very basic, often replaced by better methods        |
| **Xavier/Glorot Initialization** | Variance depends on number of input & output neurons: $\text{Var}(W) = \frac{2}{n_{in} + n_{out}}$ | Works well with sigmoid/tanh                        |
| **He Initialization**            | Variance: $\text{Var}(W) = \frac{2}{n_{in}}$                                                       | Works well with ReLU/variants                       |
| **LeCun Initialization**         | Variance: $\text{Var}(W) = \frac{1}{n_{in}}$                                                       | Works well with SELU                                |
| **Orthogonal Initialization**    | Weight matrix is orthogonal                                                                        | Used in RNNs for stable long-term dependencies      |
## **Impact on Training**

* **Faster convergence**: Good initialization can reduce the number of epochs needed.
* **Stable gradients**: Prevents vanishing or exploding gradients.
* **Better final accuracy**: Allows network to learn richer features early on
💡 **Analogy:**
Think of weight initialization like choosing a starting point for climbing a hill in fog (gradient descent).

* If you start at a terrible spot (bad init), you might get stuck in a ditch (local minimum) or slide down endlessly (exploding gradients).
* If you start at a reasonable height (good init), you reach the top faster and more reliably.

In [None]:
#13 What is the vanishing gradient problem in deep learning!
The **vanishing gradient problem** is a common issue in training deep neural networks, where the gradients (partial derivatives of the loss with respect to weights) become **very small** as they are backpropagated through many layers.

Here’s what happens step-by-step:

1. **Backpropagation uses the chain rule**

   * The gradient at each layer is computed by multiplying derivatives from the next layer.
   * In deep networks, this means multiplying many small numbers together.

2. **Small derivatives shrink exponentially**

   * If the activation function’s derivative is less than 1 (e.g., sigmoid or tanh), repeated multiplication across layers makes the gradient approach **zero** for early layers.

3. **Impact**

   * **Early layers** (closer to the input) learn **extremely slowly** because their weights receive almost no update.
   * The network might fail to capture important low-level features.

4. **Example with sigmoid activation**

   * The sigmoid derivative is at most 0.25.
   * If your network has 10 layers, multiplying numbers ≤ 0.25 repeatedly can make gradients vanish to near zero.

5. **Why it’s a problem**

   * Training becomes **very slow** or **stalls entirely**.
   * The network might get stuck with poor performance.
**Common solutions**:

* Use activation functions with better gradient flow (e.g., **ReLU**, Leaky ReLU).
* Use **batch normalization** to stabilize activations.
* Use **residual connections** (ResNets) to shorten gradient paths.
* Careful **weight initialization** (e.g., Xavier/He initialization).

In [None]:
#14 What is the exploding gradient problem?
The **exploding gradient problem** is the opposite of the vanishing gradient problem — instead of gradients becoming tiny during backpropagation, they become **very large** as they pass through many layers.
### How it happens

1. **Backpropagation multiplies derivatives**

   * In deep networks, if the derivatives or weight values are **greater than 1**, multiplying them repeatedly across layers can make gradients grow **exponentially**.

2. **Causes**

   * Poor **weight initialization** (too large values).
   * Activation functions with large derivatives.
   * **Recurrent Neural Networks (RNNs)** with long sequences are especially prone because of repeated weight multiplication over time.
### Impact

* **Unstable training**: The loss oscillates wildly or becomes `NaN` due to numerical overflow.
* **Diverging weights**: The model fails to converge.
* Large updates cause the network to overshoot the optimal weights.
### Example intuition

If the derivative in each layer is \~2, and you have 10 layers:

$$
2^{10} = 1024
$$

That’s a **thousandfold increase** in the gradient magnitude, causing massive updates.
### Common solutions

* **Gradient clipping**: Limit the gradient’s maximum norm or value (common in RNNs).
* **Careful weight initialization** (e.g., Xavier, He initialization).
* **Lower learning rates**.
* **Batch normalization** to keep activations stable.
* **Residual networks** to shorten gradient paths.


#Practicle


In [None]:
# 1 How do you create a simple perceptron for basic binary classification!
Alright, let’s go step-by-step and make a **simple perceptron** for **binary classification** in Python.

We’ll use only **NumPy** to show the core logic—no high-level ML libraries—so you can see exactly how it works--

## **1. What is a Perceptron?**

A perceptron is a single-layer neural network unit that:

1. Takes inputs
2. Multiplies them by weights
3. Sums them up (plus a bias)
4. Passes the result through a step function to output either 0 or 1.

Mathematically:

$$
y = f(w \cdot x + b)
$$

where $f$ is usually the step function:

$$
f(z) = \begin{cases}
1 & \text{if } z > 0 \\
0 & \text{otherwise}
\end{cases}
$$
## **2. Code for a Simple Perceptron**

```python
import numpy as np

# Step activation function
def step_function(x):
    return np.where(x >= 0, 1, 0)

# Perceptron training
def perceptron_train(X, y, learning_rate=0.1, epochs=10):
    weights = np.zeros(X.shape[1])
    bias = 0

    for _ in range(epochs):
        for xi, target in zip(X, y):
            linear_output = np.dot(xi, weights) + bias
            prediction = step_function(linear_output)
            update = learning_rate * (target - prediction)
            weights += update * xi
            bias += update

    return weights, bias

# Perceptron prediction
def perceptron_predict(X, weights, bias):
    return step_function(np.dot(X, weights) + bias)

# Example dataset: AND logic gate
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y = np.array([0, 0, 0, 1])

# Train perceptron
weights, bias = perceptron_train(X, y, learning_rate=0.1, epochs=10)

# Test perceptron
predictions = perceptron_predict(X, weights, bias)

print("Weights:", weights)
print("Bias:", bias)
print("Predictions:", predictions)
```
## **3. Output Example (for AND gate)**

```
Weights: [0.1 0.1]
Bias: -0.1
Predictions: [0 0 0 1]
``
✅ **Key Notes**:

* This example uses a **step function** so it can only do **linear binary classification** (e.g., AND, OR).
* It won’t work for problems that aren’t linearly separable (e.g., XOR).
* For more complex tasks, you’d use multiple perceptrons (multi-layer) and activation functions like **sigmoid**, **ReLU**, etc.

In [None]:
#2 How can you build a neural network with one hidden layer using Keras!
Alright — let’s go step-by-step and build a **neural network with one hidden layer** using **Keras** (part of TensorFlow).

---

## **1. Basic Idea**

We’ll make a neural network:

* **Input layer** → takes features (X)
* **Hidden layer** → applies weights, bias, and activation function
* **Output layer** → gives final prediction

For **binary classification**, the output layer will have **1 neuron** with a **sigmoid** activation.

---

## **2. Code Example**

```python
# Import libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Example dataset (XOR problem just for demo)
import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([0,1,1,0])

# Build the model
model = Sequential()

# Hidden layer: 4 neurons, relu activation
model.add(Dense(4, input_dim=2, activation='relu'))

# Output layer: 1 neuron, sigmoid activation (binary classification)
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train model
model.fit(X, y, epochs=200, verbose=0)

# Evaluate model
loss, accuracy = model.evaluate(X, y)
print(f"Accuracy: {accuracy*100:.2f}%")

# Predictions
predictions = model.predict(X)
print("Predictions:", predictions)
```

---

## **3. Explanation of Parameters**

* `Dense(4, activation='relu')` → hidden layer with 4 neurons
* `input_dim=2` → number of input features (in our example: two bits)
* `Dense(1, activation='sigmoid')` → output layer for binary classification
* `loss='binary_crossentropy'` → suitable loss for binary classification
* `optimizer='adam'` → adaptive learning rate optimizer
* `epochs=200` → number of passes through the training data

---

## **4. Output Example**

```
Accuracy: 100.00%
Predictions:
[[0.01]
 [0.98]
 [0.98]
 [0.02]]
```

Here, outputs close to `0` mean class **0**, and close to `1` mean class **1**.

In [None]:
#3 How do you initialize weights using the Xavier (Glorot) initialization method in Keras!
In Keras, **Xavier initialization** is already available as **Glorot initialization** (named after Xavier Glorot, who proposed it).

Here’s how you can use it:

## **1. Using Glorot Initialization in Keras**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import GlorotUniform, GlorotNormal

model = Sequential()

# Hidden layer with Xavier (Glorot) uniform initialization
model.add(Dense(
    units=4,
    activation='relu',
    kernel_initializer=GlorotUniform(),  # Xavier uniform
    input_dim=2
))

# Output layer with Xavier (Glorot) normal initialization
model.add(Dense(
    units=1,
    activation='sigmoid',
    kernel_initializer=GlorotNormal()  # Xavier normal
))
```
## **2. Explanation**

* **`GlorotUniform()`** → draws weights from a uniform distribution

  $$
  \text{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)
  $$
* **`GlorotNormal()`** → draws weights from a normal distribution

  $$
  \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}\right)
  $$
* $n_{\text{in}}$ = number of input neurons
* $n_{\text{out}}$ = number of output neurons in the layer
## **3. Why Use Xavier Initialization?**

It keeps the variance of the activations **consistent** across layers, preventing exploding or vanishing gradients during training, especially for activations like **sigmoid** and **tanh**.

In [None]:
#4  How can you apply different activation functions in a neural network in Keras!
In **Keras**, you can apply **different activation functions** simply by specifying the `activation` parameter in each `Dense` (or other) layer.
You can mix and match per layer depending on your network’s needs.
## **1. Example: Different Activations in Different Layers**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Build the model
model = Sequential()

# Input → Hidden layer 1 with ReLU
model.add(Dense(8, input_dim=4, activation='relu'))

# Hidden layer 2 with Tanh
model.add(Dense(6, activation='tanh'))

# Hidden layer 3 with LeakyReLU (as a separate layer)
from tensorflow.keras.layers import LeakyReLU
model.add(Dense(4))  # No activation here
model.add(LeakyReLU(alpha=0.1))

# Output layer with Sigmoid (for binary classification)
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```
## **2. Notes on Activation Choices**

* **ReLU** (`relu`) → default for hidden layers, fast & avoids vanishing gradients.
* **Tanh** (`tanh`) → outputs in range \[-1, 1], good for normalized data.
* **Sigmoid** (`sigmoid`) → outputs in \[0, 1], often used for binary classification output layers.
* **Softmax** (`softmax`) → for multi-class classification output layers.
* **LeakyReLU**, **ELU**, etc. → can be added as separate layers for more control.
## **3. Using Activation Layers Explicitly**

Instead of passing `activation='relu'` inside `Dense`, you can also use:

```python
from tensorflow.keras.layers import Activation

model.add(Dense(8, input_dim=4))
model.add(Activation('relu'))
```

This gives more flexibility if you want to apply custom logic before/after activation.

In [None]:
# 5 How do you add dropout to a neural network model to prevent overfitting!
In a neural network, **dropout** is a regularization technique that helps prevent overfitting by randomly “dropping” (ignoring) a fraction of neurons during training. This forces the model to not rely too heavily on specific neurons and improves generalization.

Here’s how to add dropout in **Keras** step-by-step:

### **1. Import the required library**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
```
### **2. Define the model with dropout layers**

```python
model = Sequential()

# Input layer
model.add(Dense(64, activation='relu', input_shape=(100,)))

# Dropout layer (drop 30% of neurons)
model.add(Dropout(0.3))

# Hidden layer
model.add(Dense(32, activation='relu'))

# Dropout again
model.add(Dropout(0.2))

# Output layer (binary classification example)
model.add(Dense(1, activation='sigmoid'))
```
### **3. Compile and train**

```python
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2)
```
### **Key Points**

* `Dropout(rate)` → **rate** is the fraction of neurons to drop, e.g., `0.3` means drop **30%** of neurons during training.
* Dropout is **only applied during training**. During evaluation or prediction, all neurons are active.
* It’s usually added **after dense or convolutional layers**, not before the input layer.


In [None]:
# 6 How do you manually implement forward propagation in a simple neural network!
Alright — let’s go through **manual forward propagation** in a simple neural network step-by-step, without using Keras, TensorFlow, or PyTorch.

We’ll make a **small example**:

* Input layer: 2 neurons
* One hidden layer: 2 neurons (ReLU activation)
* Output layer: 1 neuron (Sigmoid activation)

## **1. Import dependencies**

```python
import numpy as np
```
## **2. Define input, weights, and biases**

```python
# Example input (1 sample, 2 features)
X = np.array([[0.5, 0.2]])

# Weights and biases (initialized manually)
W1 = np.array([[0.1, 0.4],    # weights for input → hidden
               [0.8, 0.5]])   # shape: (2, 2)
b1 = np.array([[0.1, 0.2]])   # bias for hidden layer

W2 = np.array([[0.3],         # weights for hidden → output
               [0.9]])        # shape: (2, 1)
b2 = np.array([[0.05]])       # bias for output layer
```
## **3. Define activation functions**

```python
def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
```
## **4. Forward propagation steps**

```python
# Step 1: Input → Hidden layer
z1 = np.dot(X, W1) + b1
a1 = relu(z1)

# Step 2: Hidden → Output layer
z2 = np.dot(a1, W2) + b2
a2 = sigmoid(z2)  # final output
```
## **5. Print results**

```python
print("Hidden layer linear output (z1):", z1)
print("Hidden layer activation (a1):", a1)
print("Output layer linear output (z2):", z2)
print("Final prediction (a2):", a2)
```
### **Flow Recap**

1. **Linear transformation**: $z = XW + b$
2. **Apply activation**: $a = f(z)$
3. Repeat for each layer until the output layer is reached.

In [None]:
#7  How do you add batch normalization to a neural network model in Keras!
Batch Normalization (**BN**) is a technique that normalizes the output of a layer to have a mean of 0 and a variance of 1 during training, which helps:

* **Speed up convergence**
* **Reduce internal covariate shift**
* **Act as a form of regularization** (sometimes reducing the need for dropout)
## **1. Import the necessary classes**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
```
## **2. Add Batch Normalization to the model**

Here’s a simple example:

```python
model = Sequential()

# Input + Dense
model.add(Dense(64, activation='relu', input_shape=(100,)))

# Batch Normalization
model.add(BatchNormalization())

# Optional Dropout (for extra regularization)
model.add(Dropout(0.3))

# Hidden layer
model.add(Dense(32, activation='relu'))
model.add(BatchNormalization())

# Output layer
model.add(Dense(1, activation='sigmoid'))
```
## **3. Compile and train**

```python
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.2)
```
### **Best Practices**

* Place **`BatchNormalization()`** *after* the Dense/Conv layer, but **before** the activation in some designs:

  ```python
  model.add(Dense(64, use_bias=False))  # Bias not needed, BN handles it
  model.add(BatchNormalization())
  model.add(Activation('relu'))
  ```
* BN can sometimes reduce the need for dropout — but using both is possible.
* It works well for deep networks by stabilizing training.

In [None]:
#8 How can you visualize the training process with accuracy and loss curves!
To visualize the **training process** of a neural network, you can plot **accuracy** and **loss curves** from the `History` object returned by `model.fit()` in Keras.

## **1. Train the model and store history**

```python
history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.2)
```

The `history` variable contains `history.history`, a dictionary with keys like:

* `"accuracy"` → training accuracy
* `"val_accuracy"` → validation accuracy
* `"loss"` → training loss
* `"val_loss"` → validation loss
## **2. Plot accuracy and loss curves**

```python
import matplotlib.pyplot as plt

# Plot Accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# Plot Loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
```
## **3. What to look for**

* **If training loss keeps going down but validation loss goes up** → overfitting.
* **If both losses stagnate at high values** → underfitting.
* **Smooth curves** are good; highly noisy curves might mean a learning rate that’s too high.


In [None]:
# 9 How can you use gradient clipping in Keras to control the gradient size and prevent exploding gradients!
**Gradient clipping** is a technique that limits (clips) the size of gradients during backpropagation to prevent the **exploding gradient problem** — where gradients become extremely large, causing unstable training and NaN losses.

In **Keras**, you set gradient clipping when you create the optimizer.
There are **two common methods**:
## **1. Clip by Value**

Limits each gradient component to a fixed range $[-clipvalue, clipvalue]$.

```python
from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001, clipvalue=1.0)  # Clip each gradient value
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])
```

This is like saying: *“No single gradient component should be bigger than 1 or smaller than -1.”*

## **2. Clip by Norm**

Scales gradients so that the overall **L2 norm** does not exceed a set value.

```python
optimizer = Adam(learning_rate=0.001, clipnorm=1.0)  # Clip gradient vector norm
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])
```

This is like saying: *“The whole gradient vector’s size can’t be bigger than 1.”*
### **When to Use Which**

* **clipvalue** → Good for very aggressive gradient spikes in individual weights.
* **clipnorm** → More common in deep learning; keeps overall gradient scale controlled.



In [None]:
#10 How can you create a custom loss function in Keras!
In **Keras**, you can create a **custom loss function** either as:

1. **A Python function**
2. **A subclass of `tf.keras.losses.Loss`** (for more control)

Let’s go through both.
## **1. Custom Loss as a Function**

The function must take **`y_true`** and **`y_pred`** as arguments and return a scalar loss value.
Example: Mean Squared Error (custom version)

```python
import tensorflow as tf

def custom_mse(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))
```

**Using it:**

```python
model.compile(optimizer='adam',
              loss=custom_mse,
              metrics=['mae'])
```
## **2. Custom Loss with Extra Parameters**

If you want a parameterized loss (e.g., weighted loss), you can use a closure:

```python
def weighted_mse(weight):
    def loss(y_true, y_pred):
        return tf.reduce_mean(weight * tf.square(y_true - y_pred))
    return loss

model.compile(optimizer='adam',
              loss=weighted_mse(0.5),
              metrics=['mae'])
```
## **3. Custom Loss as a Class**

Subclass `tf.keras.losses.Loss` for advanced cases:

```python
class CustomHuberLoss(tf.keras.losses.Loss):
    def __init__(self, delta=1.0, name="custom_huber_loss"):
        super().__init__(name=name)
        self.delta = delta

    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) <= self.delta
        small_error_loss = 0.5 * tf.square(error)
        big_error_loss = self.delta * (tf.abs(error) - 0.5 * self.delta)
        return tf.reduce_mean(tf.where(is_small_error, small_error_loss, big_error_loss))
```

**Using it:**

```python
model.compile(optimizer='adam',
              loss=CustomHuberLoss(delta=1.5),
              metrics=['mae'])
```
✅ **Key Points:**

* Always return a **scalar tensor** from your loss.
* Loss should be **differentiable** so TensorFlow can compute gradients.
* You can mix TensorFlow ops and NumPy, but NumPy won’t be differentiable


In [None]:
# 11 How can you visualize the structure of a neural network model in Keras?
In **Keras**, you can visualize a neural network’s structure in a couple of ways — either as a **summary in text form** or as a **graph diagram**.
## **1. View Model Summary (Text)**

```python
model.summary()
```

This shows:

* Layer names
* Output shapes
* Number of parameters

Example output:

```
Layer (type)          Output Shape       Param #
================================================
dense (Dense)         (None, 64)         6464
dropout (Dropout)     (None, 64)         0
dense_1 (Dense)       (None, 32)         2080
dense_2 (Dense)       (None, 1)          33
================================================
Total params: 8,577
Trainable params: 8,577
Non-trainable params: 0
```
## **2. Visualize as a Diagram**

```python
from tensorflow.keras.utils import plot_model

plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)
```

* `show_shapes=True` → displays output shapes per layer
* `show_layer_names=True` → shows the names you gave layers

**Note:** You need `pydot` and `graphviz` installed:

```bash
pip install pydot graphviz
```
## **3. In Jupyter Notebook (Inline Display)**

```python
from tensorflow.keras.utils import plot_model
from IPython.display import Image

plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)
Image(filename='model.png')
```