![image.png](attachment:image.png)

# Train Data

---

## Definition
- The portion of the dataset used to **teach the model**.
- Contains **inputs (features)** and **outputs (labels)**.
- The model **learns patterns and relationships** from this data.

---

## Daily-Life Analogy
- Imagine **learning to bake a cake**.  
- You try recipes many times, note what works (amount of flour, sugar, baking time).  
- These **practice attempts** are like **train data** — you learn the patterns before baking for guests.

---

## Key Points
- Usually **60-80%** of the total dataset.
- Used to **fit the model parameters**.
- **Cannot evaluate** model performance accurately on train data alone (may lead to overfitting).

---

## Mini Example (House Price Prediction)

| House Size (sq.ft) | No. of Bedrooms | Price (₹ in Lakhs) |
|-------------------|-----------------|-------------------|
| 1000              | 2               | 40                |
| 1500              | 3               | 55                |
| 2000              | 3               | 70                |

**Goal:** Model learns relationship:  
House Size & Bedrooms → Price

---


![image.png](attachment:image.png)

# Validation Data

---

## Definition
- A portion of the dataset used to **tune the model**.
- Helps **select the best model and hyperparameters**.
- Ensures the model **doesn’t overfit** the train data.

---

## Daily-Life Analogy
- Imagine **baking a cake for a party** after practicing (train data).  
- You **taste a small piece** before serving it to everyone.  
- This **taste test** is like **validation data** — helps adjust sugar, baking time, or decorations before the final cake.

---

## Key Points
- Usually **10-20%** of the dataset.
- Used **during training** for model tuning.
- **Not used** for final performance evaluation.

---

## Mini Example (House Price Prediction)

| House Size (sq.ft) | No. of Bedrooms | Price (₹ in Lakhs) |
|-------------------|-----------------|-------------------|
| 1800              | 3               | ?                 |
| 2200              | 4               | ?                 |

**Goal:** Test model predictions → adjust model parameters for better accuracy.

---


# Test Data / Testing Model

---

## Definition
- A portion of the dataset used to **evaluate the final model’s performance**.
- Contains **unseen data** (not used in training or validation).
- Objective: Estimate **real-world accuracy** and generalization.

---

## Daily-Life Analogy
- Imagine **baking a cake for guests** after practice (train) and taste test (validation).  
- Serving the cake to your **friends/family** is like **testing** — it shows how well your recipe works in the real world.

---

## Key Points
- Usually **10-20%** of the dataset.
- Model **does not learn** from this data.
- Helps to check **true performance** and avoid overfitting.

---

## Mini Example (House Price Prediction)

| House Size (sq.ft) | No. of Bedrooms | Actual Price (₹ in Lakhs) | Predicted Price |
|-------------------|-----------------|---------------------------|----------------|
| 1900              | 3               | 65                        | 64             |
| 2300              | 4               | 85                        | 87             |

**Goal:** Compare predictions vs actual → evaluate model accuracy.

---


# Model Overfitting

---

## Definition
- **Overfitting** occurs when a model learns the **training data too well**, including noise or irrelevant patterns.
- Result: **High accuracy on train data** but **poor performance on new/unseen data**.

---

## Causes
- Very **complex model** for a small dataset
- **Too many features** relative to number of data points
- Training **too long** without regularization

---

## Mini Example

| House Size (sq.ft) | No. of Bedrooms | Price (₹ in Lakhs) |
|-------------------|-----------------|-------------------|
| 1000              | 2               | 40                |
| 1500              | 3               | 55                |
| 2000              | 3               | 70                |
| 2500              | 4               | 90                |

- **Overfitted model:** Draws a complex curve that passes exactly through every point in training set  
- **Problem:** Fails to predict new house prices accurately

---

## Daily-Life Analogy
- **Studying only past exam papers**:  
  - You memorize answers word-for-word (train data)  
  - New questions in the actual exam (test data) confuse you → **overfitting**

---

## Key Points
- Shows **low bias, high variance**
- Avoid by:  
  - Using more data  
  - Simplifying model  
  - Regularization (L1, L2)  
  - Cross-validation

---


![image.png](attachment:image.png)

# Model Underfitting

---

## Definition
- **Underfitting** occurs when a model is **too simple** to capture patterns in the data.
- Result: **Poor performance on both training and test data**.

---

## Causes
- Model is too **simple** (low complexity)
- **Insufficient features**
- Not trained enough (too few epochs or iterations)

---

## Mini Example

| House Size (sq.ft) | No. of Bedrooms | Price (₹ in Lakhs) |
|-------------------|-----------------|-------------------|
| 1000              | 2               | 40                |
| 1500              | 3               | 55                |
| 2000              | 3               | 70                |
| 2500              | 4               | 90                |

- **Underfitted model:** Draws a straight line that **misses the trend** of the data  
- **Problem:** Cannot predict train or test house prices accurately

---

## Daily-Life Analogy
- **Learning just the basics of a recipe**:  
  - You know only flour + sugar → cake often turns out wrong  
  - Need more understanding to bake well → **underfitting**

---

## Key Points
- Shows **high bias, low variance**
- Avoid by:  
  - Using a **more complex model**  
  - Adding **more relevant features**  
  - Training **longer**  

---


# Generalized Model

---

## Definition
- A **generalized model** is one that performs well on **both training and unseen test data**.
- Strikes the **right balance** between underfitting and overfitting.
- Goal: **Good accuracy and ability to generalize to new data**.

---

## Key Points
- Captures **true patterns** without memorizing noise
- Has **low bias and low variance** (well-balanced)
- Achieved by:
  - Proper model complexity
  - Enough training data
  - Regularization and tuning
  - Cross-validation

---

## Mini Example (House Price Prediction)

| House Size (sq.ft) | No. of Bedrooms | Price (₹ in Lakhs) |
|-------------------|-----------------|-------------------|
| 1000              | 2               | 40                |
| 1500              | 3               | 55                |
| 2000              | 3               | 70                |
| 2500              | 4               | 90                |
| 1800              | 3               | 60                |

- Model learns **general relationship**: House Size & Bedrooms → Price  
- Can **predict new house prices** accurately

---

## Daily-Life Analogy
- **Learning to cook a new dish**:  
  - You understand the ingredients, proportions, and techniques  
  - You can **successfully cook for any guest**, not just follow one recipe exactly → **generalization**

---

# Bias and Variance

---

## 1. Bias
- **Definition:** Error due to **over-simplification** of the model.
- **High Bias → Underfitting**
- Model **cannot capture patterns** in the data.

**Daily-Life Analogy:**  
- Studying only the summary notes for an exam → miss important details → answers wrong.  

---

## 2. Variance
- **Definition:** Error due to **over-sensitivity** to training data.
- **High Variance → Overfitting**
- Model **captures noise**, performs poorly on new data.

**Daily-Life Analogy:**  
- Memorizing past exam answers word-for-word → fail on new questions.  

---

## Mini Example (House Price Prediction)

| House Size (sq.ft) | Bedrooms | Price (₹ in Lakhs) |
|-------------------|----------|-------------------|
| 1000              | 2        | 40                |
| 1500              | 3        | 55                |
| 2000              | 3        | 70                |
| 2500              | 4        | 90                |

| Model Type        | Bias  | Variance | Outcome                     |
|------------------|-------|----------|------------------------------|
| Too simple        | High  | Low      | Underfits → poor train/test |
| Too complex       | Low   | High     | Overfits → good train, poor test |
| Balanced          | Low   | Low      | Generalized → good train/test |

---


![image.png](attachment:image.png)

# Bias-Variance Tradeoff

---

## Definition
- **Tradeoff between Bias and Variance** in a model:
  - **High Bias → Underfitting** (too simple, misses patterns)
  - **High Variance → Overfitting** (too complex, captures noise)
- Goal: **Minimize total error** by balancing bias and variance → **generalized model**.

---

## Daily-Life Analogy
- Learning to **play cricket**:  
  - **High Bias:** Only practice hitting straight → fail against new types of bowlers  
  - **High Variance:** Over-practicing fancy shots → score inconsistently  
  - **Balanced:** Learn fundamentals + some advanced shots → perform well against any bowler

---

## Mini Example (House Price Prediction)

| Model Type        | Bias  | Variance | Outcome                     |
|------------------|-------|----------|------------------------------|
| Too simple        | High  | Low      | Underfits → poor train/test |
| Too complex       | Low   | High     | Overfits → good train, poor test |
| Balanced          | Low   | Low      | Generalized → good train/test |

---

## Key Points
- **Total Error = Bias² + Variance + Irreducible Error**  
- **Goal:** Find sweet spot → low bias, low variance  
- Techniques to control tradeoff:
  - Regularization (L1, L2)  
  - More training data  
  - Cross-validation  
  - Simplify or complexify model as needed



---
# Cross Validation (CV)

**Problem** : So far, we've simply been told which points are the Training Data and which points are the Testing Data. however usually no one tells us what is for Training and what are for testing. *How do we pick the best points for **Training** and the best for **Testing** ?*

![image.png](attachment:image.png)

![image.png](attachment:image.png)