

---

---

### ✅ **Comparison Table for Quick Recall**

| Concept      | Why Needed?                       | When to Use?               |
| ------------ | --------------------------------- | -------------------------- |
| **Mean**     | Overall average                   | Symmetric data             |
| **Median**   | Robust to outliers                | Skewed data                |
| **Mode**     | Most frequent value               | Categorical data           |
| **Range**    | Quick spread check                | Small datasets             |
| **Variance** | Mathematical measure of spread    | For statistical models     |
| **SD**       | Interpretable spread in same unit | Risk analysis, normal data |

---




### ✅ **Why is sample variance divided by (n - 1) instead of n?**

---

#### **1. Problem Background**

* **Population variance** formula uses **N**:

$$
\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}
$$

* But when we **don’t know the population mean** ($\mu$), we estimate it using the **sample mean** ($\bar{x}$).
* This introduces **bias** in variance calculation if we still divide by $n$.

---

#### **2. The Bias Problem**

* When we use **sample mean**, the deviations $(x_i - \bar{x})$ are **slightly smaller on average** than if we knew the true population mean.
* So if we divide by $n$, the estimate is **too small** (underestimates the true population variance).

---

#### **3. Why (n - 1)?**

* **(n - 1)** is called **degrees of freedom**.
* We lose 1 degree of freedom because the sample mean is already computed from the sample → **one constraint**:

$$
\sum (x_i - \bar{x}) = 0
$$

* Dividing by (n - 1) **corrects this bias** and gives an **unbiased estimator** of population variance.

---

#### ✅ **In One Line (Interview Answer):**

> We divide by **(n - 1)** instead of **n** to correct the bias introduced by using the sample mean instead of the true population mean, ensuring an **unbiased estimate** of population variance. (This is called **Bessel’s correction**.)



Here’s an **interview-focused explanation** for **Random Variables (Discrete & Continuous)**:

---

### ✅ **Random Variable – Why is it important?**

* In real-world problems, outcomes of experiments (like tossing a coin, rolling a dice, measuring height) are **uncertain**.
* We assign **numerical values to these uncertain outcomes** → this is a **Random Variable**.
* It’s the foundation of **probability distributions, statistical inference, and machine learning models**.

---

### ✅ **Definition**

A **Random Variable (RV)** is a variable that takes values based on the outcome of a random experiment.

---

### ✅ **Types of Random Variables**

---

#### **1. Discrete Random Variable**

* **Definition:** Takes **finite or countable** values.
* **Examples:**

  * No. of heads in 3 coin tosses (0,1,2,3)
  * No. of calls in a call center in an hour
* **Probability Distribution:** Probability Mass Function (PMF).

$$
P(X = x) \geq 0,\ \sum P(X = x) = 1
$$

* **Key Point:** Probability of an exact value is non-zero.

---

#### **2. Continuous Random Variable**

* **Definition:** Takes **infinite uncountable values within an interval**.
* **Examples:**

  * Height, weight, time taken to run 100m
* **Probability Distribution:** Probability Density Function (PDF).
* **Key Point:**

$$
P(X = a) = 0,\ \text{but } P(a < X < b) > 0
$$

* We calculate probability using **area under the curve**.

---

---

### ✅ **Interview-Style Comparison Table**

| Feature           | Discrete RV                | Continuous RV               |
| ----------------- | -------------------------- | --------------------------- |
| Values            | Countable                  | Uncountable (interval)      |
| Examples          | Dice roll, No. of students | Height, Weight              |
| Function Used     | PMF (Probability Mass)     | PDF (Probability Density)   |
| P(X = x)          | > 0                        | = 0                         |
| Probability Calc. | Sum of probabilities       | Area under curve (integral) |

---

---

### ✅ **Common Interview Questions & Short Answers**

**Q1: Why P(X = a) = 0 for continuous variables?**
**A:** Because there are infinite possible values in any interval, so the probability of one exact value is infinitesimally small (zero).

---

**Q2: Can a variable be both discrete and continuous?**
**A:** Yes, a **mixed random variable** exists (e.g., time until a phone call – if no call, value is 0 (discrete), otherwise continuous).

---

**Q3: Which one uses PMF and which uses PDF?**

* Discrete → PMF
* Continuous → PDF

---

👉 Do you want me to **also cover Probability Distributions (Binomial, Poisson, Normal)** next, with interview-focused notes? Or continue with **Hypothesis Testing basics**?



---

### ✅ **Why Percentiles & Quartiles are Important**

* Central tendency (mean/median) gives the **center**, but we also need to understand **how data is spread across its range**.
* Percentiles and quartiles help in:

  * **Identifying relative position** of a value in the dataset.
  * Detecting **outliers**.
  * Summarizing data for **statistical analysis**.

---

## ✅ **Percentiles**

#### **Definition**

* A **percentile** indicates the value **below which a given percentage of observations fall**.
* Example: 90th percentile = value below which **90% of data lies**.

$$
\text{Percentile Rank of x} = \frac{\text{No. of values ≤ x}}{\text{Total values}} \times 100
$$

---

#### **Why important?**

* Used in **standardized tests, health indicators, salary benchmarking**.
* Example: If your score is at the **80th percentile**, you scored better than 80% of people.

---

#### **Common Interview Questions**

* **Q:** If someone is in the 95th percentile in height, what does it mean?
  **A:** They are taller than 95% of people in the dataset.

---

---

## ✅ **Quartiles**

#### **Definition**

* Quartiles **divide the dataset into 4 equal parts**:

  * **Q1 (25th percentile):** 25% of values are below Q1.
  * **Q2 (50th percentile):** Median.
  * **Q3 (75th percentile):** 75% of values are below Q3.

---

#### **Why important?**

* Used in **Boxplots**, **Interquartile Range (IQR)** for detecting outliers:

$$
\text{IQR} = Q3 - Q1
$$

* Values outside $[Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR]$ are **outliers**.

---

---

### ✅ **Quick Comparison Table**

| Term           | Meaning                               |
| -------------- | ------------------------------------- |
| **Percentile** | Position based on 100 equal divisions |
| **Quartile**   | Position based on 4 equal divisions   |
| **Q1**         | 25th percentile                       |
| **Q2**         | 50th percentile (Median)              |
| **Q3**         | 75th percentile                       |

---

---

### ✅ **Common Interview Questions & Answers**

**Q1: Why do we use quartiles instead of mean for detecting outliers?**
**A:** Quartiles are **robust to outliers**, while the mean is influenced by extreme values.

---

**Q2: What is the IQR and why is it important?**
**A:** IQR = Q3 – Q1. It measures the **middle 50% spread** and is used to **detect outliers**.

---

**Q3: Are percentiles affected by outliers?**
**A:** No, because they are based on **position (rank)**, not magnitude.

---

👉 Do you want me to **also include a small diagram (Boxplot explanation) for quartiles and outliers**? Or move to **next topic: Probability Distributions (Binomial, Poisson, Normal)**?


Here’s the **interview-focused explanation** for the missing dispersion topics:

---

## ✅ **1. Mean Absolute Deviation (MAD)**

### **Definition**

* MAD is the **average of absolute deviations from the mean**.

$$
\text{MAD} = \frac{\sum |x_i - \bar{x}|}{n}
$$

---

### **Why important?**

* Measures **spread without squaring deviations** (unlike variance).
* Less sensitive to **extreme outliers** than variance.
* Used when **robustness** is required (e.g., economics, quality control).

---

### **When to use?**

* When **interpretability in same units** is needed.
* For **non-normal data** with outliers.

---

### **Interview Question:**

**Q:** Why do we use absolute deviations in MAD?
**A:** To avoid positive and negative deviations canceling each other.

---

---

## ✅ **2. Coefficient of Variation (CV)**

### **Definition**

* CV measures **relative variability** (spread compared to the mean).

$$
\text{CV} = \frac{\sigma}{\bar{x}} \times 100\%
$$

where $\sigma$ = Standard Deviation, $\bar{x}$ = Mean.

---

### **Why important?**

* Allows comparison of **variability across datasets with different units or scales**.
* Example:

  * Dataset A: Mean = 100, SD = 10 → CV = 10%
  * Dataset B: Mean = 1000, SD = 50 → CV = 5%
    Even though Dataset B has higher SD, **relative variation** is lower.

---

### **When to use?**

* Comparing **risk vs return** in finance (stocks, portfolios).
* Comparing **stability of processes** in manufacturing.

---

### **Interview Question:**

**Q:** Why CV is better than SD for comparing two datasets?
**A:** Because CV is **dimensionless**, it standardizes the measure of dispersion.

---

---

## ✅ **3. Quartile Deviation (QD) / Semi-Interquartile Range**

### **Definition**

* Measures **spread of the middle 50% of data**.

$$
\text{QD} = \frac{Q_3 - Q_1}{2}
$$

where $Q_1$ = 25th percentile, $Q_3$ = 75th percentile.

---

### **Why important?**

* **Robust to outliers**, because it ignores extreme values.
* Used in **skewed distributions**.

---

### **When to use?**

* For **income distribution**, **real estate prices**, where extremes distort mean and SD.

---

### **Interview Question:**

**Q:** Why is QD called semi-interquartile range?
**A:** Because it is **half of the interquartile range (IQR)**:

$$
\text{IQR} = Q_3 - Q_1,\ \text{QD} = \frac{\text{IQR}}{2}
$$

---

---

### ✅ **Quick Summary Table**

| Measure | Formula                               | When to Use                         |       |                                      |
| ------- | ------------------------------------- | ----------------------------------- | ----- | ------------------------------------ |
| **MAD** | (\frac{\sum                           | x\_i - \bar{x}                      | }{n}) | Robust to outliers, interpretability |
| **CV**  | $\frac{\sigma}{\bar{x}} \times 100\%$ | Compare variability across datasets |       |                                      |
| **QD**  | $\frac{Q_3 - Q_1}{2}$                 | Skewed data, outlier presence       |       |                                      |

---

👉 Do you want me to now **combine all dispersion measures (Range, Variance, SD, IQR, MAD, CV, QD)** into **one revision sheet with formulas + key points + when to use** for quick interview prep? Or move to **next big topic: Probability Distributions**?


Perfect! Let’s make a **single-page interview-ready revision sheet for Descriptive Statistics**. I’ll structure it clearly with **formulas, interpretation, and when to use**, so it’s easy to revise and convert to PDF.

---

# **Descriptive Statistics – Data Science Interview Notes**

---

## **1. Measures of Central Tendency**

| Measure            | Formula                        | When to Use             | Key Points              |
| ------------------ | ------------------------------ | ----------------------- | ----------------------- |
| **Mean (Average)** | $\bar{x} = \frac{\sum x_i}{n}$ | Symmetric data, numeric | Sensitive to outliers   |
| **Median**         | Middle value after sorting     | Skewed data             | Robust to outliers      |
| **Mode**           | Most frequent value            | Categorical data        | Can have multiple modes |

---

## **2. Measures of Dispersion**

| Measure                           | Formula                                                                                            | When to Use                         | Key Points                       |                          |                            |
| --------------------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------------- | -------------------------------- | ------------------------ | -------------------------- |
| **Range**                         | Max − Min                                                                                          | Quick check of spread               | Sensitive to outliers            |                          |                            |
| **Variance**                      | $\sigma^2 = \frac{\sum (x_i - \bar{x})^2}{n}$ (pop), $\frac{\sum (x_i - \bar{x})^2}{n-1}$ (sample) | Measures spread mathematically      | Units squared, hard to interpret |                          |                            |
| **Standard Deviation (SD)**       | $\sigma = \sqrt{\text{Variance}}$                                                                  | Normal data, risk analysis          | Same unit as data                |                          |                            |
| **Mean Absolute Deviation (MAD)** | (\text{MAD} = \frac{\sum                                                                           | x\_i - \bar{x}                      | }{n})                            | Robust measure of spread | Less sensitive to outliers |
| **Coefficient of Variation (CV)** | $\text{CV} = \frac{\sigma}{\bar{x}} \times 100\%$                                                  | Compare variability across datasets | Unit-free                        |                          |                            |
| **Quartile Deviation (QD)**       | $\frac{Q_3 - Q_1}{2}$                                                                              | Skewed data                         | Robust to outliers               |                          |                            |
| **Interquartile Range (IQR)**     | $Q_3 - Q_1$                                                                                        | Detect outliers                     | Middle 50% spread                |                          |                            |

---

## **3. Position Measures**

| Measure         | Definition                                               | Key Points                                  |
| --------------- | -------------------------------------------------------- | ------------------------------------------- |
| **Percentiles** | Value below which x% of data lies                        | Example: 90th percentile → 90% values below |
| **Quartiles**   | Divide data into 4 parts (Q1=25%, Q2=50%=Median, Q3=75%) | Used in IQR & boxplots                      |
| **Deciles**     | Divide data into 10 equal parts                          | Less commonly used                          |

---

## **4. Shape of Distribution**

| Measure      | Meaning             | Interpretation                                                |
| ------------ | ------------------- | ------------------------------------------------------------- |
| **Skewness** | Symmetry of data    | +ve skew → right tail; -ve skew → left tail                   |
| **Kurtosis** | Peakedness/flatness | Mesokurtic → normal; Leptokurtic → peaked; Platykurtic → flat |

---

## **5. Relationship Measures**

| Measure                   | Formula                                                           | Key Points                                                         |
| ------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------ |
| **Covariance**            | $\text{Cov}(X,Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n}$ | Shows direction of relationship; unit-dependent                    |
| **Correlation (Pearson)** | $r = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$                   | Unit-free, -1 ≤ r ≤ 1; strength & direction of linear relationship |

---

## **6. Outlier Detection & Standardization**

| Concept          | Formula / Rule                                             | Key Points             |   |                         |
| ---------------- | ---------------------------------------------------------- | ---------------------- | - | ----------------------- |
| **Z-score**      | $Z = \frac{x - \bar{x}}{\sigma}$                           | Standardizes data;     | Z | > 3 → potential outlier |
| **Boxplot Rule** | Outlier < $Q1 - 1.5 \times IQR$ or > $Q3 + 1.5 \times IQR$ | Detects extreme values |   |                         |

---

### ✅ **Quick Tips for Interviews**

* Always mention **robustness to outliers** when comparing measures.
* For skewed data → **use median and IQR/QD**.
* For comparisons across datasets → **use CV**.
* Use **correlation vs covariance** correctly; correlation is **dimensionless**.
* Know **Bessel’s correction** (n-1) for sample variance.

---

If you want, I can **convert this structured sheet directly into a **well-formatted PDF** with headings, tables, and highlights**, ready to **print or revise on the go**.

Do you want me to do that next?
