# Descriptive Statistics

- Descriptive statistics is the branch of statistics that **summarizes and organizes data** so it can be understood easily.  
- It **describes the main features** of a dataset without making predictions or inferences.  

---

## Common Measures in Descriptive Statistics

1. **Measures of Central Tendency** – Describe the “center” of the data:  
   - **Mean** → Average value  
   - **Median** → Middle value when data is sorted  
   - **Mode** → Most frequent value  

2. **Measures of Dispersion** – Describe the spread or variability:  
   - **Range** → Difference between maximum and minimum  
   - **Variance** → Average squared deviation from the mean  
   - **Standard Deviation** → Square root of variance  

3. **Measures of Position** – Describe the relative location of a value:  
   - Percentiles, Quartiles, Minimum, Maximum  

4. **Data Visualization** – Helps understand patterns visually:  
   - **Histogram** → Frequency distribution  
   - **Boxplot** → Shows median, quartiles, outliers  
   - **Bar Chart / Pie Chart** → For categorical data  

---

## Example Use Case

Imagine an e-commerce company analyzing customer orders:  
- Mean order value = ₹350 → Average spending per customer  
- Median = ₹300 → Middle value of order amounts  
- Mode = ₹200 → Most common order value  
- Standard deviation = ₹75 → Spread of spending  
- Histogram → Shows distribution of order values  


# Measures of Central Tendency

Measures of central tendency describe the **center point** or typical value of a dataset.  
- They help us understand where most of the data values lie.  
- Widely used in EDA (Exploratory Data Analysis, Data Preparation, Feature Engineering.)

![image.png](attachment:image.png)

## Mean (Arithmetic Average)

- The **mean** is the sum of all observations divided by the number of observations.  
- It represents the **central or typical value** of a dataset.  

![image.png](attachment:image.png)

- 'N' for population calculation.
- 'n' for sample calculation

- **Example:**

Dataset: [5, 8, 12, 15, 20]  

\[
\text{Mean} = \frac{5 + 8 + 12 + 15 + 20}{5} = \frac{60}{5} = 12
\]

- **Key Points:**
- Represents the **overall average** of the data.  
- **Sensitive to outliers** (extremely high or low values can skew the mean).  
- Useful for **quantitative data** like marks, income, or age. 

## Median

- The **median** is the **middle value** of a dataset when the values are arranged in **ascending or descending order**.  
- It divides the dataset into **two equal halves**.  

- **How to Find Median:**
    1. Arrange the data in order (smallest to largest).  
    2. If the number of values \(n\) is **odd**, median = middle value.  
    3. If \(n\) is **even**, median = average of the two middle values.  

- **Example 1 (Odd number of values):**
    Dataset: [5, 8, 12, 15, 20]  
    - Ordered: [5, 8, 12, 15, 20]  
    - Middle value = 12 → **Median = 12**  

- **Example 2 (Even number of values):**
    Dataset: [5, 8, 12, 15]  
    - Ordered: [5, 8, 12, 15]  
    - Middle values = 8 and 12  
    - Median = (8 + 12) / 2 = 10  

- **Key Points:**
    - Median is **less affected by outliers** compared to mean.  
    - Useful for **skewed data** or data with extreme values.  
    - Represents the **central location** of data.  


**Physical Mid Point (Median : Middle Most Elements)**

- Assume the numbers are representing human beings --> 1, 2,  3,  4,  ,5
- Mid value is = 3
- if we change 5 to 100 in that case also mid value will be 3.

### Use Case Example: Monthly Income of Employees

Suppose a company has 7 employees with monthly salaries (in ₹):

Dataset: [25,000, 28,000, 30,000, 32,000, 35,000, 38,000, 1,50,000]  

Here, ₹1,50,000 is an **outlier** (maybe the CEO).

#### 1. Physical Midpoint
\[
Mid point = (Max + Min)/ 2 = (25,000 + 1,50,000)/2 = 87,500
\]  
- **Observation:** Only considers the smallest and largest values, so midpoint is **much higher than most salaries**.  

#### 2. Median
- Arrange in order: [25,000, 28,000, 30,000, 32,000, 35,000, 38,000, 1,50,000]  
- Middle value (4th value) = **32,000**  
- **Observation:** Median gives a **better sense of typical salary**, unaffected by the outlier.  

#### 3. Mean
\[
Mean = (25,000 + 28,000 + 30,000 + 32,000 + 35,000 + 38,000 + 1,50,000)/7 \approx 49,714
\]  
- **Observation:** Mean is **inflated due to the outlier**, not representing the typical employee salary.  

---

#### Key Takeaway
- **Physical Midpoint** → Only extremes, not reliable for distribution.  
- **Median** → Best for datasets with outliers.  
- **Mean** → Sensitive to outliers; best for balanced data.


# Mode

- The **mode** is the value that **occurs most frequently** in a dataset.  
- A dataset may have:  
  - **No mode** (all values occur once)  
  - **One mode** (unimodal)  
  - **Multiple modes** (bimodal or multimodal)  

- **Example 1 (Single Mode):**
    Dataset: [5, 8, 12, 8, 15]  
    - Most frequent value = 8 → **Mode = 8**  

- **Example 2 (Multiple Modes):**
    Dataset: [5, 8, 12, 8, 12]  
    - Most frequent values = 8 and 12 → **Modes = 8, 12**  

- **Example 3 (No Mode):**
    Dataset: [5, 8, 12, 15, 20]  
    - All values occur once → **No mode**

- **Key Points:**
    - Useful for **categorical/discrete data** (e.g., most purchased product, most common shoe size).  
    - Can also be used for **numerical data**, especially to see the most typical value.  
    - Less affected by outliers compared to mean.


### *On which type of data mean, median, mode is calculated*

- On Numerircal (Continuous ) Data
    - Mean
    - Median
- On Categorical Data
    - Mode

## Missing Value Imputation

- **Missing values** occur when some data points are **not recorded** or are absent in a dataset.  
- **Imputation** is the process of **replacing missing values** with reasonable estimates so that the dataset can be used for analysis or modeling.  

- **Common Methods of Imputation:**

1. **Mean Imputation**  
   - Replace missing values with the **mean** of the available data.  
   - Example: If ages = [25, 30, NaN, 35], missing value → (25+30+35)/3 = 30  

2. **Median Imputation**  
   - Replace missing values with the **median** of the available data.  
   - Useful for **skewed data** to reduce effect of outliers.  

3. **Mode Imputation**  
   - Replace missing values with the **most frequent value**.  
   - Useful for **categorical data**.  
   - Example: Colors = [Red, Blue, Red, NaN, Green] → missing value → Red  

4. **Forward/Backward Fill**  
   - Fill missing value with **previous** or **next** observed value.  
   - Common in **time-series data**.  

5. **Prediction-Based Imputation**  
   - Use **regression or machine learning models** to predict missing values based on other variables.  

- **Key Points:**
    - Imputation helps **avoid losing valuable data**.  
    - Choice of method depends on:  
    - Type of data (categorical or numerical)  
    - Distribution of data  
    - Amount of missing values  



### Example: Missing Value Imputation

Suppose we have a dataset of employees with missing **Age** and **Department** information:

| Employee ID | Age  | Department  |
|------------|------|------------|
| E001       | 25   | Sales      |
| E002       | NaN  | Marketing  |
| E003       | 30   | NaN        |
| E004       | 35   | HR         |
| E005       | NaN  | Sales      |

- **Imputation Methods Applied**

| Employee ID | Age (Mean Imputed) | Age (Median Imputed) | Department (Mode Imputed) |
|------------|------------------|--------------------|----------------------------|
| E001       | 25               | 25                 | Sales                      |
| E002       | 30               | 30                 | Marketing                  |
| E003       | 30               | 30                 | Sales                      |
| E004       | 35               | 35                 | HR                         |
| E005       | 30               | 30                 | Sales                      |

**Explanation**:  
    - **Age (Mean Imputed)** → Mean of available ages = (25+30+35)/3 = 30  
    - **Age (Median Imputed)** → Median of available ages = 30  
    - **Department (Mode Imputed)** → Mode of available departments = Sales  
