#  Feature Engineering – Handling Missing Values

In real-world **Data Science / ML projects**, raw data is messy and often contains **missing values**.  
Handling missing values is one of the most important steps in **data preprocessing**.

---

##  Why Missing Values Occur?
- People skip questions in surveys (e.g., salary, age).  
- Data entry errors.  
- Machine/sensor failures.  
- Lost records (e.g., Titanic passenger details).  

---

##  Types of Missing Data

### 1. **MCAR – Missing Completely At Random**
- Probability of missingness is **unrelated** to observed or unobserved data.  
- Missing happens purely by **chance/error**.  

 Example:  
A survey participant **forgets** to fill a field.

---

### 2. **MAR – Missing At Random**
- Probability of missingness **depends on other observed data**.  

 Example:  
- Men tend to skip salary field.  
- Women tend to skip age field.  

---

### 3. **MNAR – Missing Not At Random**
- Missingness depends on the **value of the missing data itself** or other unmeasured factors.  

 Example:  
- Employees **dissatisfied** with their jobs don’t report income → missingness depends on job satisfaction.

---

 **Summary:**
- **MCAR** → Missing purely by chance.  
- **MAR** → Missing related to other observed variables.  
- **MNAR** → Missing related to unobserved or hidden reasons.


Detecting Missing Values

In [1]:
import seaborn as sns
import pandas as pd

# Load Titanic dataset
df = sns.load_dataset("titanic")

# Check missing values
df.isnull().sum()


survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

#  Ways to Handle Missing Data

In machine learning, handling missing values is crucial for building reliable models.  
Here are the most common techniques:

---

## 1️ Dropping Missing Values
- **Row-wise:** Remove rows with missing values → `df.dropna()`  
- **Column-wise:** Remove columns with many missing values → `df.dropna(axis=1)`

 **Problem:** May result in **huge data loss**.  
 **Best used when:**
- Missing percentage is **very high**.
- Dataset is **very large**.

---

## 2️ Imputation Techniques

### a) **Mean Imputation**
- Replace missing values with the **mean** of the column.  
-  Works well if **data is normally distributed**.  

```python
df['col'].fillna(df['col'].mean(), inplace=True)


b) Median Imputation

Replace missing values with the median of the column.

Works well if data is skewed or has outliers.

In [None]:
df["age_median"] = df["age"].fillna(df["age"].median())


c) Mode Imputation

Replace missing values with the most frequent value.

Used for categorical variables.

In [3]:
mode_value = df["embarked"].mode()[0]
df["embarked_mode"] = df["embarked"].fillna(mode_value)


d) Random Sample Imputation

Replace missing values with a random sample from the same column.

Keeps the distribution more realistic.

In [None]:
import numpy as np
df["age_random"] = df["age"].apply(
    lambda x: np.random.choice(df["age"].dropna()) if pd.isnull(x) else x
)
#df['col'].fillna(df['col'].dropna().sample(), inplace=True)



#  Feature Engineering – Handling Imbalanced Dataset

---

##  What is an Imbalanced Dataset?

- In classification problems, when one class has **significantly more samples** than another, the dataset is said to be **imbalanced**.  
- Example:  
  - Dataset = 1000 samples  
  - Class **Yes** = 900  
  - Class **No** = 100  
  - Ratio = **9:1 → Imbalanced dataset**

---

##  Why is it a Problem?

- Models tend to get **biased towards the majority class**.  
- Poor performance in predicting the minority class.  
- Example: Model may always predict "Yes" since it's the majority.

---

##  Techniques to Handle Imbalanced Data

Two common approaches:  

### 1️ Upsampling (Over-sampling minority class)
- Increase the number of **minority class samples** by duplicating or creating synthetic samples.  
- Example: Increasing **100 "No" samples → 900 "No" samples**.  
-  Advantage: Uses full dataset, balances classes.  
-  Disadvantage: May lead to **overfitting**.



In [None]:

from sklearn.utils import resample

# Separate majority and minority classes
df_minority = df[df.target == 1]   # Minority class
df_majority = df[df.target == 0]   # Majority class

# Upsample minority
df_minority_upsampled = resample(df_minority,
                                 replace=True,        # Sample with replacement
                                 n_samples=len(df_majority), # Match majority
                                 random_state=42)

# Combine majority & upsampled minority
df_upsampled = pd.concat([df_majority, df_minority_upsampled])


2️ Downsampling (Under-sampling majority class)

Reduce the number of majority class samples.

Example: Reducing 900 "Yes" samples → 100 "Yes" samples.

 Advantage: Balanced dataset, faster training.

 Disadvantage: Loss of data (may remove important info).

1. ThreadPoolExecutor (Multithreading)

Comes from concurrent.futures.

Used for I/O-bound tasks (network requests, file reading, database queries).

Manages a pool of threads automatically → no need to manually start() or join().

Example:

In [None]:
# Downsample majority
df_majority_downsampled = resample(df_majority,
                                   replace=False,       # No replacement
                                   n_samples=len(df_minority), # Match minority
                                   random_state=42)

# Combine minority & downsampled majority
df_downsampled = pd.concat([df_minority, df_majority_downsampled])

print(df_downsampled['target'].value_counts())
