# Dealing with Missing Data

**Module 4: Data Cleaning & Transformation**

## Learning Objectives
- Understand why data goes missing and its impact on analysis
- Identify and quantify missing data in datasets
- Apply appropriate strategies to handle missing values
- Document decisions for stakeholder communication

## Business Context
> "Before applying any technique, ask yourself: **Why is this data missing?** The answer determines your strategy."

Missing data is inevitable in real-world datasets. As a Data Analyst, your job is not just to "fix" it, but to **understand it** and **make informed decisions** that you can explain to stakeholders.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("‚úì Libraries loaded successfully")

---
## 1. Types of Missing Data

Understanding **why** data is missing is crucial for choosing the right strategy:

| Type | Description | Example | Impact |
|------|-------------|---------|--------|
| **MCAR** | Missing Completely At Random | Survey response lost due to technical error | Safe to drop or impute |
| **MAR** | Missing At Random (depends on other variables) | Younger users skip "income" field | Can use other variables to impute |
| **MNAR** | Missing Not At Random (depends on the missing value itself) | High earners don't report salary | Problematic - may need domain expertise |

### üéØ Key Question for Stakeholders
> "Is there a pattern to why this data is missing? Could the missingness itself tell us something?"

In [None]:
# Create a realistic HR dataset with different types of missing data
np.random.seed(42)
n = 200

# Base data
employee_data = pd.DataFrame({
    'employee_id': range(1001, 1001 + n),
    'name': [f'Employee_{i}' for i in range(n)],
    'department': np.random.choice(['Sales', 'IT', 'HR', 'Marketing', 'Finance'], n),
    'age': np.random.randint(22, 60, n),
    'salary': np.random.normal(55000, 15000, n).astype(int),
    'years_experience': np.random.randint(0, 25, n),
    'performance_score': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.40, 0.30, 0.10]),
    'email': [f'emp_{i}@company.com' for i in range(n)],
    'hire_date': pd.date_range('2015-01-01', periods=n, freq='W').tolist()[:n]
})

# Introduce MCAR: Random technical errors (5% of ages missing randomly)
mcar_mask = np.random.random(n) < 0.05
employee_data.loc[mcar_mask, 'age'] = np.nan

# Introduce MAR: Newer employees don't have performance scores yet
mar_mask = employee_data['years_experience'] < 1
employee_data.loc[mar_mask, 'performance_score'] = np.nan

# Introduce MNAR: High earners tend to not disclose salary
mnar_mask = (employee_data['salary'] > 70000) & (np.random.random(n) < 0.4)
employee_data.loc[mnar_mask, 'salary'] = np.nan

# Additional missing: Some emails are missing
email_mask = np.random.random(n) < 0.08
employee_data.loc[email_mask, 'email'] = np.nan

print("Employee Dataset Created")
print(f"Shape: {employee_data.shape}")
employee_data.head(10)

---
## 2. Identifying Missing Data

Before deciding what to do, you need to **understand the extent and pattern** of missing data.

In [None]:
# Quick overview of missing data
def missing_data_summary(df):
    """
    Generate a comprehensive missing data report.
    This is something you'll use in almost every project!
    """
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    
    summary = pd.DataFrame({
        'Missing Count': missing,
        'Missing %': missing_pct,
        'Data Type': df.dtypes
    })
    
    # Only show columns with missing values
    summary = summary[summary['Missing Count'] > 0].sort_values('Missing %', ascending=False)
    
    return summary

print("=== Missing Data Summary ===")
print(missing_data_summary(employee_data))
print(f"\nTotal rows: {len(employee_data)}")
print(f"Rows with any missing: {employee_data.isnull().any(axis=1).sum()}")
print(f"Complete rows: {len(employee_data.dropna())}")

In [None]:
# Visualize missing data patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of missing percentages
missing_pct = (employee_data.isnull().sum() / len(employee_data) * 100)
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=True)

axes[0].barh(missing_pct.index, missing_pct.values, color='coral')
axes[0].set_xlabel('Missing %')
axes[0].set_title('Missing Data by Column')
for i, v in enumerate(missing_pct.values):
    axes[0].text(v + 0.5, i, f'{v:.1f}%', va='center')

# Heatmap of missing data (sample)
sample = employee_data.head(50)
sns.heatmap(sample.isnull(), cbar=True, yticklabels=False, cmap='YlOrRd', ax=axes[1])
axes[1].set_title('Missing Data Pattern (First 50 Rows)')

plt.tight_layout()
plt.show()

### üîç Investigating Missing Data Patterns

Let's check if there are relationships between missing values and other variables:

In [None]:
# Is salary missingness related to department?
print("=== Salary Missing by Department ===")
salary_missing_by_dept = employee_data.groupby('department')['salary'].apply(
    lambda x: x.isnull().sum() / len(x) * 100
).round(1)
print(salary_missing_by_dept)

# Is performance_score missingness related to experience?
print("\n=== Performance Score Missing by Experience ===")
employee_data['exp_group'] = pd.cut(employee_data['years_experience'], 
                                     bins=[0, 1, 5, 10, 25], 
                                     labels=['<1 year', '1-5 years', '5-10 years', '10+ years'])
perf_missing_by_exp = employee_data.groupby('exp_group')['performance_score'].apply(
    lambda x: x.isnull().sum() / len(x) * 100
).round(1)
print(perf_missing_by_exp)

# Clean up temporary column
employee_data.drop('exp_group', axis=1, inplace=True)

---
## 3. Strategies for Handling Missing Data

### Strategy Decision Framework

```
Is missing data < 5% of column?
‚îú‚îÄ‚îÄ YES ‚Üí Consider dropping rows OR simple imputation
‚îî‚îÄ‚îÄ NO ‚Üí Is there a pattern (MAR)?
    ‚îú‚îÄ‚îÄ YES ‚Üí Use group-based imputation
    ‚îî‚îÄ‚îÄ NO ‚Üí Is it MNAR?
        ‚îú‚îÄ‚îÄ YES ‚Üí Consult domain expert, consider special category
        ‚îî‚îÄ‚îÄ NO (MCAR) ‚Üí Mean/median imputation is acceptable
```

### 3.1 Dropping Missing Data

**When to use:** 
- Missing percentage is very small (< 5%)
- You have plenty of data
- Rows are missing completely at random

In [None]:
# Make a copy to preserve original
df_work = employee_data.copy()

# Example: Drop rows where 'age' is missing (MCAR, ~5%)
print(f"Before dropping age nulls: {len(df_work)} rows")

df_age_dropped = df_work.dropna(subset=['age'])
print(f"After dropping age nulls: {len(df_age_dropped)} rows")
print(f"Rows lost: {len(df_work) - len(df_age_dropped)} ({(len(df_work) - len(df_age_dropped))/len(df_work)*100:.1f}%)")

### 3.2 Simple Imputation (Mean/Median/Mode)

**When to use:**
- Data is MCAR
- You need to preserve sample size
- Distribution is roughly normal (mean) or skewed (median)

In [None]:
df_work = employee_data.copy()

# Check distribution before deciding mean vs median
print("Age distribution (non-null values):")
print(f"  Mean: {df_work['age'].mean():.1f}")
print(f"  Median: {df_work['age'].median():.1f}")
print(f"  Skewness: {df_work['age'].skew():.2f}")

# Since age is roughly symmetric, mean is fine
age_mean = df_work['age'].mean()
df_work['age_imputed'] = df_work['age'].fillna(age_mean)

print(f"\nImputed {df_work['age'].isnull().sum()} missing ages with mean: {age_mean:.1f}")

# For categorical data, use mode
# Example with a different dataset scenario
print("\n‚ö†Ô∏è Note: For categorical data, use mode (most frequent value)")

### 3.3 Group-Based Imputation (For MAR Data)

**When to use:**
- Missing data depends on another variable
- Groups have different distributions

This is more accurate than simple imputation!

In [None]:
df_work = employee_data.copy()

# Performance score is MAR - depends on experience
# Let's impute based on department median (more meaningful groups)

print("Performance Score by Department (before imputation):")
print(df_work.groupby('department')['performance_score'].agg(['median', 'count', 
                                                               lambda x: x.isnull().sum()]))

# Group-based imputation using transform
df_work['performance_imputed'] = df_work.groupby('department')['performance_score'].transform(
    lambda x: x.fillna(x.median())
)

print("\n‚úì Imputed missing performance scores with department median")
print(f"Missing before: {df_work['performance_score'].isnull().sum()}")
print(f"Missing after: {df_work['performance_imputed'].isnull().sum()}")

### 3.4 Creating a "Missing" Category

**When to use:**
- The missingness itself might be informative (MNAR)
- You want to preserve the information that data was missing

In [None]:
df_work = employee_data.copy()

# For salary (MNAR - high earners don't report), 
# create a flag to track what was missing
df_work['salary_was_missing'] = df_work['salary'].isnull().astype(int)

# Then impute with median
salary_median = df_work['salary'].median()
df_work['salary_imputed'] = df_work['salary'].fillna(salary_median)

print("Salary Imputation with Missing Flag:")
print(df_work[['employee_id', 'salary', 'salary_was_missing', 'salary_imputed']].head(20))

print(f"\n‚ö†Ô∏è Note: The 'salary_was_missing' flag preserves important information!")
print(f"Records where salary was missing: {df_work['salary_was_missing'].sum()}")

### 3.5 Forward Fill and Backward Fill (Time Series)

**When to use:**
- Time series or sequential data
- Values are expected to be similar to adjacent values

In [None]:
# Create a time series example
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts_data = pd.DataFrame({
    'date': dates,
    'temperature': [22, 23, np.nan, np.nan, 25, 24, np.nan, 26, 27, 28]
})

print("Original Time Series:")
print(ts_data)

# Forward fill - use previous value
ts_data['temp_ffill'] = ts_data['temperature'].ffill()

# Backward fill - use next value
ts_data['temp_bfill'] = ts_data['temperature'].bfill()

# Interpolation - linear interpolation
ts_data['temp_interpolate'] = ts_data['temperature'].interpolate()

print("\nWith Different Fill Methods:")
print(ts_data)

---
## 4. Documenting Your Decisions

### üí° Communication Skill: The Missing Data Report

As a Data Analyst, you need to communicate your decisions to stakeholders. Here's a template:

In [None]:
def generate_missing_data_report(df, decisions):
    """
    Generate a professional report documenting missing data handling.
    
    Parameters:
    - df: Original DataFrame
    - decisions: Dict with column names and decisions made
    """
    report = []
    report.append("=" * 60)
    report.append("MISSING DATA HANDLING REPORT")
    report.append("=" * 60)
    report.append(f"\nDataset: {len(df)} rows √ó {len(df.columns)} columns")
    report.append(f"Total missing values: {df.isnull().sum().sum()}")
    report.append(f"Rows with any missing: {df.isnull().any(axis=1).sum()}")
    
    report.append("\n" + "-" * 60)
    report.append("DECISIONS BY COLUMN")
    report.append("-" * 60)
    
    for col, decision in decisions.items():
        missing_count = df[col].isnull().sum()
        missing_pct = missing_count / len(df) * 100
        report.append(f"\nüìä {col}")
        report.append(f"   Missing: {missing_count} ({missing_pct:.1f}%)")
        report.append(f"   Type: {decision['type']}")
        report.append(f"   Action: {decision['action']}")
        report.append(f"   Rationale: {decision['rationale']}")
    
    return "\n".join(report)

# Example usage
decisions = {
    'age': {
        'type': 'MCAR',
        'action': 'Imputed with mean (38.5 years)',
        'rationale': 'Random missing pattern, small percentage (5%), symmetric distribution'
    },
    'salary': {
        'type': 'MNAR',
        'action': 'Imputed with median + created missing flag',
        'rationale': 'High earners less likely to report; flag preserves this information'
    },
    'performance_score': {
        'type': 'MAR',
        'action': 'Imputed with department median',
        'rationale': 'Missing only for new employees; department context more relevant'
    },
    'email': {
        'type': 'MCAR',
        'action': 'Kept as missing (not used in analysis)',
        'rationale': 'Not required for current analysis; can be collected later'
    }
}

print(generate_missing_data_report(employee_data, decisions))

---
## 5. Practical Exercises

### Exercise 1: Analyze Missing Data

Using the customer dataset below, identify the missing data patterns.

In [None]:
# Customer dataset with missing values
np.random.seed(123)
n = 150

customers = pd.DataFrame({
    'customer_id': range(1, n + 1),
    'age': np.random.randint(18, 70, n),
    'income': np.random.normal(50000, 20000, n),
    'gender': np.random.choice(['M', 'F', 'Other'], n),
    'loyalty_score': np.random.randint(1, 100, n),
    'last_purchase_days': np.random.randint(1, 365, n)
})

# Introduce missing values
customers.loc[np.random.choice(n, 10, replace=False), 'age'] = np.nan
customers.loc[customers['income'] > 70000, 'income'] = np.where(
    np.random.random(len(customers[customers['income'] > 70000])) < 0.5,
    np.nan, customers.loc[customers['income'] > 70000, 'income']
)
customers.loc[customers['last_purchase_days'] > 300, 'loyalty_score'] = np.nan

print("Customer Dataset:")
print(customers.head(10))

In [None]:
# TODO: Generate a missing data summary
# Hint: Use the missing_data_summary() function we created


In [None]:
# TODO: Investigate if income missingness is related to any other variable
# Is it MCAR, MAR, or MNAR?


In [None]:
# TODO: Investigate the pattern of missing loyalty_score
# Hint: Look at the relationship with last_purchase_days


### Exercise 2: Apply Imputation Strategies

Based on your analysis in Exercise 1, apply appropriate imputation strategies.

In [None]:
# TODO: Impute 'age' using an appropriate method
# Justify your choice


In [None]:
# TODO: Impute 'income' - consider the pattern you discovered
# Should you create a missing flag?


In [None]:
# TODO: Handle 'loyalty_score' appropriately
# Consider: What does it mean when inactive customers don't have a loyalty score?


### Exercise 3: Create a Missing Data Report

Document your decisions from Exercise 2 in a professional report.

In [None]:
# TODO: Create a decisions dictionary and generate a report
# Use the generate_missing_data_report() function

my_decisions = {
    # Fill in your decisions here
}

# print(generate_missing_data_report(customers, my_decisions))

---
## 6. Key Takeaways

### ‚úÖ Best Practices

1. **Always investigate WHY data is missing** before deciding how to handle it
2. **Document your decisions** - you'll need to explain them to stakeholders
3. **Consider the business context** - what makes sense for your analysis?
4. **Preserve information** - use missing flags when missingness might be informative
5. **Use group-based imputation** when data is MAR

### ‚ö†Ô∏è Common Mistakes to Avoid

1. Dropping all rows with any missing value (loses too much data)
2. Using mean imputation without checking distribution
3. Ignoring MNAR patterns (can bias your analysis)
4. Not documenting your decisions

### üìö Further Reading

- [Pandas Missing Data Documentation](https://pandas.pydata.org/docs/user_guide/missing_data.html)
- [Sklearn Imputation](https://scikit-learn.org/stable/modules/impute.html)