# Week 3A: Data Quality Assessment and Missing Data Strategies

## ISM6251: Machine Learning for Business Applications

---

## Introduction

Welcome to Week 3! This week, we dive deep into one of the most critical aspects of machine learning: **data preparation**. No matter how sophisticated your algorithms are, they cannot compensate for poor quality data.

In this notebook, we'll explore:
- **Data Quality Assessment**: How to systematically evaluate your data
- **Missing Data Analysis**: Understanding patterns and mechanisms
- **Imputation Strategies**: When and how to fill missing values
- **Decision Frameworks**: Making informed choices about data handling

Remember: *"Data preparation is not just a step in the ML pipeline—it's the foundation upon which all successful models are built."*

## Setup and Imports

Let's start by importing the libraries we'll need throughout this notebook:

In [None]:
# Standard library imports
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Missing data visualization
!pip install missingno -q
import missingno as msno

# Machine learning utilities
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 2)

print("✓ All libraries imported successfully!")

---

## Part 1: Understanding Data Quality

### 1.1 What Makes Data "Good Quality"?

High-quality data exhibits several key characteristics:

1. **Completeness**: Minimal missing values
2. **Consistency**: Uniform formats and conventions
3. **Accuracy**: Values reflect reality
4. **Uniqueness**: No unwanted duplicates
5. **Timeliness**: Data is current and relevant
6. **Validity**: Values conform to defined rules

Let's create a realistic dataset with various quality issues to explore:

In [None]:
# Create a realistic e-commerce dataset with quality issues
np.random.seed(42)

n_customers = 1000

# Generate base data
customer_data = {
    'customer_id': range(1, n_customers + 1),
    'age': np.random.randint(18, 80, n_customers),
    'income': np.random.normal(50000, 20000, n_customers),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], 
                            n_customers, p=[0.3, 0.25, 0.2, 0.15, 0.1]),
    'signup_date': pd.date_range('2020-01-01', periods=n_customers, freq='6H'),
    'total_purchases': np.random.poisson(5, n_customers),
    'avg_order_value': np.random.exponential(75, n_customers),
    'email_domain': np.random.choice(['gmail', 'yahoo', 'outlook', 'company'], 
                                   n_customers, p=[0.4, 0.2, 0.2, 0.2]),
    'customer_segment': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], 
                                        n_customers, p=[0.4, 0.3, 0.2, 0.1])
}

df_original = pd.DataFrame(customer_data)

# Introduce realistic data quality issues
df = df_original.copy()

# 1. Missing values (MCAR - Missing Completely At Random)
missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
df.loc[missing_indices, 'income'] = np.nan

# 2. Missing values (MAR - Missing At Random, younger people less likely to report income)
young_indices = df[df['age'] < 30].index
missing_young = np.random.choice(young_indices, size=int(0.15 * len(young_indices)), replace=False)
df.loc[missing_young, 'income'] = np.nan

# 3. Missing values (MNAR - Missing Not At Random, high earners don't report)
high_income_indices = df[df['income'] > 80000].index
missing_high = np.random.choice(high_income_indices, size=int(0.1 * len(high_income_indices)), replace=False)
df.loc[missing_high, 'income'] = np.nan

# 4. Introduce outliers
outlier_indices = np.random.choice(df.index, size=5, replace=False)
df.loc[outlier_indices[0], 'age'] = 999  # Data entry error
df.loc[outlier_indices[1], 'income'] = -5000  # Negative income
df.loc[outlier_indices[2], 'avg_order_value'] = 10000  # Extreme purchase

# 5. Introduce duplicates
duplicate_indices = np.random.choice(df.index, size=10, replace=False)
df = pd.concat([df, df.loc[duplicate_indices]], ignore_index=True)

# 6. Introduce inconsistencies in city names
inconsistent_indices = np.random.choice(df.index, size=20, replace=False)
df.loc[inconsistent_indices[0:5], 'city'] = 'new york'  # lowercase
df.loc[inconsistent_indices[5:10], 'city'] = 'LA'  # abbreviation
df.loc[inconsistent_indices[10:15], 'city'] = 'Chicago, IL'  # with state
df.loc[inconsistent_indices[15:20], 'city'] = 'Houston '  # trailing space

# 7. Add some missing values to other columns
df.loc[np.random.choice(df.index, 30), 'customer_segment'] = np.nan
df.loc[np.random.choice(df.index, 15), 'email_domain'] = np.nan

print(f"Dataset created with {len(df)} records and various quality issues")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

### 1.2 Comprehensive Data Quality Assessment

Let's create a comprehensive function to assess data quality:

In [None]:
def comprehensive_data_quality_report(df, target_col=None):
    """
    Generate a comprehensive data quality report
    
    Parameters:
    -----------
    df : pd.DataFrame
        The dataframe to analyze
    target_col : str, optional
        The target column for ML (if applicable)
    
    Returns:
    --------
    dict : Quality report with various metrics
    """
    report = {}
    
    # Basic information
    report['basic_info'] = {
        'n_rows': len(df),
        'n_columns': len(df.columns),
        'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024**2,
        'duplicated_rows': df.duplicated().sum(),
        'duplicated_rows_pct': (df.duplicated().sum() / len(df)) * 100
    }
    
    # Missing data analysis
    missing_data = pd.DataFrame({
        'column': df.columns,
        'missing_count': df.isnull().sum(),
        'missing_pct': (df.isnull().sum() / len(df)) * 100,
        'dtype': df.dtypes
    })
    missing_data = missing_data[missing_data['missing_count'] > 0].sort_values('missing_pct', ascending=False)
    report['missing_data'] = missing_data
    
    # Data type analysis
    report['dtypes'] = {
        'numeric': list(df.select_dtypes(include=[np.number]).columns),
        'categorical': list(df.select_dtypes(include=['object', 'category']).columns),
        'datetime': list(df.select_dtypes(include=['datetime64']).columns)
    }
    
    # Numeric columns analysis
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    numeric_analysis = []
    
    for col in numeric_cols:
        col_data = df[col].dropna()
        if len(col_data) > 0:
            q1, q3 = col_data.quantile([0.25, 0.75])
            iqr = q3 - q1
            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr
            outliers = col_data[(col_data < lower_bound) | (col_data > upper_bound)]
            
            numeric_analysis.append({
                'column': col,
                'mean': col_data.mean(),
                'median': col_data.median(),
                'std': col_data.std(),
                'min': col_data.min(),
                'max': col_data.max(),
                'outliers_count': len(outliers),
                'outliers_pct': (len(outliers) / len(col_data)) * 100,
                'negative_values': (col_data < 0).sum(),
                'zeros': (col_data == 0).sum()
            })
    
    report['numeric_analysis'] = pd.DataFrame(numeric_analysis)
    
    # Categorical columns analysis
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    categorical_analysis = []
    
    for col in categorical_cols:
        categorical_analysis.append({
            'column': col,
            'unique_values': df[col].nunique(),
            'most_frequent': df[col].mode().iloc[0] if len(df[col].mode()) > 0 else None,
            'most_frequent_pct': (df[col].value_counts().iloc[0] / len(df[col].dropna())) * 100 if len(df[col].value_counts()) > 0 else 0
        })
    
    report['categorical_analysis'] = pd.DataFrame(categorical_analysis)
    
    # Data quality score (0-100)
    quality_score = 100
    quality_score -= report['basic_info']['duplicated_rows_pct'] * 2  # Penalize duplicates
    quality_score -= missing_data['missing_pct'].mean() if len(missing_data) > 0 else 0  # Penalize missing data
    quality_score = max(0, quality_score)  # Ensure non-negative
    
    report['quality_score'] = quality_score
    
    return report

# Generate the quality report
quality_report = comprehensive_data_quality_report(df)

print("="*60)
print("DATA QUALITY REPORT")
print("="*60)
print(f"\n📊 Basic Information:")
for key, value in quality_report['basic_info'].items():
    print(f"  {key}: {value:.2f}" if isinstance(value, float) else f"  {key}: {value}")

print(f"\n❌ Missing Data:")
print(quality_report['missing_data'])

print(f"\n📈 Numeric Columns Analysis:")
print(quality_report['numeric_analysis'])

print(f"\n📝 Categorical Columns Analysis:")
print(quality_report['categorical_analysis'])

print(f"\n⭐ Overall Data Quality Score: {quality_report['quality_score']:.1f}/100")

### 1.3 Visualizing Data Quality Issues

Visualization helps us understand patterns in data quality issues:

In [None]:
# Create comprehensive visualizations
fig = plt.figure(figsize=(20, 12))

# 1. Missing data matrix
ax1 = plt.subplot(2, 3, 1)
msno.matrix(df, ax=ax1, sparkline=False)
ax1.set_title('Missing Data Pattern Matrix', fontsize=14, fontweight='bold')

# 2. Missing data bar chart
ax2 = plt.subplot(2, 3, 2)
missing_counts = df.isnull().sum().sort_values(ascending=False)
missing_counts = missing_counts[missing_counts > 0]
ax2.bar(range(len(missing_counts)), missing_counts.values)
ax2.set_xticks(range(len(missing_counts)))
ax2.set_xticklabels(missing_counts.index, rotation=45, ha='right')
ax2.set_ylabel('Number of Missing Values')
ax2.set_title('Missing Values by Column', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)

# 3. Missing data heatmap
ax3 = plt.subplot(2, 3, 3)
msno.heatmap(df, ax=ax3)
ax3.set_title('Missing Data Correlation Heatmap', fontsize=14, fontweight='bold')

# 4. Distribution of numeric columns with outliers
ax4 = plt.subplot(2, 3, 4)
numeric_cols = df.select_dtypes(include=[np.number]).columns[:4]  # First 4 numeric columns
for i, col in enumerate(numeric_cols):
    data = df[col].dropna()
    ax4.boxplot(data, positions=[i], widths=0.6, patch_artist=True,
                boxprops=dict(facecolor=f'C{i}', alpha=0.5))
ax4.set_xticks(range(len(numeric_cols)))
ax4.set_xticklabels(numeric_cols, rotation=45, ha='right')
ax4.set_title('Outlier Detection (Boxplots)', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)

# 5. Data completeness by row
ax5 = plt.subplot(2, 3, 5)
row_completeness = (df.notna().sum(axis=1) / len(df.columns)) * 100
ax5.hist(row_completeness, bins=20, edgecolor='black', alpha=0.7)
ax5.set_xlabel('Row Completeness (%)')
ax5.set_ylabel('Number of Rows')
ax5.set_title('Distribution of Row Completeness', fontsize=14, fontweight='bold')
ax5.axvline(x=row_completeness.mean(), color='red', linestyle='--', label=f'Mean: {row_completeness.mean():.1f}%')
ax5.legend()
ax5.grid(True, alpha=0.3)

# 6. Duplicate analysis
ax6 = plt.subplot(2, 3, 6)
duplicate_counts = df.duplicated(keep=False).value_counts()
colors = ['green', 'red']
labels = ['Unique', 'Duplicate']
ax6.pie(duplicate_counts.values, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
ax6.set_title('Duplicate Records Analysis', fontsize=14, fontweight='bold')

plt.suptitle('Comprehensive Data Quality Visualization', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### 🏋️ Practice Exercise 1: Data Quality Assessment

Now it's your turn! Complete the following data quality assessment tasks:

In [None]:
# EXERCISE 1: Identify specific data quality issues

# TODO: Find all customers with age > 100 (likely data entry errors)
age_errors = df[___]  # Fill in the condition

# TODO: Find all records with negative income
negative_income = df[___]  # Fill in the condition

# TODO: Find all city names that might be inconsistent (hint: use str.lower() and value_counts())
city_variations = df['city'].___  # Complete the analysis

# TODO: Calculate the percentage of rows that have at least one missing value
rows_with_missing = ___  # Calculate this percentage

print(f"Customers with age > 100: {len(age_errors)}")
print(f"Records with negative income: {len(negative_income)}")
print(f"\nCity name variations:")
print(city_variations)
print(f"\nPercentage of rows with missing data: {rows_with_missing:.1f}%")

In [None]:
# SOLUTION
age_errors = df[df['age'] > 100]
negative_income = df[df['income'] < 0]
city_variations = df['city'].str.lower().value_counts()
rows_with_missing = (df.isnull().any(axis=1).sum() / len(df)) * 100

print(f"Customers with age > 100: {len(age_errors)}")
print(f"Records with negative income: {len(negative_income)}")
print(f"\nCity name variations:")
print(city_variations.head(10))
print(f"\nPercentage of rows with missing data: {rows_with_missing:.1f}%")

---

## Part 2: Missing Data Strategies

### 2.1 Understanding Missing Data Mechanisms

Before deciding how to handle missing data, we need to understand WHY it's missing:

In [None]:
# Analyze missing data patterns by age group
def analyze_missing_patterns(df, column, group_by):
    """
    Analyze missing data patterns for a column grouped by another variable
    """
    analysis = df.groupby(group_by).agg({
        column: lambda x: x.isnull().sum(),
        'customer_id': 'count'
    }).rename(columns={column: 'missing_count', 'customer_id': 'total_count'})
    
    analysis['missing_pct'] = (analysis['missing_count'] / analysis['total_count']) * 100
    
    return analysis

# Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 50, 100, 1000], 
                         labels=['<30', '30-50', '50-100', '>100'])

# Analyze missing income by age group
missing_by_age = analyze_missing_patterns(df, 'income', 'age_group')
print("Missing Income by Age Group:")
print(missing_by_age)
print()

# Visualize the pattern
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Plot 1: Missing percentage by age group
axes[0].bar(missing_by_age.index.astype(str), missing_by_age['missing_pct'])
axes[0].set_xlabel('Age Group')
axes[0].set_ylabel('Missing Income (%)')
axes[0].set_title('Missing Income by Age Group\n(Evidence of MAR - younger people report less)')
axes[0].grid(True, alpha=0.3)

# Plot 2: Income distribution for those who reported
for age_grp in df['age_group'].unique():
    if pd.notna(age_grp):
        income_data = df[df['age_group'] == age_grp]['income'].dropna()
        if len(income_data) > 0:
            axes[1].hist(income_data, alpha=0.5, label=str(age_grp), bins=20)

axes[1].set_xlabel('Income')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Income Distribution by Age Group\n(For non-missing values)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Insight: The missing pattern suggests MAR (Missing At Random) - "
      "younger people are less likely to report income.")

### 2.2 Decision Framework: Drop vs Impute

Let's implement a systematic decision framework:

In [None]:
def missing_data_decision_framework(df, column):
    """
    Recommend whether to drop or impute based on missing data characteristics
    """
    total_rows = len(df)
    missing_count = df[column].isnull().sum()
    missing_pct = (missing_count / total_rows) * 100
    
    # Decision rules
    decision = {
        'column': column,
        'missing_count': missing_count,
        'missing_pct': missing_pct,
        'recommendation': '',
        'reasoning': []
    }
    
    # Apply decision rules
    if missing_pct > 50:
        decision['recommendation'] = 'DROP COLUMN'
        decision['reasoning'].append(f'More than 50% missing ({missing_pct:.1f}%)')
    elif missing_pct > 30:
        decision['recommendation'] = 'IMPUTE (CAREFULLY)'
        decision['reasoning'].append(f'30-50% missing - significant but manageable')
        decision['reasoning'].append('Consider advanced imputation methods')
    elif missing_pct > 10:
        decision['recommendation'] = 'IMPUTE'
        decision['reasoning'].append(f'10-30% missing - standard imputation appropriate')
    elif missing_pct > 5:
        decision['recommendation'] = 'IMPUTE or DROP ROWS'
        decision['reasoning'].append(f'5-10% missing - either approach viable')
        decision['reasoning'].append(f'Consider dataset size ({total_rows} rows)')
    else:
        decision['recommendation'] = 'DROP ROWS'
        decision['reasoning'].append(f'Less than 5% missing ({missing_pct:.1f}%)')
        decision['reasoning'].append(f'Minimal data loss ({missing_count} rows)')
    
    # Additional considerations
    if df[column].dtype in ['object', 'category']:
        unique_values = df[column].nunique()
        if unique_values > 50:
            decision['reasoning'].append(f'High cardinality ({unique_values} unique values)')
            if decision['recommendation'] != 'DROP COLUMN':
                decision['recommendation'] += ' (or DROP COLUMN)'
    
    return decision

# Apply framework to all columns with missing data
print("="*70)
print("MISSING DATA DECISION FRAMEWORK")
print("="*70)

columns_with_missing = df.columns[df.isnull().any()].tolist()

for col in columns_with_missing:
    decision = missing_data_decision_framework(df, col)
    print(f"\n📌 Column: {decision['column']}")
    print(f"   Missing: {decision['missing_count']} ({decision['missing_pct']:.1f}%)")
    print(f"   ✅ Recommendation: {decision['recommendation']}")
    print(f"   Reasoning:")
    for reason in decision['reasoning']:
        print(f"      • {reason}")

### 2.3 Imputation Techniques Comparison

Let's compare different imputation methods and their effectiveness:

In [None]:
# Prepare data for imputation comparison
# We'll use income as our target variable since it has missing values

# Create a version where we know the true values
df_complete = df[df['income'].notna()].copy()

# Artificially create missing values for testing
test_missing_indices = np.random.choice(df_complete.index, 
                                       size=int(0.2 * len(df_complete)), 
                                       replace=False)
df_test = df_complete.copy()
df_test.loc[test_missing_indices, 'income'] = np.nan

# Store true values for comparison
true_values = df_complete.loc[test_missing_indices, 'income']

# Prepare features for imputation (numeric only for this example)
feature_cols = ['age', 'total_purchases', 'avg_order_value']
X = df_test[feature_cols + ['income']].copy()

# Dictionary to store results
imputation_results = {}

# Method 1: Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
X_mean = X.copy()
X_mean['income'] = mean_imputer.fit_transform(X[['income']])
imputation_results['Mean'] = X_mean.loc[test_missing_indices, 'income']

# Method 2: Median imputation
median_imputer = SimpleImputer(strategy='median')
X_median = X.copy()
X_median['income'] = median_imputer.fit_transform(X[['income']])
imputation_results['Median'] = X_median.loc[test_missing_indices, 'income']

# Method 3: KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_knn = pd.DataFrame(
    knn_imputer.fit_transform(X),
    columns=X.columns,
    index=X.index
)
imputation_results['KNN'] = X_knn.loc[test_missing_indices, 'income']

# Method 4: Iterative imputation (MICE-like)
iterative_imputer = IterativeImputer(random_state=42, max_iter=10)
X_iterative = pd.DataFrame(
    iterative_imputer.fit_transform(X),
    columns=X.columns,
    index=X.index
)
imputation_results['Iterative'] = X_iterative.loc[test_missing_indices, 'income']

# Calculate errors
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

comparison_results = []
for method, imputed_values in imputation_results.items():
    mae = mean_absolute_error(true_values, imputed_values)
    rmse = np.sqrt(mean_squared_error(true_values, imputed_values))
    r2 = r2_score(true_values, imputed_values)
    
    comparison_results.append({
        'Method': method,
        'MAE': mae,
        'RMSE': rmse,
        'R²': r2
    })

comparison_df = pd.DataFrame(comparison_results).sort_values('MAE')

print("Imputation Methods Comparison:")
print("="*50)
print(comparison_df.to_string(index=False))
print("\n📊 Lower MAE and RMSE is better, higher R² is better")

# Visualize the comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: MAE comparison
axes[0, 0].bar(comparison_df['Method'], comparison_df['MAE'])
axes[0, 0].set_ylabel('Mean Absolute Error')
axes[0, 0].set_title('MAE by Imputation Method')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Scatter plots for each method
for i, (method, imputed_values) in enumerate(imputation_results.items()):
    ax_idx = (i + 1) // 2, (i + 1) % 2
    if i < 3:  # We have 4 methods, so use 3 subplots for scatter
        axes[ax_idx].scatter(true_values, imputed_values, alpha=0.5, s=20)
        axes[ax_idx].plot([true_values.min(), true_values.max()], 
                         [true_values.min(), true_values.max()], 
                         'r--', alpha=0.5)
        axes[ax_idx].set_xlabel('True Values')
        axes[ax_idx].set_ylabel('Imputed Values')
        axes[ax_idx].set_title(f'{method} Imputation')
        axes[ax_idx].grid(True, alpha=0.3)

# Plot 3: Distribution comparison
axes[1, 1].hist(true_values, alpha=0.5, label='True', bins=20)
axes[1, 1].hist(imputation_results['KNN'], alpha=0.5, label='KNN Imputed', bins=20)
axes[1, 1].set_xlabel('Income')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution: True vs KNN Imputed')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('Imputation Methods Performance Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n💡 Key Insight: Advanced methods (KNN, Iterative) generally perform better than simple methods")
print("   because they consider relationships between variables.")

### 2.4 Practical Imputation Pipeline

Let's create a practical imputation pipeline for our dataset:

In [None]:
class SmartImputer:
    """
    A smart imputation class that handles different data types appropriately
    """
    
    def __init__(self, numeric_strategy='knn', categorical_strategy='most_frequent', 
                 missing_threshold=0.5):
        self.numeric_strategy = numeric_strategy
        self.categorical_strategy = categorical_strategy
        self.missing_threshold = missing_threshold
        self.columns_to_drop = []
        self.numeric_imputers = {}
        self.categorical_imputers = {}
    
    def fit(self, df):
        """
        Fit the imputer on the dataframe
        """
        # Identify columns to drop (too many missing values)
        for col in df.columns:
            missing_pct = df[col].isnull().sum() / len(df)
            if missing_pct > self.missing_threshold:
                self.columns_to_drop.append(col)
                print(f"Will drop column '{col}': {missing_pct:.1%} missing")
        
        # Prepare imputers for remaining columns
        remaining_cols = [col for col in df.columns if col not in self.columns_to_drop]
        
        for col in remaining_cols:
            if df[col].isnull().any():
                if df[col].dtype in ['object', 'category']:
                    # Categorical imputation
                    imputer = SimpleImputer(strategy=self.categorical_strategy)
                    imputer.fit(df[[col]])
                    self.categorical_imputers[col] = imputer
                else:
                    # Numeric imputation
                    if self.numeric_strategy == 'knn':
                        # For KNN, we need all numeric columns
                        numeric_cols = df.select_dtypes(include=[np.number]).columns
                        numeric_cols = [c for c in numeric_cols if c not in self.columns_to_drop]
                        if col not in self.numeric_imputers:  # Avoid duplicate KNN imputers
                            imputer = KNNImputer(n_neighbors=5)
                            imputer.fit(df[numeric_cols])
                            for nc in numeric_cols:
                                self.numeric_imputers[nc] = (imputer, numeric_cols)
                    else:
                        imputer = SimpleImputer(strategy=self.numeric_strategy)
                        imputer.fit(df[[col]])
                        self.numeric_imputers[col] = (imputer, [col])
        
        return self
    
    def transform(self, df):
        """
        Transform the dataframe by imputing missing values
        """
        df_imputed = df.copy()
        
        # Drop columns with too many missing values
        df_imputed = df_imputed.drop(columns=self.columns_to_drop, errors='ignore')
        
        # Apply categorical imputers
        for col, imputer in self.categorical_imputers.items():
            if col in df_imputed.columns:
                df_imputed[col] = imputer.transform(df_imputed[[col]]).ravel()
        
        # Apply numeric imputers
        if self.numeric_strategy == 'knn' and self.numeric_imputers:
            # For KNN, impute all numeric columns at once
            imputer, numeric_cols = list(self.numeric_imputers.values())[0]
            numeric_cols = [c for c in numeric_cols if c in df_imputed.columns]
            if numeric_cols:
                df_imputed[numeric_cols] = imputer.transform(df_imputed[numeric_cols])
        else:
            # For other strategies, impute column by column
            for col, (imputer, cols) in self.numeric_imputers.items():
                if col in df_imputed.columns:
                    df_imputed[col] = imputer.transform(df_imputed[[col]]).ravel()
        
        return df_imputed
    
    def fit_transform(self, df):
        """
        Fit and transform in one step
        """
        self.fit(df)
        return self.transform(df)

# Apply the smart imputer to our dataset
print("Applying Smart Imputation Pipeline...\n")

smart_imputer = SmartImputer(
    numeric_strategy='knn',
    categorical_strategy='most_frequent',
    missing_threshold=0.5
)

df_imputed = smart_imputer.fit_transform(df)

print("\n" + "="*50)
print("Imputation Complete!")
print("="*50)

# Compare before and after
print(f"\nBefore imputation:")
print(f"  Shape: {df.shape}")
print(f"  Missing values: {df.isnull().sum().sum()}")

print(f"\nAfter imputation:")
print(f"  Shape: {df_imputed.shape}")
print(f"  Missing values: {df_imputed.isnull().sum().sum()}")

# Show remaining missing values by column
remaining_missing = df_imputed.isnull().sum()
remaining_missing = remaining_missing[remaining_missing > 0]
if len(remaining_missing) > 0:
    print(f"\nRemaining missing values by column:")
    print(remaining_missing)
else:
    print(f"\n✅ All missing values have been handled!")

### 🏋️ Practice Exercise 2: Missing Data Handling

Now practice your missing data handling skills:

In [None]:
# EXERCISE 2: Create your own imputation strategy

# Create a test dataset with specific missing patterns
exercise_data = pd.DataFrame({
    'feature_1': [1, 2, np.nan, 4, 5, np.nan, 7, 8, 9, 10],
    'feature_2': [10, np.nan, 30, 40, np.nan, 60, 70, np.nan, 90, 100],
    'feature_3': ['A', 'B', np.nan, 'A', 'C', 'B', np.nan, 'A', 'B', 'C'],
    'target': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
})

print("Exercise Dataset:")
print(exercise_data)
print(f"\nMissing values per column:")
print(exercise_data.isnull().sum())

# TODO: Implement the following:
# 1. Impute feature_1 using the median
# 2. Impute feature_2 using forward fill
# 3. Impute feature_3 using the mode
# 4. Calculate the correlation between imputed features and target

# Your code here:
exercise_imputed = exercise_data.copy()

# Impute feature_1
# exercise_imputed['feature_1'] = ...

# Impute feature_2
# exercise_imputed['feature_2'] = ...

# Impute feature_3
# exercise_imputed['feature_3'] = ...

# Calculate correlations
# correlations = ...

print("\nYour imputed dataset:")
# print(exercise_imputed)

In [None]:
# SOLUTION
exercise_imputed = exercise_data.copy()

# Impute feature_1 with median
exercise_imputed['feature_1'].fillna(exercise_imputed['feature_1'].median(), inplace=True)

# Impute feature_2 with forward fill
exercise_imputed['feature_2'].fillna(method='ffill', inplace=True)

# Impute feature_3 with mode
mode_value = exercise_imputed['feature_3'].mode()[0]
exercise_imputed['feature_3'].fillna(mode_value, inplace=True)

# Calculate correlations with target
# First, encode categorical feature_3
exercise_imputed['feature_3_encoded'] = exercise_imputed['feature_3'].map({'A': 0, 'B': 1, 'C': 2})

correlations = exercise_imputed[['feature_1', 'feature_2', 'feature_3_encoded', 'target']].corr()['target'].drop('target')

print("\nImputed dataset:")
print(exercise_imputed)
print("\nCorrelations with target:")
print(correlations)
print("\n✅ Exercise complete!")

---

## Summary and Key Takeaways

### 📚 What We've Learned

1. **Data Quality Assessment**
   - Systematic evaluation of completeness, consistency, accuracy, and uniqueness
   - Visual and statistical methods to identify quality issues
   - Importance of understanding data before modeling

2. **Missing Data Mechanisms**
   - **MCAR**: Truly random missingness
   - **MAR**: Missingness depends on observed variables
   - **MNAR**: Missingness depends on the missing value itself

3. **Decision Framework**
   - **Drop columns**: >50% missing or irrelevant
   - **Drop rows**: <5% affected and large dataset
   - **Impute**: 10-50% missing in important features

4. **Imputation Strategies**
   - Simple methods: Mean, median, mode
   - Advanced methods: KNN, iterative imputation
   - Consider data type and missing mechanism

### 🎯 Best Practices

1. **Always assess data quality first** - understand before acting
2. **Document your decisions** - explain why you chose specific strategies
3. **Validate imputation impact** - check if relationships are preserved
4. **Consider domain knowledge** - statistical rules aren't everything
5. **Create reproducible pipelines** - consistency is key

### 🚀 Next Steps

In the next notebook (week03b), we'll explore:
- Advanced filtering techniques
- Outlier detection and treatment
- Data grouping and aggregation
- Feature engineering strategies

Remember: **Quality data preparation is the foundation of successful machine learning!**