# Clean the Housing Dataset

In this lab, you'll learn how to clean real-world housing data using Pandas. You'll practice fixing common issues like missing values, outliers, inconsistent text, and logical errors. Data cleaning is a key step to make sure your analysis is accurate and trustworthy.

## Learning Objectives
By the end of this lab, you will be able to:
- Identify and assess common data quality issues in real-world datasets
- Standardize text fields and categorical variables for consistency
- Handle missing values using appropriate imputation strategies
- Detect and correct outliers and impossible values
- Remove duplicate records effectively
- Validate cleaned data to ensure quality standards
- Document data cleaning processes for reproducibility and transparency

## Why Data Cleaning Matters
Real-world data is almost never clean when you first receive it. Understanding how to systematically identify and fix data quality issues is one of the most valuable skills in data science. Poor data quality can lead to incorrect conclusions, failed models, and bad business decisions.

This lab simulates the kinds of data quality problems you'll encounter in professional settings, where data comes from multiple sources, different systems, and various people with different standards.

## Section 1: Understanding the Messy Dataset

Real-world data is often messy. Errors, missing values, and inconsistencies are common due to manual entry, system differences, and changing standards. In this lab, you'll work with a housing dataset that includes these typical problems and learn how to clean them step by step.

**Key Insight**: The messiness in this dataset isn't random—it follows patterns you'll see in real business data. Understanding these patterns helps you clean data more efficiently and systematically.

In [None]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducible results
np.random.seed(42)

print("🏠 Creating a realistic 'messy' housing dataset...")
print("This simulates data that might come from multiple sources with different standards")

In [None]:
# Generate base data with realistic variations
n_properties = 600
property_ids = list(range(1001, 1001 + n_properties))

# Create realistic property characteristics
square_footage = np.random.normal(2000, 500, n_properties).astype(int)
square_footage = np.clip(square_footage, 600, 5000)

bedrooms = np.random.choice([1, 2, 3, 4, 5, 6], n_properties, p=[0.05, 0.25, 0.35, 0.25, 0.08, 0.02])
bathrooms = np.random.normal(2.5, 0.8, n_properties)
bathrooms = np.clip(bathrooms, 1, 5)

ages = np.random.exponential(15, n_properties).astype(int)
ages = np.clip(ages, 0, 120)

print(f"✓ Generated base characteristics for {n_properties} properties")
print(f"  Square footage range: {square_footage.min():,} - {square_footage.max():,} sq ft")
print(f"  Bedroom range: {bedrooms.min()} - {bedrooms.max()}")
print(f"  Age range: {ages.min()} - {ages.max()} years")

In [None]:
# Create property types with inconsistent formatting
property_types_messy = [
    'Single Family', 'single family', 'SINGLE FAMILY', 'Single-Family', 'SF', 'SFH',
    'Townhouse', 'townhouse', 'TOWNHOUSE', 'Town House', 'TH', 
    'Condo', 'condo', 'CONDO', 'Condominium', 'Apartment', 'apt',
    'Duplex', 'duplex', 'DUPLEX', '2-unit', 'Two Unit'
]

# Normalize probabilities so they sum to 1
property_types_probs = np.array([0.15, 0.1, 0.05, 0.08, 0.1, 0.02, 
                                0.08, 0.05, 0.03, 0.02, 0.05,
                                0.08, 0.05, 0.02, 0.04, 0.02, 0.02,
                                0.03, 0.02, 0.01, 0.02, 0.01])
property_types_probs = property_types_probs / property_types_probs.sum()

property_types = np.random.choice(property_types_messy, n_properties, 
                                 p=property_types_probs)

# Create neighborhoods with spacing and capitalization issues
neighborhoods_messy = [
    'Downtown', 'downtown', ' Downtown ', 'Down Town', 'DOWNTOWN',
    'Suburban', 'suburban', 'SUBURBAN', ' Suburban', 'Suburban ',
    'Riverside', 'riverside', 'RIVERSIDE', 'River Side', ' Riverside ',
    'Historic District', 'historic district', 'HISTORIC DISTRICT', 
    'Historic Dist', 'Historic_District', ' Historic District ',
    'New Development', 'new development', 'NEW DEVELOPMENT', 
    'New Dev', 'NewDevelopment', ' New Development '
]

# Normalize neighborhood probabilities so they sum to 1
neighborhoods_probs = np.array([
    0.08, 0.05, 0.02, 0.02, 0.03,
    0.15, 0.1, 0.05, 0.02, 0.03,
    0.1, 0.05, 0.02, 0.02, 0.01,
    0.08, 0.03, 0.02, 0.02, 0.01, 0.02,
    0.1, 0.05, 0.03, 0.02, 0.01, 0.02
])
neighborhoods_probs = neighborhoods_probs / neighborhoods_probs.sum()

neighborhoods = np.random.choice(neighborhoods_messy, n_properties,
                                p=neighborhoods_probs)

print(f"\n✓ Created inconsistent categorical data")
print(f"  Property type variations: {len(np.unique(property_types))}")
print(f"  Neighborhood variations: {len(np.unique(neighborhoods))}")

In [None]:
# Calculate realistic prices
base_price = 120 * square_footage
bedroom_bonus = bedrooms * 12000
bathroom_bonus = bathrooms * 8000
age_penalty = ages * 800

prices = base_price + bedroom_bonus + bathroom_bonus - age_penalty
prices = prices * np.random.normal(1.0, 0.2, n_properties)
prices = np.round(prices, -3).astype(int)

# Create initial DataFrame
housing_data = pd.DataFrame({
    'property_id': property_ids,
    'square_feet': square_footage,
    'bedrooms': bedrooms,
    'bathrooms': bathrooms,
    'age_years': ages,
    'property_type': property_types,
    'neighborhood': neighborhoods,
    'price': prices
})

print(f"\n✓ Created initial housing dataset")
print(f"  Shape: {housing_data.shape}")
print(f"\nFirst few rows:")
print(housing_data.head())

In [None]:
# Introduce realistic data quality issues
print("\n🔧 Introducing realistic data quality issues...")

# Missing values
missing_price_indices = np.random.choice(housing_data.index, size=25, replace=False)
housing_data.loc[missing_price_indices, 'price'] = np.nan

missing_bathroom_indices = np.random.choice(housing_data.index, size=15, replace=False)
housing_data.loc[missing_bathroom_indices, 'bathrooms'] = np.nan

missing_age_indices = np.random.choice(housing_data.index, size=20, replace=False)
housing_data.loc[missing_age_indices, 'age_years'] = np.nan

# Duplicates
duplicate_indices = np.random.choice(housing_data.index[:-10], size=5, replace=False)
for idx in duplicate_indices:
    housing_data.loc[len(housing_data)] = housing_data.loc[idx].copy()

# Impossible values
housing_data.loc[np.random.choice(housing_data.index, 3), 'age_years'] = [-5, -2, -10]
housing_data.loc[np.random.choice(housing_data.index, 2), 'bathrooms'] = [0, -1]
housing_data.loc[np.random.choice(housing_data.index, 3), 'price'] = [50000, 5000000, 25000]

# Inconsistent property IDs
id_issues = np.random.choice(housing_data.index, 8, replace=False)
housing_data.loc[id_issues[:3], 'property_id'] = ['ID1001', 'prop_1002', 'P1003']
housing_data.loc[id_issues[3:6], 'property_id'] = [None, '', 'MISSING']

print(f"✓ Data quality issues introduced:")
print(f"  • Missing values in multiple columns")
print(f"  • Duplicate records")
print(f"  • Impossible values (negative ages, zero bathrooms)")
print(f"  • Extreme outliers")
print(f"  • Inconsistent text formatting")
print(f"\n📊 Final messy dataset: {len(housing_data)} records")

**Exercise 1.1:** Explore the messy dataset:
1. Examine the first 10 rows and identify obvious problems
2. Check unique values in categorical columns
3. Look at data types and basic statistics
4. Count missing values in each column

In [None]:
# Your exploration code here


## Section 2: Initial Data Quality Assessment

Before cleaning data, it's important to identify and understand all data quality issues. This assessment helps you plan your cleaning steps, prioritize problems, and estimate the effort needed.

In [None]:
print("=== COMPREHENSIVE DATA QUALITY ASSESSMENT ===")
print(f"Dataset shape: {housing_data.shape}")

# Missing value analysis
print("\n1. MISSING VALUES ANALYSIS")
print("-" * 30)
missing_summary = housing_data.isnull().sum()
missing_percentages = (missing_summary / len(housing_data)) * 100

for column in housing_data.columns:
    missing_count = missing_summary[column]
    missing_pct = missing_percentages[column]
    status = "✓ Clean" if missing_count == 0 else f"⚠ {missing_count} missing ({missing_pct:.1f}%)"
    print(f"{column:15} {status}")

print(f"\nTotal missing values: {missing_summary.sum()}")

In [None]:
# Duplicate analysis
print("\n2. DUPLICATE RECORDS ANALYSIS")
print("-" * 30)
total_duplicates = housing_data.duplicated().sum()
print(f"Complete duplicate rows: {total_duplicates}")

if total_duplicates > 0:
    print("\nSample duplicate records:")
    duplicate_rows = housing_data[housing_data.duplicated(keep=False)].sort_values('property_id')
    print(duplicate_rows[['property_id', 'square_feet', 'bedrooms', 'price']].head())

# Text formatting issues
print("\n3. TEXT FORMATTING INCONSISTENCIES")
print("-" * 38)

print(f"Property type variations: {housing_data['property_type'].nunique()}")
print(f"Sample variations: {list(housing_data['property_type'].unique())[:8]}")

print(f"\nNeighborhood variations: {housing_data['neighborhood'].nunique()}")
print(f"Sample variations: {list(housing_data['neighborhood'].unique())[:8]}")

In [None]:
# Check for impossible values
print("\n4. IMPOSSIBLE VALUES DETECTION")
print("-" * 31)

negative_ages = (housing_data['age_years'] < 0).sum()
print(f"Properties with negative age: {negative_ages}")
if negative_ages > 0:
    print(f"  Values: {housing_data[housing_data['age_years'] < 0]['age_years'].tolist()}")

impossible_bathrooms = (housing_data['bathrooms'] <= 0).sum()
print(f"Properties with ≤0 bathrooms: {impossible_bathrooms}")

# Price outliers using IQR method
if housing_data['price'].notna().sum() > 0:
    Q1 = housing_data['price'].quantile(0.25)
    Q3 = housing_data['price'].quantile(0.75)
    IQR = Q3 - Q1
    price_outliers = ((housing_data['price'] < Q1 - 3*IQR) | 
                     (housing_data['price'] > Q3 + 3*IQR)).sum()
    print(f"Extreme price outliers: {price_outliers}")

print(f"\n📋 Assessment complete - roadmap for cleaning established!")

**Exercise 2.1:** Based on the assessment, prioritize the issues:
1. Which problems should be fixed first and why?
2. What additional information would help you decide how to handle missing values?
3. How might the severity of these issues depend on your analysis goals?

In [None]:
# Your prioritization analysis here


## Section 3: Cleaning Text Data and Standardizing Categories

Text columns often have inconsistent formatting in real-world data. Standardizing categorical variables is important for accurate analysis. The goal is to automate as much as possible while avoiding losing important differences.

In [None]:
# Create working copy
housing_clean = housing_data.copy()

print("=== CLEANING TEXT DATA AND STANDARDIZING CATEGORIES ===")
print("🔧 Creating working copy of data for cleaning...")

def standardize_property_type(prop_type):
    """
    Standardize property type values to consistent categories.
    Uses rule-based approach to handle variations in capitalization,
    spacing, and abbreviations.
    """
    if pd.isna(prop_type):
        return prop_type
    
    prop_type_clean = str(prop_type).lower().strip()
    
    if any(term in prop_type_clean for term in ['single', 'sf', 'sfh']):
        return 'Single Family'
    elif any(term in prop_type_clean for term in ['town', 'th']):
        return 'Townhouse'
    elif any(term in prop_type_clean for term in ['condo', 'apartment', 'apt']):
        return 'Condo'
    elif any(term in prop_type_clean for term in ['duplex', '2-unit', 'two unit']):
        return 'Duplex'
    else:
        return 'Other'

# Test the function
test_types = ['Single Family', 'single family', 'SF', 'TOWNHOUSE', 'apt']
print("\nTesting property type standardization:")
for test_type in test_types:
    result = standardize_property_type(test_type)
    print(f"  '{test_type}' → '{result}'")

In [None]:
def standardize_neighborhood(neighborhood):
    """
    Standardize neighborhood values to consistent categories.
    Handles spacing, capitalization, and alternative spellings.
    """
    if pd.isna(neighborhood):
        return neighborhood
    
    neighborhood_clean = str(neighborhood).lower().strip().replace('_', ' ')
    
    if 'downtown' in neighborhood_clean or 'down town' in neighborhood_clean:
        return 'Downtown'
    elif 'suburban' in neighborhood_clean:
        return 'Suburban'
    elif 'riverside' in neighborhood_clean or 'river side' in neighborhood_clean:
        return 'Riverside'
    elif 'historic' in neighborhood_clean:
        return 'Historic District'
    elif 'new' in neighborhood_clean and ('dev' in neighborhood_clean or 
                                         'development' in neighborhood_clean):
        return 'New Development'
    else:
        return 'Other'

def clean_property_id(prop_id):
    """
    Extract numeric property ID from various formats.
    Handles mixed formats by extracting the numeric portion.
    """
    if pd.isna(prop_id) or prop_id == '' or str(prop_id).upper() == 'MISSING':
        return np.nan
    
    prop_id_str = str(prop_id)
    digits = re.findall(r'\d+', prop_id_str)
    
    if digits:
        return int(digits[0])
    else:
        return np.nan

# Apply standardization functions
print("\nApplying standardization functions...")
print(f"Before: {housing_clean['property_type'].nunique()} property types, {housing_clean['neighborhood'].nunique()} neighborhoods")

housing_clean['property_type'] = housing_clean['property_type'].apply(standardize_property_type)
housing_clean['neighborhood'] = housing_clean['neighborhood'].apply(standardize_neighborhood)
housing_clean['property_id'] = housing_clean['property_id'].apply(clean_property_id)

print(f"After: {housing_clean['property_type'].nunique()} property types, {housing_clean['neighborhood'].nunique()} neighborhoods")
print("\n✅ Text standardization completed!")

**Exercise 3.1:** Practice creating standardization functions for other data types like phone numbers, email domains, or state abbreviations.

In [None]:
# Your standardization practice here


## Section 4: Handling Missing Values

Missing values must be handled carefully because how you fill them can impact your analysis. Choose an imputation method that fits your data and why values are missing.

In [None]:
print("=== HANDLING MISSING VALUES ===")

# Strategy 1: Remove rows with missing property IDs (essential identifiers)
print("\n1. HANDLING MISSING PROPERTY IDs")
print("Strategy: Remove records (IDs are essential identifiers)")

before_removal = len(housing_clean)
housing_clean = housing_clean.dropna(subset=['property_id'])
after_removal = len(housing_clean)
removed_count = before_removal - after_removal

print(f"Removed {removed_count} rows with missing property IDs")
print(f"Remaining records: {after_removal}")

In [None]:
# Strategy 2: Impute missing ages using neighborhood patterns
print("\n2. HANDLING MISSING AGE VALUES")
print("Strategy: Use neighborhood medians (similar areas built at similar times)")

missing_age_count = housing_clean['age_years'].isna().sum()
print(f"Properties with missing age: {missing_age_count}")

if missing_age_count > 0:
    for neighborhood in housing_clean['neighborhood'].unique():
        if pd.notna(neighborhood):
            neighborhood_mask = housing_clean['neighborhood'] == neighborhood
            neighborhood_ages = housing_clean.loc[neighborhood_mask, 'age_years']
            
            if neighborhood_ages.notna().sum() > 0:
                median_age = neighborhood_ages.median()
                missing_in_neighborhood = neighborhood_ages.isna().sum()
                
                if missing_in_neighborhood > 0:
                    housing_clean.loc[neighborhood_mask & housing_clean['age_years'].isna(), 'age_years'] = median_age
                    print(f"  {neighborhood}: filled {missing_in_neighborhood} values with {median_age:.0f} years")
    
    # Fill any remaining missing values with overall median
    remaining_missing = housing_clean['age_years'].isna().sum()
    if remaining_missing > 0:
        overall_median = housing_clean['age_years'].median()
        housing_clean['age_years'].fillna(overall_median, inplace=True)
        print(f"  Filled remaining {remaining_missing} values with overall median: {overall_median:.0f} years")

print("✅ Age imputation completed")

In [None]:
# Strategy 3: Impute bathrooms based on bedrooms and property type
print("\n3. HANDLING MISSING BATHROOM VALUES")
print("Strategy: Use bedroom and property type patterns")

missing_bathroom_count = housing_clean['bathrooms'].isna().sum()
print(f"Properties with missing bathrooms: {missing_bathroom_count}")

if missing_bathroom_count > 0:
    for prop_type in housing_clean['property_type'].unique():
        if pd.notna(prop_type):
            for bedrooms in sorted(housing_clean['bedrooms'].unique()):
                if pd.notna(bedrooms):
                    combined_mask = ((housing_clean['property_type'] == prop_type) & 
                                   (housing_clean['bedrooms'] == bedrooms))
                    reference_bathrooms = housing_clean.loc[combined_mask, 'bathrooms']
                    
                    if reference_bathrooms.notna().sum() >= 3:
                        median_bathrooms = reference_bathrooms.median()
                        missing_mask = combined_mask & housing_clean['bathrooms'].isna()
                        missing_count = missing_mask.sum()
                        
                        if missing_count > 0:
                            housing_clean.loc[missing_mask, 'bathrooms'] = median_bathrooms
                            print(f"  {prop_type}, {bedrooms} bed: filled {missing_count} values with {median_bathrooms:.1f} bathrooms")
    
    # Apply fallback rule for remaining missing values
    remaining_missing = housing_clean['bathrooms'].isna().sum()
    if remaining_missing > 0:
        for idx in housing_clean[housing_clean['bathrooms'].isna()].index:
            bedrooms = housing_clean.loc[idx, 'bedrooms']
            estimated_bathrooms = max(1.0, bedrooms * 0.75 + 0.5)
            housing_clean.loc[idx, 'bathrooms'] = round(estimated_bathrooms * 2) / 2
        print(f"  Applied bedroom-based estimation for remaining {remaining_missing} values")

print("✅ Bathroom imputation completed")

In [None]:
# Strategy 4: Impute prices using similar properties
print("\n4. HANDLING MISSING PRICE VALUES")
print("Strategy: Find similar properties and use their median price")

missing_price_count = housing_clean['price'].isna().sum()
print(f"Properties with missing prices: {missing_price_count}")

if missing_price_count > 0:
    has_price = housing_clean['price'].notna()
    imputation_count = 0
    
    for idx in housing_clean[housing_clean['price'].isna()].index:
        current_property = housing_clean.loc[idx]
        
        # Find similar properties (size, bedrooms, type)
        similar_mask = (
            (abs(housing_clean['square_feet'] - current_property['square_feet']) <= 200) &
            (housing_clean['bedrooms'] == current_property['bedrooms']) &
            (housing_clean['property_type'] == current_property['property_type']) &
            has_price
        )
        
        similar_properties = housing_clean[similar_mask]
        
        if len(similar_properties) >= 3:
            estimated_price = similar_properties['price'].median()
            housing_clean.loc[idx, 'price'] = estimated_price
            imputation_count += 1
            if imputation_count <= 3:
                print(f"  Property {int(current_property['property_id'])}: ${estimated_price:,.0f} (based on {len(similar_properties)} similar properties)")
        else:
            # Fallback: neighborhood and property type median
            fallback_mask = (
                (housing_clean['neighborhood'] == current_property['neighborhood']) &
                (housing_clean['property_type'] == current_property['property_type']) &
                has_price
            )
            fallback_properties = housing_clean[fallback_mask]
            
            if len(fallback_properties) > 0:
                estimated_price = fallback_properties['price'].median()
                housing_clean.loc[idx, 'price'] = estimated_price
                imputation_count += 1
    
    if imputation_count > 3:
        print(f"  ... and {imputation_count - 3} other imputations")
    
    print(f"✅ Successfully imputed {imputation_count} missing prices")

# Final check
final_missing = housing_clean.isnull().sum().sum()
print(f"\n📊 Final missing values: {final_missing}")
if final_missing == 0:
    print("🎉 All missing values successfully handled!")

**Exercise 4.1:** Evaluate the missing value strategies and consider alternatives for different scenarios.

In [None]:
# Your missing value evaluation here


## Section 5: Removing Duplicates and Handling Outliers

Duplicates and outliers can distort your analysis. We need to distinguish between errors (clearly wrong values) and legitimate extreme values.

In [None]:
print("=== REMOVING DUPLICATES ===")

duplicates_before = housing_clean.duplicated().sum()
print(f"Duplicate rows found: {duplicates_before}")

if duplicates_before > 0:
    housing_clean = housing_clean.drop_duplicates()
    duplicates_after = housing_clean.duplicated().sum()
    removed = duplicates_before - duplicates_after
    print(f"✅ Removed {removed} duplicate records")

# Check for duplicate property IDs
id_duplicates = housing_clean['property_id'].duplicated().sum()
if id_duplicates > 0:
    housing_clean = housing_clean.drop_duplicates(subset=['property_id'])
    print(f"✅ Removed {id_duplicates} duplicate property IDs")

print(f"Final dataset shape: {housing_clean.shape}")

In [None]:
print("\n=== HANDLING IMPOSSIBLE VALUES ===")

# Fix negative ages
negative_ages = (housing_clean['age_years'] < 0).sum()
if negative_ages > 0:
    print(f"Found {negative_ages} negative ages - converting to positive")
    housing_clean.loc[housing_clean['age_years'] < 0, 'age_years'] = \
        abs(housing_clean.loc[housing_clean['age_years'] < 0, 'age_years'])
    print("✅ Negative ages corrected")

# Fix impossible bathroom counts
impossible_bathrooms = (housing_clean['bathrooms'] <= 0).sum()
if impossible_bathrooms > 0:
    print(f"Found {impossible_bathrooms} properties with ≤0 bathrooms - setting to 1.0")
    housing_clean.loc[housing_clean['bathrooms'] <= 0, 'bathrooms'] = 1.0
    print("✅ Impossible bathroom counts corrected")

if negative_ages == 0 and impossible_bathrooms == 0:
    print("✅ No impossible values found")

In [None]:
print("\n=== HANDLING EXTREME OUTLIERS ===")

def detect_outliers_iqr(data, column, factor=3):
    """Detect outliers using IQR method with specified factor."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR
    
    outlier_mask = (data[column] < lower_bound) | (data[column] > upper_bound)
    
    return {
        'count': outlier_mask.sum(),
        'lower_bound': lower_bound,
        'upper_bound': upper_bound,
        'mask': outlier_mask
    }

# Analyze price outliers
price_outliers = detect_outliers_iqr(housing_clean, 'price', factor=3)
print(f"Price outliers found: {price_outliers['count']}")
print(f"Normal range: ${price_outliers['lower_bound']:,.0f} - ${price_outliers['upper_bound']:,.0f}")

if price_outliers['count'] > 0:
    extreme_outliers = housing_clean.loc[price_outliers['mask']]
    corrections_made = 0
    
    for idx, row in extreme_outliers.iterrows():
        price = row['price']
        
        if price < 100000:  # Very low prices - likely errors
            # Find similar properties for correction
            similar_properties = housing_clean[
                (housing_clean['square_feet'].between(row['square_feet'] - 300, row['square_feet'] + 300)) &
                (housing_clean['neighborhood'] == row['neighborhood']) &
                (housing_clean['property_type'] == row['property_type']) &
                (~price_outliers['mask'])
            ]
            
            if len(similar_properties) > 0:
                corrected_price = similar_properties['price'].median()
                housing_clean.loc[idx, 'price'] = corrected_price
                corrections_made += 1
                if corrections_made <= 3:
                    print(f"  Corrected ${price:,} → ${corrected_price:,.0f} (Property {int(row['property_id'])})")
    
    if corrections_made > 3:
        print(f"  ... and {corrections_made - 3} other corrections")
    
    print(f"✅ Corrected {corrections_made} obvious price errors")
else:
    print("✅ No extreme price outliers found")

print("\n✅ Outlier handling completed")

**Exercise 5.1:** Practice outlier detection with different IQR factors and visualize the results.

In [None]:
# Your outlier analysis practice here


## Section 6: Validating the Cleaned Dataset

After cleaning, validate that your dataset is accurate and ready for analysis. Think of validation as final quality control.

In [None]:
print("=== VALIDATING THE CLEANED DATASET ===")

print(f"📊 Final dataset shape: {housing_clean.shape}")
print(f"📊 Data retention rate: {(len(housing_clean) / len(housing_data)) * 100:.1f}%")

# Check for remaining issues
remaining_missing = housing_clean.isnull().sum().sum()
remaining_duplicates = housing_clean.duplicated().sum()

print(f"\n✅ Remaining missing values: {remaining_missing}")
print(f"✅ Remaining duplicates: {remaining_duplicates}")

# Validate value ranges
print("\n📏 VALUE RANGE VALIDATION:")
validation_rules = {
    'property_id': {'min': 1000, 'max': 2000},
    'square_feet': {'min': 400, 'max': 5000},
    'bedrooms': {'min': 1, 'max': 6},
    'bathrooms': {'min': 0.5, 'max': 5},
    'age_years': {'min': 0, 'max': 150},
    'price': {'min': 50000, 'max': 3000000}
}

all_ranges_valid = True
for column, rules in validation_rules.items():
    min_val = housing_clean[column].min()
    max_val = housing_clean[column].max()
    
    min_valid = min_val >= rules['min']
    max_valid = max_val <= rules['max']
    range_valid = min_valid and max_valid
    
    if not range_valid:
        all_ranges_valid = False
    
    status = "✅" if range_valid else "❌"
    print(f"  {column:15} {status} [{min_val:8.1f} - {max_val:8.1f}]")

print(f"\n📊 Range validation: {'✅ PASSED' if all_ranges_valid else '❌ FAILED'}")

In [None]:
# Validate categorical consistency
print("\n🏷️ CATEGORICAL VALIDATION:")

expected_property_types = {'Single Family', 'Townhouse', 'Condo', 'Duplex', 'Other'}
actual_property_types = set(housing_clean['property_type'].unique())
property_type_valid = actual_property_types.issubset(expected_property_types)

expected_neighborhoods = {'Downtown', 'Suburban', 'Riverside', 'Historic District', 'New Development', 'Other'}
actual_neighborhoods = set(housing_clean['neighborhood'].unique())
neighborhood_valid = actual_neighborhoods.issubset(expected_neighborhoods)

print(f"  Property types: {'✅' if property_type_valid else '❌'} {actual_property_types}")
print(f"  Neighborhoods: {'✅' if neighborhood_valid else '❌'} {actual_neighborhoods}")

categorical_valid = property_type_valid and neighborhood_valid

# Validate logical relationships
print("\n🔗 LOGICAL RELATIONSHIP VALIDATION:")

reasonable_bathroom_ratio = ((housing_clean['bathrooms'] / housing_clean['bedrooms']) <= 2).all()
housing_clean['price_per_sqft'] = housing_clean['price'] / housing_clean['square_feet']
reasonable_price_per_sqft = ((housing_clean['price_per_sqft'] >= 50) & 
                           (housing_clean['price_per_sqft'] <= 600)).all()
size_price_correlation = housing_clean['square_feet'].corr(housing_clean['price'])
reasonable_correlation = size_price_correlation > 0.5
unique_ids = housing_clean['property_id'].nunique() == len(housing_clean)

print(f"  Bathroom/bedroom ratios: {'✅' if reasonable_bathroom_ratio else '❌'}")
print(f"  Price per sqft range: {'✅' if reasonable_price_per_sqft else '❌'}")
print(f"  Size-price correlation: {'✅' if reasonable_correlation else '❌'} (r = {size_price_correlation:.3f})")
print(f"  Unique property IDs: {'✅' if unique_ids else '❌'}")

logical_valid = reasonable_bathroom_ratio and reasonable_price_per_sqft and reasonable_correlation and unique_ids

In [None]:
# Final quality score
quality_checks = [
    remaining_missing == 0,
    remaining_duplicates == 0,
    all_ranges_valid,
    categorical_valid,
    logical_valid
]

quality_score = sum(quality_checks)
total_checks = len(quality_checks)

print(f"\n🎯 FINAL DATA QUALITY SCORE: {quality_score}/{total_checks} ({(quality_score/total_checks)*100:.0f}%)")

if quality_score == total_checks:
    print("🎉 EXCELLENT! Dataset passed all quality checks.")
    print("   ✅ Data is ready for analysis and modeling.")
elif quality_score >= total_checks * 0.8:
    print("👍 GOOD! Dataset passed most quality checks.")
    print("   ⚠️ Minor issues may need attention.")
else:
    print("⚠️ WARNING! Significant quality issues remain.")

# Display key statistics
print(f"\n📈 CLEANED DATASET STATISTICS:")
print(f"   • Total properties: {len(housing_clean):,}")
print(f"   • Average price: ${housing_clean['price'].mean():,.0f}")
print(f"   • Price range: ${housing_clean['price'].min():,.0f} - ${housing_clean['price'].max():,.0f}")
print(f"   • Average size: {housing_clean['square_feet'].mean():.0f} sq ft")
print(f"   • Most common type: {housing_clean['property_type'].mode()[0]}")
print(f"   • Most common neighborhood: {housing_clean['neighborhood'].mode()[0]}")

**Exercise 6.1:** Design additional validation checks specific to real estate data or other domains.

In [None]:
# Your validation enhancement practice here


## Section 7: Documenting the Cleaning Process

Clear documentation ensures others can understand, reproduce, and trust your work. Document what was done, why it was done, the impact, and any limitations.

In [None]:
print("=== DATA CLEANING DOCUMENTATION ===")

current_time = pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')

print("📋 DATA CLEANING REPORT")
print("=" * 50)
print(f"Date: {current_time}")
print(f"Dataset: Housing Market Data")
print(f"Purpose: Prepare housing data for market analysis")

print(f"\n📊 TRANSFORMATION SUMMARY:")
print(f"├─ Original: {housing_data.shape[0]:,} properties, {housing_data.shape[1]} features")
print(f"├─ Final: {housing_clean.shape[0]:,} properties, {housing_clean.shape[1]} features")
print(f"├─ Retention: {(len(housing_clean) / len(housing_data)) * 100:.1f}%")
print(f"└─ Removed: {len(housing_data) - len(housing_clean):,} records")

# Calculate improvement metrics
original_missing = housing_data.isnull().sum().sum()
final_missing = housing_clean.isnull().sum().sum()
original_duplicates = housing_data.duplicated().sum()
final_duplicates = housing_clean.duplicated().sum()
original_prop_types = housing_data['property_type'].nunique()
final_prop_types = housing_clean['property_type'].nunique()

print(f"\n🔧 CLEANING ACTIONS PERFORMED:")
print(f"\n1️⃣ TEXT STANDARDIZATION:")
print(f"   • Property types: {original_prop_types} → {final_prop_types} categories")
print(f"   • Neighborhoods: Standardized to consistent format")
print(f"   • Property IDs: Converted to numeric format")

print(f"\n2️⃣ MISSING VALUE TREATMENT:")
print(f"   • Total missing: {original_missing} → {final_missing}")
print(f"   • Ages: Imputed using neighborhood medians")
print(f"   • Bathrooms: Imputed using bedroom/type patterns")
print(f"   • Prices: Imputed using similar properties")

print(f"\n3️⃣ QUALITY IMPROVEMENTS:")
print(f"   • Duplicates removed: {original_duplicates} → {final_duplicates}")
print(f"   • Impossible values corrected")
print(f"   • Extreme outliers investigated and corrected")
print(f"   • Final quality score: {quality_score}/{total_checks} ({(quality_score/total_checks)*100:.0f}%)")

print(f"\n✅ BUSINESS IMPACT:")
print(f"   • Dataset ready for market analysis and modeling")
print(f"   • Consistent categories enable accurate comparisons")
print(f"   • Clean price data suitable for valuation models")
print(f"   • No missing values to interfere with algorithms")

print(f"\n⚠️ LIMITATIONS:")
print(f"   • Imputed values may introduce slight bias")
print(f"   • Outlier corrections based on statistical rules")
print(f"   • Synthetic dataset may not capture all real-world complexity")

print(f"\n📋 RECOMMENDATIONS:")
print(f"   1. Use cleaned dataset for market analysis")
print(f"   2. Consider flagging imputed values in sensitive analyses")
print(f"   3. Validate results against domain expertise")
print(f"   4. Monitor data quality in future updates")

print(f"\n✅ DATA CLEANING COMPLETED SUCCESSFULLY!")
print(f"   📊 {housing_clean.shape[0]:,} clean properties ready for analysis")

## Lab Summary and Key Takeaways

Congratulations! You’ve completed a realistic data cleaning lab, tackling the kinds of messy data challenges you’ll face in real projects.

### Key Skills Practiced

- **Systematic assessment:** You learned to identify, categorize, and prioritize data quality issues before cleaning.
- **Text standardization:** You created functions to clean up inconsistent categorical data, a common real-world problem.
- **Missing value imputation:** You used context-aware strategies, applying domain knowledge and data patterns to fill gaps.
- **Outlier handling:** You practiced distinguishing between errors and legitimate extreme values, using both statistics and business logic.
- **Validation and documentation:** You validated your results and documented your process for transparency and reproducibility.

### Why This Matters

Most of your time as a data scientist will be spent preparing and cleaning data. Clean data is essential for trustworthy analysis and modeling—without it, results are unreliable.

### Real-World Relevance

These techniques apply across industries: finance, healthcare, retail, manufacturing, and more. The skills you’ve built are foundational for any data-driven role.

### Best Practices

- Always work on a copy of your data.
- Document every cleaning decision.
- Combine statistical methods with domain knowledge.
- Validate your results from multiple angles.
- Be transparent about limitations and assumptions.

### Moving Forward

You’re now equipped to approach any data cleaning task methodically: assess, plan, clean, validate, and document. Continue practicing on new datasets, try advanced imputation, and always seek to understand the root causes of data issues.

Clean data is the foundation of all good data science—your work here sets you up for success in any analysis or modeling project.
