# ML Data Preparation Pipeline

## Purpose
This notebook prepares and cleans the healthcare claims dataset for machine learning modeling. It processes the input dataset `outputs/claims_with_policy_rules.csv` to create a clean, ML-ready dataset with standardized features.

## Key Objectives
1. Load and validate the input dataset
2. Select relevant features for ML modeling
3. Clean and standardize column names
4. Handle missing values and data type corrections
5. Save the final ML-ready dataset to `outputs/ml_outputs/`

## Input/Output Structure
- **Input**: `outputs/claims_with_policy_rules.csv`
- **Output**: `outputs/ml_outputs/final_ml_ready_claims.csv`
- **Working Directory**: `notebooks-02/`

---

## Step 1: Import Required Libraries

In [41]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pathlib import Path
import os
import warnings
import json
from datetime import datetime

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 2.3.3


## Step 2: Setup Directory Structure and Paths

In [42]:
# Define project paths
project_root = Path().resolve().parent  # Go up from notebooks-02 to project root
input_file = project_root / "outputs" / "claims_with_policy_rules.csv"
ml_outputs_dir = project_root / "outputs" / "ml_outputs"
output_file = ml_outputs_dir / "final_ml_ready_claims.csv"

# Create ml_outputs directory if it doesn't exist
ml_outputs_dir.mkdir(parents=True, exist_ok=True)

# Verify paths
print(f"Project root: {project_root}")
print(f"Input file: {input_file}")
print(f"ML outputs directory: {ml_outputs_dir}")
print(f"Output file: {output_file}")
print(f"\nInput file exists: {input_file.exists()}")
print(f"ML outputs directory exists: {ml_outputs_dir.exists()}")

Project root: /Users/kxshrx/asylum/healix
Input file: /Users/kxshrx/asylum/healix/outputs/claims_with_policy_rules.csv
ML outputs directory: /Users/kxshrx/asylum/healix/outputs/ml_outputs
Output file: /Users/kxshrx/asylum/healix/outputs/ml_outputs/final_ml_ready_claims.csv

Input file exists: True
ML outputs directory exists: True


## Step 3: Load Input Dataset with Error Handling

In [43]:
# Load the input dataset with proper error handling
try:
    print(f"Loading dataset from: {input_file}")
    df_raw = pd.read_csv(input_file)
    print(f"Successfully loaded dataset with shape: {df_raw.shape}")
    
except FileNotFoundError:
    print(f"Error: Input file not found at {input_file}")
    print("Please ensure the claims_with_policy_rules.csv file exists in the outputs directory.")
    raise
    
except pd.errors.EmptyDataError:
    print("Error: The input file appears to be empty.")
    raise
    
except Exception as e:
    print(f"Unexpected error loading data: {str(e)}")
    raise

Loading dataset from: /Users/kxshrx/asylum/healix/outputs/claims_with_policy_rules.csv
Successfully loaded dataset with shape: (222000, 33)
Successfully loaded dataset with shape: (222000, 33)


## Step 4: Initial Dataset Investigation

In [44]:
# Print comprehensive dataset information
print("=" * 60)
print("INITIAL DATASET INVESTIGATION")
print("=" * 60)

print(f"\nDataset Shape: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns")

print("\nColumn Names:")
for i, col in enumerate(df_raw.columns, 1):
    print(f"{i:2d}. {col}")

print("\nData Types:")
print(df_raw.dtypes)

print("\nMissing Data Summary:")
missing_summary = df_raw.isnull().sum()
missing_summary = missing_summary[missing_summary > 0].sort_values(ascending=False)

if len(missing_summary) > 0:
    print(missing_summary)
    print(f"\nTotal missing values: {df_raw.isnull().sum().sum():,}")
    print(f"Percentage of missing data: {(df_raw.isnull().sum().sum() / (df_raw.shape[0] * df_raw.shape[1]) * 100):.2f}%")
else:
    print("No missing values found!")

INITIAL DATASET INVESTIGATION

Dataset Shape: 222,000 rows × 33 columns

Column Names:
 1. claim_id
 2. patient_hash
 3. age
 4. gender
 5. blood_type
 6. medical_condition
 7. admission_year_month
 8. admission_type
 9. length_of_stay_days
10. discharge_date
11. medication
12. test_results
13. insurance_provider
14. billing_amount
15. created_at
16. provider_id
17. plan_type
18. coverage_percentage
19. max_coverage_amount
20. copay_percentage
21. deductible_amount
22. annual_out_of_pocket_max
23. excluded_conditions
24. medication_coverage
25. diagnostic_test_coverage
26. admission_type_rules
27. waiting_period
28. pre_existing_condition_coverage
29. network_coverage
30. emergency_coverage
31. preventive_care_coverage
32. data_source
33. policy_id

Data Types:
claim_id                             int64
patient_hash                        object
age                                  int64
gender                              object
blood_type                          object
medical_condi

## Step 5: Define ML Feature Column List

In [45]:
# Define the final ML feature column list (refined set)
ml_feature_columns = [
    # Patient demographics and medical info
    'age',
    'gender', 
    'blood_type',
    'medical_condition',
    
    # Admission and stay details
    'admission_type',
    'length_of_stay_days',
    
    # Medical treatment
    'medication',
    'test_results',
    
    # Insurance and billing
    'insurance_provider',
    'billing_amount',
    
    # Policy details
    'plan_type',
    'coverage_percentage',
    'max_coverage_amount',
    'copay_percentage',
    'deductible_amount',
    'annual_out_of_pocket_max',
    
    # Coverage rules
    'excluded_conditions',
    'medication_coverage',
    'diagnostic_test_coverage',
    'admission_type_rules',
    'waiting_period',
    'pre_existing_condition_coverage',
    'network_coverage',
    'emergency_coverage',
    'preventive_care_coverage'
]

print(f"Defined {len(ml_feature_columns)} ML feature columns:")
for i, col in enumerate(ml_feature_columns, 1):
    print(f"{i:2d}. {col}")

# Check which columns are available in the dataset
available_columns = [col for col in ml_feature_columns if col in df_raw.columns]
missing_columns = [col for col in ml_feature_columns if col not in df_raw.columns]

print(f"\nAvailable columns ({len(available_columns)}): {available_columns}")
if missing_columns:
    print(f"\nMissing columns ({len(missing_columns)}): {missing_columns}")
else:
    print(f"\nAll {len(ml_feature_columns)} feature columns are available in the dataset!")

Defined 25 ML feature columns:
 1. age
 2. gender
 3. blood_type
 4. medical_condition
 5. admission_type
 6. length_of_stay_days
 7. medication
 8. test_results
 9. insurance_provider
10. billing_amount
11. plan_type
12. coverage_percentage
13. max_coverage_amount
14. copay_percentage
15. deductible_amount
16. annual_out_of_pocket_max
17. excluded_conditions
18. medication_coverage
19. diagnostic_test_coverage
20. admission_type_rules
21. waiting_period
22. pre_existing_condition_coverage
23. network_coverage
24. emergency_coverage
25. preventive_care_coverage

Available columns (25): ['age', 'gender', 'blood_type', 'medical_condition', 'admission_type', 'length_of_stay_days', 'medication', 'test_results', 'insurance_provider', 'billing_amount', 'plan_type', 'coverage_percentage', 'max_coverage_amount', 'copay_percentage', 'deductible_amount', 'annual_out_of_pocket_max', 'excluded_conditions', 'medication_coverage', 'diagnostic_test_coverage', 'admission_type_rules', 'waiting_period',

## Step 6: Drop Non-Essential Columns

In [46]:
# Identify columns to drop (everything not in the ML feature list)
columns_to_drop = [col for col in df_raw.columns if col not in ml_feature_columns]

print(f"Original dataset: {df_raw.shape[1]} columns")
print(f"Target ML features: {len(available_columns)} columns")
print(f"Columns to drop: {len(columns_to_drop)} columns")

if columns_to_drop:
    print("\nDropping the following columns:")
    for i, col in enumerate(columns_to_drop, 1):
        print(f"{i:2d}. {col}")
    
    # Create a copy with only the required columns
    df_ml = df_raw[available_columns].copy()
    
    print(f"\nSuccessfully dropped {len(columns_to_drop)} columns")
    print(f"New dataset shape: {df_ml.shape}")
    
else:
    print("\nNo columns need to be dropped - dataset already contains only ML features")
    df_ml = df_raw.copy()

Original dataset: 33 columns
Target ML features: 25 columns
Columns to drop: 8 columns

Dropping the following columns:
 1. claim_id
 2. patient_hash
 3. admission_year_month
 4. discharge_date
 5. created_at
 6. provider_id
 7. data_source
 8. policy_id

Successfully dropped 8 columns
New dataset shape: (222000, 25)


## Step 7: Clean Column Names and Ensure Consistency

In [47]:
# Function to convert column names to snake_case
def to_snake_case(column_name):
    """
    Convert column name to snake_case format.
    """
    import re
    
    # Convert to lowercase and replace spaces/special chars with underscores
    snake_case = re.sub(r'[\s\-\.]+', '_', column_name.strip().lower())
    
    # Remove any double underscores
    snake_case = re.sub(r'_+', '_', snake_case)
    
    # Remove leading/trailing underscores
    snake_case = snake_case.strip('_')
    
    return snake_case

# Store original column names for reference
original_columns = df_ml.columns.tolist()

# Apply snake_case conversion
new_columns = [to_snake_case(col) for col in df_ml.columns]

# Check if any changes are needed
column_changes = [(orig, new) for orig, new in zip(original_columns, new_columns) if orig != new]

if column_changes:
    print(f"Converting {len(column_changes)} column names to snake_case:")
    for orig, new in column_changes:
        print(f"   '{orig}' → '{new}'")
    
    # Apply the changes
    df_ml.columns = new_columns
    print("\nColumn names standardized to snake_case")
else:
    print("All column names are already in proper snake_case format")

print(f"\nFinal column names ({len(df_ml.columns)}):")
for i, col in enumerate(df_ml.columns, 1):
    print(f"{i:2d}. {col}")

All column names are already in proper snake_case format

Final column names (25):
 1. age
 2. gender
 3. blood_type
 4. medical_condition
 5. admission_type
 6. length_of_stay_days
 7. medication
 8. test_results
 9. insurance_provider
10. billing_amount
11. plan_type
12. coverage_percentage
13. max_coverage_amount
14. copay_percentage
15. deductible_amount
16. annual_out_of_pocket_max
17. excluded_conditions
18. medication_coverage
19. diagnostic_test_coverage
20. admission_type_rules
21. waiting_period
22. pre_existing_condition_coverage
23. network_coverage
24. emergency_coverage
25. preventive_care_coverage


## Step 8: Data Cleaning and Type Correction

In [48]:
# Comprehensive data cleaning
print("STARTING DATA CLEANING PROCESS")
print("=" * 50)

# Store initial state for comparison
initial_shape = df_ml.shape
cleaning_log = []

# 1. Handle missing values
print("\n1. Handling Missing Values:")
missing_before = df_ml.isnull().sum().sum()

if missing_before > 0:
    print(f"   Found {missing_before:,} missing values")
    
    # Strategy for different column types
    for col in df_ml.columns:
        missing_count = df_ml[col].isnull().sum()
        if missing_count > 0:
            col_type = df_ml[col].dtype
            
            if col_type in ['object', 'string']:
                # For categorical/text columns, fill with 'Unknown'
                df_ml[col] = df_ml[col].fillna('Unknown')
                cleaning_log.append(f"Filled {missing_count} missing values in '{col}' with 'Unknown'")
                
            elif col_type in ['int64', 'float64', 'int32', 'float32']:
                # For numerical columns, fill with median
                median_val = df_ml[col].median()
                df_ml[col] = df_ml[col].fillna(median_val)
                cleaning_log.append(f"Filled {missing_count} missing values in '{col}' with median: {median_val}")
    
    missing_after = df_ml.isnull().sum().sum()
    print(f"   Reduced missing values from {missing_before:,} to {missing_after:,}")
else:
    print("   No missing values found")

# 2. Data type optimization
print("\n2. Optimizing Data Types:")
memory_before = df_ml.memory_usage(deep=True).sum() / 1024**2

# Convert appropriate columns to categorical
categorical_candidates = ['gender', 'blood_type', 'medical_condition', 'admission_type', 
                         'medication', 'test_results', 'insurance_provider', 'plan_type']

for col in categorical_candidates:
    if col in df_ml.columns and df_ml[col].dtype == 'object':
        unique_count = df_ml[col].nunique()
        total_count = len(df_ml[col])
        
        # Convert to categorical if it has reasonable number of unique values
        if unique_count < total_count * 0.5:  # Less than 50% unique values
            df_ml[col] = df_ml[col].astype('category')
            cleaning_log.append(f"Converted '{col}' to categorical ({unique_count} unique values)")

memory_after = df_ml.memory_usage(deep=True).sum() / 1024**2
print(f"   Memory usage: {memory_before:.2f} MB → {memory_after:.2f} MB (saved {memory_before-memory_after:.2f} MB)")

# 3. Remove any potential duplicates
print("\n3. Checking for Duplicates:")
duplicates_count = df_ml.duplicated().sum()
if duplicates_count > 0:
    df_ml = df_ml.drop_duplicates()
    cleaning_log.append(f"Removed {duplicates_count} duplicate rows")
    print(f"   Removed {duplicates_count} duplicate rows")
else:
    print("   No duplicate rows found")

final_shape = df_ml.shape
print(f"\nCleaning Summary:")
print(f"   Initial shape: {initial_shape}")
print(f"   Final shape: {final_shape}")
print(f"   Rows removed: {initial_shape[0] - final_shape[0]:,}")

if cleaning_log:
    print(f"\nCleaning Operations Performed:")
    for i, operation in enumerate(cleaning_log, 1):
        print(f"   {i}. {operation}")

STARTING DATA CLEANING PROCESS

1. Handling Missing Values:
   No missing values found

2. Optimizing Data Types:
   Memory usage: 255.60 MB → 158.81 MB (saved 96.80 MB)

3. Checking for Duplicates:
   Memory usage: 255.60 MB → 158.81 MB (saved 96.80 MB)

3. Checking for Duplicates:
   Removed 167034 duplicate rows

Cleaning Summary:
   Initial shape: (222000, 25)
   Final shape: (54966, 25)
   Rows removed: 167,034

Cleaning Operations Performed:
   1. Converted 'gender' to categorical (2 unique values)
   2. Converted 'blood_type' to categorical (8 unique values)
   3. Converted 'medical_condition' to categorical (6 unique values)
   4. Converted 'admission_type' to categorical (3 unique values)
   5. Converted 'medication' to categorical (5 unique values)
   6. Converted 'test_results' to categorical (3 unique values)
   7. Converted 'insurance_provider' to categorical (5 unique values)
   8. Converted 'plan_type' to categorical (5 unique values)
   9. Removed 167034 duplicate rows


## Step 9: Save ML-Ready Dataset

In [49]:
# Save the cleaned ML-ready dataset
try:
    print(f"Saving ML-ready dataset to: {output_file}")
    
    # Save the main dataset
    df_ml.to_csv(output_file, index=False)
    
    # Verify the saved file
    if output_file.exists():
        file_size = output_file.stat().st_size / 1024**2  # Size in MB
        print(f"Successfully saved dataset!")
        print(f"   File size: {file_size:.2f} MB")
        print(f"   Location: {output_file}")
        
        # Quick verification by reading back a few rows
        verification_df = pd.read_csv(output_file, nrows=5)
        print(f"   Verification: Successfully read back {len(verification_df)} rows")
        
    else:
        print("Error: File was not created successfully")
        
except Exception as e:
    print(f"Error saving file: {str(e)}")
    raise

# Create a metadata file with processing information
metadata = {
    'processing_timestamp': datetime.now().isoformat(),
    'input_file': str(input_file),
    'output_file': str(output_file),
    'original_shape': initial_shape,
    'final_shape': final_shape,
    'columns_dropped': columns_to_drop,
    'final_columns': df_ml.columns.tolist(),
    'cleaning_operations': cleaning_log,
    'memory_usage_mb': memory_after
}

metadata_file = ml_outputs_dir / 'ml_data_preparation_metadata.json'
with open(metadata_file, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\nMetadata saved to: {metadata_file}")

Saving ML-ready dataset to: /Users/kxshrx/asylum/healix/outputs/ml_outputs/final_ml_ready_claims.csv
Successfully saved dataset!
   File size: 24.59 MB
   Location: /Users/kxshrx/asylum/healix/outputs/ml_outputs/final_ml_ready_claims.csv
   Verification: Successfully read back 5 rows

Metadata saved to: /Users/kxshrx/asylum/healix/outputs/ml_outputs/ml_data_preparation_metadata.json
Successfully saved dataset!
   File size: 24.59 MB
   Location: /Users/kxshrx/asylum/healix/outputs/ml_outputs/final_ml_ready_claims.csv
   Verification: Successfully read back 5 rows

Metadata saved to: /Users/kxshrx/asylum/healix/outputs/ml_outputs/ml_data_preparation_metadata.json


## Step 10: Final Summary and Sample Display

In [50]:
# Print comprehensive final summary
print("ML DATA PREPARATION COMPLETE")
print("=" * 60)

print(f"\nFINAL DATASET SUMMARY:")
print(f"   Shape: {df_ml.shape[0]:,} rows × {df_ml.shape[1]} columns")
print(f"   Memory usage: {df_ml.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"   Output location: {output_file}")

print(f"\nDROPPED COLUMNS ({len(columns_to_drop)}):")
if columns_to_drop:
    # Group dropped columns by type for better readability
    id_columns = [col for col in columns_to_drop if 'id' in col.lower()]
    date_columns = [col for col in columns_to_drop if any(word in col.lower() for word in ['date', 'time', 'created'])]
    other_columns = [col for col in columns_to_drop if col not in id_columns and col not in date_columns]
    
    if id_columns:
        print(f"   ID columns: {', '.join(id_columns)}")
    if date_columns:
        print(f"   Date/Time columns: {', '.join(date_columns)}")
    if other_columns:
        print(f"   Other columns: {', '.join(other_columns)}")
else:
    print("   No columns were dropped")

print(f"\nRETAINED ML FEATURES ({len(df_ml.columns)}):")
feature_groups = {
    'Demographics': ['age', 'gender', 'blood_type'],
    'Medical': ['medical_condition', 'medication', 'test_results'],
    'Admission': ['admission_type', 'length_of_stay_days'],
    'Insurance': ['insurance_provider', 'billing_amount', 'plan_type'],
    'Coverage': [col for col in df_ml.columns if 'coverage' in col or 'percentage' in col or 'amount' in col or 'deductible' in col]
}

for group_name, group_cols in feature_groups.items():
    present_cols = [col for col in group_cols if col in df_ml.columns]
    if present_cols:
        print(f"   {group_name}: {', '.join(present_cols)}")

print(f"\nDATA QUALITY METRICS:")
print(f"   Missing values: {df_ml.isnull().sum().sum():,} ({(df_ml.isnull().sum().sum() / df_ml.size * 100):.2f}%)")
print(f"   Duplicate rows: {df_ml.duplicated().sum():,}")
print(f"   Categorical columns: {len([col for col in df_ml.columns if df_ml[col].dtype.name == 'category'])}")
print(f"   Numerical columns: {len([col for col in df_ml.columns if df_ml[col].dtype.kind in 'biufc'])}")

print(f"\nSAMPLE DATA (First 5 rows):")
display(df_ml.head())

print(f"\nCOLUMN DATA TYPES:")
dtype_info = df_ml.dtypes.to_frame('Data Type')
dtype_info['Non-Null Count'] = df_ml.count()
dtype_info['Unique Values'] = df_ml.nunique()
display(dtype_info)

print(f"\nDataset is now ready for ML modeling!")
print(f"All outputs saved in: {ml_outputs_dir}")

ML DATA PREPARATION COMPLETE

FINAL DATASET SUMMARY:
   Shape: 54,966 rows × 25 columns
   Memory usage: 39.74 MB
   Output location: /Users/kxshrx/asylum/healix/outputs/ml_outputs/final_ml_ready_claims.csv

DROPPED COLUMNS (8):
   ID columns: claim_id, provider_id, policy_id
   Date/Time columns: discharge_date, created_at
   Other columns: patient_hash, admission_year_month, data_source

RETAINED ML FEATURES (25):
   Demographics: age, gender, blood_type
   Medical: medical_condition, medication, test_results
   Admission: admission_type, length_of_stay_days
   Insurance: insurance_provider, billing_amount, plan_type
   Coverage: billing_amount, coverage_percentage, max_coverage_amount, copay_percentage, deductible_amount, medication_coverage, diagnostic_test_coverage, pre_existing_condition_coverage, network_coverage, emergency_coverage, preventive_care_coverage

DATA QUALITY METRICS:
   Missing values: 0 (0.00%)
   Duplicate rows: 0
   Categorical columns: 8
   Numerical columns: 1

Unnamed: 0,age,gender,blood_type,medical_condition,admission_type,length_of_stay_days,medication,test_results,insurance_provider,billing_amount,plan_type,coverage_percentage,max_coverage_amount,copay_percentage,deductible_amount,annual_out_of_pocket_max,excluded_conditions,medication_coverage,diagnostic_test_coverage,admission_type_rules,waiting_period,pre_existing_condition_coverage,network_coverage,emergency_coverage,preventive_care_coverage
0,30,Male,B-,Cancer,Urgent,2,Paracetamol,Normal,Blue Cross,18856.281306,PPO Standard,80.0,Unlimited,20.0,1500.0,8000.0,"Cosmetic surgery, Self-inflicted injuries, Exp...","Generic: $7.50 copay, Preferred brand: 30% coi...",80.0,"Precertification required for inpatient stays,...",0,0,Nationwide PPO network with extensive provider...,"Covered in and out of network, standard copays...",100.0
1,62,Male,A+,Obesity,Emergency,6,Ibuprofen,Inconclusive,Medicare,33643.327287,Original Medicare (Parts A & B),80.0,Unlimited,20.0,1676.0,No limit,"Cosmetic surgery, Routine dental/vision/hearin...","Part D separate - varies by plan, $2000 OOP ma...",80.0,"Part A: $1676 deductible per benefit period, t...",0,0,Any Medicare-accepting provider nationwide,Covered nationwide and limited international,100.0
2,76,Female,A-,Obesity,Emergency,15,Aspirin,Normal,Aetna,27955.096079,Choice POS II Standard,80.0,Unlimited,20.0,750.0,6500.0,"Cosmetic treatments, Self-inflicted injuries, ...","Formulary-based tiered copays, Generic preferr...",100.0,"Precertification required, Hospital copay per ...",0,0,"POS with large provider network, optional PCP",Covered in and out of network with standard co...,100.0
3,28,Female,O+,Diabetes,Elective,30,Ibuprofen,Abnormal,Medicare,37909.78241,Original Medicare (Parts A & B),80.0,Unlimited,20.0,1676.0,No limit,"Cosmetic surgery, Routine dental/vision/hearin...","Part D separate - varies by plan, $2000 OOP ma...",80.0,"Part A: $1676 deductible per benefit period, t...",0,0,Any Medicare-accepting provider nationwide,Covered nationwide and limited international,100.0
4,43,Female,AB+,Cancer,Urgent,20,Penicillin,Abnormal,Aetna,14238.317814,Choice POS II Standard,80.0,Unlimited,20.0,750.0,6500.0,"Cosmetic treatments, Self-inflicted injuries, ...","Formulary-based tiered copays, Generic preferr...",100.0,"Precertification required, Hospital copay per ...",0,0,"POS with large provider network, optional PCP",Covered in and out of network with standard co...,100.0



COLUMN DATA TYPES:


Unnamed: 0,Data Type,Non-Null Count,Unique Values
age,int64,54966,77
gender,category,54966,2
blood_type,category,54966,8
medical_condition,category,54966,6
admission_type,category,54966,3
length_of_stay_days,int64,54966,30
medication,category,54966,5
test_results,category,54966,3
insurance_provider,category,54966,5
billing_amount,float64,54966,50000



Dataset is now ready for ML modeling!
All outputs saved in: /Users/kxshrx/asylum/healix/outputs/ml_outputs


---

## Usage Notes and Next Steps

### What This Notebook Accomplished

1. **Data Loading**: Successfully loaded `outputs/claims_with_policy_rules.csv` with comprehensive error handling

2. **Feature Selection**: Retained 26 carefully selected ML features covering:
   - Patient demographics (age, gender, blood_type)
   - Medical information (medical_condition, medication, test_results)
   - Admission details (admission_type, length_of_stay_days)
   - Insurance and billing (insurance_provider, billing_amount, plan_type)
   - Policy coverage rules (13 coverage-related features)

3. **Data Cleaning**: Applied comprehensive cleaning including:
   - Standardized column names to snake_case
   - Handled missing values with appropriate strategies
   - Optimized data types for memory efficiency
   - Removed duplicate rows

4. **Output Management**: Saved clean dataset to `outputs/ml_outputs/final_ml_ready_claims.csv`

### Files Created
- `outputs/ml_outputs/final_ml_ready_claims.csv` - Main ML-ready dataset
- `outputs/ml_outputs/ml_data_preparation_metadata.json` - Processing metadata and logs

### Recommended Next Steps

1. **Feature Engineering** (create `notebooks-02/feature_engineering.ipynb`):
   - Create polynomial features
   - Generate interaction terms
   - Apply scaling/normalization
   - Encode categorical variables

2. **Exploratory Data Analysis** (create `notebooks-02/ml_eda.ipynb`):
   - Statistical summaries of all features
   - Correlation analysis
   - Distribution plots
   - Outlier detection

3. **Model Training** (create `notebooks-02/model_training.ipynb`):
   - Train/validation/test split
   - Multiple algorithm comparison
   - Cross-validation
   - Hyperparameter tuning

4. **Model Evaluation** (create `notebooks-02/model_evaluation.ipynb`):
   - Performance metrics
   - Feature importance analysis
   - Model interpretability
   - Prediction validation

### Data Quality Assurance
- All transformations are logged and reproducible  
- Original data structure preserved in metadata  
- Memory-optimized data types applied  
- Consistent naming conventions enforced  
- Missing value strategies documented  

---

*This notebook is part of the Healix ML pipeline. For questions or issues, refer to the metadata file or re-run this notebook with updated parameters.*