# AutoPrepML Demo Notebook

This notebook demonstrates the core functionality of AutoPrepML using the Titanic dataset.

## Features Demonstrated:
- Data loading and inspection
- Automatic issue detection
- Data cleaning and preprocessing
- Report generation
- Visualizations

In [None]:
# Install AutoPrepML (if not already installed)
# !pip install autoprepml

import pandas as pd
import numpy as np
from autoprepml import AutoPrepML
import warnings
warnings.filterwarnings('ignore')

print("✅ Imports successful!")

## 1. Load and Prepare Dataset

We'll create a sample Titanic-like dataset with common data quality issues.

In [None]:
# Create a sample Titanic-like dataset
np.random.seed(42)

data = {
    'PassengerId': range(1, 101),
    'Survived': np.random.choice([0, 1], 100, p=[0.6, 0.4]),
    'Pclass': np.random.choice([1, 2, 3], 100, p=[0.2, 0.3, 0.5]),
    'Age': np.random.normal(30, 15, 100),
    'SibSp': np.random.poisson(0.5, 100),
    'Parch': np.random.poisson(0.3, 100),
    'Fare': np.random.lognormal(3, 1, 100),
    'Sex': np.random.choice(['male', 'female'], 100, p=[0.65, 0.35]),
    'Embarked': np.random.choice(['S', 'C', 'Q'], 100, p=[0.7, 0.2, 0.1])
}

df = pd.DataFrame(data)

# Add missing values
df.loc[df.sample(frac=0.2).index, 'Age'] = np.nan
df.loc[df.sample(frac=0.15).index, 'Embarked'] = np.nan
df.loc[df.sample(frac=0.05).index, 'Fare'] = np.nan

# Add outliers
df.loc[0, 'Fare'] = 1000.0
df.loc[1, 'Age'] = 150.0

print("Dataset created!")
print(f"Shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## 2. Initialize AutoPrepML

Create an AutoPrepML instance with your dataset.

In [None]:
# Initialize AutoPrepML
prep = AutoPrepML(df)

# Get a quick summary
summary = prep.summary()

print("📊 Dataset Summary")
print("=" * 50)
print(f"Shape: {summary['shape']}")
print(f"Numeric columns: {summary['numeric_columns']}")
print(f"Categorical columns: {summary['categorical_columns']}")
print(f"\nMissing values:")
for col, info in summary['missing_values'].items():
    print(f"  - {col}: {info['count']} ({info['percent']}%)")

## 3. Detect Data Issues

Run detection to identify missing values, outliers, and class imbalance.

In [None]:
# Run detection
detection_results = prep.detect(target_col='Survived')

print("🔍 Detection Results")
print("=" * 50)

# Missing values
print(f"\n📌 Missing Values: {len(detection_results['missing_values'])} columns affected")
for col, info in detection_results['missing_values'].items():
    print(f"  - {col}: {info['count']} missing ({info['percent']}%)")

# Outliers
outliers = detection_results['outliers']
print(f"\n📌 Outliers: {outliers['outlier_count']} detected using {outliers['method']}")

# Class imbalance
if 'class_imbalance' in detection_results:
    imbalance = detection_results['class_imbalance']
    print(f"\n📌 Class Balance:")
    print(f"  - Target: {imbalance['target_column']}")
    print(f"  - Is imbalanced: {imbalance['is_imbalanced']}")
    print(f"  - Distribution: {imbalance['class_distribution']}")

## 4. Clean the Data

Apply automatic cleaning based on detected issues.

In [None]:
# Clean the dataset
clean_df, report = prep.clean(task='classification', target_col='Survived')

print("🧹 Cleaning Complete!")
print("=" * 50)
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {clean_df.shape}")
print(f"Missing values remaining: {clean_df.isnull().sum().sum()}")

# Show first few rows of cleaned data
print("\n✅ Cleaned Data Sample:")
clean_df.head()

## 5. Review the Report

The report contains detailed information about all preprocessing steps.

In [None]:
# Examine the report
print("📄 Preprocessing Report")
print("=" * 50)
print(f"Timestamp: {report['timestamp']}")
print(f"Original shape: {report['original_shape']}")
print(f"Cleaned shape: {report['cleaned_shape']}")

print("\n🔧 Processing Steps:")
for i, log_entry in enumerate(report['logs'][-5:], 1):  # Last 5 steps
    print(f"{i}. {log_entry['action']}: {log_entry['details']}")

## 6. Save Results

Save the cleaned dataset and generate HTML report.

In [None]:
# Save cleaned data
clean_df.to_csv('titanic_cleaned.csv', index=False)

# Save HTML report
prep.save_report('titanic_report.html')

print("✅ Results saved!")
print("  - Cleaned data: titanic_cleaned.csv")
print("  - Report: titanic_report.html")
print("\n💡 Open titanic_report.html in a browser to see visualizations!")

## Summary

In this notebook, we:

1. ✅ Created a sample dataset with missing values and outliers
2. ✅ Initialized AutoPrepML
3. ✅ Detected data quality issues
4. ✅ Automatically cleaned the data
5. ✅ Generated comprehensive reports
6. ✅ Saved results for further analysis

### Next Steps

- Try AutoPrepML on your own datasets
- Customize preprocessing with config files
- Explore the HTML report visualizations
- Use the CLI for batch processing