# Automated EDA with Helper Tools

This notebook demonstrates how to use the existing `auto_eda.py` module from the helper_scripts repository to perform comprehensive exploratory data analysis quickly and efficiently.

## Learning Objectives
- Learn to use the GeneralEDA class for automated analysis
- Generate comprehensive data reports automatically
- Understand advanced EDA techniques and their applications
- Build reusable EDA workflows

## 1. Setup and Import

First, let's import the necessary libraries and the auto_eda module:

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import the GeneralEDA class from the repository
import sys
sys.path.append('../../')  # Add path to access the auto_eda module
from auto_eda import GeneralEDA

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ All imports successful!")
print("📊 Ready to perform automated EDA!")

## 2. Load Sample Dataset

Let's create a sample dataset to demonstrate the EDA capabilities:

In [None]:
# Create a realistic sample dataset for demonstration
np.random.seed(42)
n_samples = 1000

# Generate synthetic customer data
data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.normal(35, 12, n_samples).astype(int),
    'income': np.random.exponential(50000, n_samples),
    'spending_score': np.random.beta(2, 5, n_samples) * 100,
    'education_level': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 
                                       n_samples, p=[0.4, 0.35, 0.2, 0.05]),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], 
                            n_samples, p=[0.3, 0.25, 0.2, 0.15, 0.1]),
    'purchase_frequency': np.random.poisson(3, n_samples),
    'satisfaction_rating': np.random.choice([1, 2, 3, 4, 5], n_samples, p=[0.05, 0.1, 0.2, 0.4, 0.25]),
    'signup_date': pd.date_range('2020-01-01', periods=n_samples, freq='D')[:n_samples]
}

# Create DataFrame
df = pd.DataFrame(data)

# Add some missing values to make it realistic
missing_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)
df.loc[missing_indices, 'income'] = np.nan

missing_indices = np.random.choice(df.index, size=int(0.03 * len(df)), replace=False)
df.loc[missing_indices, 'education_level'] = np.nan

# Add some outliers
outlier_indices = np.random.choice(df.index, size=20, replace=False)
df.loc[outlier_indices, 'income'] = df.loc[outlier_indices, 'income'] * 10

print(f"Dataset created with {len(df)} rows and {len(df.columns)} columns")
print("\nFirst few rows:")
df.head()

## 3. Initialize GeneralEDA Class

Now let's initialize the GeneralEDA class and start our automated analysis:

In [None]:
# Initialize the GeneralEDA class
eda = GeneralEDA(df)

print("🔧 GeneralEDA initialized successfully!")
print("📋 Available methods:")

# List available methods
methods = [method for method in dir(eda) if not method.startswith('_')]
for i, method in enumerate(methods, 1):
    print(f"{i:2d}. {method}")

## 4. Data Integrity Validation

Let's start with validating the integrity of our data:

In [None]:
# Validate data integrity
print("🔍 Validating data integrity...\n")
eda.validate_data_integrity()

print("\n" + "="*50)
print("Data integrity validation completed!")

## 5. Basic Data Information

Get comprehensive information about the dataset:

In [None]:
# Get basic data information
print("📊 Basic Data Information:\n")
eda.data_info()

print("\n" + "="*50)
print("Basic data information completed!")

## 6. Statistical Summary

Generate comprehensive statistical summaries:

In [None]:
# Generate statistical summary
print("📈 Statistical Summary:\n")
eda.statistical_summary()

print("\n" + "="*50)
print("Statistical summary completed!")

## 7. Handle Missing Values

Automatically detect and handle missing values:

In [None]:
# Check missing values before handling
print("❌ Missing values before handling:")
print(eda.df.isnull().sum())

# Handle missing values
print("\n🔧 Handling missing values...")
eda.handle_missing_values(strategy='mean')  # For numerical columns

# Check missing values after handling
print("\n✅ Missing values after handling:")
print(eda.df.isnull().sum())

print("\n" + "="*50)
print("Missing value handling completed!")

## 8. Handle Duplicates

Detect and remove duplicate rows:

In [None]:
# Check for duplicates
print(f"📋 Rows before duplicate handling: {len(eda.df)}")

# Handle duplicates
eda.handle_duplicates()

print(f"📋 Rows after duplicate handling: {len(eda.df)}")

print("\n" + "="*50)
print("Duplicate handling completed!")

## 9. Outlier Detection and Handling

Automatically detect and handle outliers:

In [None]:
# Handle outliers using Z-score method
print("🎯 Handling outliers using Z-score method...\n")
eda.handle_outliers(method='zscore', threshold=3)

print("\n" + "="*50)
print("Outlier handling completed!")

## 10. Feature Engineering

Automatically create new features:

In [None]:
# Get shape before feature engineering
print(f"📊 Columns before feature engineering: {len(eda.df.columns)}")
print(f"Columns: {list(eda.df.columns)}")

# Perform feature engineering
print("\n🔨 Performing automated feature engineering...")
eda.feature_engineering()

# Get shape after feature engineering
print(f"\n📊 Columns after feature engineering: {len(eda.df.columns)}")
print(f"New columns added: {len(eda.df.columns) - len(df.columns)}")

print("\n" + "="*50)
print("Feature engineering completed!")

## 11. Feature Scaling

Apply feature scaling to numerical columns:

In [None]:
# Apply feature scaling
print("⚖️ Applying feature scaling (standard scaler)...")
eda.feature_scaling(method='standard')

# Show summary statistics after scaling
print("\n📊 Summary statistics after scaling:")
numerical_cols = eda.df.select_dtypes(include=[np.number]).columns
print(eda.df[numerical_cols].describe())

print("\n" + "="*50)
print("Feature scaling completed!")

## 12. Anomaly Detection

Detect anomalies in the dataset using Isolation Forest:

In [None]:
# Detect and visualize anomalies
print("🔍 Detecting anomalies using Isolation Forest...")
eda.detect_and_visualize_anomalies(contamination=0.05)

print("\n" + "="*50)
print("Anomaly detection completed!")

## 13. Generate Comprehensive Reports

Create automated reports for the analysis:

In [None]:
# Generate pandas profiling report (commented out to avoid large output)
# This would create a comprehensive HTML report
print("📋 Generating comprehensive data profiling report...")
print("(This would normally create an HTML report with detailed analysis)")

# Uncomment the following lines to generate actual report:
# eda.generate_pandas_profiling_report(
#     output_file="customer_data_profile.html",
#     sample_fraction=0.1,  # Use 10% sample for faster processing
#     title="Customer Data Analysis Report"
# )

print("✅ Report generation configured (uncomment to run)")

## 14. Save Analysis Report

Save a text summary of the analysis:

In [None]:
# Save comprehensive text report
print("💾 Saving comprehensive EDA report...")
eda.save_report(filepath="customer_eda_report.txt")

print("✅ Report saved successfully!")
print("📁 Check 'customer_eda_report.txt' for detailed analysis")

## 15. Get Transformed Dataset

Retrieve the final processed dataset:

In [None]:
# Get the transformed dataset
transformed_df = eda.get_dataframe()

print("📊 Final Dataset Summary:")
print(f"   Original shape: {df.shape}")
print(f"   Transformed shape: {transformed_df.shape}")
print(f"   Columns added: {transformed_df.shape[1] - df.shape[1]}")
print(f"   Missing values: {transformed_df.isnull().sum().sum()}")

print("\n📋 Final columns:")
for i, col in enumerate(transformed_df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n✅ Dataset ready for machine learning!")

## 16. Custom Analysis Functions

Let's create some custom analysis functions that extend the GeneralEDA capabilities:

In [None]:
def analyze_categorical_relationships(df, target_col=None):
    """Analyze relationships between categorical variables"""
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    
    if len(categorical_cols) < 2:
        print("Need at least 2 categorical columns for relationship analysis")
        return
    
    print("🔗 Categorical Variable Relationships:")
    
    for i, col1 in enumerate(categorical_cols):
        for col2 in categorical_cols[i+1:]:
            # Create contingency table
            contingency = pd.crosstab(df[col1], df[col2])
            
            # Calculate Cramér's V (measure of association)
            chi2 = contingency.values
            n = contingency.sum().sum()
            
            print(f"\n{col1} vs {col2}:")
            print(contingency)

def create_feature_interaction_plot(df, feature1, feature2, target=None):
    """Create interaction plots between features"""
    plt.figure(figsize=(12, 4))
    
    # Scatter plot
    plt.subplot(1, 3, 1)
    plt.scatter(df[feature1], df[feature2], alpha=0.6)
    plt.xlabel(feature1)
    plt.ylabel(feature2)
    plt.title(f'{feature1} vs {feature2}')
    plt.grid(True, alpha=0.3)
    
    # Distribution plots
    plt.subplot(1, 3, 2)
    plt.hist(df[feature1], bins=30, alpha=0.7, label=feature1)
    plt.xlabel(feature1)
    plt.ylabel('Frequency')
    plt.title(f'Distribution of {feature1}')
    plt.grid(True, alpha=0.3)
    
    plt.subplot(1, 3, 3)
    plt.hist(df[feature2], bins=30, alpha=0.7, label=feature2, color='orange')
    plt.xlabel(feature2)
    plt.ylabel('Frequency')
    plt.title(f'Distribution of {feature2}')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Apply custom analysis
print("🎨 Applying custom analysis functions...")

# Analyze categorical relationships
analyze_categorical_relationships(df)

# Create feature interaction plots for numerical features
numerical_features = df.select_dtypes(include=[np.number]).columns[:2]
if len(numerical_features) >= 2:
    create_feature_interaction_plot(df, numerical_features[0], numerical_features[1])

print("\n✅ Custom analysis completed!")

## 17. Summary and Next Steps

Let's summarize what we've accomplished with the automated EDA:

In [None]:
print("🎉 AUTOMATED EDA COMPLETED!")
print("=" * 50)

print("\n✅ Tasks Completed:")
tasks = [
    "Data integrity validation",
    "Basic data information analysis",
    "Statistical summary generation",
    "Missing value handling",
    "Duplicate detection and removal",
    "Outlier detection and treatment",
    "Automated feature engineering",
    "Feature scaling and normalization",
    "Anomaly detection with visualization",
    "Comprehensive report generation",
    "Custom analysis functions"
]

for i, task in enumerate(tasks, 1):
    print(f"{i:2d}. {task}")

print("\n📊 Dataset Transformation Summary:")
print(f"   • Original columns: {len(df.columns)}")
print(f"   • Final columns: {len(transformed_df.columns)}")
print(f"   • Features added: {len(transformed_df.columns) - len(df.columns)}")
print(f"   • Missing values eliminated: {df.isnull().sum().sum()} → {transformed_df.isnull().sum().sum()}")
print(f"   • Dataset ready for ML: ✅")

print("\n🎯 Next Steps:")
print("   1. Use the transformed dataset for machine learning models")
print("   2. Apply the same EDA pipeline to new datasets")
print("   3. Customize the analysis based on specific domain requirements")
print("   4. Explore advanced feature engineering techniques")
print("   5. Move to Module 3: Supervised Learning")

print("\n🚀 You're now ready to build machine learning models with clean, well-prepared data!")

## Key Takeaways

### What You've Learned:

1. **Automated Analysis**: How to use existing tools to quickly analyze datasets
2. **Data Quality**: Importance of data validation and integrity checks
3. **Missing Value Strategies**: Different approaches for different data types
4. **Outlier Detection**: Multiple methods for identifying unusual data points
5. **Feature Engineering**: Automated creation of new meaningful features
6. **Scaling and Normalization**: Preparing data for machine learning algorithms
7. **Anomaly Detection**: Using unsupervised methods to find unusual patterns
8. **Report Generation**: Creating comprehensive documentation of analysis

### Best Practices:

- Always validate data integrity before analysis
- Handle missing values appropriately for each data type
- Visualize distributions and relationships
- Document all preprocessing steps
- Save intermediate results for reproducibility
- Use automation for routine tasks, but understand what's happening

### Exercise for Practice:

Try applying this automated EDA workflow to different types of datasets:
- Time series data (stock prices, sensor readings)
- Text data (customer reviews, social media posts)
- Image metadata (if working with computer vision datasets)
- Mixed data types (combination of numerical, categorical, and text)

The GeneralEDA class provides a solid foundation that you can extend and customize for specific use cases!