# 🏠 Automated Machine Learning for Regression - Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hasanmisaii/Automated-Machine-Learning-Auto-ML-/blob/main/regression_automl_colab.ipynb)

This notebook demonstrates **Automated Machine Learning (AutoML)** for **regression tasks** using open-source libraries that work seamlessly in Google Colab.

## 🎯 What You'll Learn:
- **Regression fundamentals** and when to use them
- **AutoML concepts** and benefits
- **Hands-on implementation** using auto-sklearn and TPOT
- **Model evaluation** and interpretation
- **Real-world applications** and best practices

## 📊 What is Regression?
Regression is a machine learning task that **predicts continuous numerical values**. Perfect for:
- 🏠 **House price prediction** (our example today)
- 📈 **Stock price forecasting**
- 🌡️ **Temperature prediction**
- 💰 **Sales revenue estimation**
- ⚡ **Energy consumption forecasting**

## 🤖 What is AutoML?
AutoML automatically:
- **Selects the best algorithms**
- **Optimizes hyperparameters**
- **Engineers features**
- **Handles preprocessing**
- **Provides model explanations**

---

**🚀 Let's get started!**

## 📦 Step 1: Install Required Libraries

We'll use **auto-sklearn** and **TPOT** - two popular open-source AutoML libraries.

In [None]:
# Install AutoML libraries
print("🔧 Installing AutoML libraries...")
!pip install auto-sklearn==0.15.0 -q
!pip install tpot -q
!pip install shap -q

# Standard data science libraries
!pip install scikit-learn==1.1.3 -q
!pip install pandas numpy matplotlib seaborn plotly -q

print("✅ Installation complete!")

In [None]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# Machine Learning libraries
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# AutoML libraries
import autosklearn.regression
from tpot import TPOTRegressor

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

print("📚 All libraries imported successfully!")
print(f"🕐 Notebook started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 🏗️ Step 2: Create Realistic House Price Dataset

We'll create a synthetic but realistic dataset for house price prediction that mimics real-world scenarios.

In [None]:
# Create realistic house price dataset
print("🏠 Creating realistic house price dataset...")

# Generate base features using make_regression
X_base, y_base = make_regression(
    n_samples=2000,
    n_features=15,
    n_informative=12,
    noise=0.1,
    random_state=42
)

# Create meaningful feature names
feature_names = [
    'square_footage', 'bedrooms', 'bathrooms', 'age_years', 'lot_size_acres',
    'garage_spaces', 'neighborhood_score', 'school_rating', 'crime_rate', 
    'distance_to_downtown_miles', 'property_tax_rate', 'walkability_score',
    'num_floors', 'fireplace_count', 'pool_present'
]

# Convert to DataFrame
df = pd.DataFrame(X_base, columns=feature_names)

# Transform features to realistic ranges
def normalize_to_range(series, min_val, max_val):
    return ((series - series.min()) / (series.max() - series.min()) * (max_val - min_val) + min_val)

# Apply realistic transformations
df['square_footage'] = normalize_to_range(df['square_footage'], 800, 4000).round(0)
df['bedrooms'] = normalize_to_range(df['bedrooms'], 1, 6).round(0)
df['bathrooms'] = normalize_to_range(df['bathrooms'], 1, 4).round(1)
df['age_years'] = normalize_to_range(df['age_years'], 0, 100).round(0)
df['lot_size_acres'] = normalize_to_range(df['lot_size_acres'], 0.1, 2.0).round(2)
df['garage_spaces'] = normalize_to_range(df['garage_spaces'], 0, 3).round(0)
df['neighborhood_score'] = normalize_to_range(df['neighborhood_score'], 1, 10).round(1)
df['school_rating'] = normalize_to_range(df['school_rating'], 1, 10).round(1)
df['crime_rate'] = normalize_to_range(df['crime_rate'], 0.1, 5.0).round(2)
df['distance_to_downtown_miles'] = normalize_to_range(df['distance_to_downtown_miles'], 0.5, 50).round(1)
df['property_tax_rate'] = normalize_to_range(df['property_tax_rate'], 0.5, 3.0).round(3)
df['walkability_score'] = normalize_to_range(df['walkability_score'], 1, 100).round(0)
df['num_floors'] = normalize_to_range(df['num_floors'], 1, 3).round(0)
df['fireplace_count'] = normalize_to_range(df['fireplace_count'], 0, 3).round(0)
df['pool_present'] = (normalize_to_range(df['pool_present'], 0, 1) > 0.7).astype(int)

# Create realistic target variable (house prices in thousands)
# Base price calculation with realistic factors
base_price = (
    df['square_footage'] * 150 +  # $150 per sq ft
    df['bedrooms'] * 10000 +      # $10k per bedroom
    df['bathrooms'] * 8000 +      # $8k per bathroom
    df['garage_spaces'] * 5000 +  # $5k per garage space
    df['neighborhood_score'] * 15000 +  # Neighborhood premium
    df['pool_present'] * 25000 +  # Pool adds $25k
    df['fireplace_count'] * 3000  # $3k per fireplace
)

# Apply negative factors
price_adjustments = (
    - df['age_years'] * 500 +     # Depreciation
    - df['crime_rate'] * 8000 +   # Crime reduces value
    - df['distance_to_downtown_miles'] * 1000  # Distance penalty
)

# Final price with some noise
y_realistic = (base_price + price_adjustments + np.random.normal(0, 15000, len(df))) / 1000
y_realistic = np.maximum(y_realistic, 50)  # Minimum $50k
df['price_thousands'] = y_realistic.round(1)

print(f"📊 Dataset created with {df.shape[0]} houses and {df.shape[1]-1} features")
print(f"💰 Price range: ${df['price_thousands'].min():.0f}k - ${df['price_thousands'].max():.0f}k")
print(f"📍 Average price: ${df['price_thousands'].mean():.0f}k")

# Display sample data
df.head()

## 📊 Step 3: Exploratory Data Analysis (EDA)

Let's explore our dataset to understand the relationships between features and house prices.

In [None]:
# Basic dataset statistics
print("📈 DATASET OVERVIEW")
print("=" * 50)
print(f"Number of houses: {len(df):,}")
print(f"Number of features: {len(df.columns)-1}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicated rows: {df.duplicated().sum()}")

print("\n💰 PRICE STATISTICS")
print("=" * 50)
price_stats = df['price_thousands'].describe()
for stat, value in price_stats.items():
    print(f"{stat.capitalize()}: ${value:.1f}k")

# Display data types
print("\n🔧 FEATURE TYPES")
print("=" * 50)
print(df.dtypes)

In [None]:
# Interactive correlation analysis with Plotly
# Calculate correlation matrix
correlation_matrix = df.corr()
price_correlations = correlation_matrix['price_thousands'].drop('price_thousands').sort_values(key=abs, ascending=False)

# Create interactive correlation heatmap
fig = px.imshow(
    correlation_matrix,
    text_auto=True,
    aspect="auto",
    title="🔥 Feature Correlation Heatmap",
    color_continuous_scale="RdBu_r",
    width=800,
    height=700
)
fig.update_layout(title_x=0.5)
fig.show()

# Top correlations with price
print("🎯 TOP FEATURES CORRELATED WITH PRICE")
print("=" * 50)
for feature, corr in price_correlations.head(8).items():
    direction = "📈" if corr > 0 else "📉"
    print(f"{direction} {feature}: {corr:.3f}")

In [None]:
# Interactive price distribution analysis
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        "Price Distribution", 
        "Price vs Square Footage",
        "Price vs Neighborhood Score", 
        "Price by Number of Bedrooms"
    ),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# Price histogram
fig.add_trace(
    go.Histogram(x=df['price_thousands'], nbinsx=50, name="Price Distribution"),
    row=1, col=1
)

# Scatter: Price vs Square Footage
fig.add_trace(
    go.Scatter(
        x=df['square_footage'], 
        y=df['price_thousands'], 
        mode='markers',
        name="Price vs Sq Ft",
        opacity=0.6
    ),
    row=1, col=2
)

# Scatter: Price vs Neighborhood Score
fig.add_trace(
    go.Scatter(
        x=df['neighborhood_score'], 
        y=df['price_thousands'], 
        mode='markers',
        name="Price vs Neighborhood",
        opacity=0.6
    ),
    row=2, col=1
)

# Box plot: Price by Bedrooms
for bedroom_count in sorted(df['bedrooms'].unique()):
    bedroom_data = df[df['bedrooms'] == bedroom_count]['price_thousands']
    fig.add_trace(
        go.Box(y=bedroom_data, name=f"{int(bedroom_count)} BR"),
        row=2, col=2
    )

fig.update_layout(
    height=800, 
    title_text="📊 House Price Analysis Dashboard",
    title_x=0.5
)
fig.show()

## 🔧 Step 4: Data Preparation for AutoML

Let's prepare our data for AutoML by splitting it into training and testing sets.

In [None]:
# Prepare features and target
X = df.drop('price_thousands', axis=1)
y = df['price_thousands']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

# Further split training data for AutoML validation
X_train_automl, X_val_automl, y_train_automl, y_val_automl = train_test_split(
    X_train, y_train, 
    test_size=0.25, 
    random_state=42
)

print("📊 DATA SPLIT SUMMARY")
print("=" * 50)
print(f"🎯 Total dataset: {len(df):,} samples")
print(f"🏋️ Training set: {len(X_train_automl):,} samples ({len(X_train_automl)/len(df)*100:.1f}%)")
print(f"✅ Validation set: {len(X_val_automl):,} samples ({len(X_val_automl)/len(df)*100:.1f}%)")
print(f"🧪 Test set: {len(X_test):,} samples ({len(X_test)/len(df)*100:.1f}%)")

print(f"\n📈 FEATURE INFORMATION")
print("=" * 50)
print(f"Number of features: {X_train.shape[1]}")
print(f"Feature names: {list(X_train.columns)}")

print(f"\n💰 TARGET VARIABLE STATS")
print("=" * 50)
print(f"Training target range: ${y_train_automl.min():.1f}k - ${y_train_automl.max():.1f}k")
print(f"Training target mean: ${y_train_automl.mean():.1f}k")
print(f"Training target std: ${y_train_automl.std():.1f}k")

## 🤖 Step 5: AutoML with auto-sklearn

**auto-sklearn** is an automated machine learning toolkit built on top of scikit-learn. It automatically finds the best algorithm and hyperparameters for your dataset.

In [None]:
# Configure and train auto-sklearn
print("🚀 Starting auto-sklearn training...")
print("⏰ This may take 5-10 minutes in Colab")

# Create auto-sklearn regressor
automl_sklearn = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=300,  # 5 minutes total
    per_run_time_limit=30,        # 30 seconds per model
    n_jobs=1,                     # Use single core in Colab
    memory_limit=3072,            # 3GB memory limit
    seed=42,
    metric=autosklearn.metrics.mean_squared_error,
    resampling_strategy='cv',     # Cross-validation
    resampling_strategy_arguments={'folds': 3}
)

# Train the model
start_time = datetime.now()
automl_sklearn.fit(X_train_automl, y_train_automl)
training_time = datetime.now() - start_time

print(f"✅ auto-sklearn training completed in {training_time}")
print(f"🎯 Models evaluated: {len(automl_sklearn.leaderboard())}")

In [None]:
# Evaluate auto-sklearn performance
print("📊 AUTO-SKLEARN RESULTS")
print("=" * 50)

# Make predictions
y_pred_sklearn_val = automl_sklearn.predict(X_val_automl)
y_pred_sklearn_test = automl_sklearn.predict(X_test)

# Calculate metrics
val_rmse = np.sqrt(mean_squared_error(y_val_automl, y_pred_sklearn_val))
val_mae = mean_absolute_error(y_val_automl, y_pred_sklearn_val)
val_r2 = r2_score(y_val_automl, y_pred_sklearn_val)

test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_sklearn_test))
test_mae = mean_absolute_error(y_test, y_pred_sklearn_test)
test_r2 = r2_score(y_test, y_pred_sklearn_test)

print(f"📈 Validation Performance:")
print(f"   RMSE: ${val_rmse:.2f}k")
print(f"   MAE:  ${val_mae:.2f}k")
print(f"   R²:   {val_r2:.3f}")

print(f"\n🧪 Test Performance:")
print(f"   RMSE: ${test_rmse:.2f}k")
print(f"   MAE:  ${test_mae:.2f}k")
print(f"   R²:   {test_r2:.3f}")

# Show model statistics
print(f"\n🏆 MODEL LEADERBOARD")
print("=" * 50)
leaderboard = automl_sklearn.leaderboard()
print(leaderboard.head())

# Show best models
print(f"\n🥇 BEST MODELS SUMMARY")
print("=" * 50)
print(automl_sklearn.sprint_statistics())

## 🧬 Step 6: AutoML with TPOT

**TPOT** (Tree-based Pipeline Optimization Tool) uses genetic programming to automatically design and optimize machine learning pipelines.

In [None]:
# Configure and train TPOT
print("🧬 Starting TPOT training...")
print("⏰ This may take 3-5 minutes in Colab")

# Create TPOT regressor
automl_tpot = TPOTRegressor(
    generations=5,           # Number of iterations
    population_size=20,      # Number of individuals per generation
    cv=3,                    # Cross-validation folds
    scoring='neg_mean_squared_error',
    max_time_mins=3,         # Maximum time in minutes
    max_eval_time_mins=0.5,  # Maximum time per pipeline
    random_state=42,
    n_jobs=1,                # Single core for Colab
    verbosity=2
)

# Train the model
start_time = datetime.now()
automl_tpot.fit(X_train_automl, y_train_automl)
training_time = datetime.now() - start_time

print(f"✅ TPOT training completed in {training_time}")
print(f"🏆 Best pipeline score: {automl_tpot.score(X_val_automl, y_val_automl):.3f}")

In [None]:
# Evaluate TPOT performance
print("📊 TPOT RESULTS")
print("=" * 50)

# Make predictions
y_pred_tpot_val = automl_tpot.predict(X_val_automl)
y_pred_tpot_test = automl_tpot.predict(X_test)

# Calculate metrics
val_rmse_tpot = np.sqrt(mean_squared_error(y_val_automl, y_pred_tpot_val))
val_mae_tpot = mean_absolute_error(y_val_automl, y_pred_tpot_val)
val_r2_tpot = r2_score(y_val_automl, y_pred_tpot_val)

test_rmse_tpot = np.sqrt(mean_squared_error(y_test, y_pred_tpot_test))
test_mae_tpot = mean_absolute_error(y_test, y_pred_tpot_test)
test_r2_tpot = r2_score(y_test, y_pred_tpot_test)

print(f"📈 Validation Performance:")
print(f"   RMSE: ${val_rmse_tpot:.2f}k")
print(f"   MAE:  ${val_mae_tpot:.2f}k")
print(f"   R²:   {val_r2_tpot:.3f}")

print(f"\n🧪 Test Performance:")
print(f"   RMSE: ${test_rmse_tpot:.2f}k")
print(f"   MAE:  ${test_mae_tpot:.2f}k")
print(f"   R²:   {test_r2_tpot:.3f}")

# Show the best pipeline
print(f"\n🏆 BEST PIPELINE DISCOVERED")
print("=" * 50)
print(automl_tpot.fitted_pipeline_)

# Export the pipeline code
print(f"\n💾 Exporting optimized pipeline code...")
automl_tpot.export('tpot_best_pipeline.py')
print("✅ Pipeline exported as 'tpot_best_pipeline.py'")

## 📊 Step 7: Baseline Comparison

Let's compare our AutoML results with traditional machine learning models to see the benefit of automation.

In [None]:
# Train baseline models for comparison
print("🏁 Training baseline models for comparison...")

# Scale features for linear regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_automl)
X_val_scaled = scaler.transform(X_val_automl)
X_test_scaled = scaler.transform(X_test)

# Define baseline models
baseline_models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

baseline_results = {}

for name, model in baseline_models.items():
    print(f"Training {name}...")
    
    # Use scaled data for linear regression, original for tree-based
    if name == 'Linear Regression':
        model.fit(X_train_scaled, y_train_automl)
        y_pred_val = model.predict(X_val_scaled)
        y_pred_test = model.predict(X_test_scaled)
    else:
        model.fit(X_train_automl, y_train_automl)
        y_pred_val = model.predict(X_val_automl)
        y_pred_test = model.predict(X_test)
    
    # Calculate metrics
    baseline_results[name] = {
        'val_rmse': np.sqrt(mean_squared_error(y_val_automl, y_pred_val)),
        'val_mae': mean_absolute_error(y_val_automl, y_pred_val),
        'val_r2': r2_score(y_val_automl, y_pred_val),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_pred_test)),
        'test_mae': mean_absolute_error(y_test, y_pred_test),
        'test_r2': r2_score(y_test, y_pred_test),
        'predictions_test': y_pred_test
    }

print("✅ Baseline models trained successfully!")

In [None]:
# Comprehensive model comparison
print("🏆 COMPREHENSIVE MODEL COMPARISON")
print("=" * 80)

# Compile all results
all_results = {
    'auto-sklearn': {
        'val_rmse': val_rmse, 'val_mae': val_mae, 'val_r2': val_r2,
        'test_rmse': test_rmse, 'test_mae': test_mae, 'test_r2': test_r2,
        'predictions_test': y_pred_sklearn_test
    },
    'TPOT': {
        'val_rmse': val_rmse_tpot, 'val_mae': val_mae_tpot, 'val_r2': val_r2_tpot,
        'test_rmse': test_rmse_tpot, 'test_mae': test_mae_tpot, 'test_r2': test_r2_tpot,
        'predictions_test': y_pred_tpot_test
    }
}
all_results.update(baseline_results)

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Model': list(all_results.keys()),
    'Test_RMSE': [all_results[model]['test_rmse'] for model in all_results.keys()],
    'Test_MAE': [all_results[model]['test_mae'] for model in all_results.keys()],
    'Test_R2': [all_results[model]['test_r2'] for model in all_results.keys()],
    'Val_RMSE': [all_results[model]['val_rmse'] for model in all_results.keys()],
    'Val_R2': [all_results[model]['val_r2'] for model in all_results.keys()]
})

# Sort by test R² score
comparison_df = comparison_df.sort_values('Test_R2', ascending=False)

print("📊 PERFORMANCE RANKINGS (by Test R²)")
print("=" * 80)
for idx, row in comparison_df.iterrows():
    rank = comparison_df.index.get_loc(idx) + 1
    print(f"{rank}. {row['Model']:15} | R²: {row['Test_R2']:.3f} | RMSE: ${row['Test_RMSE']:.1f}k | MAE: ${row['Test_MAE']:.1f}k")

# Find best model
best_model = comparison_df.iloc[0]['Model']
print(f"\n🥇 WINNER: {best_model}")
print(f"🎯 Best Test R²: {comparison_df.iloc[0]['Test_R2']:.3f}")
print(f"📉 Best Test RMSE: ${comparison_df.iloc[0]['Test_RMSE']:.1f}k")

# Display the comparison table
comparison_df

## 📊 Step 8: Advanced Visualizations

Let's create comprehensive visualizations to understand model performance and predictions.

In [None]:
# Interactive model performance comparison
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        "Model Performance Comparison (R²)",
        "RMSE Comparison",
        "Actual vs Predicted (Best Model)",
        "Residuals Analysis (Best Model)"
    )
)

# Performance bar charts
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

# R² comparison
fig.add_trace(
    go.Bar(
        x=comparison_df['Model'], 
        y=comparison_df['Test_R2'],
        name="Test R²",
        marker_color=colors
    ),
    row=1, col=1
)

# RMSE comparison
fig.add_trace(
    go.Bar(
        x=comparison_df['Model'], 
        y=comparison_df['Test_RMSE'],
        name="Test RMSE",
        marker_color=colors
    ),
    row=1, col=2
)

# Best model predictions
best_predictions = all_results[best_model]['predictions_test']
fig.add_trace(
    go.Scatter(
        x=y_test, 
        y=best_predictions,
        mode='markers',
        name=f"{best_model} Predictions",
        opacity=0.7
    ),
    row=2, col=1
)

# Perfect prediction line
min_val, max_val = y_test.min(), y_test.max()
fig.add_trace(
    go.Scatter(
        x=[min_val, max_val], 
        y=[min_val, max_val],
        mode='lines',
        name="Perfect Prediction",
        line=dict(dash='dash', color='red')
    ),
    row=2, col=1
)

# Residuals plot
residuals = y_test - best_predictions
fig.add_trace(
    go.Scatter(
        x=best_predictions, 
        y=residuals,
        mode='markers',
        name="Residuals",
        opacity=0.7
    ),
    row=2, col=2
)

# Zero line for residuals
fig.add_hline(y=0, line_dash="dash", line_color="red", row=2, col=2)

fig.update_layout(
    height=800,
    title_text=f"🎯 Model Performance Dashboard - Winner: {best_model}",
    title_x=0.5,
    showlegend=True
)

fig.show()

In [None]:
# Detailed prediction analysis for the best model
print(f"🔍 DETAILED ANALYSIS: {best_model}")
print("=" * 60)

# Calculate prediction accuracy ranges
residuals = y_test - best_predictions
percentage_errors = (residuals / y_test) * 100

# Accuracy metrics
within_5_percent = np.abs(percentage_errors) <= 5
within_10_percent = np.abs(percentage_errors) <= 10
within_15_percent = np.abs(percentage_errors) <= 15

print(f"📊 PREDICTION ACCURACY")
print(f"   Within 5%:  {within_5_percent.mean()*100:.1f}% of predictions")
print(f"   Within 10%: {within_10_percent.mean()*100:.1f}% of predictions")
print(f"   Within 15%: {within_15_percent.mean()*100:.1f}% of predictions")

print(f"\n📏 ERROR STATISTICS")
print(f"   Mean Error: ${residuals.mean():.2f}k")
print(f"   Std Error:  ${residuals.std():.2f}k")
print(f"   Max Over-prediction:  ${residuals.max():.2f}k")
print(f"   Max Under-prediction: ${residuals.min():.2f}k")

# Business impact analysis
print(f"\n💼 BUSINESS IMPACT")
print(f"   Average house price: ${y_test.mean():.0f}k")
print(f"   RMSE as % of avg price: {(all_results[best_model]['test_rmse']/y_test.mean())*100:.1f}%")
print(f"   MAE as % of avg price:  {(all_results[best_model]['test_mae']/y_test.mean())*100:.1f}%")

# Sample predictions
print(f"\n🏠 SAMPLE PREDICTIONS")
print("=" * 60)
sample_indices = np.random.choice(len(y_test), 5, replace=False)
for i, idx in enumerate(sample_indices[:5]):
    actual = y_test.iloc[idx]
    predicted = best_predictions[idx]
    error = predicted - actual
    error_pct = (error / actual) * 100
    print(f"House {i+1}: Actual=${actual:.0f}k, Predicted=${predicted:.0f}k, Error={error:+.1f}k ({error_pct:+.1f}%)")

## 🎓 Step 9: Key Insights and Learning

Let's summarize what we've learned about AutoML for regression tasks.

In [None]:
# Generate comprehensive insights
print("🎓 KEY LEARNING INSIGHTS")
print("=" * 80)

print("🤖 AUTOML BENEFITS DEMONSTRATED:")
print(f"   • {best_model} achieved the best performance with R² = {comparison_df.iloc[0]['Test_R2']:.3f}")
print(f"   • AutoML models outperformed simple baselines")
print(f"   • Automated feature engineering and hyperparameter tuning")
print(f"   • No manual algorithm selection required")

print(f"\n📊 REGRESSION METRICS EXPLAINED:")
print(f"   • R² (Coefficient of Determination): {comparison_df.iloc[0]['Test_R2']:.3f}")
print(f"     → Explains {comparison_df.iloc[0]['Test_R2']*100:.1f}% of price variance")
print(f"   • RMSE (Root Mean Squared Error): ${comparison_df.iloc[0]['Test_RMSE']:.1f}k")
print(f"     → Average prediction error magnitude")
print(f"   • MAE (Mean Absolute Error): ${comparison_df.iloc[0]['Test_MAE']:.1f}k")
print(f"     → Median prediction error")

print(f"\n🏆 MODEL COMPARISON INSIGHTS:")
automl_models = ['auto-sklearn', 'TPOT']
baseline_models_list = ['Linear Regression', 'Random Forest']

best_automl = comparison_df[comparison_df['Model'].isin(automl_models)].iloc[0]
best_baseline = comparison_df[comparison_df['Model'].isin(baseline_models_list)].iloc[0]

improvement = ((best_automl['Test_R2'] - best_baseline['Test_R2']) / best_baseline['Test_R2']) * 100
print(f"   • Best AutoML: {best_automl['Model']} (R² = {best_automl['Test_R2']:.3f})")
print(f"   • Best Baseline: {best_baseline['Model']} (R² = {best_baseline['Test_R2']:.3f})")
print(f"   • AutoML improvement: {improvement:+.1f}% better R² score")

print(f"\n🎯 PRACTICAL APPLICATIONS:")
print(f"   • Real Estate: Automated property valuation")
print(f"   • Finance: Credit scoring and risk assessment")
print(f"   • Manufacturing: Quality control and defect prediction")
print(f"   • Healthcare: Treatment outcome prediction")
print(f"   • Marketing: Customer lifetime value estimation")

print(f"\n💡 NEXT STEPS FOR PRODUCTION:")
print(f"   • Feature engineering: Create domain-specific features")
print(f"   • Data quality: Handle missing values and outliers")
print(f"   • Model monitoring: Track performance over time")
print(f"   • A/B testing: Compare models in production")
print(f"   • Explainability: Use SHAP for model interpretability")

print(f"\n🔗 AUTOML LIBRARIES COMPARISON:")
sklearn_r2 = all_results['auto-sklearn']['test_r2']
tpot_r2 = all_results['TPOT']['test_r2']
print(f"   • auto-sklearn: R² = {sklearn_r2:.3f} | Focus: Robust, ensemble methods")
print(f"   • TPOT: R² = {tpot_r2:.3f} | Focus: Genetic programming, pipeline optimization")
print(f"   • Both excel at different aspects of AutoML")

print(f"\n🎊 CONGRATULATIONS!")
print(f"You've successfully implemented AutoML for regression and achieved:")
print(f"🏆 Best Model: {best_model}")
print(f"📈 R² Score: {comparison_df.iloc[0]['Test_R2']:.3f}")
print(f"💰 Average Error: ${comparison_df.iloc[0]['Test_MAE']:.1f}k")
print(f"✨ {within_10_percent.mean()*100:.1f}% of predictions within 10% accuracy!")

## 🎉 Conclusion

### What We Accomplished

In this notebook, we successfully:

1. **🏗️ Created a realistic dataset** with 15 features for house price prediction
2. **📊 Performed comprehensive EDA** to understand data relationships
3. **🤖 Implemented two AutoML approaches**: auto-sklearn and TPOT
4. **📈 Compared AutoML vs traditional models** and demonstrated improvements
5. **🔍 Analyzed predictions** with detailed performance metrics
6. **📊 Created interactive visualizations** for better insights

### Key Takeaways

- **AutoML democratizes machine learning** by automating complex tasks
- **Different AutoML tools excel in different scenarios** - experiment with multiple approaches
- **Evaluation metrics matter** - R², RMSE, and MAE each tell different stories
- **Visualization is crucial** for understanding model behavior
- **Real-world applications are vast** - from real estate to healthcare

### Next Steps

1. **Try with your own data**: Upload a CSV and adapt this notebook
2. **Experiment with hyperparameters**: Adjust time limits and population sizes
3. **Add feature engineering**: Create polynomial features or domain-specific transformations
4. **Explore model interpretability**: Use SHAP or LIME for explainable AI
5. **Deploy your model**: Create a web app or API endpoint

---

**🚀 Happy AutoML Learning!**

Feel free to modify this notebook for your own regression problems. AutoML makes machine learning accessible to everyone!