# Automated Machine Learning for Regression Tasks

This notebook demonstrates how to use Azure Automated Machine Learning (AutoML) for regression tasks. We'll create a complete end-to-end example that includes:

* Data preparation and exploration
* Setting up Azure ML workspace
* Configuring AutoML for regression
* Training and evaluating models
* Analyzing regression-specific metrics
* Model interpretation and insights

## What is Regression?
Regression is a machine learning task that predicts continuous numerical values. Common examples include:
- Predicting house prices
- Forecasting sales revenue
- Estimating energy consumption
- Calculating insurance premiums

## Prerequisites

Please ensure you have the following installed:
- Azure Machine Learning Python SDK v2
- Required data science libraries

```bash
pip install azure-ai-ml azure-identity
pip install pandas numpy scikit-learn matplotlib seaborn
```

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 1. Data Preparation and Exploration

For this demo, we'll create a synthetic regression dataset that simulates real-world scenarios. This approach ensures the notebook works without external dependencies.

In [None]:
# Create a synthetic regression dataset
# This simulates a house price prediction scenario
np.random.seed(42)

# Generate base features
X, y = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    noise=0.1,
    random_state=42
)

# Create meaningful feature names for house price prediction
feature_names = [
    'square_footage', 'bedrooms', 'bathrooms', 'age', 'lot_size',
    'garage_size', 'neighborhood_score', 'school_rating', 'distance_to_city', 'crime_rate'
]

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)

# Scale features to realistic ranges
df['square_footage'] = ((df['square_footage'] - df['square_footage'].min()) / 
                       (df['square_footage'].max() - df['square_footage'].min()) * 2000 + 800).round(0)
df['bedrooms'] = ((df['bedrooms'] - df['bedrooms'].min()) / 
                 (df['bedrooms'].max() - df['bedrooms'].min()) * 4 + 1).round(0)
df['bathrooms'] = ((df['bathrooms'] - df['bathrooms'].min()) / 
                  (df['bathrooms'].max() - df['bathrooms'].min()) * 3 + 1).round(1)
df['age'] = ((df['age'] - df['age'].min()) / 
            (df['age'].max() - df['age'].min()) * 50).round(0)
df['lot_size'] = ((df['lot_size'] - df['lot_size'].min()) / 
                 (df['lot_size'].max() - df['lot_size'].min()) * 1 + 0.1).round(2)

# Scale target to realistic house prices (in thousands)
y_scaled = ((y - y.min()) / (y.max() - y.min()) * 400 + 150).round(1)
df['price'] = y_scaled

print(f"Dataset shape: {df.shape}")
print(f"Target variable (price) range: ${df['price'].min():.1f}k - ${df['price'].max():.1f}k")
df.head()

In [None]:
# Data exploration and visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Distribution of target variable
axes[0, 0].hist(df['price'], bins=30, alpha=0.7, color='skyblue')
axes[0, 0].set_title('Distribution of House Prices')
axes[0, 0].set_xlabel('Price (thousands $)')
axes[0, 0].set_ylabel('Frequency')

# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            ax=axes[0, 1], fmt='.2f')
axes[0, 1].set_title('Feature Correlation Matrix')

# Scatter plot: Square footage vs Price
axes[1, 0].scatter(df['square_footage'], df['price'], alpha=0.6, color='coral')
axes[1, 0].set_xlabel('Square Footage')
axes[1, 0].set_ylabel('Price (thousands $)')
axes[1, 0].set_title('Square Footage vs Price')

# Feature importance (correlation with target)
feature_corr = df.corr()['price'].drop('price').sort_values(key=abs, ascending=False)
axes[1, 1].barh(range(len(feature_corr)), feature_corr.values, color='lightgreen')
axes[1, 1].set_yticks(range(len(feature_corr)))
axes[1, 1].set_yticklabels(feature_corr.index)
axes[1, 1].set_xlabel('Correlation with Price')
axes[1, 1].set_title('Feature Correlation with Target')

plt.tight_layout()
plt.show()

# Summary statistics
print("\nDataset Summary Statistics:")
print(df.describe())

## 2. Connect to Azure Machine Learning Workspace

Initialize connection to your Azure ML workspace. Update the subscription, resource group, and workspace name according to your Azure setup.

In [None]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

# Azure ML workspace configuration
# Update these values with your Azure subscription details
subscription_id = "your-subscription-id"
resource_group = "your-resource-group"
workspace_name = "your-workspace-name"

# Authenticate and create ML client
try:
    credential = DefaultAzureCredential()
    ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)
    workspace = ml_client.workspaces.get(name=ml_client.workspace_name)
    print(f"Connected to workspace: {ml_client.workspace_name}")
    print(f"Resource group: {workspace.resource_group}")
    print(f"Location: {workspace.location}")
    print(f"Subscription: {ml_client.connections._subscription_id}")
except Exception as e:
    print(f"Authentication failed: {e}")
    print("Please ensure you're logged in to Azure CLI or have proper credentials configured")
    print("You can continue with the local demonstration without Azure ML connection")

## 3. Prepare Data for AutoML

Split the data and save it in a format that Azure AutoML can consume.

In [None]:
# Split the data into training and testing sets
X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create training dataset
train_data = X_train.copy()
train_data['price'] = y_train

# Create test dataset for evaluation
test_data = X_test.copy()
test_data['price'] = y_test

print(f"Training set shape: {train_data.shape}")
print(f"Test set shape: {test_data.shape}")

# Save datasets locally
import os
os.makedirs('./data', exist_ok=True)
train_data.to_csv('./data/train_regression_data.csv', index=False)
test_data.to_csv('./data/test_regression_data.csv', index=False)

print("\nDatasets saved locally in ./data/ directory")
print("Files created:")
print("- train_regression_data.csv")
print("- test_regression_data.csv")

## 4. Configure Azure AutoML for Regression

Set up AutoML configuration specifically for regression tasks. This includes specifying the target column, task type, and optimization metrics.

In [None]:
from azure.ai.ml import automl
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import Input

# Create data input for AutoML
# In a real scenario, you would upload this to Azure ML datastore
training_data_input = Input(
    type=AssetTypes.URI_FILE,
    path="./data/train_regression_data.csv"
)

# Configure AutoML job for regression
regression_job = automl.regression(
    # Data configuration
    training_data=training_data_input,
    target_column_name="price",
    
    # Primary metric for optimization
    primary_metric="normalized_root_mean_squared_error",
    
    # Experiment settings
    experiment_name="house-price-regression-automl",
    
    # Model training settings
    enable_model_explainability=True,
    enable_early_stopping=True,
    
    # Timeout and trial settings
    experiment_timeout_hours=0.5,  # 30 minutes for demo
    max_trials=10,
    max_concurrent_trials=2,
    
    # Cross-validation
    n_cross_validations=3,
    
    # Feature engineering
    enable_feature_engineering=True,
    
    # Tags for organization
    tags={"task": "regression", "dataset": "synthetic_house_prices"}
)

print("AutoML regression job configured successfully!")
print(f"Target column: {regression_job.target_column_name}")
print(f"Primary metric: {regression_job.primary_metric}")
print(f"Max trials: {regression_job.limits.max_trials}")
print(f"Experiment timeout: {regression_job.limits.timeout_minutes} minutes")

## 5. Set Up Compute Target

Create or use existing compute target for running the AutoML experiment.

In [None]:
from azure.ai.ml.entities import AmlCompute

# Define compute cluster name
cluster_name = "regression-compute"

try:
    # Check if compute target already exists
    compute_target = ml_client.compute.get(cluster_name)
    print(f"Found existing compute target: {cluster_name}")
    print(f"VM size: {compute_target.size}")
    print(f"Max instances: {compute_target.max_instances}")
    
except Exception:
    print(f"Creating new compute target: {cluster_name}")
    
    # Create new compute target
    compute_config = AmlCompute(
        name=cluster_name,
        size="Standard_DS3_v2",  # Good for AutoML workloads
        max_instances=4,
        min_instances=0,
        idle_time_before_scale_down=120,  # Scale down after 2 minutes
        tier="Dedicated"
    )
    
    try:
        compute_target = ml_client.compute.begin_create_or_update(compute_config)
        print("Compute target creation initiated...")
        print("This may take a few minutes to complete.")
    except Exception as e:
        print(f"Failed to create compute target: {e}")
        print("You can run this experiment on any available compute target")

# Set compute target for the job
if 'ml_client' in locals():
    regression_job.compute = cluster_name

## 6. Submit AutoML Experiment

Submit the regression experiment to Azure AutoML. This will train multiple models and select the best one based on the specified metric.

In [None]:
# Submit the AutoML job
if 'ml_client' in locals():
    try:
        returned_job = ml_client.jobs.create_or_update(regression_job)
        
        print(f"AutoML job submitted successfully!")
        print(f"Job name: {returned_job.name}")
        print(f"Job status: {returned_job.status}")
        print(f"Studio URL: {returned_job.studio_url}")
        
        # You can monitor the job in Azure ML Studio using the URL above
        print("\n" + "="*50)
        print("IMPORTANT: Monitor your job progress at:")
        print(returned_job.studio_url)
        print("="*50)
        
    except Exception as e:
        print(f"Failed to submit job: {e}")
        print("Please check your Azure ML configuration and permissions")
else:
    print("Azure ML client not configured. Skipping job submission.")
    print("To run this experiment, please configure your Azure ML workspace credentials.")

## 7. Local Regression Model Demonstration

While the Azure AutoML job runs, let's demonstrate regression concepts with a local model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Train a simple local model for demonstration
print("Training local regression models for comparison...")

# Scale features for linear regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

results = {}

for name, model in models.items():
    if name == 'Linear Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    # Calculate regression metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2,
        'predictions': y_pred
    }
    
    print(f"\n{name} Results:")
    print(f"  RMSE: ${rmse:.2f}k")
    print(f"  MAE:  ${mae:.2f}k")
    print(f"  R²:   {r2:.3f}")

# Create comparison DataFrame
results_df = pd.DataFrame({
    'Model': list(results.keys()),
    'RMSE': [results[model]['RMSE'] for model in results.keys()],
    'MAE': [results[model]['MAE'] for model in results.keys()],
    'R²': [results[model]['R²'] for model in results.keys()]
})

print("\nModel Comparison:")
print(results_df.round(3))

## 8. Regression Metrics Visualization

Visualize model performance with regression-specific plots.

In [None]:
# Create comprehensive regression visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Model comparison bar chart
x_pos = np.arange(len(results_df))
axes[0, 0].bar(x_pos, results_df['RMSE'], color=['skyblue', 'lightcoral'])
axes[0, 0].set_xlabel('Models')
axes[0, 0].set_ylabel('RMSE (thousands $)')
axes[0, 0].set_title('Model Comparison - RMSE')
axes[0, 0].set_xticks(x_pos)
axes[0, 0].set_xticklabels(results_df['Model'])

# R² comparison
axes[0, 1].bar(x_pos, results_df['R²'], color=['lightgreen', 'orange'])
axes[0, 1].set_xlabel('Models')
axes[0, 1].set_ylabel('R² Score')
axes[0, 1].set_title('Model Comparison - R² Score')
axes[0, 1].set_xticks(x_pos)
axes[0, 1].set_xticklabels(results_df['Model'])
axes[0, 1].set_ylim(0, 1)

# Predicted vs Actual scatter plot (Random Forest)
rf_predictions = results['Random Forest']['predictions']
axes[0, 2].scatter(y_test, rf_predictions, alpha=0.6, color='purple')
axes[0, 2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0, 2].set_xlabel('Actual Price (thousands $)')
axes[0, 2].set_ylabel('Predicted Price (thousands $)')
axes[0, 2].set_title('Random Forest: Predicted vs Actual')

# Residuals plot
residuals = y_test - rf_predictions
axes[1, 0].scatter(rf_predictions, residuals, alpha=0.6, color='red')
axes[1, 0].axhline(y=0, color='black', linestyle='--')
axes[1, 0].set_xlabel('Predicted Price (thousands $)')
axes[1, 0].set_ylabel('Residuals')
axes[1, 0].set_title('Residuals Plot (Random Forest)')

# Distribution of residuals
axes[1, 1].hist(residuals, bins=20, alpha=0.7, color='cyan')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Residuals')

# Feature importance (Random Forest)
if hasattr(models['Random Forest'], 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': models['Random Forest'].feature_importances_
    }).sort_values('importance', ascending=True)
    
    axes[1, 2].barh(range(len(feature_importance)), feature_importance['importance'], color='gold')
    axes[1, 2].set_yticks(range(len(feature_importance)))
    axes[1, 2].set_yticklabels(feature_importance['feature'])
    axes[1, 2].set_xlabel('Feature Importance')
    axes[1, 2].set_title('Random Forest Feature Importance')

plt.tight_layout()
plt.show()

# Print key insights
print("\n" + "="*50)
print("KEY REGRESSION INSIGHTS")
print("="*50)
print(f"Best performing model: {results_df.loc[results_df['R²'].idxmax(), 'Model']}")
print(f"Highest R² score: {results_df['R²'].max():.3f}")
print(f"Lowest RMSE: ${results_df['RMSE'].min():.2f}k")
print(f"\nMost important features (Random Forest):")
if 'feature_importance' in locals():
    top_features = feature_importance.tail(3)
    for _, row in top_features.iterrows():
        print(f"  {row['feature']}: {row['importance']:.3f}")

## 9. Understanding Regression Metrics

Let's understand what each regression metric tells us about model performance.

In [None]:
print("REGRESSION METRICS EXPLAINED")
print("=" * 40)
print()
print("📊 R² (R-squared): Coefficient of Determination")
print("   • Range: 0 to 1 (higher is better)")
print("   • Measures how much variance in target is explained by features")
print(f"   • Our best model explains {results_df['R²'].max()*100:.1f}% of price variance")
print()
print("📏 RMSE (Root Mean Squared Error):")
print("   • Units: Same as target variable (thousands $)")
print("   • Penalizes large errors more heavily")
print(f"   • Our best model has average error of ${results_df['RMSE'].min():.1f}k")
print()
print("📐 MAE (Mean Absolute Error):")
print("   • Units: Same as target variable (thousands $)")
print("   • Average absolute difference between predicted and actual")
print(f"   • Our best model has median error of ${results_df['MAE'].min():.1f}k")
print()
print("🎯 Business Impact:")
print(f"   • Price range: ${df['price'].min():.0f}k - ${df['price'].max():.0f}k")
print(f"   • Average price: ${df['price'].mean():.0f}k")
print(f"   • Model error as % of average price: {(results_df['RMSE'].min()/df['price'].mean())*100:.1f}%")

## 10. AutoML Results Analysis (if connected to Azure)

If you submitted the AutoML job, you can retrieve and analyze the results here.

In [None]:
# Check AutoML job status and retrieve results
if 'ml_client' in locals() and 'returned_job' in locals():
    try:
        # Get job details
        job_details = ml_client.jobs.get(returned_job.name)
        print(f"Job Status: {job_details.status}")
        print(f"Job Name: {job_details.name}")
        
        if job_details.status == "Completed":
            print("\n🎉 AutoML job completed successfully!")
            
            # You can retrieve the best model and its metrics here
            # This would require additional code to download and analyze the model
            print("\nTo analyze the AutoML results:")
            print("1. Visit the Studio URL provided earlier")
            print("2. Review the model leaderboard")
            print("3. Examine feature importance and explanations")
            print("4. Download the best model for deployment")
            
        elif job_details.status == "Running":
            print("\n⏳ AutoML job is still running...")
            print("You can monitor progress in Azure ML Studio")
            
        else:
            print(f"\n⚠️ Job status: {job_details.status}")
            
    except Exception as e:
        print(f"Error retrieving job status: {e}")
else:
    print("AutoML job was not submitted. To run AutoML:")
    print("1. Configure Azure ML workspace credentials")
    print("2. Re-run the AutoML submission cells")
    print("3. Monitor progress in Azure ML Studio")

## 11. Next Steps and Best Practices

This notebook demonstrated the fundamentals of regression with AutoML. Here are recommended next steps:

In [None]:
print("🚀 NEXT STEPS FOR PRODUCTION REGRESSION MODELS")
print("=" * 55)
print()
print("📊 Data Quality:")
print("   • Handle missing values appropriately")
print("   • Remove or transform outliers")
print("   • Ensure feature distributions are reasonable")
print("   • Check for data leakage")
print()
print("🔧 Feature Engineering:")
print("   • Create polynomial features for non-linear relationships")
print("   • Apply domain-specific transformations")
print("   • Use feature selection techniques")
print("   • Handle categorical variables properly")
print()
print("🎯 Model Selection:")
print("   • Try ensemble methods (Random Forest, Gradient Boosting)")
print("   • Consider neural networks for complex patterns")
print("   • Use cross-validation for robust evaluation")
print("   • Compare multiple metrics, not just one")
print()
print("📈 Model Validation:")
print("   • Use time-based splits for time series data")
print("   • Validate on out-of-sample data")
print("   • Check residual patterns")
print("   • Test model stability over time")
print()
print("🚀 Deployment:")
print("   • Set up model monitoring")
print("   • Plan for model retraining")
print("   • Implement A/B testing")
print("   • Document model limitations and assumptions")
print()
print("💡 AutoML Benefits:")
print("   • Automatically tries multiple algorithms")
print("   • Handles feature engineering")
print("   • Provides model explanations")
print("   • Optimizes hyperparameters")
print("   • Reduces time to production")

## Summary

This notebook provided a comprehensive introduction to regression tasks using Automated Machine Learning. We covered:

1. **Data Preparation**: Created and explored a realistic regression dataset
2. **Azure ML Setup**: Connected to Azure ML workspace and configured compute
3. **AutoML Configuration**: Set up AutoML specifically for regression tasks
4. **Model Training**: Submitted AutoML job and trained local models for comparison
5. **Evaluation**: Analyzed regression metrics (RMSE, MAE, R²) and visualizations
6. **Insights**: Interpreted model performance and feature importance

### Key Takeaways:
- Regression predicts continuous numerical values
- AutoML automates model selection and hyperparameter tuning
- Multiple metrics provide different perspectives on model performance
- Feature importance helps understand what drives predictions
- Azure AutoML provides enterprise-grade MLOps capabilities

For production use, always validate models thoroughly and consider business context when interpreting results.