# AutoML: Automated Machine Learning for Defect Prediction

**Databricks AutoML** automatically builds machine learning models with minimal code. It tries multiple algorithms, tunes hyperparameters, and provides a leaderboard of the best models.

## What You'll Learn

✅ Understand what AutoML does and when to use it  
✅ Run classification experiments to predict defects using the UI  
✅ Review model performance and metrics  
✅ Deploy the best model for predictions  
✅ Understand feature importance  

---

## Use Case: Predicting Device Defects

We'll use AutoML to predict whether a device will have defects based on sensor readings. This can help:
- **Preventive maintenance**: Identify devices at risk before failure
- **Quality control**: Catch issues early in production
- **Cost reduction**: Minimize downtime and repairs

---

## Table of Contents

1. Understanding AutoML
2. Running Classification Experiment
3. Reviewing Results
4. Using the Model
5. Advanced: Regression Example

---

**References:**
- [AutoML Overview](https://docs.databricks.com/aws/en/machine-learning/automl/)
- [Classification](https://docs.databricks.com/aws/en/machine-learning/automl/classification)
- [Regression](https://docs.databricks.com/aws/en/machine-learning/automl/regression)


In [0]:
# Configuration
CATALOG = 'default'
SCHEMA = 'db_crash_course'

print(f"Using: {CATALOG}.{SCHEMA}")

## 1. Understanding AutoML <a id="understanding"></a>

### What is Databricks AutoML?

AutoML automates the machine learning workflow:
1. **Data preprocessing** - Handles missing values, encoding, scaling
2. **Feature engineering** - Creates derived features automatically
3. **Model selection** - Tries multiple algorithms (Random Forest, XGBoost, LightGBM, etc.)
4. **Hyperparameter tuning** - Optimizes model parameters
5. **Model evaluation** - Compares models using relevant metrics

### Supported Problem Types:

- **Classification**: Predict categories (e.g., defect/no defect)
- **Regression**: Predict continuous values (e.g., temperature)
- **Forecasting**: Predict time series values (e.g., future sensor readings)

### Benefits:

✅ **Fast experimentation** - Get results in minutes  
✅ **Best practices built-in** - Follows ML best practices automatically  
✅ **Transparency** - Generates notebooks you can review and modify  
✅ **Production-ready** - Models are registered and ready to deploy  

### When to Use AutoML:

- Quick proof-of-concept
- Baseline models for comparison
- When you need results fast
- Learning ML best practices

### When Not to Use AutoML:

- Highly specialized models needed
- Custom architectures required
- Deep learning for images/text
- Fine control over every step


## 2. Running Classification with AutoML <a id="classification"></a>

### Option 1: Using the UI (Recommended)

We'll use the `inspection_silver` table we created earlier with the Lakeflow Designer. AutoML will automatically handle data preprocessing, feature engineering, and model selection.

**Steps:**

1. Click **Machine Learning** in the left sidebar
2. Click **AutoML** (or **Experiments** → **Create AutoML Experiment**)
3. Configure the experiment:
   - **Problem type**: Classification
   - **Dataset**: Browse and select `default.db_crash_course.inspection_silver` table
   - **Target column**: `defect`
   - **Evaluation metric**: F1 Score (good for imbalanced classes)
   - **Training framework**: LightGBM, XGBoost, sklearn (select all)
   - **Advanced settings** → **Excluded columns**: Add `device_id` (we don't want to train on IDs)
   - **Timeout**: 30 minutes
4. Click **Start AutoML**
5. Wait for the experiment to complete

AutoML will:
- Automatically handle missing values and feature encoding
- Try multiple algorithms (Random Forest, XGBoost, LightGBM)
- Tune hyperparameters for each algorithm
- Generate a leaderboard showing the best models
- Create notebooks for each model so you can see exactly what it did


### Option 2: Using Python API (Optional)

For those who want to try the programmatic approach instead, you can also run AutoML using Python:

In [0]:
from databricks import automl

# Run AutoML classification on the inspection_silver table
training_table = f"{CATALOG}.{SCHEMA}.inspection_silver"

summary = automl.classify(
    dataset=training_table,
    target_col="defect",
    primary_metric="f1",
    timeout_minutes=30,
    exclude_cols=["device_id"],  # Exclude ID columns from features
    experiment_name=f"/Users/{spark.sql('SELECT current_user()').collect()[0][0]}/automl_defect_prediction"
)

print(f"Best trial F1 Score: {summary.best_trial.metrics['val_f1_score']:.4f}")
print(f"Best trial Run ID: {summary.best_trial.mlflow_run_id}")


## 3. Reviewing Results <a id="results"></a>

### Understanding the Leaderboard

After AutoML completes, you'll see a leaderboard with:

- **Model type**: Algorithm used (XGBoost, LightGBM, Random Forest, etc.)
- **F1 Score**: Harmonic mean of precision and recall
- **Precision**: Accuracy of positive predictions
- **Recall**: Coverage of actual positives
- **Accuracy**: Overall correctness
- **AUC**: Area under ROC curve

### Key Metrics for Classification:

**F1 Score**: Best for imbalanced datasets (we have more non-defects than defects)
- Range: 0 to 1 (higher is better)
- Balances precision and recall

**Precision**: Of predicted defects, how many are actually defects?
- High precision = fewer false alarms

**Recall**: Of actual defects, how many did we catch?
- High recall = catch more defects, but may have false alarms

**AUC**: Model's ability to distinguish between classes
- Range: 0.5 (random) to 1.0 (perfect)

### Reviewing the Best Model:

1. Click on the best model in the leaderboard
2. Review the **Model notebook** generated by AutoML
3. Check **Feature importance** - which features matter most?
4. Review **Confusion matrix** - where does the model make mistakes?
5. Check **ROC curve** and **PR curve**


### View Feature Importance (After AutoML Completes)

If you ran AutoML via Python API, you can access the best model:


In [0]:
# After AutoML completes, access the generated notebooks
# The best model notebook will show feature importance

# Example of what you'll see:
print("""
Top Features for Predicting Defects (typical results):
1. sensor_temperature - High temps correlate with defects
2. sensor_rotation_speed_EMA_5 - Smoothed rotation speed pattern
3. sensor_delay - Operational delays indicate issues  
4. sensor_density - Material density affects performance
5. sensor_air_pressure - Low pressure indicates problems

These features help the model predict which devices will have defects!
""")


## 4. Using the Model for Predictions <a id="using-model"></a>

### Register the Model

1. In the AutoML UI, click on your best model
2. Click **Register model**
3. Choose a name: `iot_defect_predictor`
4. Add description and tags
5. Click **Register**

### Make Predictions

Once registered, you can use the model to predict on new data:


In [0]:
# Example: Load model and make predictions
import mlflow
from mlflow.tracking import MlflowClient

# Get the latest model version (after you've registered it)
model_name = "iot_defect_predictor"

# This is an example - update with your actual model URI after registration
model_uri = f"models:/{model_name}/Production"

# Load new data for prediction
new_sensor_data = spark.table(f"{CATALOG}.{SCHEMA}.sensor_bronze").limit(100)

# Make predictions (example - actual code depends on your registered model)
predictions = mlflow.pyfunc.load_model(model_uri).predict(new_sensor_data.toPandas())

print("""
After registration, you can:
1. Load the model using MLflow
2. Apply it to new sensor data
3. Predict which devices will have defects
4. Take preventive action before failures occur

Example use cases:
- Real-time monitoring: Flag devices predicted to fail
- Preventive maintenance: Schedule inspections for high-risk devices
- Quality control: Identify problematic batches early
""")


### Batch Scoring with SQL

You can also apply the model directly in SQL:


In [0]:
# After model registration, create a SQL function
# Example SQL for batch inference:

sql_example = """
-- Register model as SQL function
CREATE OR REPLACE FUNCTION predict_defect(
    temperature DOUBLE,
    density FLOAT,
    delay FLOAT,
    rotation_speed DOUBLE,
    air_pressure FLOAT,
    airflow_rate DOUBLE,
    rotation_speed_ema DOUBLE
)
RETURNS DOUBLE
RETURN SELECT ai_query(
    'iot_defect_predictor',
    temperature,
    density,
    delay,
    rotation_speed,
    air_pressure,
    airflow_rate,
    rotation_speed_ema
);

-- Use the function to score data
SELECT 
    device_id,
    timestamp,
    temperature,
    predict_defect(
        sensor_temperature,
        sensor_density, 
        sensor_delay,
        sensor_rotation_speed,
        sensor_air_pressure,
        sensor_airflow_rate,
        sensor_rotation_speed_EMA_5
    ) as defect_probability
FROM sensor_bronze
WHERE timestamp > current_timestamp() - INTERVAL 1 DAY
ORDER BY defect_probability DESC
LIMIT 20;
"""

print("SQL Batch Scoring Example:")
print(sql_example)


## 5. Advanced: Regression Example <a id="regression"></a>

### Prepare Data for Regression

Now let's try AutoML for a regression task - predicting temperature. This is an optional advanced example for those who want to explore more.


In [0]:
# Prepare data for regression (predicting temperature)
regression_data = spark.table(f"{CATALOG}.{SCHEMA}.sensor_bronze").select(
    "temperature",  # Target variable
    "rotation_speed",
    "air_pressure", 
    "delay",
    "density",
    "airflow_rate",
    "factory_id",
    "model_id"
).na.drop()

# Save regression training data
regression_table = f"{CATALOG}.{SCHEMA}.temperature_prediction_training"
regression_data.write.format("delta").mode("overwrite").saveAsTable(regression_table)

print(f"✅ Regression data saved to: {regression_table}")
print(f"   Total records: {regression_data.count():,}")


### Run AutoML Regression

**Using the UI:**
1. Create new AutoML experiment
2. Select **Regression** as problem type
3. Choose `temperature_prediction_training` table
4. Target column: `temperature`
5. Metric: RMSE (Root Mean Squared Error)
6. Start AutoML

**Using Python API:**


In [0]:
# Run AutoML for regression
summary_regression = automl.regress(
    dataset=regression_table,
    target_col="temperature",
    primary_metric="rmse",
    timeout_minutes=20,
    experiment_name=f"/Users/{spark.sql('SELECT current_user()').collect()[0][0]}/automl_temperature_prediction"
)

print(f"Best trial RMSE: {summary_regression.best_trial.metrics['val_rmse']:.4f}")
print(f"Best trial R²: {summary_regression.best_trial.metrics.get('val_r2_score', 'N/A')}")


### Regression Metrics Explained

**RMSE (Root Mean Squared Error)**: 
- Average prediction error in the same units as target
- Lower is better
- Example: RMSE of 5.2 means predictions are off by ~5.2 degrees on average

**R² Score**:
- Percentage of variance explained by the model
- Range: 0 to 1 (higher is better)
- 0.8 = Model explains 80% of variance in temperature

**MAE (Mean Absolute Error)**:
- Average absolute difference between predicted and actual
- More interpretable than RMSE

### Use Cases for Temperature Prediction:

- **Anomaly detection**: Flag when actual temps deviate from predictions
- **Capacity planning**: Predict cooling needs
- **Energy optimization**: Forecast temperature changes


## Summary

In this notebook, you learned:

✅ **What is AutoML** - Automated machine learning workflow  
✅ **Run classification** - Predict defects using sensor data with the UI  
✅ **Review results** - Understand metrics and feature importance  
✅ **Register models** - Save models for production use  

### Key Takeaways:

1. **AutoML automates** data preprocessing, feature engineering, model selection, and tuning
2. **Classification** predicts categories (defect/no defect)
3. **F1 Score** is best for imbalanced classification problems
4. **Feature importance** shows which sensor readings matter most
5. **Model registration** makes models available for production use
6. **Next notebook** will cover model deployment and inference

### Best Practices:

**Data Preparation:**
- Remove nulls and outliers
- Check target variable distribution
- Exclude ID columns from features
- Include domain-relevant features

**Model Evaluation:**
- Use appropriate metrics for your problem
- Check confusion matrix for classification
- Review feature importance
- Validate on holdout data

**Model Management:**
- Register best models in MLflow
- Document model purpose and metrics
- Version your models systematically
- Prepare for deployment (covered in next notebook)

### Real-World Applications:

**Preventive Maintenance:**
- Predict equipment failures before they occur
- Schedule maintenance proactively
- Reduce downtime and repair costs

**Quality Control:**
- Identify defective products early
- Improve manufacturing processes
- Reduce waste and rework

**Anomaly Detection:**
- Flag unusual sensor patterns
- Detect cyber-attacks or tampering
- Ensure operational safety

---

### What's Next?

Now that you've trained a defect prediction model with AutoML:
- **Next Notebook (5 MLflow and MLOps)**: Learn to train custom models and track experiments
- **Notebook 6 (ML and AI Inference)**: Deploy your model for batch, streaming, and real-time predictions

---

## Try This Out

Want more practice? Try these exercises:

### 1. Try Different Model Types

In the AutoML UI, experiment with different algorithms:
- Disable XGBoost and try only LightGBM
- Compare model performance across algorithms
- Check which algorithm trains fastest

### 2. Feature Engineering

Try adding new features to improve model performance:
- **Temperature difference**: `abs(temperature - LAG(temperature))`
- **Rolling averages**: Moving average of rotation_speed
- **Time-based features**: Hour of day, day of week

Use Databricks Assistant to help generate feature engineering code!

### 3. Train a Regression Model

Try AutoML for regression - predicting temperature instead of defects:

**Quick Setup:**
```python
# Prepare regression data
regression_data = spark.table(f"{CATALOG}.{SCHEMA}.sensor_bronze").select(
    "temperature",  # Target
    "rotation_speed", "air_pressure", "delay",
    "density", "airflow_rate", "factory_id", "model_id"
).na.drop()

# Save for AutoML
regression_data.write.format("delta").mode("overwrite") \\
    .saveAsTable(f"{CATALOG}.{SCHEMA}.temperature_prediction")
```

**Then in AutoML UI:**
- Problem type: **Regression**
- Target: `temperature`
- Metric: **RMSE**

**Questions to explore:**
- What RMSE do you achieve?
- Which features are most important for temperature prediction?
- How does this compare to classification?

### 4. Try XGBoost with Databricks Assistant

Ask Databricks Assistant to help you train an XGBoost model from scratch (we'll cover this more in the next notebook):

**Prompt:** "Train an XGBoost classifier on inspection_silver to predict defects, using the same features as AutoML"

Compare your custom model to AutoML's results!

---

**Additional Resources:**
- [AutoML Documentation](https://docs.databricks.com/aws/en/machine-learning/automl/)
- [MLflow Model Registry](https://docs.databricks.com/aws/en/mlflow/model-registry)
- [Model Serving](https://docs.databricks.com/aws/en/machine-learning/model-serving/)
