# AutoML: Automated Machine Learning for Defect Prediction

**Databricks AutoML** automatically builds machine learning models with minimal code. It tries multiple algorithms, tunes hyperparameters, and provides a leaderboard of the best models.

## What You'll Learn

‚úÖ Understand what AutoML does and when to use it  
‚úÖ Run classification experiments to predict defects using the UI  
‚úÖ Review model performance and metrics  
‚úÖ Deploy the best model for predictions  
‚úÖ Understand feature importance  

---

## Use Case: Predicting Device Defects

We'll use AutoML to predict whether a device will have defects based on sensor readings. This can help:
- **Preventive maintenance**: Identify devices at risk before failure
- **Quality control**: Catch issues early in production
- **Cost reduction**: Minimize downtime and repairs

---

## Table of Contents

1. Understanding AutoML
2. Running Classification Experiment
3. Reviewing Results
4. Using the Model
5. Advanced: Regression Example

---

**References:**
- [AutoML Overview](https://docs.databricks.com/aws/en/machine-learning/automl/)
- [Classification](https://docs.databricks.com/aws/en/machine-learning/automl/classification)
- [Regression](https://docs.databricks.com/aws/en/machine-learning/automl/regression)


In [0]:
# Configuration
import re

CATALOG = 'dwx_express_insights_platform_dev_working'
READ_SCHEMA = 'db_crash_course'  # Shared schema (read-only)
username = spark.sql("SELECT current_user()").collect()[0][0]
username_base = username.split('@')[0]  # Extract username before @ symbol
WRITE_SCHEMA = re.sub(r'[^a-zA-Z0-9_]', '_', username_base)  # Replace special chars with _

# Create personal schema for any writes
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{WRITE_SCHEMA}")

print(f"‚úÖ Using catalog: {CATALOG}")
print(f"üìñ Reading from schema: {READ_SCHEMA} (shared)")
print(f"‚úçÔ∏è  Writing to schema: {WRITE_SCHEMA} (your personal schema)")

## 1. Understanding AutoML <a id="understanding"></a>

### What is Databricks AutoML?

AutoML automates the machine learning workflow:
1. **Data preprocessing** - Handles missing values, encoding, scaling
2. **Feature engineering** - Creates derived features automatically
3. **Model selection** - Tries multiple algorithms (Random Forest, XGBoost, LightGBM, etc.)
4. **Hyperparameter tuning** - Optimizes model parameters
5. **Model evaluation** - Compares models using relevant metrics

### Supported Problem Types:

- **Classification**: Predict categories (e.g., defect/no defect)
- **Regression**: Predict continuous values (e.g., temperature)
- **Forecasting**: Predict time series values (e.g., future sensor readings)

### Benefits:

‚úÖ **Fast experimentation** - Get results in minutes  
‚úÖ **Best practices built-in** - Follows ML best practices automatically  
‚úÖ **Transparency** - Generates notebooks you can review and modify  
‚úÖ **Production-ready** - Models are registered and ready to deploy  

### When to Use AutoML:

- Quick proof-of-concept
- Baseline models for comparison
- When you need results fast
- Learning ML best practices

### When Not to Use AutoML:

- Highly specialized models needed
- Custom architectures required
- Deep learning for images/text
- Fine control over every step


## 2. Running Classification with AutoML <a id="classification"></a>

We'll use the `inspection_silver` table we created earlier with the Lakeflow Designer. AutoML will automatically handle data preprocessing, feature engineering, and model selection.

**Steps:**

1. Click **Experiments** in the left sidebar
2. Click **Classification**
3. Configure the experiment:
   - **Cluster**: Any Machine Learning enabled cluster
   - **Dataset**: Browse and select `dwx_express_insights_platform_dev_working.db_crash_course.inspection_silver` table
   - **Prediction target**: `defect`
   - **Experiment name**: Pick a unique name for this experiment
   - **Features** ‚Üí **Excluded columns**: Uncheck `device_id`
   - **Advanced Configuration**: Set timeout to 45 minutes
4. Click **Start AutoML**
5. Move on to the next exercise while you wait for the experiment to complete

AutoML will:
- Automatically handle missing values and feature encoding
- Try multiple algorithms (Random Forest, XGBoost, LightGBM)
- Tune hyperparameters for each algorithm
- Generate a leaderboard showing the best models
- Create notebooks for each model so you can see exactly what it did


## 3. Reviewing Results <a id="results"></a>

### Understanding the Leaderboard

After AutoML completes, you'll see a leaderboard with:

- **Model type**: Algorithm used (XGBoost, LightGBM, Random Forest, etc.)
- **F1 Score**: Harmonic mean of precision and recall
- **Precision**: Accuracy of positive predictions
- **Recall**: Coverage of actual positives
- **Accuracy**: Overall correctness
- **AUC**: Area under ROC curve

### Key Metrics for Classification:

**F1 Score**: Best for imbalanced datasets (we have more non-defects than defects)
- Range: 0 to 1 (higher is better)
- Balances precision and recall

**Precision**: Of predicted defects, how many are actually defects?
- High precision = fewer false alarms

**Recall**: Of actual defects, how many did we catch?
- High recall = catch more defects, but may have false alarms

**AUC**: Model's ability to distinguish between classes
- Range: 0.5 (random) to 1.0 (perfect)

### Reviewing the Best Model:

1. Click on the best model in the leaderboard
2. Review the **Model notebook** generated by AutoML
3. Check **Feature importance** - which features matter most?
4. Review **Confusion matrix** - where does the model make mistakes?
5. Check **ROC curve** and **PR curve**


## 5. Regression Example <a id="regression"></a>

### Prepare Data for Regression

Now let's try AutoML for a regression task - predicting temperature. This time we'll write ML features to our own schema.


In [0]:
# Prepare data for regression (predicting temperature)
regression_data = spark.table(f"{CATALOG}.{READ_SCHEMA}.sensor_bronze").select(
    "temperature",  # Target variable
    "rotation_speed",
    "air_pressure", 
    "delay",
    "density",
    "airflow_rate",
    "factory_id",
    "model_id"
).na.drop()

# Save regression training data to YOUR personal schema
regression_table = f"{CATALOG}.{WRITE_SCHEMA}.temperature_prediction_training"
regression_data.write.format("delta").mode("overwrite").saveAsTable(regression_table)

print(f"‚úÖ Regression data saved to: {regression_table}")
print(f"   Total records: {regression_data.count():,}")

### Run AutoML Regression

**Using the UI:**
1. Experiments Tab (under AI/ML)
2. Select **Regression** as problem type
3. Choose `temperature_prediction_training` table
4. Target column: `temperature`
5. Metric: RMSE (Root Mean Squared Error)
6. Start AutoML


### Regression Metrics Explained

**RMSE (Root Mean Squared Error)**: 
- Average prediction error in the same units as target
- Lower is better
- Example: RMSE of 5.2 means predictions are off by ~5.2 degrees on average

**R¬≤ Score**:
- Percentage of variance explained by the model
- Range: 0 to 1 (higher is better)
- 0.8 = Model explains 80% of variance in temperature

**MAE (Mean Absolute Error)**:
- Average absolute difference between predicted and actual
- More interpretable than RMSE

### Use Cases for Temperature Prediction:

- **Anomaly detection**: Flag when actual temps deviate from predictions
- **Capacity planning**: Predict cooling needs
- **Energy optimization**: Forecast temperature changes


## Summary

In this notebook, you learned:

‚úÖ **What is AutoML** - Automated machine learning workflow  
‚úÖ **Run classification** - Predict defects using sensor data with the UI  
‚úÖ **Review results** - Understand metrics and feature importance  
‚úÖ **Register models** - Save models for production use  

### Key Takeaways:

1. **AutoML automates** data preprocessing, feature engineering, model selection, and tuning
2. **Classification** predicts categories (defect/no defect)
3. **F1 Score** is best for imbalanced classification problems
4. **Feature importance** shows which sensor readings matter most
5. **Model registration** makes models available for production use
6. **Next notebook** will cover model deployment and inference

### Best Practices:

**Data Preparation:**
- Remove nulls and outliers
- Check target variable distribution
- Exclude ID columns from features
- Include domain-relevant features

**Model Evaluation:**
- Use appropriate metrics for your problem
- Check confusion matrix for classification
- Review feature importance
- Validate on holdout data

**Model Management:**
- Register best models in MLflow
- Document model purpose and metrics
- Version your models systematically
- Prepare for deployment (covered in next notebook)

### Real-World Applications:

**Preventive Maintenance:**
- Predict equipment failures before they occur
- Schedule maintenance proactively
- Reduce downtime and repair costs

**Quality Control:**
- Identify defective products early
- Improve manufacturing processes
- Reduce waste and rework

**Anomaly Detection:**
- Flag unusual sensor patterns
- Detect cyber-attacks or tampering
- Ensure operational safety

---

### What's Next?

Now that you've trained a defect prediction model with AutoML:
- **Next Notebook (5 MLflow and MLOps)**: Learn to train custom models and track experiments
- **Notebook 6 (ML and AI Inference)**: Deploy your model for batch, streaming, and real-time predictions

---

## Try This Out

Want more practice? Try these exercises:

### 1. Try Different Model Types

In the AutoML UI, experiment with different algorithms:
- Disable XGBoost and try only LightGBM
- Compare model performance across algorithms
- Check which algorithm trains fastest

### 2. Feature Engineering

Try adding new features to improve model performance:
- **Temperature difference**: `abs(temperature - LAG(temperature))`
- **Rolling averages**: Moving average of rotation_speed
- **Time-based features**: Hour of day, day of week

Use Databricks Assistant to help generate feature engineering code!

### 3. Train a Regression Model

Try AutoML for regression - predicting temperature instead of defects:

**Quick Setup:**
```python
# Prepare regression data
regression_data = spark.table(f"{CATALOG}.{SCHEMA}.sensor_bronze").select(
    "temperature",  # Target
    "rotation_speed", "air_pressure", "delay",
    "density", "airflow_rate", "factory_id", "model_id"
).na.drop()

# Save for AutoML
regression_data.write.format("delta").mode("overwrite") \\
    .saveAsTable(f"{CATALOG}.{SCHEMA}.temperature_prediction")
```

**Then in AutoML UI:**
- Problem type: **Regression**
- Target: `temperature`
- Metric: **RMSE**

**Questions to explore:**
- What RMSE do you achieve?
- Which features are most important for temperature prediction?
- How does this compare to classification?

### 4. Try XGBoost with Databricks Assistant

Ask Databricks Assistant to help you train an XGBoost model from scratch (we'll cover this more in the next notebook):

**Prompt:** "Train an XGBoost classifier on inspection_silver to predict defects, using the same features as AutoML"

Compare your custom model to AutoML's results!

---

**Additional Resources:**
- [AutoML Documentation](https://docs.databricks.com/aws/en/machine-learning/automl/)
- [MLflow Model Registry](https://docs.databricks.com/aws/en/mlflow/model-registry)
- [Model Serving](https://docs.databricks.com/aws/en/machine-learning/model-serving/)
