# üè° Linear Regression Architecture Workshop

## Introduction

Welcome to the **Linear Regression Architecture Workshop**.  
This workshop is designed for college-level students learning both:

1. **Univariate Linear Regression** ‚Äì a foundational algorithm in Machine Learning, focusing on predicting continuous values from a single feature.  
2. **Machine Learning Operations (MLOps)** ‚Äì design patterns and architectural considerations that make machine learning experiments reproducible, scalable, and production-ready.  

We will use **real-world housing price data** from **California (USA)** and **Ontario (Canada)** as our case study.  
The goal is to not only understand how Linear Regression works, but also how to **design and implement a machine learning project** from sourcing data ‚Üí building models ‚Üí structuring code ‚Üí preparing for deployment.  

The workshop will be completed in **two 2-hour sessions**, with **homework assignments** to be completed before each class.  

---

## Workshop Structure

### üìö Session 1 ‚Äì Univariate Linear Regression
- **Lecture focus**: Mathematical intuition, model formulation, gradient descent, cost function, evaluation metrics.  
- **Practical focus**: Implementing Univariate Linear Regression from scratch + using `scikit-learn`.  
- **Homework before class**: Data sourcing (from CSV, APIs, and relational databases).  

### ‚öôÔ∏è Session 2 ‚Äì Machine Learning Operations (MLOps)
- **Lecture focus**: Code modularity, reproducibility, experiment tracking, design patterns in ML architecture.  
- **Practical focus**: Architecting the project with pipelines, config management, and modular scripts.  
- **Homework before class**: Refactor previous Linear Regression code into modular, production-ready format.  

---

## Instructions for Students

### üîπ Before Session 1: Data Sourcing

Your first task is to collect **housing price data** for California and Ontario.  
You must experiment with **at least three different types of data sources**:

1. **CSV Files**  
   - Find open housing datasets (e.g., Kaggle, UCI ML Repository, government portals).  
   - Example: [California Housing Dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset).  
   - Save datasets in `data/raw/` folder.  

2. **Web Services (APIs)**  
   - Explore free APIs offering housing, rental, or real-estate data.  
   - Example APIs:  
     - [Zillow (unofficial APIs exist, check docs)]  
     - [Realtor.ca data endpoints]  
     - [City of Toronto Open Data API](https://open.toronto.ca/)  
     - [California State Open Data Portal](https://data.ca.gov/).  
   - Use Python packages like `requests` or `httpx` to fetch data.  
   - Save results into structured JSON or convert to DataFrames.  

3. **Relational Databases**  
   - Connect to a **PostgreSQL** or **MySQL** demo database.  
   - Option 1: Use hosted databases with sample housing/economic data.  
   - Option 2: Load CSVs into a local database (e.g., PostgreSQL with `psql` or SQLite for portability).  
   - Connect from Python using `sqlalchemy` or `psycopg2`.  
   - Run SQL queries to filter/select data.  

üí° **Deliverable before Session 1**:  
- A Jupyter Notebook that loads housing price data from all three sources (CSV, API, Database) and explores it with basic descriptive statistics and plots.  

---

### üîπ During Session 1: Univariate Linear Regression Experiment

1. **Define the Problem**  
   - Select one feature (e.g., median income, number of rooms, lot size) to predict housing price.  

2. **Preprocess Data**  
   - Handle missing values.  
   - Normalize/standardize features.  
   - Split data into **train/test sets**.  

3. **Model Implementation**  
   - Implement Linear Regression **from scratch**:  
     - Hypothesis function $ h_\theta(x) = \theta_0 + \theta_1 x $  
     - Cost function (MSE)  
     - Gradient descent update rule  
   - Implement Linear Regression **using scikit-learn** for comparison.  

4. **Model Evaluation**  
   - Compute RMSE, MAE, and $ R^2 $ score.  
   - Visualize regression line vs. data points.  

üí° **Deliverable during Session 1**:  
- A working notebook with both a manual and `scikit-learn` Linear Regression implementation.  

---

# Feature Selection

MeanValue and MaxValue was selected as the independent variable and WorkPeriod as the dependent variable.
The number of rooms is a fundamental housing characteristic with a clear and interpretable relationship to price. Using this feature supports the assumptions of univariate linear regression and keeps the model simple and easy to explain.

## Import Data
Import the data from Neon database

Connect String: postgresql://neondb_owner:npg_Sh8bV3HjZvkd@ep-plain-scene-ahmzh8by-pooler.c-3.us-east-1.aws.neon.tech/neondb?sslmode=require&channel_binding=require

Table: robot_data

In [2]:
import os
import yaml

# If the notebook is inside notebooks/, go back to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")


# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

from src.db_export import export_postgres_table_to_csv

# Read database config
connstr = config["database"]["connstr"]
table_name = config["database"]["source_table"]

# Read output path from config
output_csv = config["paths"]["raw_csv"]

# Export data from Neon PostgreSQL to CSV
export_postgres_table_to_csv(
    connstr=connstr,
    table_name=table_name,
    output_csv=output_csv,
)

‚úÖ Exported table 'robot_data' to data/raw/RMBR4-2_export_test.csv


# Preprocess Data

Using the force values measured on each robot axis, I will analyze the robot‚Äôs time-series data, which consists of work periods and rest periods. The goal is to segment the data into work/rest periods based on the force signal. Each identified work period is assigned a sequential index, and this work-period number is used as the independent variable. For each work period, the mean force and peak force within that period are computed as the dependent variables.

Next, I group the work periods into detection intervals, where each interval contains 10 consecutive work periods. Within each detection interval, I run regression analysis on the dependent variables over the work-period index:

If the mean force shows a statistically significant upward trend, it suggests the robot may be experiencing system aging or gradual degradation.

If the peak force shows a statistically significant upward trend, it suggests the robot may be at risk of an urgent or imminent failure.

The raw dataset is located at:
data/raw/RMBR4-2_export_test.csv

I will transform it into a summarized table with the following fields:

work_period

mean_value

peak_value

interval_start_time (start time of the first work period in the interval)

interval_end_time (end time of the last work period in the interval)

Finally, I will export the resulting table to:
data/raw/RMBR4-2_export_test_1.csv

In [9]:
# ------------------------------------------------------------
# Step 1: Build work-period summary from raw time-series data
# raw_csv -> period_csv
# ------------------------------------------------------------
from src.preprocessing import run_preprocessing_pipeline
period_df, _preprocessed_df = run_preprocessing_pipeline(config)

print("‚úÖ Step 1 done. Saved period summary to:", config["paths"]["period_csv"])
print("Period rows:", len(period_df))

# ---- Inspect result ----
print("\nüìå Period CSV columns:")
print(period_df.columns.tolist())

print("\nüìå First 5 rows of period summary:")
display(period_df.head())

‚úÖ Step 1 done. Saved period summary to: data/raw/RMBR4-2_export_test_1.csv
Period rows: 858

üìå Period CSV columns:
['work_period', 'mean_value', 'peak_value', 'period_start_time', 'period_end_time']

üìå First 5 rows of period summary:


Unnamed: 0,work_period,mean_value,peak_value,period_start_time,period_end_time
0,1,3.302866,8.908955,2022-10-17 12:19:22.005000+00:00,2022-10-17 12:19:55.136000+00:00
1,2,3.449832,6.300471,2022-10-17 12:20:56.771000+00:00,2022-10-17 12:21:31.056000+00:00
2,3,3.491427,6.707684,2022-10-17 12:21:46.588000+00:00,2022-10-17 12:22:21.050000+00:00
3,4,3.255471,5.87491,2022-10-17 12:22:45.266000+00:00,2022-10-17 12:23:17.894000+00:00
4,5,2.881057,6.632019,2022-10-17 12:23:49.839000+00:00,2022-10-17 12:24:22.332000+00:00


In [4]:
# ------------------------------------------------------------
# Step 2: Split the period-level table into TRAIN/TEST
# period_csv -> preprocessed_train_csv & preprocessed_test_csv
# (Optional) creates mean_value_z / peak_value_z using TRAIN stats
# ------------------------------------------------------------
import pandas as pd

from src.splitter import split_period_csv_to_train_test

out = split_period_csv_to_train_test(config)

print("‚úÖ Step 2 done.")
print(" - TRAIN:", out["preprocessed_train_csv"])
print(" - TEST :", out["preprocessed_test_csv"])

train_df = pd.read_csv(out["preprocessed_train_csv"])
test_df  = pd.read_csv(out["preprocessed_test_csv"])

# ---- Inspect TRAIN ----
print("\nüìå TRAIN columns:")
print(train_df.columns.tolist())

print("\nüìå TRAIN first 5 rows:")
display(train_df.head())

# ---- Inspect TEST ----
print("\nüìå TEST columns:")
print(test_df.columns.tolist())

print("\nüìå TEST first 5 rows:")
display(test_df.head())

‚úÖ Step 2 done.
 - TRAIN: data/preprocessed/RMBR4-2_export_preprocessed_train.csv
 - TEST : data/preprocessed/RMBR4-2_export_preprocessed_test.csv

üìå TRAIN columns:
['work_period', 'mean_value', 'peak_value', 'period_start_time', 'period_end_time', 'interval_id', 'mean_value_z', 'peak_value_z']

üìå TRAIN first 5 rows:


Unnamed: 0,work_period,mean_value,peak_value,period_start_time,period_end_time,interval_id,mean_value_z,peak_value_z
0,11,3.586541,8.039739,2022-10-17 12:35:51.838000+00:00,2022-10-17 12:36:25.716000+00:00,2,0.979374,0.741561
1,12,3.764583,7.634454,2022-10-17 12:36:48.315000+00:00,2022-10-17 12:37:23.086000+00:00,2,1.565188,0.409189
2,13,2.908206,6.261834,2022-10-17 12:38:03.164000+00:00,2022-10-17 12:38:32.099000+00:00,2,-1.252553,-0.716489
3,14,3.188224,6.185474,2022-10-17 12:44:50.740000+00:00,2022-10-17 12:45:25.360000+00:00,2,-0.331209,-0.779112
4,15,3.479667,6.995066,2022-10-17 12:45:46.216000+00:00,2022-10-17 12:46:19.107000+00:00,2,0.627726,-0.11517



üìå TEST columns:
['work_period', 'mean_value', 'peak_value', 'period_start_time', 'period_end_time', 'interval_id', 'mean_value_z', 'peak_value_z']

üìå TEST first 5 rows:


Unnamed: 0,work_period,mean_value,peak_value,period_start_time,period_end_time,interval_id,mean_value_z,peak_value_z
0,1,3.302866,8.908955,2022-10-17 12:19:22.005000+00:00,2022-10-17 12:19:55.136000+00:00,1,0.046,1.4544
1,2,3.449832,6.300471,2022-10-17 12:20:56.771000+00:00,2022-10-17 12:21:31.056000+00:00,1,0.52956,-0.684803
2,3,3.491427,6.707684,2022-10-17 12:21:46.588000+00:00,2022-10-17 12:22:21.050000+00:00,1,0.666422,-0.35085
3,4,3.255471,5.87491,2022-10-17 12:22:45.266000+00:00,2022-10-17 12:23:17.894000+00:00,1,-0.109944,-1.033803
4,5,2.881057,6.632019,2022-10-17 12:23:49.839000+00:00,2022-10-17 12:24:22.332000+00:00,1,-1.341881,-0.412903


# Model Implementation

Input: data/preprocessed/RMBR4-2_export_preprocessed_train.csv

For each detection interval (interval_id), perform linear regression using:

Independent variable (X): work_period

Dependent variables (y): mean_value_z and peak_value_z

Apply two implementations for comparison:

From scratch (gradient descent): estimate theta0 and theta1

scikit-learn: estimate theta0 and theta1

Aggregate the results into a new table keyed by interval_id, with one row per interval.

Output: data/models/interval_theta_table.csv

Note: Each interval yields two sets of parameters (mean and peak), producing eight parameter columns:

scratch_mean_theta0, scratch_mean_theta1

scratch_peak_theta0, scratch_peak_theta1

sklearn_mean_theta0, sklearn_mean_theta1

sklearn_peak_theta0, sklearn_peak_theta1

In [5]:
# ------------------------------------------------------------
# Step 3: Model Implementation (TRAIN only)
# For each interval_id, fit linear regression:
#   X = work_period
#   y = mean_value_z and peak_value_z
# Two implementations:
#   (1) From scratch (Gradient Descent) -> theta0, theta1
#   (2) scikit-learn LinearRegression   -> theta0, theta1
# Output:
#   data/models/interval_theta_table.csv
# ------------------------------------------------------------

import os
import yaml
import pandas as pd

# If the notebook is inside /notebooks, go one level up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

# IMPORTANT: Train-only input
# Your model.py reads config["paths"]["preprocessed_train_csv"]
# and writes config["paths"]["theta_table_csv"]
from src.model import build_interval_theta_table

theta_df = build_interval_theta_table(config)

print("‚úÖ Step 3 done (TRAIN only).")
print(" - Input :", config["paths"]["preprocessed_train_csv"])
print(" - Output:", config["paths"]["theta_table_csv"])

# Inspect result
print("\nüìå interval_theta_table columns:")
print(theta_df.columns.tolist())

print("\nüìå interval_theta_table first 5 rows:")
display(theta_df.head())


‚úÖ Step 3 done (TRAIN only).
 - Input : data/preprocessed/RMBR4-2_export_preprocessed_train.csv
 - Output: data/models/interval_theta_table.csv

üìå interval_theta_table columns:
['interval_id', 'start_work_period', 'end_work_period', 'n_periods', 'scratch_mean_theta0', 'scratch_mean_theta1', 'scratch_peak_theta0', 'scratch_peak_theta1', 'sklearn_mean_theta0', 'sklearn_mean_theta1', 'sklearn_peak_theta0', 'sklearn_peak_theta1', 'learning_rate', 'iterations', 'target_space']

üìå interval_theta_table first 5 rows:


Unnamed: 0,interval_id,start_work_period,end_work_period,n_periods,scratch_mean_theta0,scratch_mean_theta1,scratch_peak_theta0,scratch_peak_theta1,sklearn_mean_theta0,sklearn_mean_theta1,sklearn_peak_theta0,sklearn_peak_theta1,learning_rate,iterations,target_space
0,2,11,20,10,-0.890499,0.10011,-0.975986,0.07542,-0.890499,0.10011,-0.975986,0.07542,0.05,3000,z
1,4,31,40,10,3.687289,-0.092567,5.146264,-0.12838,3.687289,-0.092567,5.146264,-0.12838,0.05,3000,z
2,6,51,60,10,-7.934994,0.149576,-6.466034,0.124693,-7.934994,0.149576,-6.466034,0.124693,0.05,3000,z
3,7,61,70,10,-1.38133,0.026113,-3.10991,0.051936,-1.38133,0.026113,-3.10991,0.051936,0.05,3000,z
4,8,71,80,10,8.836954,-0.118015,15.307592,-0.196919,8.836954,-0.118015,15.307592,-0.196919,0.05,3000,z


# Model Evaluation


## 4. Model Evaluation

In this step, the performance of the linear regression models is evaluated **using the training dataset only**, in order to assess how well the models fit the historical data without introducing information from the test set.

For each detection interval (`interval_id`), linear regression models are evaluated with:

- **Independent variable (X):** work period index (`work_period`)
- **Dependent variables (y):**
  - `mean_value_z` (average force, indicating long-term system aging)
  - `peak_value_z` (peak force, indicating potential imminent failure)

Two regression approaches are compared:

1. **From-scratch linear regression**, implemented using gradient descent.
2. **scikit-learn linear regression**, used as a reference implementation.

### Evaluation Metrics

For each interval and each dependent variable, the following metrics are computed:

- **Root Mean Squared Error (RMSE):** measures the overall prediction error magnitude.
- **Mean Absolute Error (MAE):** measures the average absolute deviation between predictions and true values.
- **Coefficient of Determination (R¬≤):** measures the proportion of variance explained by the model.

These metrics provide a quantitative comparison between the manually implemented model and the scikit-learn implementation.

### Visualization

To further support the evaluation, regression results are visualized for each detection interval:

- The original data points are plotted as scatter points.
- The regression line obtained from the **from-scratch implementation** is plotted as a solid line.
- The regression line obtained from the **scikit-learn implementation** is plotted as a dashed line.

These plots allow for a qualitative comparison of the two models and help verify that the gradient descent implementation converges to a similar solution as the scikit-learn model.

### Output Artifacts

The evaluation produces the following outputs:

- An evaluated parameter table containing all regression coefficients and evaluation metrics:

data/models/interval_theta_table_evaluated.csv

- Regression plots for each interval and each target variable:

data/models/plots/

This evaluation step confirms the correctness of the manual linear regression implementation and provides a reliable baseline for subsequent fault detection using the test dataset.

In [6]:
# ------------------------------------------------------------
# Step 4: Model Evaluation (TRAIN only)
#
# For each detection interval (interval_id):
#   - Compute RMSE, MAE, R^2
#   - Compare:
#       * From-scratch Linear Regression
#       * scikit-learn Linear Regression
#   - Visualize regression line vs. data points
#
# Input:
#   data/preprocessed/RMBR4-2_export_preprocessed_train.csv
#   data/models/interval_theta_table.csv
#
# Output:
#   data/models/interval_theta_table_evaluated.csv
#   data/models/plots/*.png
# ------------------------------------------------------------

import os
import yaml
import pandas as pd

# If notebook is in /notebooks, move to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")

# Load config
with open("configs/experiment_config.yaml", "r", encoding="utf-8") as f:
    config = yaml.safe_load(f)

from src.evaluation import evaluate_all_intervals

# Run evaluation (TRAIN only)
evaluated_df = evaluate_all_intervals(config)

print("‚úÖ Step 4 done (TRAIN only).")
print(" - Evaluated theta table saved to:")
print("   ", config["paths"]["evaluated_csv"])
print(" - Regression plots saved to:")
print("   ", os.path.join(config["paths"]["models_dir"], "plots"))

# Inspect results
print("\nüìå Evaluated table columns:")
print(evaluated_df.columns.tolist())

print("\nüìå Evaluated table first 5 rows:")
display(evaluated_df.head())

‚úÖ Step 4 done (TRAIN only).
 - Evaluated theta table saved to:
    data/models/interval_theta_table_evaluated.csv
 - Regression plots saved to:
    data/models\plots

üìå Evaluated table columns:
['interval_id', 'start_work_period', 'end_work_period', 'n_periods', 'scratch_mean_theta0', 'scratch_mean_theta1', 'scratch_peak_theta0', 'scratch_peak_theta1', 'sklearn_mean_theta0', 'sklearn_mean_theta1', 'sklearn_peak_theta0', 'sklearn_peak_theta1', 'learning_rate', 'iterations', 'target_space', 'scratch_mean_rmse', 'scratch_mean_mae', 'scratch_mean_r2', 'sklearn_mean_rmse', 'sklearn_mean_mae', 'sklearn_mean_r2', 'scratch_peak_rmse', 'scratch_peak_mae', 'scratch_peak_r2', 'sklearn_peak_rmse', 'sklearn_peak_mae', 'sklearn_peak_r2']

üìå Evaluated table first 5 rows:


Unnamed: 0,interval_id,start_work_period,end_work_period,n_periods,scratch_mean_theta0,scratch_mean_theta1,scratch_peak_theta0,scratch_peak_theta1,sklearn_mean_theta0,sklearn_mean_theta1,...,scratch_mean_r2,sklearn_mean_rmse,sklearn_mean_mae,sklearn_mean_r2,scratch_peak_rmse,scratch_peak_mae,scratch_peak_r2,sklearn_peak_rmse,sklearn_peak_mae,sklearn_peak_r2
0,2,11,20,10,-0.890499,0.10011,-0.975986,0.07542,-0.890499,0.10011,...,0.110413,0.816184,0.623497,0.110413,0.52705,0.438581,0.144521,0.52705,0.438581,0.144521
1,4,31,40,10,3.687289,-0.092567,5.146264,-0.12838,3.687289,-0.092567,...,0.066141,0.999061,0.799798,0.066141,1.339267,1.074924,0.070466,1.339267,1.074924,0.070466
2,6,51,60,10,-7.934994,0.149576,-6.466034,0.124693,-7.934994,0.149576,...,0.229123,0.788036,0.709481,0.229123,0.71132,0.507041,0.202244,0.71132,0.507041,0.202244
3,7,61,70,10,-1.38133,0.026113,-3.10991,0.051936,-1.38133,0.026113,...,0.014724,0.613534,0.532065,0.014724,0.834637,0.739091,0.030956,0.834637,0.739091,0.030956
4,8,71,80,10,8.836954,-0.118015,15.307592,-0.196919,8.836954,-0.118015,...,0.119751,0.919026,0.699048,0.119751,1.255306,1.008789,0.168756,1.255306,1.008789,0.168756


### üîπ After Session 1 (Homework)

- Refactor your notebook into **modular Python scripts**:  
  - `data_loader.py` ‚Äì functions to load data from CSV, API, and DB.  
  - `preprocessing.py` ‚Äì cleaning, normalization, train/test split.  
  - `model.py` ‚Äì regression model implementations.  
  - `evaluation.py` ‚Äì metrics, plots, reporting.  
- Ensure each module can run independently.  

üí° This will prepare you for **Session 2 (MLOps)**.  

---

### üîπ Before Session 2: Preparing for MLOps

- Replicate the structure, files and resources that you developed during the **DataStreamVisualization_Workshop**
- Use it to organize this project into a folder structure like:

```txt
linear_regression_project/
‚îÇ‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ processed/
‚îÇ‚îÄ‚îÄ notebooks/
‚îÇ   ‚îú‚îÄ‚îÄ EDA.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ linear_regression.ipynb
‚îÇ‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ data_loader.py
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing.py
‚îÇ   ‚îú‚îÄ‚îÄ model.py
‚îÇ   ‚îú‚îÄ‚îÄ evaluation.py
‚îÇ‚îÄ‚îÄ configs/
‚îÇ   ‚îú‚îÄ‚îÄ experiment_config.yaml
‚îÇ‚îÄ‚îÄ experiments/
‚îÇ   ‚îú‚îÄ‚îÄ results.csv
‚îÇ‚îÄ‚îÄ requirements.txt
‚îÇ‚îÄ‚îÄ README.md
````

* Create a **YAML config file** with parameters:

  * Data source path/API endpoint/DB connection string
  * Learning rate, iterations, train/test split ratio
  * Feature to use as predictor

* Document how to run your scripts step-by-step.

---

### üîπ During Session 2: MLOps Architecture

* Apply the **Robot PM MLOps design patterns**:

  * **Separation of concerns**: Each module is independent.
  * **Configuration-driven**: Experiments are parameterized by configs, not hard-coded values.
  * **Experiment tracking**: Save model performance metrics in `experiments/results.csv`.
  * **Reproducibility**: Ensure anyone can re-run your experiment with the same results.

* Discuss:

  * Why modularity matters for ML projects.
  * How config management avoids errors in scaling ML experiments.
  * How this workflow connects to real-world ML pipelines.

üí° **Deliverable during Session 2**:

* A structured project with modular code, configs, and experiment tracking.

---

## Alert Detection on Test Data

In this step, alert thresholds are derived from the training data, and the trained regression model is applied to the test data to identify abnormal detection intervals.

The alerting process is divided into two stages:

Threshold Estimation (from TRAIN data)

Alert Generation (on TEST data)

### Derive Alert Thresholds from Training Data

We first compute robust alert thresholds using the training period-level data and the interval-level regression slopes obtained during model training.

The following thresholds are estimated:

Mean force level threshold

Peak force level threshold

Mean force slope threshold

Peak force slope threshold

All thresholds are computed using median + k √ó MAD, ensuring robustness to outliers.

In [7]:
from src.thresholds import fit_thresholds_on_train
import pandas as pd

# ---------------------------------------------------------
# STEP 1: (Optional sanity check)
# Load the trained interval-level theta table.
# This table was generated during model training (TRAIN only)
# and contains regression slopes for each interval.
# ---------------------------------------------------------
theta_df = pd.read_csv(config["paths"]["theta_table_csv"])

# ---------------------------------------------------------
# STEP 2: Derive robust alert thresholds using TRAIN data only
#
# This function:
#   - Loads the preprocessed TRAIN period-level CSV
#   - Loads the TRAIN interval-level regression slopes (theta table)
#   - Computes robust thresholds using:
#         threshold = median + k * MAD
#   - Saves all thresholds to a single-row CSV file
#   - Returns the thresholds as a dictionary
#
# IMPORTANT:
#   - Only TRAIN data is used (prevents data leakage)
#   - The same thresholds will later be applied to TEST data
# ---------------------------------------------------------
thresholds = fit_thresholds_on_train(config)

# Display the computed thresholds
thresholds


{'mean_alert_threshold': 4.007271646191534,
 'peak_alert_threshold': 9.800767286249997,
 'mean_slope_threshold': 0.22846773206773197,
 'peak_slope_threshold': 0.23366951831779592,
 'slope_source': 'sklearn',
 'k': 2.5,
 'mean_slope_col': 'sklearn_mean_theta1',
 'peak_slope_col': 'sklearn_peak_theta1',
 'train_csv': 'data/preprocessed/RMBR4-2_export_preprocessed_train.csv',
 'theta_csv': 'data/models/interval_theta_table.csv'}

## Generate Alerts on Test Data

Next, the alert logic is applied to the test dataset.
Only detection intervals that satisfy both:

Trend-based condition (slope exceeds threshold), and

Level-based condition (mean or peak exceeds threshold)

are reported as anomalies.

If no abnormal intervals are detected, an empty results file is generated and the system reports that no faults were detected.

In [8]:
from src.alerts import detect_alerts_on_test

# ---------------------------------------------------------
# STEP 2: Generate alerts on TEST data
#
# In this step, the alert detection logic is applied to
# the TEST dataset only.
#
# IMPORTANT:
#   - Thresholds were derived from TRAIN data in Step 1
#   - No statistics are re-computed on TEST data
#   - This prevents data leakage
# ---------------------------------------------------------

# ---------------------------------------------------------
# Ensure the TEST preprocessed period-level CSV is used
# ---------------------------------------------------------
config["paths"]["preprocessed_test_csv"] = config["paths"]["preprocessed_test_csv"]

# ---------------------------------------------------------
# Run alert detection on TEST data
#
# This function:
#   - Loads TEST period-level data
#   - Aggregates periods into interval-level summaries
#   - Loads TRAIN-derived alert thresholds
#   - Loads TRAIN regression slopes (theta table)
#   - Applies alert logic:
#       * Trend condition: |slope| > slope threshold
#       * Level condition: mean or peak > level threshold
#   - Reports only anomalous intervals
#
# If no anomalies are detected:
#   - An empty CSV file (headers only) is written
#   - A message is printed indicating no faults were found
# ---------------------------------------------------------
results_df = detect_alerts_on_test(config)

# Display detected anomalies (may be empty)
results_df


No anomalies detected. Wrote empty results file: experiments/results.csv


Unnamed: 0,interval_id,start_work_period,end_work_period,interval_start_time,interval_end_time,predicted_failure_time,failure_type,alert_reason


### üîπ After Session 2: Extension & Homework

0. **Submission Format**  
   - This activity is **to be submitted individually**. Each student must create and manage their own project repository.

1. **Workshop Replication**  
   - This workshop is modeled on the structure, files, and resources used in the **DataStreamVisualization_Workshop**.  
   - Your submission must replicate this style of organization and completeness.  

2. **Repository Submission Instructions**  
   - Create a **remote Git repository** named:  
     ```
     LinearRegressionArchitecture_Workshop
     ```
   - Once your repository is ready, send your instructor an email with the subject line:  
     ```
     Linear Regression Architecture Workshop
     ```
   - In the body of the email, paste the **full URL of your repository**, making sure it ends with the `.git` extension.  
     - ‚úÖ Correct example: `https://github.com/username/LinearRegressionArchitecture_Workshop.git`  
     - ‚ùå Incorrect example: `https://github.com/username/LinearRegressionArchitecture_Workshop`

3. **Repository Requirements**  
   Your repository must contain:  
   - A **frozen version of the codebase** (no further modifications after submission).  
   - A `requirements.txt` file that lists all dependencies required to run your project.  
   - A `README.md` file that:  
     - Displays the title: **Linear Regression Architecture Workshop**.  
     - Describes the work completed in the workshop.  
     - Summarizes key design decisions.  

4. **Notebook Updates (RobotPM_MLOps.ipynb)**  
   - Open the notebook `RobotPM_MLOps.ipynb`.  
   - Update it so that it highlights all changes made to the original project architecture and files.  
   - Specifically, reference the lists provided in the notebook:  
     - **Recommended Additions**  
     - **Recommended Enhancements**  
     - **Breakdown examples** (from both design breakdown sections).  

5. **Expectations for Notebook Updates**  
   - You are **not required to fully implement** the changes and updates at this stage.  
   - Instead, create all **placeholders, stubs, and structure** needed to prepare the project for a future code review.  
   - Think of this as **project scaffolding** for the upcoming implementation sprint cycle, which will be executed in a future project.  

üí° **Final Deliverable**:  
- A complete GitHub repository named `LinearRegressionArchitecture_Workshop` with the required structure, files, and documentation.  
- An updated `RobotPM_MLOps.ipynb` notebook showing how the project architecture was extended and prepared for enhancements.  
- Email submission to the instructor containing the `.git` repository URL.  
