# üè° Linear Regression Architecture Workshop

## Introduction

Welcome to the **Linear Regression Architecture Workshop**.  
This workshop is designed for college-level students learning both:

1. **Univariate Linear Regression** ‚Äì a foundational algorithm in Machine Learning, focusing on predicting continuous values from a single feature.  
2. **Machine Learning Operations (MLOps)** ‚Äì design patterns and architectural considerations that make machine learning experiments reproducible, scalable, and production-ready.  

We will use **robot performance data** from the **Data Streaming Visualization Workshop** as our case study.  
The goal is to not only understand how Linear Regression works, but also how to **design and implement a machine learning project** from sourcing data ‚Üí building models ‚Üí structuring code ‚Üí preparing for deployment.  

The workshop will be completed in **two 2-hour sessions**, with **homework assignments** to be completed before each class.  

---

## Workshop Structure

### üìö Session 1 ‚Äì Univariate Linear Regression
- **Lecture focus**: Mathematical intuition, model formulation, gradient descent, cost function, evaluation metrics.  
- **Practical focus**: Implementing Univariate Linear Regression from scratch + using `scikit-learn`.  
- **Homework before class**: Data sourcing (from CSV, APIs, and relational databases).  

### ‚öôÔ∏è Session 2 ‚Äì Machine Learning Operations (MLOps)
- **Lecture focus**: Code modularity, reproducibility, experiment tracking, design patterns in ML architecture.  
- **Practical focus**: Architecting the project with pipelines, config management, and modular scripts.  
- **Homework before class**: Refactor previous Linear Regression code into modular, production-ready format.  

---

## Instructions for Students

### üîπ Before Session 1: Data Sourcing

Your first task is to reuse the code and data sets from the **Data Streaming Visualization Workshop**.
- Start with the provided CSV file and stream it to memory in a Pandas Data Frame.
- Initialize a database on a cloud-based service like https://neon.tech/.
- Populate the database with the information in the Pandas Data Frame.
- Open a connection to the remote database.
- Stream the data into a dashboard and display the live stream.
- Add data sources from **at least three different robots**.

---

### üîπ During Session 1: Univariate Linear Regression Experiment

1. **Define the Problem**  
   - Select one feature (e.g., median income, number of rooms, lot size) to predict housing price.  

2. **Preprocess Data**  
   - Handle missing values.  
   - Normalize/standardize features.  
   - Split data into **train/test sets**.  

3. **Model Implementation**  
   - Implement Linear Regression **from scratch**:  
     - Hypothesis function $ h_\theta(x) = \theta_0 + \theta_1 x $  
     - Cost function (MSE)  
     - Gradient descent update rule  
   - Implement Linear Regression **using scikit-learn** for comparison.  

4. **Model Evaluation**  
   - Compute RMSE, MAE, and $ R^2 $ score.  
   - Visualize regression line vs. data points.  

üí° **Deliverable during Session 1**:  
- A working notebook with both a manual and `scikit-learn` Linear Regression implementation.  

---

### üîπ After Session 1 (Homework)

- Refactor your notebook into **modular Python scripts**:  
  - `data_loader.py` ‚Äì functions to load data from CSV, API, and DB.  
  - `preprocessing.py` ‚Äì cleaning, normalization, train/test split.  
  - `model.py` ‚Äì regression model implementations.  
  - `evaluation.py` ‚Äì metrics, plots, reporting.  
- Ensure each module can run independently.  

üí° This will prepare you for **Session 2 (MLOps)**.  

---

### üîπ Before Session 2: Preparing for MLOps

Confirm the code and data sets from the **Data Streaming Visualization Workshop**.
- Use the provided CSV file and stream it to memory in a Pandas Data Frame.
- Initialize a database on a cloud-based service like https://neon.tech/.
- Populate the database with the information in the Pandas Data Frame.
- Open a connection to the remote database.
- Stream the data into a dashboard and display the live stream.
- Add data sources from **at least three different robots**.
- Replicate the structure, files and resources that you developed during the **DataStreamVisualization_Workshop**
- Use it to organize this project into a folder structure like:

```txt
linear_regression_project/
‚îÇ‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ processed/
‚îÇ‚îÄ‚îÄ notebooks/
‚îÇ   ‚îú‚îÄ‚îÄ EDA.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ linear_regression.ipynb
‚îÇ‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ data_loader.py
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing.py
‚îÇ   ‚îú‚îÄ‚îÄ model.py
‚îÇ   ‚îú‚îÄ‚îÄ evaluation.py
‚îÇ‚îÄ‚îÄ configs/
‚îÇ   ‚îú‚îÄ‚îÄ experiment_config.yaml
‚îÇ‚îÄ‚îÄ experiments/
‚îÇ   ‚îú‚îÄ‚îÄ results.csv
‚îÇ‚îÄ‚îÄ requirements.txt
‚îÇ‚îÄ‚îÄ README.md
````

Design an experiment where you use an **independent variable** from the data in the dashboard (saved in a new DF) and a **dependent variable** that helps predict when a robot will fail.

* Create a **YAML config file** with parameters:

  * Data source path/API endpoint/DB connection string
  * Learning rate, iterations, train/test split ratio
  * Feature to use as predictor

* Document how to run your scripts step-by-step.

---

### üîπ During Session 2: MLOps Architecture

* Implement the Linear Regression to predict when the robot(s) will fail. 
  * **Use the prediction**: To issue an alert, 2 weeks before the actual failure.

* Apply the **Robot PM MLOps design patterns**:

  * **Separation of concerns**: Each module is independent.
  * **Configuration-driven**: Experiments are parameterized by configs, not hard-coded values.
  * **Experiment tracking**: Save model performance metrics in `experiments/results.csv`.
  * **Reproducibility**: Ensure anyone can re-run your experiment with the same results.

* Discuss:

  * Why modularity matters for ML projects.
  * How config management avoids errors in scaling ML experiments.
  * How this workflow connects to real-world ML pipelines.

üí° **Deliverable during Session 2**:

* A structured project with modular code, configs, and **Linear Regression** experiment tracking.

---

### üîπ After Session 2: Extension & Homework

0. **Submission Format**  
   - This activity is **to be submitted as teams**. One team member will create and manage their teams' project repository.

1. **Workshop Replication**  
   - This workshop is modeled on the structure, files, and resources used in the **DataStreamVisualization_Workshop**.  
   - Your submission must replicate this style of organization and completeness.  

2. **Repository Submission Instructions**  
   - Create a **remote Git repository** named:  
     ```
     LinearRegressionArchitecture_Workshop
     ```
   - Once your repository is ready, send your instructor an email with the subject line:  
     ```
     Linear Regression Architecture Workshop
     ```
   - In the body of the email, paste the **full URL of your repository**, making sure it ends with the `.git` extension.  
     - ‚úÖ Correct example: `https://github.com/username/LinearRegressionArchitecture_Workshop.git`  
     - ‚ùå Incorrect example: `https://github.com/username/LinearRegressionArchitecture_Workshop`

3. **Repository Requirements**  
   Your repository must contain:  
   - A **frozen version of the codebase** (no further modifications after submission).  
   - A `requirements.txt` file that lists all dependencies required to run your project.  
   - A `README.md` file that:  
     - Displays the title: **Linear Regression Architecture Workshop**.  
     - Describes the work completed in the workshop.  
     - Summarizes key design decisions.  

4. **Notebook Updates (RobotPM_MLOps.ipynb)**  
   - Open the notebook `RobotPM_MLOps.ipynb`.  
   - Update it so that it highlights all changes made to the original project architecture and files.  
   - Specifically, reference the lists provided in the notebook:  
     - **Recommended Additions**  
     - **Recommended Enhancements**  
     - **Breakdown examples** (from both design breakdown sections).  


üí° **Final Deliverable**:  
- A complete GitHub repository named `LinearRegressionArchitecture_Workshop` with the required structure, files, and documentation.  
- An updated `RobotPM_MLOps.ipynb` notebook showing how the project architecture was extended and prepared for enhancements.  
- Email submission to the instructor containing the `.git` repository URL.  


# üè° Linear Regression Architecture Workshop

## Introduction

Welcome to the **Linear Regression Architecture Workshop**.  
This workshop is designed for college-level students learning both:

1. **Univariate Linear Regression** ‚Äì a foundational algorithm in Machine Learning, focusing on predicting continuous values from a single feature.  
2. **Machine Learning Operations (MLOps)** ‚Äì design patterns and architectural considerations that make machine learning experiments reproducible, scalable, and production-ready.  

We will use **robot performance data** from the **Data Streaming Visualization Workshop** as our case study.  
The goal is to not only understand how Linear Regression works, but also how to **design and implement a machine learning project** from sourcing data ‚Üí building models ‚Üí structuring code ‚Üí preparing for deployment.  

The workshop will be completed in **two 2-hour sessions**, with **homework assignments** to be completed before each class.  

---

## Workshop Structure

### üìö Session 1 ‚Äì Univariate Linear Regression
- **Lecture focus**: Mathematical intuition, model formulation, gradient descent, cost function, evaluation metrics.  
- **Practical focus**: Implementing Univariate Linear Regression from scratch + using `scikit-learn`.  
- **Homework before class**: Data sourcing (from CSV, APIs, and relational databases).  

### ‚öôÔ∏è Session 2 ‚Äì Machine Learning Operations (MLOps)
- **Lecture focus**: Code modularity, reproducibility, experiment tracking, design patterns in ML architecture.  
- **Practical focus**: Architecting the project with pipelines, config management, and modular scripts.  
- **Homework before class**: Refactor previous Linear Regression code into modular, production-ready format.  

---

## Instructions for Students

### üîπ Before Session 1: Data Sourcing

Your first task is to reuse the code and data sets from the **Data Streaming Visualization Workshop**.
- Start with the provided CSV file and stream it to memory in a Pandas Data Frame.
- Initialize a database on a cloud-based service like https://neon.tech/.
- Populate the database with the information in the Pandas Data Frame.
- Open a connection to the remote database.
- Stream the data into a dashboard and display the live stream.
- Add data sources from **at least three different robots**.

---

In [8]:
# Install the required packages
%pip install -r requirements.txt

import pandas as pd
from sqlalchemy import create_engine

# Link to database
conn_str = 'postgresql://neondb_owner:npg_Sh8bV3HjZvkd@ep-plain-scene-ahmzh8by-pooler.c-3.us-east-1.aws.neon.tech/neondb?sslmode=require'

# Read .csv document
file_path = r'data/RMBR4-2_1.csv'
df = pd.read_csv(file_path)

# Data Preprocessing
df.columns = [col.lower().replace(' ', '_').replace('#', '') for col in df.columns]
df.rename(columns={'time': 'recorded_at'}, inplace=True)
cols_to_drop = ['axis_9', 'axis_10', 'axis_11', 'axis_12', 'axis_13', 'axis_14']
df = df.drop(columns=[col for col in cols_to_drop if col in df.columns])

# Connect to the Database Engine
try:
    engine = create_engine(conn_str)
    
    print("Connecting to Neon and uploading data, please wait...")
    
    # 5. Data Ingestion
    # 'replace': Drops the table if it exists and creates a new one.
    # 'append': Adds new rows to the existing table. 
    # Use 'append' if you have manually created the table with a specific schema.
    df.to_sql('robot_1', engine, if_exists='replace', index=False)
    
    print("Success! Data has been imported in Neon.")

except Exception as e:
    print(f"Error occurred: {e}")


import pandas as pd
import time
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import matplotlib.dates as mdates


class StreamingSimulator:
    """
    Simulates streaming robot telemetry data from a CSV file.

    Each call to nextDataPoint():
        1) Opens a database connection
        2) Inserts one record into a database table
        3) Refreshes the chart
        4) Closes the database connection
    """

    def __init__(self, csv_file, db_conn_str, table_name="robot_stream", delay=2, max_xticks=50):
        """
        Parameters:
        csv_file (str): Relative path to the CSV file
        db_conn_str (str): SQLAlchemy PostgreSQL connection string
        table_name (str): Database table name to insert streaming data
        delay (int/float): Seconds to wait between records (stream rate)
        max_xticks (int): Maximum number of time labels on the X-axis
        """
        # Load CSV into memory
        self.df = pd.read_csv(csv_file)

        # Standardize column names: lowercase + remove spaces + remove '#'
        self.df.columns = [
            col.lower().strip().replace(" ", "_").replace("#", "")
            for col in self.df.columns
        ]

        # Rename time column to recorded_at if needed
        if "time" in self.df.columns:
            self.df.rename(columns={"time": "recorded_at"}, inplace=True)

        # Convert timestamp column to datetime for better plotting
        self.df["recorded_at"] = pd.to_datetime(self.df["recorded_at"])

        # Detect axis columns (axis_1 to axis_8 only)
        self.axis_cols = [
            col for col in self.df.columns
            if col.startswith("axis_") and col.split("_")[1].isdigit() and 1 <= int(col.split("_")[1]) <= 8
        ]
        # Sort numerically (axis_1, axis_2, ..., axis_8)
        self.axis_cols = sorted(self.axis_cols, key=lambda x: int(x.split("_")[1]))

        # Store configs
        self.db_conn_str = db_conn_str
        self.table_name = table_name
        self.delay = delay
        self.max_xticks = max_xticks
        self.current_index = 0

        # Create SQLAlchemy engine (connection opened/closed per record)
        self.engine = create_engine(self.db_conn_str)

        # Plot initialization
        plt.ion()
        self.fig, self.ax = plt.subplots(figsize=(12, 7))
        self.x_data = []
        self.y_data_dict = {col: [] for col in self.axis_cols}

        print(f"Loaded CSV: {csv_file}")
        print(f"Detected Y-axis columns: {self.axis_cols}")

    def nextDataPoint(self):
        """
        Loads one record from the CSV into a DataFrame row,
        inserts it into the database, and refreshes the chart.
        """
        # Stop condition
        if self.current_index >= len(self.df):
            print("All data points have been streamed.")
            return None

        # Read the next row as a DataFrame
        row = self.df.iloc[[self.current_index]]

        # 1) Open connection -> 2) Insert record -> 4) Close connection
        try:
            with self.engine.connect() as conn:
                row.to_sql(self.table_name, conn, if_exists="append", index=False)
        except Exception as e:
            print(f"Database insert failed at index {self.current_index}: {e}")

        # 3) Update plot
        ts = row["recorded_at"].values[0]
        self.x_data.append(ts)

        for col in self.axis_cols:
            self.y_data_dict[col].append(row[col].values[0])

        self.ax.clear()

        # Plot all axes
        for col in self.axis_cols:
            self.ax.plot(self.x_data, self.y_data_dict[col], label=col, linewidth=1)

        # Chart formatting
        self.ax.set_title(f"Streaming Robot Axis Data ({self.current_index + 1}/{len(self.df)})")
        self.ax.set_xlabel("recorded_at")
        self.ax.set_ylabel("Axis Values")

        # Format time display
        self.ax.xaxis.set_major_formatter(mdates.DateFormatter("%H:%M:%S"))

        # Limit the number of X-axis time labels
        if len(self.x_data) > self.max_xticks:
            step = max(1, len(self.x_data) // self.max_xticks)
            self.ax.set_xticks(self.x_data[::step])

        # Legend outside the plot area
        self.ax.legend(loc="center left", bbox_to_anchor=(1, 0.5), fontsize="small")

        plt.xticks(rotation=45)
        plt.grid(True, linestyle="--", alpha=0.6)
        plt.tight_layout()

        # Refresh the figure
        self.fig.canvas.draw()
        self.fig.canvas.flush_events()
        plt.pause(0.05)

        # Move to the next row and simulate streaming delay
        self.current_index += 1
        time.sleep(self.delay)

        return row

%matplotlib qt
# Instantiate simulator
ss = StreamingSimulator(
    csv_file='data/RMBR4-2_export_test.csv',
    db_conn_str=conn_str,
    table_name='robot_1_stream',
    delay=2
)

# Stream one point
ss.nextDataPoint()
# Stream all points
while True:
    result = ss.nextDataPoint()
    if result is None:
        break


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
Connecting to Neon and uploading data, please wait...
Success! Data has been imported in Neon.
Loaded CSV: data/RMBR4-2_export_test.csv
Detected Y-axis columns: ['axis_1', 'axis_2', 'axis_3', 'axis_4', 'axis_5', 'axis_6', 'axis_7', 'axis_8']


KeyboardInterrupt: 

### üîπ During Session 1: Univariate Linear Regression Experiment

1. **Define the Problem**  
   - Select one feature (e.g., median income, number of rooms, lot size) to predict housing price.  

2. **Preprocess Data**  
   - Handle missing values.  
   - Normalize/standardize features.  
   - Split data into **train/test sets**.  

3. **Model Implementation**  
   - Implement Linear Regression **from scratch**:  
     - Hypothesis function $ h_\theta(x) = \theta_0 + \theta_1 x $  
     - Cost function (MSE)  
     - Gradient descent update rule  
   - Implement Linear Regression **using scikit-learn** for comparison.  

4. **Model Evaluation**  
   - Compute RMSE, MAE, and $ R^2 $ score.  
   - Visualize regression line vs. data points.  

üí° **Deliverable during Session 1**:  
- A working notebook with both a manual and `scikit-learn` Linear Regression implementation.  

---

### üîπ After Session 1 (Homework)

- Refactor your notebook into **modular Python scripts**:  
  - `data_loader.py` ‚Äì functions to load data from CSV, API, and DB.  
  - `preprocessing.py` ‚Äì cleaning, normalization, train/test split.  
  - `model.py` ‚Äì regression model implementations.  
  - `evaluation.py` ‚Äì metrics, plots, reporting.  
- Ensure each module can run independently.  

üí° This will prepare you for **Session 2 (MLOps)**.  

---

### üîπ Before Session 2: Preparing for MLOps

Confirm the code and data sets from the **Data Streaming Visualization Workshop**.
- Use the provided CSV file and stream it to memory in a Pandas Data Frame.
- Initialize a database on a cloud-based service like https://neon.tech/.
- Populate the database with the information in the Pandas Data Frame.
- Open a connection to the remote database.
- Stream the data into a dashboard and display the live stream.
- Add data sources from **at least three different robots**.
- Replicate the structure, files and resources that you developed during the **DataStreamVisualization_Workshop**
- Use it to organize this project into a folder structure like:

```txt
linear_regression_project/
‚îÇ‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ processed/
‚îÇ‚îÄ‚îÄ notebooks/
‚îÇ   ‚îú‚îÄ‚îÄ EDA.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ linear_regression.ipynb
‚îÇ‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ data_loader.py
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing.py
‚îÇ   ‚îú‚îÄ‚îÄ model.py
‚îÇ   ‚îú‚îÄ‚îÄ evaluation.py
‚îÇ‚îÄ‚îÄ configs/
‚îÇ   ‚îú‚îÄ‚îÄ experiment_config.yaml
‚îÇ‚îÄ‚îÄ experiments/
‚îÇ   ‚îú‚îÄ‚îÄ results.csv
‚îÇ‚îÄ‚îÄ requirements.txt
‚îÇ‚îÄ‚îÄ README.md
````

Design an experiment where you use an **independent variable** from the data in the dashboard (saved in a new DF) and a **dependent variable** that helps predict when a robot will fail.

* Create a **YAML config file** with parameters:

  * Data source path/API endpoint/DB connection string
  * Learning rate, iterations, train/test split ratio
  * Feature to use as predictor

* Document how to run your scripts step-by-step.

---

### üîπ During Session 2: MLOps Architecture

* Implement the Linear Regression to predict when the robot(s) will fail. 
  * **Use the prediction**: To issue an alert, 2 weeks before the actual failure.

* Apply the **Robot PM MLOps design patterns**:

  * **Separation of concerns**: Each module is independent.
  * **Configuration-driven**: Experiments are parameterized by configs, not hard-coded values.
  * **Experiment tracking**: Save model performance metrics in `experiments/results.csv`.
  * **Reproducibility**: Ensure anyone can re-run your experiment with the same results.

* Discuss:

  * Why modularity matters for ML projects.
  * How config management avoids errors in scaling ML experiments.
  * How this workflow connects to real-world ML pipelines.

üí° **Deliverable during Session 2**:

* A structured project with modular code, configs, and **Linear Regression** experiment tracking.

---

### üîπ After Session 2: Extension & Homework

0. **Submission Format**  
   - This activity is **to be submitted as teams**. One team member will create and manage their teams' project repository.

1. **Workshop Replication**  
   - This workshop is modeled on the structure, files, and resources used in the **DataStreamVisualization_Workshop**.  
   - Your submission must replicate this style of organization and completeness.  

2. **Repository Submission Instructions**  
   - Create a **remote Git repository** named:  
     ```
     LinearRegressionArchitecture_Workshop
     ```
   - Once your repository is ready, send your instructor an email with the subject line:  
     ```
     Linear Regression Architecture Workshop
     ```
   - In the body of the email, paste the **full URL of your repository**, making sure it ends with the `.git` extension.  
     - ‚úÖ Correct example: `https://github.com/username/LinearRegressionArchitecture_Workshop.git`  
     - ‚ùå Incorrect example: `https://github.com/username/LinearRegressionArchitecture_Workshop`

3. **Repository Requirements**  
   Your repository must contain:  
   - A **frozen version of the codebase** (no further modifications after submission).  
   - A `requirements.txt` file that lists all dependencies required to run your project.  
   - A `README.md` file that:  
     - Displays the title: **Linear Regression Architecture Workshop**.  
     - Describes the work completed in the workshop.  
     - Summarizes key design decisions.  

4. **Notebook Updates (RobotPM_MLOps.ipynb)**  
   - Open the notebook `RobotPM_MLOps.ipynb`.  
   - Update it so that it highlights all changes made to the original project architecture and files.  
   - Specifically, reference the lists provided in the notebook:  
     - **Recommended Additions**  
     - **Recommended Enhancements**  
     - **Breakdown examples** (from both design breakdown sections).  


üí° **Final Deliverable**:  
- A complete GitHub repository named `LinearRegressionArchitecture_Workshop` with the required structure, files, and documentation.  
- An updated `RobotPM_MLOps.ipynb` notebook showing how the project architecture was extended and prepared for enhancements.  
- Email submission to the instructor containing the `.git` repository URL.  