# 🏡 Linear Regression Architecture Workshop

## Introduction

Welcome to the **Linear Regression Architecture Workshop**.  
This workshop is designed for college-level students learning both:

1. **Univariate Linear Regression** – a foundational algorithm in Machine Learning, focusing on predicting continuous values from a single feature.  
2. **Machine Learning Operations (MLOps)** – design patterns and architectural considerations that make machine learning experiments reproducible, scalable, and production-ready.  

We will use **real-world housing price data** from **California (USA)** and **Ontario (Canada)** as our case study.  
The goal is to not only understand how Linear Regression works, but also how to **design and implement a machine learning project** from sourcing data → building models → structuring code → preparing for deployment.  

The workshop will be completed in **two 2-hour sessions**, with **homework assignments** to be completed before each class.  

---

## Workshop Structure

### 📚 Session 1 – Univariate Linear Regression
- **Lecture focus**: Mathematical intuition, model formulation, gradient descent, cost function, evaluation metrics.  
- **Practical focus**: Implementing Univariate Linear Regression from scratch + using `scikit-learn`.  
- **Homework before class**: Data sourcing (from CSV, APIs, and relational databases).  

### ⚙️ Session 2 – Machine Learning Operations (MLOps)
- **Lecture focus**: Code modularity, reproducibility, experiment tracking, design patterns in ML architecture.  
- **Practical focus**: Architecting the project with pipelines, config management, and modular scripts.  
- **Homework before class**: Refactor previous Linear Regression code into modular, production-ready format.  

---

## Instructions for Students

### 🔹 Before Session 1: Data Sourcing

Your first task is to collect **housing price data** for California and Ontario.  
You must experiment with **at least three different types of data sources**:

1. **CSV Files**  
   - Find open housing datasets (e.g., Kaggle, UCI ML Repository, government portals).  
   - Example: [California Housing Dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset).  
   - Save datasets in `data/raw/` folder.  

2. **Web Services (APIs)**  
   - Explore free APIs offering housing, rental, or real-estate data.  
   - Example APIs:  
     - [Zillow (unofficial APIs exist, check docs)]  
     - [Realtor.ca data endpoints]  
     - [City of Toronto Open Data API](https://open.toronto.ca/)  
     - [California State Open Data Portal](https://data.ca.gov/).  
   - Use Python packages like `requests` or `httpx` to fetch data.  
   - Save results into structured JSON or convert to DataFrames.  

3. **Relational Databases**  
   - Connect to a **PostgreSQL** or **MySQL** demo database.  
   - Option 1: Use hosted databases with sample housing/economic data.  
   - Option 2: Load CSVs into a local database (e.g., PostgreSQL with `psql` or SQLite for portability).  
   - Connect from Python using `sqlalchemy` or `psycopg2`.  
   - Run SQL queries to filter/select data.  

💡 **Deliverable before Session 1**:  
- A Jupyter Notebook that loads housing price data from all three sources (CSV, API, Database) and explores it with basic descriptive statistics and plots.  

---

### 🔹 During Session 1: Univariate Linear Regression Experiment

1. **Define the Problem**  
   - Select one feature (e.g., median income, number of rooms, lot size) to predict housing price.  

2. **Preprocess Data**  
   - Handle missing values.  
   - Normalize/standardize features.  
   - Split data into **train/test sets**.  

3. **Model Implementation**  
   - Implement Linear Regression **from scratch**:  
     - Hypothesis function $ h_\theta(x) = \theta_0 + \theta_1 x $  
     - Cost function (MSE)  
     - Gradient descent update rule  
   - Implement Linear Regression **using scikit-learn** for comparison.  

4. **Model Evaluation**  
   - Compute RMSE, MAE, and $ R^2 $ score.  
   - Visualize regression line vs. data points.  

💡 **Deliverable during Session 1**:  
- A working notebook with both a manual and `scikit-learn` Linear Regression implementation.  

---

### 🔹 After Session 1 (Homework)

- Refactor your notebook into **modular Python scripts**:  
  - `data_loader.py` – functions to load data from CSV, API, and DB.  
  - `preprocessing.py` – cleaning, normalization, train/test split.  
  - `model.py` – regression model implementations.  
  - `evaluation.py` – metrics, plots, reporting.  
- Ensure each module can run independently.  

💡 This will prepare you for **Session 2 (MLOps)**.  

---

### 🔹 Before Session 2: Preparing for MLOps

- Replicate the structure, files and resources that you developed during the **DataStreamVisualization_Workshop**
- Use it to organize this project into a folder structure like:

```txt
linear_regression_project/
│── data/
│   ├── raw/
│   ├── processed/
│── notebooks/
│   ├── EDA.ipynb
│   ├── linear_regression.ipynb
│── src/
│   ├── data_loader.py
│   ├── preprocessing.py
│   ├── model.py
│   ├── evaluation.py
│── configs/
│   ├── experiment_config.yaml
│── experiments/
│   ├── results.csv
│── requirements.txt
│── README.md
````

* Create a **YAML config file** with parameters:

  * Data source path/API endpoint/DB connection string
  * Learning rate, iterations, train/test split ratio
  * Feature to use as predictor

* Document how to run your scripts step-by-step.

---

### 🔹 During Session 2: MLOps Architecture

* Apply the **Robot PM MLOps design patterns**:

  * **Separation of concerns**: Each module is independent.
  * **Configuration-driven**: Experiments are parameterized by configs, not hard-coded values.
  * **Experiment tracking**: Save model performance metrics in `experiments/results.csv`.
  * **Reproducibility**: Ensure anyone can re-run your experiment with the same results.

* Discuss:

  * Why modularity matters for ML projects.
  * How config management avoids errors in scaling ML experiments.
  * How this workflow connects to real-world ML pipelines.

💡 **Deliverable during Session 2**:

* A structured project with modular code, configs, and experiment tracking.

---

### 🔹 After Session 2: Extension & Homework

0. **Submission Format**  
   - This activity is **to be submitted individually**. Each student must create and manage their own project repository.

1. **Workshop Replication**  
   - This workshop is modeled on the structure, files, and resources used in the **DataStreamVisualization_Workshop**.  
   - Your submission must replicate this style of organization and completeness.  

2. **Repository Submission Instructions**  
   - Create a **remote Git repository** named:  
     ```
     LinearRegressionArchitecture_Workshop
     ```
   - Once your repository is ready, send your instructor an email with the subject line:  
     ```
     Linear Regression Architecture Workshop
     ```
   - In the body of the email, paste the **full URL of your repository**, making sure it ends with the `.git` extension.  
     - ✅ Correct example: `https://github.com/username/LinearRegressionArchitecture_Workshop.git`  
     - ❌ Incorrect example: `https://github.com/username/LinearRegressionArchitecture_Workshop`

3. **Repository Requirements**  
   Your repository must contain:  
   - A **frozen version of the codebase** (no further modifications after submission).  
   - A `requirements.txt` file that lists all dependencies required to run your project.  
   - A `README.md` file that:  
     - Displays the title: **Linear Regression Architecture Workshop**.  
     - Describes the work completed in the workshop.  
     - Summarizes key design decisions.  

4. **Notebook Updates (RobotPM_MLOps.ipynb)**  
   - Open the notebook `RobotPM_MLOps.ipynb`.  
   - Update it so that it highlights all changes made to the original project architecture and files.  
   - Specifically, reference the lists provided in the notebook:  
     - **Recommended Additions**  
     - **Recommended Enhancements**  
     - **Breakdown examples** (from both design breakdown sections).  

5. **Expectations for Notebook Updates**  
   - You are **not required to fully implement** the changes and updates at this stage.  
   - Instead, create all **placeholders, stubs, and structure** needed to prepare the project for a future code review.  
   - Think of this as **project scaffolding** for the upcoming implementation sprint cycle, which will be executed in a future project.  

💡 **Final Deliverable**:  
- A complete GitHub repository named `LinearRegressionArchitecture_Workshop` with the required structure, files, and documentation.  
- An updated `RobotPM_MLOps.ipynb` notebook showing how the project architecture was extended and prepared for enhancements.  
- Email submission to the instructor containing the `.git` repository URL.  
