Skip to content

A professional, educational Python library implementing linear regression and essential ML utilities from the ground up. Includes custom data preprocessing, train/test splitting, and a clean, modular API - ideal for learning, experimentation, and research.

License

Notifications You must be signed in to change notification settings

illoonego/linear-regression-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Linear Regression from Scratch

A production-quality, educational implementation of linear regression algorithms built from scratch using NumPy. This library provides clean, well-documented implementations for learning the mathematical foundations of linear regression while maintaining professional-grade code quality.

🎯 Project Overview

This project implements linear regression algorithms from first principles without using high-level ML libraries like scikit-learn. It's designed as both an educational tool and a functional library that demonstrates professional Python package development practices.

🌟 What Makes This Special

  • πŸ“š Educational Focus: Understand the mathematics behind linear regression
  • πŸ—οΈ Production Quality: Professional package structure ready for PyPI
  • πŸ”¬ From Scratch: Only NumPy used for mathematical operations
  • πŸ§ͺ Fully Tested: Comprehensive test suite with edge case handling
  • πŸ“¦ Complete Package: Installable via pip with proper dependency management

πŸ“ Project Architecture

See the full project architecture in DEVELOPMENT.md.

πŸ“ Mathematical Background

For a detailed explanation of the mathematical foundations behind linear regression, see mathematical_background.md.

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • pip package manager

Installation

Option 1: Install from PyPI (recommended)

pip install linreg-from-scratch

Option 2: Clone & Setup for development

git clone https://github.com/illoonego/linear-regression-from-scratch.git
cd linear-regression-from-scratch

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install all dependencies using pyproject.toml (PEP 621)
pip install -e .[dev]
# For optional dependencies (notebooks, docs):
pip install -e ".[notebooks,docs]"

Note: All dependencies are managed via pyproject.toml. The legacy requirements.txt file has been removed for clarity and modern Python packaging best practices.

Running Examples

# Run all examples
python examples/basic_linear_regression.py

# Run specific examples
python examples/basic_linear_regression.py 1d    # Simple regression
python examples/basic_linear_regression.py 2d    # Multiple regression

Basic Usage

Simple Linear Regression

import numpy as np
from linear_regression import LinearRegression, StandardScaler, r2_score

# Create sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 3.9, 6.1, 8.0, 9.9])  # y β‰ˆ 2x with noise

# Option 1: Direct usage
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
model.fit(X, y, method='gradient_descent')
predictions = model.predict(X)
print(f"Weights: {model.weights_}")
print(f"RΒ² Score: {r2_score(y, predictions):.4f}")

# Option 2: With preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit(X).transform(X)
model.fit(X_scaled, y)
predictions_scaled = model.predict(X_scaled)
print(f"Weights (scaled): {model.weights_}")
print(f"RΒ² Score (scaled): {r2_score(y, predictions_scaled):.4f}")

Multiple Linear Regression

import numpy as np
from linear_regression import LinearRegression, r2_score

# House price prediction example
np.random.seed(42)
size_sqft = np.random.uniform(800, 2500, 100)
bedrooms = np.random.randint(1, 5, 100)
X = np.column_stack((size_sqft, bedrooms))

# True relationship: price = 150*size + 10000*bedrooms + 20000 + noise
price = 150 * size_sqft + 10000 * bedrooms + 20000 + np.random.randn(100) * 10000

model = LinearRegression(learning_rate=1e-7, n_iterations=5000)
model.fit(X, price)
predictions = model.predict(X)

print(f"Learned coefficients: {model.weights_[1:]}")  # [size_coef, bedroom_coef]
print(f"Intercept: {model.weights_[0]}")
print(f"RΒ² Score: {r2_score(price, predictions):.4f}")

πŸ“Š Current Features

βœ… Implemented & Tested

LinearRegression: Complete implementation with both gradient descent and normal equation (closed-form solution) StandardScaler: Feature standardization with robust validation
Examples: Working 1D and 2D regression demonstrations Error Handling: Comprehensive input validation and edge case management Verbose Training Output: Control progress printing with the verbose flag Professional Structure: PyPI-ready package with proper metadata

🚧 Planned Features

See the DEVELOPMENT.md for the full roadmap and planned features.

πŸ§ͺ Testing & Development

Run Tests & Coverage

# Run all tests
pytest tests/

# Run with coverage (see missing lines in terminal)
pytest --cov=src/linear_regression --cov-report=term-missing

# Run specific test file
pytest tests/test_linear_regression.py -v

Continuous Integration & Delivery (CI/CD)

This project uses GitHub Actions for:

  • CI: Automatic tests, linting (ruff), formatting checks (black), and coverage reporting on every push and pull request. See .github/workflows/python-ci.yml.
  • CD: Automated publishing to PyPI on new version tags. See .github/workflows/python-cd.yml.

How releases work:

  • When a new version tag (e.g., v1.0.0) is pushed, the CD workflow builds and publishes the package to PyPI using secure repository secrets.
  • See DEVELOPMENT.md for more on the release workflow.

Code Quality

# Format code
black src/ tests/ examples/

# Sort imports  
isort src/ tests/ examples/

# Lint code
flake8 src/ tests/ examples/

Development Installation

# Install with development dependencies
pip install -e ".[dev,notebooks,docs]"

🎯 Example Output

$ python examples/basic_linear_regression.py 2d

2D Multiple Linear Regression Example
----------------------------------------

Generating synthetic data...
Data points: 100
True weights: size coefficient=150, bedroom coefficient=10000, intercept=20000

Training model with Gradient Descent...
Iteration 0: Cost = 1250000000.0000
Iteration 500: Cost = 125678923.4567  
Iteration 1000: Cost = 89234567.1234

Training completed!

Results:
Learned weights: size coefficient=149.87, bedroom coefficient=9989.23, intercept=20145.67
RΒ² Score:        0.9234
MSE:             89234567.12

Comparison with True Values:
model_gd = LinearRegression(learning_rate=0.01, n_iterations=1000, verbose=True)
model_gd.fit(X, y, method='gradient_descent')
predictions_gd = model_gd.predict(X)
print(f"GD Weights: {model_gd.weights_}")
print(f"GD RΒ² Score: {r2_score(y, predictions_gd):.4f}")

# Option 2: Normal Equation (closed-form)
model_ne = LinearRegression(verbose=False)
model_ne.fit(X, y, method='normal_equation')
predictions_ne = model_ne.predict(X)
print(f"NE Weights: {model_ne.weights_}")
print(f"NE RΒ² Score: {r2_score(y, predictions_ne):.4f}")
True:    size=150.00, bedroom=10000.00, intercept=20000.00  
Learned: size=149.87, bedroom=9989.23, intercept=20145.67
Error:   size=0.13, bedroom=10.77, intercept=145.67

πŸŽ“ Educational Value

This project demonstrates:

  • Mathematical Understanding: Implement algorithms from equations
  • Software Engineering: Professional Python package development
  • Machine Learning: Core concepts without library abstractions
  • Numerical Computing: Efficient NumPy vectorized operations
  • Testing: Comprehensive test coverage with edge cases
  • Documentation: Clear code documentation and user guides

🀝 Contributing

We welcome contributions! Please see:

πŸ™ Acknowledgments

  • Built for educational purposes to understand ML fundamentals
  • Mathematical foundations from "The Elements of Statistical Learning"
  • Inspired by the need for transparent, understandable ML implementations

Note: This is primarily an educational project. For production ML workflows, consider using established libraries like scikit-learn, though this implementation is production-quality and could be used in real applications.

About

A professional, educational Python library implementing linear regression and essential ML utilities from the ground up. Includes custom data preprocessing, train/test splitting, and a clean, modular API - ideal for learning, experimentation, and research.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages