## Baseline: Linear Regression with scikit-learn

This notebook establishes a **reference baseline** using scikit-learn's linear regression implementation.

The goal is not exploration or optimization, but to:
- Define expected coefficients and error metrics
- Provide a numerical reference for later implementations
- Ensure that subsequent deviations are due to methodology, not bugs

This baseline will be treated as a **frozen oracle** for correctness.

## Experimental setup

We fix the random seed and define a minimal experimental setup to ensure
reproducibility. All results in this notebook should be exactly reproducible
across runs.

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## Synthetic data generation

We generate a simple one-dimensional dataset with a known linear relationship
and additive Gaussian noise. This allows us to verify whether the learned
coefficients match the ground truth within reasonable error bounds.

In [2]:
# Ground truth parameters
TRUE_SLOPE = 3.0
TRUE_INTERCEPT = 2.0
NOISE_STD = 0.5

# Generate data
n_samples = 200
X = np.random.uniform(-5, 5, size=n_samples)
noise = np.random.normal(0.0, NOISE_STD, size=n_samples)

y = TRUE_SLOPE * X + TRUE_INTERCEPT + noise

# Reshape for scikit-learn
X = X.reshape(-1, 1)

## Train / validation split

Even for a simple baseline, we separate training and validation data.
This ensures that evaluation reflects generalization rather than memorization.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

# Sanity check
X_train.shape, X_val.shape

((150, 1), (50, 1))

## Baseline model: scikit-learn LinearRegression

We fit a standard linear regression model using scikit-learn.
This model serves as a numerical reference for later implementations.

In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

coef = model.coef_[0]
intercept = model.intercept_

coef, intercept

(np.float64(3.0030532941128323), np.float64(2.0545769300718177))

## Baseline predictions

We generate predictions on both training and validation splits.
At this stage, we only verify that outputs are numerically reasonable.

In [5]:
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

y_train_pred[:5], y_val_pred[:5]

(array([ 6.06076283, -6.1589007 , -8.27609096, 11.60472983, -7.35789124]),
 array([  1.86825538,  -7.45295437,   5.28420615,  -5.82430745,
        -12.75191336]))

## Baseline evaluation: mean squared error

We evaluate the baseline model using mean squared error (MSE).
This metric will later be used to verify the correctness of custom implementations.

In [6]:
from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(y_train, y_train_pred)
val_mse = mean_squared_error(y_val, y_val_pred)

train_mse, val_mse

(0.22903972169937115, 0.2500437189227639)

## Baseline invariants

At this point, the scikit-learn baseline is considered frozen.

The following quantities are treated as invariants for the rest of this module:
- Learned coefficient and intercept
- Training and validation MSE under the current split
- Data preprocessing and splitting strategy

Any future implementation (closed-form, gradient descent, or refactored versions)
must reproduce these values within numerical tolerance.

If discrepancies appear, they should be treated as implementation or data handling bugs,
not model differences.