# Lesson 5.3: Linear Regression

## Predicting Numbers

Linear Regression finds the **best-fit line** through your data.

Formula: `y = mx + b` (or `y = w1*x1 + w2*x2 + ... + bias`)

**Use case**: Predict TDS output based on filter age, usage, etc.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

%matplotlib inline

In [None]:
# Generate data: filter age → TDS output
np.random.seed(42)
age = np.random.randint(10, 365, 100)
tds = 30 + age * 0.25 + np.random.randn(100) * 10  # Linear relationship + noise

X = age.reshape(-1, 1)  # sklearn needs 2D input
y = tds

# Visualize first!
plt.figure(figsize=(8, 5))
plt.scatter(age, tds, alpha=0.5)
plt.xlabel('Filter Age (days)')
plt.ylabel('TDS Output (ppm)')
plt.title('Filter Age vs TDS - Can we draw a line through this?')
plt.show()

In [None]:
# Step 1: Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Train
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Slope (coefficient): {model.coef_[0]:.4f}")  # How much TDS increases per day
print(f"Intercept (bias): {model.intercept_:.2f}")     # TDS when age=0
print(f"\nMeaning: TDS increases by ~{model.coef_[0]:.2f} ppm per day of filter age")

In [None]:
# Step 3: Predict & Evaluate
y_pred = model.predict(X_test)

print("Evaluation Metrics:")
print(f"  R² Score: {r2_score(y_test, y_pred):.3f}")  # 1.0 = perfect, 0 = useless
print(f"  MAE: {mean_absolute_error(y_test, y_pred):.1f} ppm")  # Average error
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.1f} ppm")  # Penalizes big errors

In [None]:
# Visualize the prediction line
plt.figure(figsize=(8, 5))
plt.scatter(X_test, y_test, alpha=0.5, label='Actual')
plt.plot(sorted(X_test), model.predict(sorted(X_test)), color='red', linewidth=2, label='Prediction')
plt.xlabel('Filter Age (days)')
plt.ylabel('TDS Output (ppm)')
plt.title('Linear Regression: Age → TDS')
plt.legend()
plt.show()

## Exercise

1. Predict TDS for a filter that is 200 days old
2. Try adding more features (flow_rate, pressure) - does R² improve?
3. Plot actual vs predicted values (should be close to diagonal line)

In [None]:
# YOUR CODE HERE