# Lesson 5.4: Logistic Regression

## Classification Despite the Name!

Despite being called "regression", logistic regression predicts **categories** (yes/no, spam/not spam).

Instead of predicting a number, it predicts a **probability** (0 to 1):
- "This filter has a 92% chance of needing maintenance"

### PHP Parallel
Like a validation rule: `'email' => 'required|email'` returns true/false. Logistic regression says "this email is 87% likely to be spam."

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns

%matplotlib inline

In [None]:
# The sigmoid function - turns any number into 0-1 probability
x = np.linspace(-10, 10, 100)
sigmoid = 1 / (1 + np.exp(-x))

plt.figure(figsize=(8, 4))
plt.plot(x, sigmoid, 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='Decision boundary (0.5)')
plt.xlabel('Input')
plt.ylabel('Probability')
plt.title('Sigmoid Function: Converts anything to 0-1')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Water filter classification
np.random.seed(42)
n = 200

age = np.random.randint(10, 365, n)
tds = 30 + age * 0.25 + np.random.randn(n) * 15
flow = 2.5 - age * 0.004 + np.random.randn(n) * 0.3

X = pd.DataFrame({'tds_output': tds, 'flow_rate': flow, 'age_days': age})
y = ((tds > 80) | (flow < 1.0)).astype(int)  # 1 = needs maintenance

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)  # Get probabilities!

print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}")
print(f"\nSample predictions with probabilities:")
for i in range(5):
    print(f"  Filter: TDS={X_test.iloc[i]['tds_output']:.0f}, "
          f"Prob(maintenance)={y_proba[i][1]:.1%}, "
          f"Prediction={'Needs Maintenance' if y_pred[i] else 'OK'}")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['OK', 'Maintenance'], yticklabels=['OK', 'Maintenance'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

print(classification_report(y_test, y_pred, target_names=['OK', 'Maintenance']))

## Exercise

1. What does each quadrant of the confusion matrix mean?
2. Try changing the threshold from 0.5 to 0.3 (catch more maintenance cases)
3. Which metric matters more for water filters: precision or recall? Why?

In [None]:
# YOUR CODE HERE
# Hint for #2: y_pred_custom = (y_proba[:, 1] > 0.3).astype(int)