# Module 03: Logistic Regression

**CS229 Aligned Curriculum** | *Gold Standard Edition*

## üìñ 0. Definition & When to Use

### What is Logistic Regression?

Alright, let's start with the basics. **Logistic Regression** is a classification algorithm for **binary outcomes** (0 or 1).

**Formula:** $P(y=1|x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$

**Output:** A probability between 0 and 1

**Decision Rule:** Predict 1 if $P(y=1) \geq$ threshold (default 0.5)

---

### üéØ When to Use Logistic Regression?

| Scenario | ‚úÖ Use It | ‚ùå Don't Use It |
|----------|---------|----------------|
| **Binary classification** | ‚úÖ Spam/Not, Pass/Fail, Fraud/Legit | ‚ùå Multi-class (use Softmax) |
| **Probability needed** | ‚úÖ Risk scoring, confidence levels | ‚ùå Just hard labels (use Perceptron) |
| **Interpretable coefficients** | ‚úÖ Medical diagnosis explanation | ‚ùå Black-box acceptable (use NN) |
| **Linear decision boundary OK** | ‚úÖ Linearly separable data | ‚ùå Complex boundaries (use SVM/NN) |

---

### üåç Real-World Examples:

üí≥ **Fraud Detection**: P(Fraud) > 0.8 ‚Üí Block transaction
üè• **Medical Diagnosis**: P(Disease) based on symptoms, lab results
üìß **Email Filtering**: Spam vs Not Spam classification
üéì **Student Admission**: Predict acceptance probability based on GPA, SAT

---

**üí° Rule of Thumb:**
- Linear models (Linear/Logistic/Softmax) = **interpretable, fast, good baseline**
- Non-linear models (SVM/Neural Nets) = **higher accuracy, less interpretable**
- Always start simple ‚Üí Add complexity if needed!

## üìñ 1. Introduction: Linear vs Logistic Regression

### üéØ When to Use Linear vs Logistic?

Here's a quick comparison to help you choose:

| Aspect | Linear Regression | Logistic Regression |
|-------|-------------------|---------------------|
| **Output** | Continuous (numbers) | Binary/Categorical (class) |
| **Range** | $(-\infty, +\infty)$ | $[0, 1]$ (probability) |
| **Example** | Predict house price | Spam vs Not Spam |
| **Loss Function** | MSE (Mean Squared Error) | Cross-Entropy |
| **Decision** | Direct prediction | Threshold (e.g., > 0.5) |

---

### üé≠ Beginner's Analogy: Email Spam Filter

**Scenario:** You want to filter spam emails.

**‚ùå WRONG: Using Linear Regression**
```
Model output: 0.3, 1.5, -0.2, 0.95
Problem: What does 1.5 mean? "Super spam"? -0.2 = "Anti-spam"?
```

**‚úÖ CORRECT: Using Logistic Regression**
```
Model output: 0.3, 0.95, 0.05, 0.87
Interpretation: P(Spam) - probability between 0-1!
Decision: If P(Spam) > 0.5 ‚Üí Spam ‚úÖ
```

**üí° Key Insight:** Logistic transforms linear output into probability using the **Sigmoid Function**.

---

### üåü Level 2: Doctor Diagnosis Analogy

Think about it this way. A doctor sees patient symptoms: fever (X‚ÇÅ), cough (X‚ÇÇ), shortness of breath (X‚ÇÉ).

**Linear Regression:** Output = 1.2 (what does that mean? 120% sick?)

**Logistic Regression:** Output = 0.85 = **85% probability** patient has the disease.

**Threshold:**
- P > 0.8 ‚Üí Positive diagnosis, prescribe treatment
- P < 0.2 ‚Üí Negative diagnosis, patient is healthy
- 0.2 < P < 0.8 ‚Üí Gray area, further testing needed

**üí° Trade-off:**
- **Low threshold (0.3):** Sensitive, detects more sick patients (but high false alarm rate)
- **High threshold (0.9):** Specific, only diagnoses when certain (but misses some cases)

## üìê 1.5 Mathematical Derivation: MLE ‚Üí Cross-Entropy Loss

### üéØ Why Cross-Entropy?

Alright, math time ‚Äì but don't worry, we'll walk through it step by step.

**Step 1: Likelihood for 1 Data Point**

For binary classification, we assume:
$$P(y|x) = \begin{cases} h_\theta(x) & \text{if } y = 1 \\ 1 - h_\theta(x) & \text{if } y = 0 \end{cases}$$

We can write this more compactly:
$$P(y|x; \theta) = h_\theta(x)^y \cdot (1 - h_\theta(x))^{1-y}$$

**Step 2: Likelihood for All Data**

$$\mathcal{L}(\theta) = \prod_{i=1}^{m} h_\theta(x^{(i)})^{y^{(i)}} \cdot (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}$$

**Step 3: Log-Likelihood (for convenience)**

$$\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{m} y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))$$

**Step 4: Maximize ‚Üí Minimize Negative**

$$J(\theta) = -\ell(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)})) \right]$$

**‚úÖ And there it is ‚Äì CROSS-ENTROPY LOSS!**

This is one of those beautiful moments in ML where probability theory and optimization align perfectly.

---

### üîç Intuition: Why Log?

**Analogy: Escalating Penalty**

| Prediction | Actual | Loss | Interpretation |
|----------|--------|------|--------------|
| **0.99** | 1 | 0.01 | ‚úÖ Great, small penalty |
| **0.5** | 1 | 0.69 | ‚ö†Ô∏è Uncertain |
| **0.01** | 1 | 4.61 | ‚ùå Completely wrong, HUGE penalty |

**üí° Key Insight:** Cross-Entropy **heavily penalizes** wrong predictions made with high confidence!

## üß™ 2. Statistical Framework: Maximum Likelihood

### Why Cross-Entropy Loss?

Alright, here's where the stats come in. We model P(y=1|x) = œÉ(Œ∏·µÄx) where œÉ is the sigmoid function.

For each data point (x, y), the likelihood is:
$$P(y|x) = h^y (1-h)^{1-y}$$

Log-likelihood:
$$\ell(\theta) = \sum [y \log h + (1-y) \log(1-h)]$$

**Maximize log-likelihood = Minimize Cross-Entropy Loss!** See how elegant that is?

### Why Sigmoid?

Now here's what makes sigmoid so clever:

1. Output is always between 0 and 1 (valid probability)
2. Easy derivative: œÉ'(z) = œÉ(z)(1 - œÉ(z))
3. Derives from GLM with Bernoulli distribution

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# 1. SETUP DATA (2 Classes)
np.random.seed(0)
m = 100
# We have 2 features: x1 (Exam Score 1), x2 (Exam Score 2)
X = np.random.randn(m, 2) 
# Label y: 1 if Exam1 + Exam2 > 0, else 0 (Diagonal boundary)
y = (X[:, 0] + X[:, 1] > 0).astype(int).reshape(-1, 1)

# Bias Trick (x0 = 1)
X_b = np.c_[np.ones((m, 1)), X] 

print("Shape Data:", X_b.shape)
print("Shape Label:", y.shape)

In [None]:
# 2. MANUAL FUNCTIONS

def sigmoid(z):
    # Formula: 1 / (1 + e^-z)
    return 1 / (1 + np.exp(-z))

def hypothesis(X, theta):
    z = X.dot(theta)
    return sigmoid(z)

def compute_cost(X, y, theta):
    m = len(y)
    h = hypothesis(X, theta)
    # Prevent log(0) error with small epsilon
    epsilon = 1e-5 
    
    # Log Likelihood formula (Negative because we want to minimize cost)
    # J = -1/m * sum( y*log(h) + (1-y)*log(1-h) )
    term1 = y * np.log(h + epsilon)
    term2 = (1 - y) * np.log(1 - h + epsilon)
    cost = -(1/m) * np.sum(term1 + term2)
    return cost

def gradient_descent(X, y, theta, alpha, n_iterations):
    m = len(y)
    cost_history = []
    
    for i in range(n_iterations):
        # 1. Forward Pass
        preds = hypothesis(X, theta)
        
        # 2. Error Vector (Predictions - Actual)
        error = preds - y
        
        # 3. Gradient Calculation
        # Same structure as Linear Regression!
        gradients = (1/m) * X.T.dot(error)
        
        # 4. Update Theta
        theta = theta - alpha * gradients
        
        cost_history.append(compute_cost(X, y, theta))
        
    return theta, cost_history

In [None]:
# 3. TRAINING
theta_init = np.zeros((3, 1)) # Start from zeros
alpha = 0.5
iterations = 1000

theta_final, history = gradient_descent(X_b, y, theta_init, alpha, iterations)

print("Final Theta:", theta_final.ravel())
print("Should be roughly [0, 1, 1] (since the rule is x1+x2 > 0)")

In [None]:
# 4. VISUALIZE DECISION BOUNDARY
plt.figure(figsize=(10,4))

# Cost History
plt.subplot(1,2,1)
plt.plot(history)
plt.title("Cost History (Log Loss)")

# Decision Boundary
plt.subplot(1,2,2)
# Scatter plot (Blue=0, Red=1)
plt.scatter(X[y.ravel()==0, 0], X[y.ravel()==0, 1], color='blue', label='Class 0')
plt.scatter(X[y.ravel()==1, 0], X[y.ravel()==1, 1], color='red', label='Class 1')

# Draw decision boundary: theta0 + theta1*x1 + theta2*x2 = 0
# x2 = -(theta0 + theta1*x1) / theta2
x1_vals = np.array([-3, 3])
x2_vals = -(theta_final[0] + theta_final[1]*x1_vals) / theta_final[2]
plt.plot(x1_vals, x2_vals, "k--", label="Decision Boundary")
plt.legend()
plt.title("Classification Results")

plt.show()

### üîç How to Read Output & Graphs

**Training Output:**
- Final Theta [Œ∏‚ÇÄ, Œ∏‚ÇÅ, Œ∏‚ÇÇ]: Trained model coefficients
- Œ∏‚ÇÄ = bias/intercept
- Œ∏‚ÇÅ, Œ∏‚ÇÇ ‚âà 1 because data was generated with rule x‚ÇÅ + x‚ÇÇ > 0

**Cost History Graph:**
- Y-axis = Log Loss (Cross Entropy)
- ‚úÖ **Good:** Curve decreases and stabilizes
- ‚ö†Ô∏è **Warning:** If it oscillates ‚Üí learning rate too high

**Decision Boundary Graph:**
- Dashed line = decision boundary (h(x) = 0.5)
- Above line ‚Üí Predict 1 (red)
- Below line ‚Üí Predict 0 (blue)

**Logistic vs Linear Regression:**
| Aspect | Linear | Logistic |
|-------|--------|----------|
| Output | Any number | 0-1 (probability) |
| Loss | MSE | Cross Entropy |
| Use case | Regression | Classification |

**üí° Key Takeaway:** Logistic = Linear Regression + Sigmoid + Cross Entropy!

---

# üè≠ Industry Examples: Logistic Regression

## 1. üí∞ Finance: Loan Default Prediction
- Features: Income, credit score, debt ratio
- Output: P(default) ‚Üí Approve if < threshold
- Interpretable: Coefficient = log odds ratio

## 2. üè• Healthcare: Disease Screening
- Features: Lab results, symptoms
- Output: P(disease | symptoms)
- Threshold tuning based on cost of false negative

## 3. üì± Tech: Click-Through Rate
- Features: User features, ad features
- Output: P(click | user, ad)
- Real-time serving for ad bidding

---

---
## üìä 3. Evaluation Metrics (Beyond Accuracy)

Alright, so here's the thing ‚Äì accuracy alone won't cut it! This is especially true with **imbalanced data** (think cancer detection: 99% healthy, 1% sick).

Check this out: If we just predict "everyone is healthy", accuracy is 99%, but the model COMPLETELY FAILS to detect the disease. Pretty bad, right?

That's why we need:
1. **Confusion Matrix**: See True Positive, False Positive, etc.
2. **Precision & Recall**: Trade-off between "correct predictions" and "catching all cases".
3. **F1-Score**: Harmonic mean of Precision & Recall.
4. **ROC & AUC**: How good is the model at distinguishing class 0 from class 1?

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import seaborn as sns
import matplotlib.pyplot as plt

def evaluate_classification(y_true, y_pred, y_prob=None, title="Model Evaluation"):
    # 1. Classification Report & Confusion Matrix
    print(f"üîπ {title} Report:")
    print(classification_report(y_true, y_pred))
    
    plt.figure(figsize=(12, 5))
    
    # Plot Confusion Matrix
    plt.subplot(1, 2, 1)
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'{title} - Confusion Matrix')
    
    # 2. ROC Curve (if probs provided)
    if y_prob is not None:
        plt.subplot(1, 2, 2)
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        roc_auc = auc(fpr, tpr)
        
        plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f'{title} - ROC Curve')
        plt.legend(loc="lower right")
    
    plt.tight_layout()
    plt.show()

# Example usage (Make sure y_test and y_pred exist from previous cells)
# evaluate_classification(y_test, predictions, probabilities, title="Logistic Regression (Scratch)")

### üîç How to Read Confusion Matrix:

```
              Predicted
           |  0  |  1  |
Actual  0  | TN  | FP  |
        1  | FN  | TP  |
```

**Explanation:**
- **TN (True Negative):** Correctly predicted 0 ‚úÖ
- **TP (True Positive):** Correctly predicted 1 ‚úÖ
- **FP (False Positive):** Wrong, predicted 1 but was 0 ‚ùå (Type I Error)
- **FN (False Negative):** Wrong, predicted 0 but was 1 ‚ùå (Type II Error)

**Watch for:**
- **High FP:** Model is too "anxious" (lots of false alarms)
- **High FN:** Model is too "complacent" (missing many cases)

**üí° Trade-off:** Lower threshold ‚Üí FP goes up, FN goes down

---
## üöÄ 4. Production Quality & Feature Engineering

Alright, now let's talk about taking our model to production. We need **Modularity** and **Automation** to make this work in the real world.

### 4.1 Interaction Features
Sometimes relationships between variables aren't independent (for example, the effect of 'Age' might differ based on 'Gender'). When that happens, we create interaction features.

### 4.2 Pipeline Automation
Instead of doing scaling and encoding manually every single time, we wrap everything in a **Pipeline** to ensure consistency between training and serving. This prevents the classic "works on my machine" problem!

In [None]:
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
# üöÄ 4. PRODUCTION QUALITY & FEATURE ENGINEERING
# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Create sample data if not already available
if 'X_train' not in dir() or 'X_test' not in dir():
    print("‚ö†Ô∏è X_train/X_test not found. Creating sample data...")
    from sklearn.datasets import make_classification
    X_sample, y_sample = make_classification(
        n_samples=200, 
        n_features=10, 
        n_informative=8,
        n_redundant=2,
        random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X_sample, y_sample, test_size=0.3, random_state=42
    )
    print(f"‚úÖ Created sample data: {X_train.shape}")

# Scale the features first
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Original features: {X_train.shape[1]}")

# 1. Feature Interaction Demo
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interact = poly.fit_transform(X_train_scaled)
print(f"Features with Interaction: {X_interact.shape[1]}")
print(f"Increase: {X_interact.shape[1] - X_train_scaled.shape[1]} new features")

# 2. Scikit-Learn Pipeline Automation
from sklearn.linear_model import LogisticRegression

full_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('interaction', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ('logistic', LogisticRegression(class_weight='balanced', max_iter=1000))
])

# Fit pipeline (uses ORIGINAL X_train, not scaled - pipeline handles it!)
full_pipeline.fit(X_train, y_train)

# Evaluate
train_score = full_pipeline.score(X_train, y_train)
test_score = full_pipeline.score(X_test, y_test)

print(f"\nüìä Pipeline Performance:")
print(f"Train Score: {train_score:.4f}")
print(f"Test Score: {test_score:.4f}")

if train_score - test_score > 0.1:
    print("‚ö†Ô∏è Warning: Possible overfitting (train >> test)")
else:
    print("‚úÖ Good generalization")


### üîç How to Read ROC-AUC:

**ROC Curve:** Plots TPR (Recall) vs FPR at various thresholds.

**AUC (Area Under Curve):** Measures classifier quality.

| AUC Value | Quality | Interpretation |
|-----------|----------|---------------|
| **1.0** | Perfect | Flawless model |
| **0.9 - 1.0** | Excellent | Production-ready |
| **0.8 - 0.9** | Good | Acceptable |
| **0.7 - 0.8** | Fair | Needs improvement |
| **0.5 - 0.7** | Poor | Bad model |
| **0.5** | Random | No better than coin flip! |

**If ROC curve approaches top-left:** Good classifier (high TPR with low FPR)

**If ROC curve = diagonal:** Random guess, model failed!

---

## üé¨ 5. Animated Visualizations

Now here's the fun part ‚Äì let's see gradient descent in action!

### 5.1 Decision Boundary Evolution

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

np.random.seed(42)

# Generate 2-class data
X0 = np.random.randn(50, 2) + np.array([-2, -2])
X1 = np.random.randn(50, 2) + np.array([2, 2])
X = np.vstack([X0, X1])
y = np.hstack([np.zeros(50), np.ones(50)])

# Add bias term
X_b = np.c_[np.ones(100), X]

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Track gradient descent
theta = np.zeros(3)
theta_history = [theta.copy()]
alpha = 0.1

for _ in range(50):
    h = sigmoid(X_b @ theta)
    gradient = X_b.T @ (h - y) / len(y)
    theta = theta - alpha * gradient
    theta_history.append(theta.copy())

# Animation
fig, ax = plt.subplots(figsize=(10, 8))
xx = np.linspace(-5, 5, 100)

def animate(frame):
    ax.clear()
    theta = theta_history[min(frame, len(theta_history)-1)]
    
    ax.scatter(X[y==0, 0], X[y==0, 1], c='blue', s=50, label='Class 0', alpha=0.7)
    ax.scatter(X[y==1, 0], X[y==1, 1], c='red', s=50, label='Class 1', alpha=0.7)
    
    if abs(theta[2]) > 0.001:
        yy = -(theta[0] + theta[1] * xx) / theta[2]
        ax.plot(xx, yy, 'g-', lw=3, label='Decision Boundary')
        ax.fill_between(xx, yy, 5, alpha=0.1, color='red')
        ax.fill_between(xx, -5, yy, alpha=0.1, color='blue')
    
    ax.set_xlim(-5, 5)
    ax.set_ylim(-5, 5)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title(f'Logistic Regression - Iteration {frame}')
    ax.legend(loc='upper left')
    ax.grid(True, alpha=0.3)
    return []

anim = FuncAnimation(fig, animate, frames=len(theta_history), interval=200, blit=True)
plt.close()
HTML(anim.to_jshtml())

---

## üìù 9. Exercises with Solutions

Let's test your understanding with some practice problems!

### Exercise 1: Sigmoid Properties

**Q:** Prove that œÉ(-z) = 1 - œÉ(z)

<details>
<summary>üîë Solution</summary>

$$\sigma(-z) = \frac{1}{1+e^{z}} = \frac{1}{1+e^z} \cdot \frac{e^{-z}}{e^{-z}} = \frac{e^{-z}}{e^{-z}+1}$$

$$1 - \sigma(z) = 1 - \frac{1}{1+e^{-z}} = \frac{1+e^{-z}-1}{1+e^{-z}} = \frac{e^{-z}}{1+e^{-z}}$$

They're the same! ‚úÖ
</details>

---

### Exercise 2: Threshold Selection

**Q:** For medical screening, is a threshold of 0.5 good? Why or why not?

<details>
<summary>üîë Solution</summary>

No! Cost of FN (missing disease) >> Cost of FP (extra test).

Use a lower threshold (e.g., 0.2) to:
- Achieve High Recall (catch most positives)
- Accept lower Precision (more false positives)

Better safe than sorry!
</details>

---

### Exercise 3: Coefficient Interpretation

**Q:** If Œ∏_income = 0.05, what does it mean?

<details>
<summary>üîë Solution</summary>

For every ‚Üë1 unit increase in income:
- Log-odds ‚Üë by 0.05
- Odds ratio = e^0.05 ‚âà 1.05
- Probability of positive class ‚Üë ~5% (approximately, depends on baseline)

Note: Odds ratio interpretation is more accurate than probability!
</details>

---

---

## üöÄ 10. Deployment Case Study: Fraud Detection API

Now let's see how this works in the real world!

### Architecture
```
Transaction ‚Üí Feature Extraction ‚Üí Logistic Model ‚Üí Risk Score ‚Üí Decision
```

### FastAPI Code
```python
from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI()
model = joblib.load('fraud_detector.pkl')
scaler = joblib.load('scaler.pkl')

@app.post('/predict_fraud')
def predict(amount: float, hour: int, distance_km: float):
    X = np.array([[amount, hour, distance_km]])
    X_scaled = scaler.transform(X)
    prob = model.predict_proba(X_scaled)[0, 1]
    
    decision = 'BLOCK' if prob > 0.7 else 'REVIEW' if prob > 0.3 else 'APPROVE'
    return {'fraud_probability': round(prob, 3), 'decision': decision}
```

### Monitoring Metrics
| Metric | Target | Alert |
|--------|--------|-------|
| Precision | > 0.8 | < 0.6 |
| Recall | > 0.9 | < 0.7 |
| Latency (p99) | < 50ms | > 100ms |

---

In [None]:
import numpy as np
try:
    from numba import njit
    HAS_NUMBA = True
except ImportError:
    HAS_NUMBA = False
    # Mock njit decorator if not available
    def njit(func):
        return func

@njit
def _sigmoid_fast(z):
    """Sigmoid function with Numba acceleration."""
    return 1 / (1 + np.exp(-z))

class LogisticRegressionScratch:
    """
    Optimized Logistic Regression implementation from scratch.
    Supports: Mini-batch Gradient Descent, Early Stopping, and Class Weighting.
    """
    def __init__(self, learning_rate=0.01, n_iterations=1000, batch_size=None, 
                 tol=1e-4, patience=10, class_weight=None):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.batch_size = batch_size # If None, use full batch
        self.tol = tol # Tolerance for Early Stopping
        self.patience = patience # Wait epochs before stopping
        self.class_weight = class_weight # 'balanced' or dict
        self.theta = None
        self.cost_history = []

    def fit(self, X, y):
        """
        Train the model using Gradient Descent.
        Includes shape validation and early stopping logic.
        """
        try:
            m, n = X.shape
            self.theta = np.zeros(n)
            
            # Compute weights for handling imbalance
            weights = self._get_weights(y, m)
            
            best_cost = np.inf
            wait_count = 0
            
            batch_size = self.batch_size if self.batch_size else m
            
            for i in range(self.n_iterations):
                # Shuffle for Mini-batch / Stochastic behavior
                indices = np.random.permutation(m)
                X_shuffled = X[indices]
                y_shuffled = y[indices]
                w_shuffled = weights[indices]
                
                for j in range(0, m, batch_size):
                    X_batch = X_shuffled[j:j+batch_size]
                    y_batch = y_shuffled[j:j+batch_size]
                    w_batch = w_shuffled[j:j+batch_size]
                    
                    # Vectorized forward pass and gradient update
                    z = np.dot(X_batch, self.theta)
                    h = _sigmoid_fast(z)
                    error = h - y_batch
                    gradient = (1/len(y_batch)) * np.dot(X_batch.T, error * w_batch)
                    self.theta -= self.lr * gradient
                
                # Calculate cost on full training set for monitoring
                full_h = _sigmoid_fast(np.dot(X, self.theta))
                current_cost = self._compute_cost(y, full_h, weights, m)
                self.cost_history.append(current_cost)
                
                # Early Stopping Logic
                if current_cost < best_cost - self.tol:
                    best_cost = current_cost
                    wait_count = 0
                else:
                    wait_count += 1
                    if wait_count >= self.patience:
                        print(f"Early stopping triggered at iteration {i}")
                        break
                        
        except Exception as e:
            print(f"Error during fitting: {e}")

    def _get_weights(self, y, m):
        if self.class_weight == 'balanced':
            n0 = np.sum(y == 0)
            n1 = np.sum(y == 1)
            w0 = m / (2 * n0)
            w1 = m / (2 * n1)
            return np.where(y == 0, w0, w1)
        elif isinstance(self.class_weight, dict):
            return np.array([self.class_weight[label] for label in y])
        return np.ones(m)

    def _compute_cost(self, y, h, weights, m):
        epsilon = 1e-15
        return -(1/m) * np.sum(weights * (y * np.log(h+epsilon) + (1-y) * np.log(1-h+epsilon)))

    def predict_proba(self, X):
        return _sigmoid_fast(np.dot(X, self.theta))

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)


---
## üö¢ 6. Real World Case Study - Titanic Survival

Now let's apply Logistic Regression to the legendary **Titanic** dataset!
Target: Predict `Survived` (1 = Survived, 0 = Did not survive).

Real-World Challenges:
- **Categorical** data (Sex, Embarked) ‚Üí Need Encoding.
- **Missing** data (Age, Cabin) ‚Üí Need Imputation.
- **Imbalanced** data.

In [None]:
# 1. Load Data
import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

print("Shape:", df.shape)
df.head()

In [None]:
# 2. Preprocessing & Feature Engineering

# Handle Missing Values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Select Features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']
X = df[features].copy()
y = df['Survived']

# One-Hot Encoding Manual (for understanding)
X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})

# Check correlation/imbalance
print("Class Distribution:\n", y.value_counts(normalize=True))

# Train Test Split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling (Required for Gradient Descent!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Add Bias Term manually for our scratch model
X_train_b = np.c_[np.ones((len(X_train_scaled), 1)), X_train_scaled]
X_test_b = np.c_[np.ones((len(X_test_scaled), 1)), X_test_scaled]

In [None]:
# 3. Train Models (Scratch vs Sklearn)



# A. Scratch Model (Balanced Weight)

model_scratch = LogisticRegressionScratch(learning_rate=0.1, n_iterations=3000, class_weight='balanced')

model_scratch.fit(X_train_b, y_train.values)



# B. Sklearn Model

clf_sklearn = LogisticRegression(class_weight='balanced', random_state=42)

clf_sklearn.fit(X_train_scaled, y_train)



# Plot Loss History Scratch

plt.figure(figsize=(8, 4))

plt.plot(model_scratch.cost_history)

plt.title('Training Loss (Scratch Model)')

plt.xlabel('Epochs')

plt.ylabel('Log Loss')

plt.show()



# Predictions

y_pred_scratch = model_scratch.predict(X_test_b)

y_prob_scratch = model_scratch.predict_proba(X_test_b)



y_pred_sklearn = clf_sklearn.predict(X_test_scaled)

y_prob_sklearn = clf_sklearn.predict_proba(X_test_scaled)[:, 1]



In [None]:
# 4. Evaluate SKLEARN
evaluate_classification(y_test, y_pred_sklearn, y_prob_sklearn, title="Titanic - Sklearn")

## üîß 7. Advanced Tuning & Diagnostics

Alright, after training comes the real work. Here's what we need to do:
1.  **VIF Analysis**: Check if features are highly correlated with each other (Multicollinearity).
2.  **Threshold Tuning**: That default 0.5 threshold? Not always optimal, especially with imbalanced data. We'll find the threshold that maximizes F1-Score.
3.  **Cross Validation**: Manually validate our scratch model using K-Fold.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.metrics import f1_score



def calculate_vif(X, feature_names):

    vif_data = pd.DataFrame()

    vif_data["Feature"] = feature_names

    vif_data["VIF"] = [variance_inflation_factor(X, i) for i in range(X.shape[1])]

    return vif_data



# 1. Check Multicollinearity

print("Checking VIF (Variance Inflation Factor)...")

vif_df = calculate_vif(X_train_scaled, features)

print(vif_df)



# 2. Threshold Tuning (Optimize F1-Score)

thresholds = np.arange(0.1, 1.0, 0.05)

f1_scores = []

for t in thresholds:

    y_p = model_scratch.predict(X_test_b, threshold=t)

    f1_scores.append(f1_score(y_test, y_p))



best_thresh = thresholds[np.argmax(f1_scores)]

print(f"\nOptimal Threshold for F1-Score: {best_thresh:.2f}")



plt.figure(figsize=(8, 4))

plt.plot(thresholds, f1_scores, marker='o')

plt.axvline(best_thresh, color='r', linestyle='--', label=f'Best: {best_thresh:.2f}')

plt.title('Threshold vs F1-Score')

plt.xlabel('Threshold')

plt.ylabel('F1 Score')

plt.legend()

plt.show()



# 3. Manual Cross Validation (Bonus)

print("\nRunning Manual 5-Fold CV for Scratch Model...")

indices = np.arange(len(X_train_b))

fold_size = len(X_train_b) // 5

scores = []



for i in range(5):

    val_idx = indices[i*fold_size : (i+1)*fold_size]

    train_idx = np.concatenate([indices[:i*fold_size], indices[(i+1)*fold_size:]])

    

    X_tr_cv, y_tr_cv = X_train_b[train_idx], y_train.values[train_idx]

    X_val_cv, y_val_cv = X_train_b[val_idx], y_train.values[val_idx]

    

    m_cv = LogisticRegressionScratch(learning_rate=0.1, n_iterations=1000)

    m_cv.fit(X_tr_cv, y_tr_cv)

    

    # Simple Accuracy

    acc = np.mean(m_cv.predict(X_val_cv) == y_val_cv)

    scores.append(acc)



print(f"CV Scores: {scores}")

print(f"Average Accuracy: {np.mean(scores):.4f}")



### üîç How to Read Precision, Recall, F1:

**Precision** = $\frac{TP}{TP + FP}$ = "Of those predicted positive, how many were actually correct?"

**Recall** = $\frac{TP}{TP + FN}$ = "Of those actually positive, how many were detected?"

**F1-Score** = $2 \times \frac{Precision \times Recall}{Precision + Recall}$ = Harmonic mean

| Metric | Low | High | When to Prioritize |
|--------|--------|--------|------------------|
| **Precision** | Many false alarms | Few false alarms | Spam detection (false alarms are annoying) |
| **Recall** | Missing many cases | Detecting all cases | Cancer screening (don't miss any!) |
| **F1** | Imbalanced | Balanced | General use case |

**If Precision is high, Recall is low:** Model is conservative (only predicts positive when confident)

**If Recall is high, Precision is low:** Model is aggressive (detects many but makes many mistakes)

---
## üõ°Ô∏è 8. Regularization (L2 / Ridge)

Now here's what happens when our model is **Overfitting** (memorizing training data but failing on new data) ‚Äì we need **Regularization**.

The idea is simple: we add a "penalty" to the cost function if $\theta$ values get too large.

$$ J(\theta) = -\frac{1}{m}\sum [...] + \frac{\lambda}{2m} \sum \theta_j^2 $$

In Sklearn, this is controlled by parameter `C` (Inverse Regularization Strength):
- Small `C` ‚Üí STRONG Regularization ($\lambda$ large) ‚Üí Simple model (might underfit)
- Large `C` ‚Üí WEAK Regularization ($\lambda$ small) ‚Üí Complex model (might overfit)

Let's see the effect of `C` on the decision boundary with some visualization.

In [None]:
import numpy as np
# Visualize Effect of C (Conceptual Code)
C_values = [0.001, 1, 100]

print("Experimenting with different C values...")
for c in C_values:
    model = LogisticRegression(C=c, max_iter=1000)
    model.fit(X_train_scaled, y_train)
    acc = model.score(X_test_scaled, y_test)
    print(f"C={c}: Test Accuracy = {acc:.4f}")
    # Smaller C typically means coefficients shrink toward zero

---

## üéì 11. CAPSTONE PROJECT: Customer Churn Predictor

### Objective
Build an end-to-end Logistic Regression model to predict customer churn.

### Requirements (100 pts)

#### Part 1: EDA (20 pts)
- [ ] Load Telco Churn dataset
- [ ] Class distribution analysis
- [ ] Feature correlation

#### Part 2: Model (30 pts)
- [ ] Implement from scratch
- [ ] Compare with sklearn
- [ ] Regularization tuning

#### Part 3: Evaluation (25 pts)
- [ ] Confusion matrix
- [ ] ROC curve & AUC
- [ ] Threshold optimization
- [ ] Coefficient interpretation

#### Part 4: Deployment (25 pts)
- [ ] FastAPI endpoint
- [ ] Docker container
- [ ] Documentation

---

In [None]:
# üéì CAPSTONE START CODE
# Dataset: Telco Customer Churn
# Task: Predict who will Churn (switch to competitor)

url_churn = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
# df_churn = pd.read_csv(url_churn)
# df_churn.head()

# 1. EDA & Cleaning (Handle 'TotalCharges' which is string, missing values)
# 2. Encoding (Gender, InternetService, etc)
# 3. Train Model (Logistic Regression)
# 4. Evaluation (Confusion Matrix, Recall is important here!)
#    (Why Recall? Because we want to catch ALL potential churners)