# From Linear to Logistic Regression: A Comprehensive Hands-on Tutorial

## Learning Objectives
- Master advanced linear regression techniques including regularization
- Understand the transition from regression to classification
- Implement logistic regression from scratch and using scikit-learn
- Apply multiclass classification strategies

## Structure
1. **Setup and Data Preparation**
2. **Advanced Linear Regression** (Exercises 1-3)
3. **Transition to Classification** (Exercises 4-6)
4. **Logistic Regression Implementation** (Exercises 7-8)
5. **Multiclass Classification** (Exercises 9-10)

---

## Part 1: Setup and Imports
Import all necessary libraries for our regression journey.

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports
import sklearn
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    LogisticRegression, Perceptron
)
from sklearn.metrics import (
    mean_squared_error, r2_score, mean_absolute_error,
    accuracy_score, confusion_matrix, classification_report,
    roc_curve, roc_auc_score
)
from sklearn.datasets import make_regression, make_classification, load_iris
from sklearn.multiclass import OneVsRestClassifier

# Interactive plotting
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print('Setup complete! All libraries loaded successfully.')
print(f'NumPy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')
print(f'Scikit-learn version: {sklearn.__version__}')

---
## Exercise 1: Polynomial Regression and Basis Expansion

### Concept
Linear models can fit non-linear relationships by transforming features. We transform $x$ into $x, x^2, x^3, ...$ to create polynomial features.

**Key Insight**: The model remains linear in parameters (coefficients) even though it's non-linear in the original features.

### Implementation

In [None]:
# Generate non-linear synthetic data
np.random.seed(42)
n_samples = 100
X = np.sort(np.random.uniform(-3, 3, n_samples))
y_true = 0.5 * X**3 - 2 * X**2 + X + 3
y = y_true + np.random.normal(0, 3, n_samples)  # Add noise

# Reshape for sklearn
X_reshape = X.reshape(-1, 1)

# Create polynomial features of different degrees
degrees = [1, 3, 5, 10]
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    # Transform features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X_reshape)
    
    # Fit model
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Predictions for smooth curve
    X_test = np.linspace(-3, 3, 300).reshape(-1, 1)
    X_test_poly = poly.transform(X_test)
    y_pred = model.predict(X_test_poly)
    
    # Calculate R² score
    train_score = model.score(X_poly, y)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.6, s=30, label='Data')
    axes[idx].plot(X_test, y_pred, 'r-', linewidth=2, label=f'Degree {degree}')
    axes[idx].plot(X, y_true, 'g--', alpha=0.5, label='True function')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('y')
    axes[idx].set_title(f'Polynomial Degree {degree}\nR² = {train_score:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("- Degree 1: Underfitting (can't capture the non-linear pattern)")
print("- Degree 3: Good fit (matches the true function well)")
print("- Degree 10: Overfitting (fits noise, oscillates wildly)")

### Your Turn
1. Generate a dataset with a different non-linear relationship (e.g., sine wave)
2. Use cross-validation to find the optimal polynomial degree
3. Plot validation curves showing train and test scores vs degree

---
## Exercise 2: Ridge vs Lasso vs Elastic Net Regularization

### Concept
Regularization prevents overfitting by adding penalty terms:
- **Ridge (L2)**: $\text{Loss} + \lambda\sum\beta_i^2$ - Shrinks coefficients smoothly
- **Lasso (L1)**: $\text{Loss} + \lambda\sum|\beta_i|$ - Can zero out coefficients (feature selection)
- **Elastic Net**: Combines L1 and L2 penalties

### Implementation

In [None]:
# Generate high-dimensional sparse data
n_samples, n_features = 100, 20
n_informative = 5  # Only 5 features are actually useful

X, y = make_regression(n_samples=n_samples, n_features=n_features,
                      n_informative=n_informative, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different regularization strengths
alphas = np.logspace(-3, 1, 20)

# Store results
results = {
    'Ridge': {'train_scores': [], 'test_scores': [], 'n_nonzero': []},
    'Lasso': {'train_scores': [], 'test_scores': [], 'n_nonzero': []},
    'ElasticNet': {'train_scores': [], 'test_scores': [], 'n_nonzero': []}
}

# Fit models with different alphas
for alpha in alphas:
    # Ridge
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    results['Ridge']['train_scores'].append(ridge.score(X_train_scaled, y_train))
    results['Ridge']['test_scores'].append(ridge.score(X_test_scaled, y_test))
    results['Ridge']['n_nonzero'].append(np.sum(np.abs(ridge.coef_) > 0.01))
    
    # Lasso
    lasso = Lasso(alpha=alpha, max_iter=1000)
    lasso.fit(X_train_scaled, y_train)
    results['Lasso']['train_scores'].append(lasso.score(X_train_scaled, y_train))
    results['Lasso']['test_scores'].append(lasso.score(X_test_scaled, y_test))
    results['Lasso']['n_nonzero'].append(np.sum(lasso.coef_ != 0))
    
    # Elastic Net
    elastic = ElasticNet(alpha=alpha, l1_ratio=0.5, max_iter=1000)
    elastic.fit(X_train_scaled, y_train)
    results['ElasticNet']['train_scores'].append(elastic.score(X_train_scaled, y_train))
    results['ElasticNet']['test_scores'].append(elastic.score(X_test_scaled, y_test))
    results['ElasticNet']['n_nonzero'].append(np.sum(elastic.coef_ != 0))

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Test scores vs alpha
for method, color in zip(['Ridge', 'Lasso', 'ElasticNet'], ['blue', 'green', 'red']):
    axes[0].semilogx(alphas, results[method]['test_scores'], '-o', 
                     label=method, color=color, markersize=4)
axes[0].set_xlabel('Regularization strength (α)')
axes[0].set_ylabel('Test R² Score')
axes[0].set_title('Model Performance vs Regularization')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Number of non-zero coefficients
for method, color in zip(['Ridge', 'Lasso', 'ElasticNet'], ['blue', 'green', 'red']):
    axes[1].semilogx(alphas, results[method]['n_nonzero'], '-o', 
                     label=method, color=color, markersize=4)
axes[1].set_xlabel('Regularization strength (α)')
axes[1].set_ylabel('Number of non-zero coefficients')
axes[1].set_title('Feature Selection Effect')
axes[1].axhline(y=n_informative, color='black', linestyle='--', 
                label=f'True informative features ({n_informative})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Coefficient paths for Lasso
lasso_coefs = []
for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=1000)
    lasso.fit(X_train_scaled, y_train)
    lasso_coefs.append(lasso.coef_)

lasso_coefs = np.array(lasso_coefs)
for i in range(n_features):
    axes[2].semilogx(alphas, lasso_coefs[:, i], alpha=0.7)
axes[2].set_xlabel('Regularization strength (α)')
axes[2].set_ylabel('Coefficient value')
axes[2].set_title('Lasso Coefficient Paths')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("✓ Ridge: Shrinks all coefficients but keeps all features")
print("✓ Lasso: Performs automatic feature selection (sparse solution)")
print("✓ Elastic Net: Balance between Ridge and Lasso")
print(f"\nTrue number of informative features: {n_informative}")
print("Notice how Lasso identifies approximately the right number!")

---
## Exercise 3: Feature Selection and Importance

### Concept
Understanding which features are important helps with:
- Model interpretability
- Dimensionality reduction
- Identifying key drivers

### Implementation

In [None]:
# Create a dataset with named features
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
housing = fetch_california_housing()
X_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
y_housing = housing.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X_housing, y_housing, test_size=0.3, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit different models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.01),
    'ElasticNet': ElasticNet(alpha=0.01)
}

# Store feature importances
feature_importance_df = pd.DataFrame()

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    feature_importance_df[name] = np.abs(model.coef_)

feature_importance_df.index = X_housing.columns

# Create interactive visualization with Plotly
fig = go.Figure()

for column in feature_importance_df.columns:
    fig.add_trace(go.Bar(
        name=column,
        x=feature_importance_df.index,
        y=feature_importance_df[column],
        text=feature_importance_df[column].round(3),
        textposition='auto',
    ))

fig.update_layout(
    title='Feature Importance Comparison Across Models',
    xaxis_title='Features',
    yaxis_title='Absolute Coefficient Value',
    barmode='group',
    height=500,
    hovermode='x unified'
)

fig.show()

# Summary statistics
print("\nFeature Importance Summary:")
print("=" * 60)
print(feature_importance_df.round(3))

print("\n\nTop 3 Most Important Features by Model:")
print("=" * 60)
for model_name in feature_importance_df.columns:
    top_features = feature_importance_df[model_name].nlargest(3)
    print(f"\n{model_name}:")
    for feat, importance in top_features.items():
        print(f"  - {feat}: {importance:.3f}")

### Your Turn
1. Implement Recursive Feature Elimination (RFE) to select optimal features
2. Compare feature importance from different methods
3. Use permutation importance as an alternative method

---
## Exercise 4: Why Linear Regression Fails for Classification

### Concept
Linear regression predicts continuous values, but classification needs:
- Output bounded to [0,1] for probabilities
- Discrete class predictions
- Appropriate loss function for categorical outcomes

### Implementation: Demonstrating the Problem

In [None]:
# Generate binary classification data
np.random.seed(42)
n_samples = 200

# Create two classes with some overlap
X_class = np.random.randn(n_samples, 2)
y_class = (X_class[:, 0] + 0.5 * X_class[:, 1] > 0.5).astype(int)

# Add some outliers
X_class[0] = [5, 5]
y_class[0] = 1
X_class[1] = [-5, -5]
y_class[1] = 0

# Fit both linear regression and logistic regression
linear_reg = LinearRegression()
logistic_reg = LogisticRegression()

linear_reg.fit(X_class, y_class)
logistic_reg.fit(X_class, y_class)

# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_class[:, 0].min() - 1, X_class[:, 0].max() + 1
y_min, y_max = X_class[:, 1].min() - 1, X_class[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predictions
Z_linear = linear_reg.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
Z_logistic = logistic_reg.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1].reshape(xx.shape)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Linear Regression Predictions
im1 = axes[0].contourf(xx, yy, Z_linear, levels=20, cmap='RdBu', alpha=0.6)
axes[0].scatter(X_class[y_class == 0, 0], X_class[y_class == 0, 1], 
                c='blue', edgecolor='black', s=50, label='Class 0')
axes[0].scatter(X_class[y_class == 1, 0], X_class[y_class == 1, 1], 
                c='red', edgecolor='black', s=50, label='Class 1')
axes[0].contour(xx, yy, Z_linear, levels=[0.5], colors='green', linewidths=2)
axes[0].set_title('Linear Regression\n(Unbounded predictions)')
axes[0].legend()
plt.colorbar(im1, ax=axes[0])

# Plot 2: Logistic Regression Probabilities
im2 = axes[1].contourf(xx, yy, Z_logistic, levels=20, cmap='RdBu', alpha=0.6)
axes[1].scatter(X_class[y_class == 0, 0], X_class[y_class == 0, 1], 
                c='blue', edgecolor='black', s=50, label='Class 0')
axes[1].scatter(X_class[y_class == 1, 0], X_class[y_class == 1, 1], 
                c='red', edgecolor='black', s=50, label='Class 1')
axes[1].contour(xx, yy, Z_logistic, levels=[0.5], colors='green', linewidths=2)
axes[1].set_title('Logistic Regression\n(Probabilities in [0,1])')
axes[1].legend()
plt.colorbar(im2, ax=axes[1])

# Plot 3: Histogram of predictions
linear_preds = linear_reg.predict(X_class)
logistic_preds = logistic_reg.predict_proba(X_class)[:, 1]

axes[2].hist(linear_preds, bins=30, alpha=0.5, label='Linear Reg', color='blue')
axes[2].hist(logistic_preds, bins=30, alpha=0.5, label='Logistic Reg', color='red')
axes[2].axvline(0, color='black', linestyle='--', alpha=0.5)
axes[2].axvline(1, color='black', linestyle='--', alpha=0.5)
axes[2].set_xlabel('Predicted Value')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Distribution of Predictions')
axes[2].legend()

plt.tight_layout()
plt.show()

print("\nProblems with Linear Regression for Classification:")
print(f"1. Predictions outside [0,1]: {np.sum((linear_preds < 0) | (linear_preds > 1))} out of {len(linear_preds)}")
print(f"   Min prediction: {linear_preds.min():.3f}")
print(f"   Max prediction: {linear_preds.max():.3f}")
print("\n2. Sensitive to outliers (see how outliers affect the decision boundary)")
print("3. Inappropriate loss function (squared error doesn't make sense for classes)")

---
## Exercise 5: Understanding Sigmoid Function and Odds

### Concept
The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ maps any real number to (0,1):
- **Odds**: $\frac{p}{1-p}$ (ratio of probability to its complement)
- **Log-odds (logit)**: $\log\left(\frac{p}{1-p}\right)$ (can take any real value)

### Implementation

In [None]:
def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """Derivative of sigmoid function"""
    s = sigmoid(z)
    return s * (1 - s)

def odds(p):
    """Calculate odds from probability"""
    return p / (1 - p)

def log_odds(p):
    """Calculate log-odds from probability"""
    return np.log(odds(p))

# Create interactive visualization
z = np.linspace(-10, 10, 1000)

# Create subplots
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Sigmoid Function', 'Sigmoid Derivative', 
                   'Probability → Odds', 'Probability → Log-Odds')
)

# Sigmoid function
fig.add_trace(
    go.Scatter(x=z, y=sigmoid(z), name='σ(z)', line=dict(color='blue', width=3)),
    row=1, col=1
)
fig.add_hline(y=0.5, line_dash="dash", line_color="gray", row=1, col=1)
fig.add_vline(x=0, line_dash="dash", line_color="gray", row=1, col=1)

# Sigmoid derivative
fig.add_trace(
    go.Scatter(x=z, y=sigmoid_derivative(z), name="σ'(z)", 
               line=dict(color='green', width=3)),
    row=1, col=2
)

# Odds transformation
p_range = np.linspace(0.01, 0.99, 100)
fig.add_trace(
    go.Scatter(x=p_range, y=odds(p_range), name='Odds', 
               line=dict(color='orange', width=3)),
    row=2, col=1
)

# Log-odds transformation
fig.add_trace(
    go.Scatter(x=p_range, y=log_odds(p_range), name='Log-Odds', 
               line=dict(color='red', width=3)),
    row=2, col=2
)

# Update layout
fig.update_xaxes(title_text="z", row=1, col=1)
fig.update_xaxes(title_text="z", row=1, col=2)
fig.update_xaxes(title_text="Probability", row=2, col=1)
fig.update_xaxes(title_text="Probability", row=2, col=2)

fig.update_yaxes(title_text="σ(z)", row=1, col=1)
fig.update_yaxes(title_text="σ'(z)", row=1, col=2)
fig.update_yaxes(title_text="Odds", row=2, col=1)
fig.update_yaxes(title_text="Log-Odds", row=2, col=2)

fig.update_layout(height=700, showlegend=False,
                 title_text="Sigmoid Function and Related Transformations")
fig.show()

# Key properties table
print("\nKey Properties of Sigmoid Function:")
print("=" * 50)
properties = pd.DataFrame([
    ['σ(0)', sigmoid(0)],
    ['σ(-∞)', 0],
    ['σ(+∞)', 1],
    ['σ(z) + σ(-z)', 1],
    ["Max of σ'(z)", sigmoid_derivative(0)],
    ["Occurs at z=", 0]
], columns=['Property', 'Value'])
print(properties.to_string(index=False))

print("\n\nProbability ↔ Odds ↔ Log-Odds Examples:")
print("=" * 50)
for p in [0.1, 0.5, 0.75, 0.9, 0.99]:
    print(f"P = {p:.2f} → Odds = {odds(p):.3f} → Log-Odds = {log_odds(p):.3f}")

---
## Exercise 6: Perceptron Algorithm - The Simplest Linear Classifier

### Concept
The Perceptron (1957) is the foundation of neural networks:
- Update rule: $w \leftarrow w + \eta(y - \hat{y})x$
- Converges if data is linearly separable
- No convergence guarantee for non-separable data

### Implementation

In [None]:
class SimplePerceptron:
    """Perceptron implementation from scratch"""
    
    def __init__(self, learning_rate=0.01, n_iterations=100):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.errors_per_iteration = []
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Convert labels to -1, 1
        y_converted = np.where(y <= 0, -1, 1)
        
        # Training loop
        for iteration in range(self.n_iterations):
            errors = 0
            
            for idx, x_i in enumerate(X):
                # Linear output
                linear_output = np.dot(x_i, self.weights) + self.bias
                # Prediction
                y_pred = np.sign(linear_output)
                
                # Update if misclassified
                if y_converted[idx] != y_pred:
                    update = self.learning_rate * y_converted[idx]
                    self.weights += update * x_i
                    self.bias += update
                    errors += 1
            
            self.errors_per_iteration.append(errors)
            
            # Stop if no errors
            if errors == 0:
                print(f"Converged at iteration {iteration + 1}")
                break
    
    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        return np.where(np.sign(linear_output) <= 0, 0, 1)

# Generate linearly separable and non-separable datasets
np.random.seed(42)

# Dataset 1: Linearly separable
X_sep, y_sep = make_classification(n_samples=100, n_features=2, n_redundant=0,
                                   n_informative=2, random_state=1,
                                   n_clusters_per_class=1)

# Dataset 2: Non-linearly separable (XOR-like)
X_nonsep = np.random.randn(200, 2)
y_nonsep = np.logical_xor(X_nonsep[:, 0] > 0, X_nonsep[:, 1] > 0).astype(int)

# Train perceptrons
perceptron_sep = SimplePerceptron(learning_rate=0.1, n_iterations=100)
perceptron_nonsep = SimplePerceptron(learning_rate=0.1, n_iterations=100)

perceptron_sep.fit(X_sep, y_sep)
perceptron_nonsep.fit(X_nonsep, y_nonsep)

# Compare with sklearn's Perceptron
sklearn_perceptron_sep = Perceptron(random_state=42)
sklearn_perceptron_sep.fit(X_sep, y_sep)

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Helper function to plot decision boundary
def plot_decision_boundary(ax, X, y, model, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    ax.scatter(X[y == 0, 0], X[y == 0, 1], c='blue', s=50, edgecolor='black')
    ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', s=50, edgecolor='black')
    ax.set_title(title)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

# Row 1: Linearly separable data
plot_decision_boundary(axes[0, 0], X_sep, y_sep, perceptron_sep, 
                      'Linearly Separable\n(Our Perceptron)')
plot_decision_boundary(axes[0, 1], X_sep, y_sep, sklearn_perceptron_sep, 
                      'Linearly Separable\n(Sklearn Perceptron)')

# Convergence plot
axes[0, 2].plot(perceptron_sep.errors_per_iteration, 'b-o', markersize=4)
axes[0, 2].set_xlabel('Iteration')
axes[0, 2].set_ylabel('Number of Errors')
axes[0, 2].set_title('Convergence (Linearly Separable)')
axes[0, 2].grid(True, alpha=0.3)

# Row 2: Non-linearly separable data
plot_decision_boundary(axes[1, 0], X_nonsep, y_nonsep, perceptron_nonsep, 
                      'XOR Problem\n(Non-separable)')

# Show actual XOR pattern
axes[1, 1].scatter(X_nonsep[y_nonsep == 0, 0], X_nonsep[y_nonsep == 0, 1], 
                   c='blue', s=20, alpha=0.5, label='Class 0')
axes[1, 1].scatter(X_nonsep[y_nonsep == 1, 0], X_nonsep[y_nonsep == 1, 1], 
                   c='red', s=20, alpha=0.5, label='Class 1')
axes[1, 1].set_title('XOR Pattern\n(Cannot be separated linearly)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# Non-convergence plot
axes[1, 2].plot(perceptron_nonsep.errors_per_iteration, 'r-o', markersize=4)
axes[1, 2].set_xlabel('Iteration')
axes[1, 2].set_ylabel('Number of Errors')
axes[1, 2].set_title('No Convergence (Non-separable)')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("1. Perceptron converges perfectly on linearly separable data")
print("2. Cannot learn XOR pattern (non-linearly separable)")
print("3. This limitation led to the development of multi-layer perceptrons (neural networks)")

---
## Exercise 7: Implementing Logistic Regression from Scratch

### Concept
Logistic Regression uses:
- Maximum Likelihood Estimation (MLE)
- Binary Cross-Entropy Loss: $L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$
- Gradient Descent optimization

### Implementation

In [None]:
class LogisticRegressionFromScratch:
    """Logistic Regression implementation using gradient descent"""
    
    def __init__(self, learning_rate=0.01, n_iterations=1000, regularization=None, lambda_reg=0.01):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.regularization = regularization  # None, 'l2', or 'l1'
        self.lambda_reg = lambda_reg
        self.weights = None
        self.bias = None
        self.losses = []
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def binary_cross_entropy(self, y_true, y_pred):
        """Binary cross-entropy loss"""
        # Add small epsilon to prevent log(0)
        epsilon = 1e-7
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        
        # Add regularization term
        if self.regularization == 'l2':
            loss += self.lambda_reg * np.sum(self.weights ** 2) / (2 * len(y_true))
        elif self.regularization == 'l1':
            loss += self.lambda_reg * np.sum(np.abs(self.weights)) / len(y_true)
        
        return loss
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(z)
            
            # Calculate loss
            loss = self.binary_cross_entropy(y, y_pred)
            self.losses.append(loss)
            
            # Backward pass (gradients)
            dw = np.dot(X.T, (y_pred - y)) / n_samples
            db = np.sum(y_pred - y) / n_samples
            
            # Add regularization gradient
            if self.regularization == 'l2':
                dw += self.lambda_reg * self.weights / n_samples
            elif self.regularization == 'l1':
                dw += self.lambda_reg * np.sign(self.weights) / n_samples
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Print progress
            if i % 100 == 0:
                print(f'Iteration {i}, Loss: {loss:.4f}')
    
    def predict_proba(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

# Generate dataset
X, y = make_classification(n_samples=500, n_features=2, n_redundant=0,
                          n_informative=2, random_state=42,
                          n_clusters_per_class=1, flip_y=0.1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train models with different regularization
models = {
    'No Regularization': LogisticRegressionFromScratch(learning_rate=0.1, n_iterations=500),
    'L2 Regularization': LogisticRegressionFromScratch(learning_rate=0.1, n_iterations=500, 
                                                       regularization='l2', lambda_reg=0.1),
    'L1 Regularization': LogisticRegressionFromScratch(learning_rate=0.1, n_iterations=500, 
                                                       regularization='l1', lambda_reg=0.1)
}

# Train and evaluate
results = {}
for name, model in models.items():
    print(f"\n{name}:")
    model.fit(X_train_scaled, y_train)
    
    # Predictions
    train_pred = model.predict(X_train_scaled)
    test_pred = model.predict(X_test_scaled)
    
    # Store results
    results[name] = {
        'model': model,
        'train_acc': accuracy_score(y_train, train_pred),
        'test_acc': accuracy_score(y_test, test_pred)
    }

# Visualization
fig = plt.figure(figsize=(15, 10))

# Plot 1: Loss curves
ax1 = plt.subplot(2, 3, 1)
for name, data in results.items():
    ax1.plot(data['model'].losses, label=name, linewidth=2)
ax1.set_xlabel('Iteration')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss Curves')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plots 2-4: Decision boundaries
for idx, (name, data) in enumerate(results.items()):
    ax = plt.subplot(2, 3, idx + 2)
    
    # Create mesh
    h = 0.02
    x_min, x_max = X_test_scaled[:, 0].min() - 1, X_test_scaled[:, 0].max() + 1
    y_min, y_max = X_test_scaled[:, 1].min() - 1, X_test_scaled[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = data['model'].predict_proba(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    ax.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
    ax.scatter(X_test_scaled[y_test == 0, 0], X_test_scaled[y_test == 0, 1], 
              c='blue', edgecolor='black', s=50)
    ax.scatter(X_test_scaled[y_test == 1, 0], X_test_scaled[y_test == 1, 1], 
              c='red', edgecolor='black', s=50)
    ax.contour(xx, yy, Z, levels=[0.5], colors='green', linewidths=2)
    ax.set_title(f'{name}\nTest Acc: {data["test_acc"]:.3f}')

# Plot 5: Coefficient comparison
ax5 = plt.subplot(2, 3, 5)
width = 0.25
x = np.arange(len(results[list(results.keys())[0]]['model'].weights))

for idx, (name, data) in enumerate(results.items()):
    ax5.bar(x + idx * width, np.abs(data['model'].weights), width, label=name)

ax5.set_xlabel('Feature Index')
ax5.set_ylabel('|Coefficient|')
ax5.set_title('Coefficient Magnitudes')
ax5.legend()
ax5.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary
print("\n" + "="*60)
print("Model Performance Summary:")
print("="*60)
for name, data in results.items():
    print(f"\n{name}:")
    print(f"  Train Accuracy: {data['train_acc']:.3f}")
    print(f"  Test Accuracy:  {data['test_acc']:.3f}")
    print(f"  Weights: {data['model'].weights}")
    print(f"  Bias: {data['model'].bias:.3f}")

---
## Exercise 8: Model Evaluation and Threshold Optimization

### Concept
Classification metrics:
- **ROC Curve**: True Positive Rate vs False Positive Rate
- **AUC**: Area Under the Curve (higher is better)
- **Threshold tuning**: Balance precision and recall

### Implementation

In [None]:
# Generate imbalanced dataset
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                          n_redundant=5, weights=[0.9, 0.1], flip_y=0.05,
                          random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                   random_state=42, stratify=y)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Get predicted probabilities
y_train_proba = log_reg.predict_proba(X_train_scaled)[:, 1]
y_test_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

# Calculate ROC curves
fpr_train, tpr_train, thresholds_train = roc_curve(y_train, y_train_proba)
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_test_proba)

auc_train = roc_auc_score(y_train, y_train_proba)
auc_test = roc_auc_score(y_test, y_test_proba)

# Find optimal threshold using Youden's J statistic
j_scores = tpr_test - fpr_test
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds_test[optimal_idx]

# Create interactive ROC curve
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('ROC Curves', 'Precision-Recall Curve',
                   'Threshold Effects', 'Confusion Matrices')
)

# ROC Curves
fig.add_trace(
    go.Scatter(x=fpr_train, y=tpr_train, 
               name=f'Train (AUC={auc_train:.3f})',
               line=dict(color='blue', width=2)),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=fpr_test, y=tpr_test, 
               name=f'Test (AUC={auc_test:.3f})',
               line=dict(color='red', width=2)),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=[0, 1], y=[0, 1], 
               name='Random',
               line=dict(color='gray', width=1, dash='dash')),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=[fpr_test[optimal_idx]], y=[tpr_test[optimal_idx]],
               mode='markers', name=f'Optimal (t={optimal_threshold:.3f})',
               marker=dict(size=12, color='green')),
    row=1, col=1
)

# Precision-Recall Curve
from sklearn.metrics import precision_recall_curve
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_test_proba)

fig.add_trace(
    go.Scatter(x=recall, y=precision, 
               name='PR Curve',
               line=dict(color='purple', width=2)),
    row=1, col=2
)

# Threshold effects
thresholds_to_test = np.linspace(0, 1, 50)
accuracies = []
precisions = []
recalls = []
f1_scores = []

for t in thresholds_to_test:
    y_pred_t = (y_test_proba >= t).astype(int)
    
    accuracies.append(accuracy_score(y_test, y_pred_t))
    
    if y_pred_t.sum() > 0:  # Avoid division by zero
        precisions.append(precision_score(y_test, y_pred_t, zero_division=0))
        recalls.append(recall_score(y_test, y_pred_t, zero_division=0))
        f1_scores.append(f1_score(y_test, y_pred_t, zero_division=0))
    else:
        precisions.append(0)
        recalls.append(0)
        f1_scores.append(0)

fig.add_trace(
    go.Scatter(x=thresholds_to_test, y=accuracies, 
               name='Accuracy', line=dict(color='blue', width=2)),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=thresholds_to_test, y=precisions, 
               name='Precision', line=dict(color='green', width=2)),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=thresholds_to_test, y=recalls, 
               name='Recall', line=dict(color='red', width=2)),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(x=thresholds_to_test, y=f1_scores, 
               name='F1-Score', line=dict(color='purple', width=2)),
    row=2, col=1
)

# Update layout
fig.update_xaxes(title_text="False Positive Rate", row=1, col=1)
fig.update_yaxes(title_text="True Positive Rate", row=1, col=1)
fig.update_xaxes(title_text="Recall", row=1, col=2)
fig.update_yaxes(title_text="Precision", row=1, col=2)
fig.update_xaxes(title_text="Threshold", row=2, col=1)
fig.update_yaxes(title_text="Score", row=2, col=1)

fig.update_layout(height=800, title_text="Comprehensive Model Evaluation")
fig.show()

# Confusion matrices for different thresholds
fig2, axes = plt.subplots(1, 3, figsize=(15, 5))

thresholds_compare = [0.3, 0.5, optimal_threshold]
titles = ['Low Threshold (0.3)', 'Default (0.5)', f'Optimal ({optimal_threshold:.3f})']

for ax, t, title in zip(axes, thresholds_compare, titles):
    y_pred = (y_test_proba >= t).astype(int)
    cm = confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, cbar=False)
    ax.set_title(title)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    
    # Add metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    
    ax.text(0.5, -0.15, f'Acc: {acc:.3f}, Prec: {prec:.3f}\nRec: {rec:.3f}, F1: {f1:.3f}',
           transform=ax.transAxes, ha='center')

plt.tight_layout()
plt.show()

print("\nClass Distribution:")
print(f"Training set: {np.bincount(y_train)}")
print(f"Test set: {np.bincount(y_test)}")
print(f"\nClass imbalance ratio: {np.bincount(y_test)[0]/np.bincount(y_test)[1]:.2f}:1")
print(f"\nOptimal threshold (Youden's J): {optimal_threshold:.3f}")

---
## Exercise 9: Multiclass Classification - OvR vs Softmax

### Concept
Two main strategies for multiclass:
1. **One-vs-Rest (OvR)**: Train K binary classifiers
2. **Softmax (Multinomial)**: Native multiclass with probability distribution

Softmax function: $P(y=k|x) = \frac{e^{w_k^Tx}}{\sum_j e^{w_j^Tx}}$

### Implementation

In [None]:
# Load Iris dataset for multiclass
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Use only 2 features for visualization
X_iris_2d = X_iris[:, [0, 2]]  # Sepal length and petal length

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_iris_2d, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train different multiclass strategies
models = {
    'One-vs-Rest': OneVsRestClassifier(LogisticRegression(random_state=42)),
    'Softmax (Multinomial)': LogisticRegression(multi_class='multinomial', random_state=42),
    'Auto (Default)': LogisticRegression(random_state=42)
}

# Train models
for name, model in models.items():
    model.fit(X_train_scaled, y_train)

# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Helper function for decision regions
def plot_decision_regions_multi(ax, X, y, model, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    
    # Plot data points
    colors = ['blue', 'red', 'green']
    markers = ['o', 's', '^']
    for i in range(3):
        idx = y == i
        ax.scatter(X[idx, 0], X[idx, 1], c=colors[i], 
                  marker=markers[i], s=50, edgecolor='black',
                  label=iris.target_names[i])
    
    ax.set_title(title)
    ax.set_xlabel('Sepal Length (scaled)')
    ax.set_ylabel('Petal Length (scaled)')
    ax.legend()

# Row 1: Decision boundaries
for idx, (name, model) in enumerate(models.items()):
    plot_decision_regions_multi(axes[0, idx], X_test_scaled, y_test, model, 
                               f'{name}\nTest Acc: {model.score(X_test_scaled, y_test):.3f}')

# Row 2: Probability contours for Softmax model
softmax_model = models['Softmax (Multinomial)']

h = 0.02
x_min, x_max = X_test_scaled[:, 0].min() - 1, X_test_scaled[:, 0].max() + 1
y_min, y_max = X_test_scaled[:, 1].min() - 1, X_test_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Get probabilities for each class
Z_proba = softmax_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])

for i in range(3):
    Z_class = Z_proba[:, i].reshape(xx.shape)
    
    im = axes[1, i].contourf(xx, yy, Z_class, levels=20, cmap='RdBu_r', alpha=0.7)
    axes[1, i].scatter(X_test_scaled[y_test == i, 0], 
                      X_test_scaled[y_test == i, 1],
                      c='black', s=50, edgecolor='white')
    axes[1, i].set_title(f'P(y={iris.target_names[i]}|x)')
    axes[1, i].set_xlabel('Sepal Length (scaled)')
    axes[1, i].set_ylabel('Petal Length (scaled)')
    plt.colorbar(im, ax=axes[1, i])

plt.tight_layout()
plt.show()

# Compare predictions and probabilities
print("\nSample Predictions Comparison:")
print("=" * 70)

sample_idx = [0, 10, 20, 30, 40]
X_samples = X_test_scaled[sample_idx]
y_true_samples = y_test[sample_idx]

for name, model in models.items():
    print(f"\n{name}:")
    predictions = model.predict(X_samples)
    probabilities = model.predict_proba(X_samples)
    
    for i in range(len(sample_idx)):
        print(f"  Sample {i}: True={iris.target_names[y_true_samples[i]]}, "
              f"Pred={iris.target_names[predictions[i]]}, "
              f"Probs={probabilities[i].round(3)}")

# Classification reports
print("\n" + "="*70)
print("Classification Reports:")
print("="*70)

for name, model in models.items():
    print(f"\n{name}:")
    y_pred = model.predict(X_test_scaled)
    print(classification_report(y_test, y_pred, target_names=iris.target_names))

---
## Exercise 10: Softmax Regression Implementation

### Concept
Softmax regression generalizes logistic regression to K classes:
- Uses categorical cross-entropy loss
- Outputs probability distribution over all classes
- Each class has its own weight vector

### Implementation

In [None]:
class SoftmaxRegression:
    """Softmax Regression (Multinomial Logistic Regression) from scratch"""
    
    def __init__(self, learning_rate=0.01, n_iterations=1000, reg_lambda=0.01):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.reg_lambda = reg_lambda
        self.weights = None
        self.bias = None
        self.losses = []
    
    def softmax(self, z):
        """Softmax function for multi-class probabilities"""
        # Subtract max for numerical stability
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    def one_hot_encode(self, y, n_classes):
        """Convert labels to one-hot encoding"""
        one_hot = np.zeros((len(y), n_classes))
        one_hot[np.arange(len(y)), y] = 1
        return one_hot
    
    def categorical_cross_entropy(self, y_true, y_pred):
        """Categorical cross-entropy loss"""
        n_samples = len(y_true)
        # Add small epsilon to prevent log(0)
        epsilon = 1e-7
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        
        # Calculate loss
        loss = -np.sum(y_true * np.log(y_pred)) / n_samples
        
        # Add L2 regularization
        loss += self.reg_lambda * np.sum(self.weights ** 2) / (2 * n_samples)
        
        return loss
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.n_classes = len(np.unique(y))
        
        # Initialize weights and bias
        self.weights = np.random.randn(n_features, self.n_classes) * 0.01
        self.bias = np.zeros(self.n_classes)
        
        # One-hot encode labels
        y_one_hot = self.one_hot_encode(y, self.n_classes)
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.softmax(z)
            
            # Calculate loss
            loss = self.categorical_cross_entropy(y_one_hot, y_pred)
            self.losses.append(loss)
            
            # Backward pass
            dz = (y_pred - y_one_hot) / n_samples
            dw = np.dot(X.T, dz) + self.reg_lambda * self.weights / n_samples
            db = np.sum(dz, axis=0)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            if i % 100 == 0:
                print(f'Iteration {i}, Loss: {loss:.4f}')
    
    def predict_proba(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self.softmax(z)
    
    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

# Create synthetic multiclass dataset
X, y = make_classification(n_samples=600, n_features=20, n_informative=15,
                          n_redundant=5, n_classes=4, random_state=42)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                   random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train our softmax regression
print("Training Softmax Regression from scratch...")
softmax_reg = SoftmaxRegression(learning_rate=0.1, n_iterations=500, reg_lambda=0.01)
softmax_reg.fit(X_train_scaled, y_train)

# Compare with sklearn
print("\nTraining sklearn's Logistic Regression (multinomial)...")
sklearn_softmax = LogisticRegression(multi_class='multinomial', max_iter=500)
sklearn_softmax.fit(X_train_scaled, y_train)

# Evaluate models
our_train_acc = softmax_reg.score(X_train_scaled, y_train)
our_test_acc = softmax_reg.score(X_test_scaled, y_test)
sklearn_train_acc = sklearn_softmax.score(X_train_scaled, y_train)
sklearn_test_acc = sklearn_softmax.score(X_test_scaled, y_test)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Loss curve
axes[0, 0].plot(softmax_reg.losses, 'b-', linewidth=2)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Training Loss (Categorical Cross-Entropy)')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Confusion matrix (our implementation)
from sklearn.metrics import confusion_matrix
y_pred_our = softmax_reg.predict(X_test_scaled)
cm_our = confusion_matrix(y_test, y_pred_our)
sns.heatmap(cm_our, annot=True, fmt='d', cmap='Blues', ax=axes[0, 1])
axes[0, 1].set_title(f'Our Implementation\nTest Acc: {our_test_acc:.3f}')
axes[0, 1].set_xlabel('Predicted')
axes[0, 1].set_ylabel('Actual')

# Plot 3: Confusion matrix (sklearn)
y_pred_sklearn = sklearn_softmax.predict(X_test_scaled)
cm_sklearn = confusion_matrix(y_test, y_pred_sklearn)
sns.heatmap(cm_sklearn, annot=True, fmt='d', cmap='Greens', ax=axes[1, 0])
axes[1, 0].set_title(f'Sklearn Implementation\nTest Acc: {sklearn_test_acc:.3f}')
axes[1, 0].set_xlabel('Predicted')
axes[1, 0].set_ylabel('Actual')

# Plot 4: Probability distributions for sample points
n_samples_plot = 10
sample_probs = softmax_reg.predict_proba(X_test_scaled[:n_samples_plot])

x_pos = np.arange(n_samples_plot)
width = 0.2
colors = ['blue', 'green', 'red', 'orange']
colors = ['blue', 'green', 'red', 'orange']

for i in range(4):
    axes[1, 1].bar(x_pos + i * width, sample_probs[:, i], width,
                   label=f'Class {i}', color=colors[i])

axes[1, 1].set_xlabel('Sample Index')
axes[1, 1].set_ylabel('Probability')
axes[1, 1].set_title('Predicted Probability Distribution\n(First 10 test samples)')
axes[1, 1].set_xticks(x_pos + 1.5 * width)
axes[1, 1].set_xticklabels([f'S{i}' for i in range(n_samples_plot)])
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nModel Comparison:")
print("=" * 50)
print(f"Our Implementation:")
print(f"  Train Accuracy: {our_train_acc:.3f}")
print(f"  Test Accuracy:  {our_test_acc:.3f}")
print(f"\nSklearn Implementation:")
print(f"  Train Accuracy: {sklearn_train_acc:.3f}")
print(f"  Test Accuracy:  {sklearn_test_acc:.3f}")
print("\nOur implementation achieves comparable performance!")

### Your Turn
1. Implement early stopping based on validation loss
2. Add dropout regularization to prevent overfitting
3. Compare different optimization algorithms (SGD, Adam, etc.)

---
## Congratulations! 🎉

You've completed a comprehensive journey from Linear to Logistic Regression!

### What you've learned:
✅ Advanced linear regression techniques (polynomial, regularization)  
✅ Why and how to transition to classification  
✅ Logistic regression theory and implementation  
✅ Multiclass classification strategies  
✅ Model evaluation and optimization  

### Next Steps:
1. **Deep Learning**: Neural networks extend these concepts
2. **Tree-based Methods**: Decision trees, Random Forests, XGBoost
3. **Support Vector Machines**: Another approach to classification
4. **Ensemble Methods**: Combining multiple models

Happy modeling! 🚀