# Logistic Regression - Complete ML Pipeline

This notebook demonstrates a complete machine learning pipeline for logistic regression including:
- Data loading and exploration
- Data preprocessing (one-hot encoding, categorical encoding, normalization)
- Model training with hyperparameter optimization
- Model evaluation

Dataset: Iris dataset from sklearn (publicly available)

## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set random seed for reproducibility
np.random.seed(42)

## 2. Load and Explore Data

In [None]:
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df['species'] = df['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
df.head()

In [None]:
# Basic statistics
print("Dataset statistics:")
df.describe()

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Check class distribution
print("\nClass distribution:")
print(df['species'].value_counts())

In [None]:
# Visualize data distribution
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for idx, col in enumerate(iris.feature_names):
    ax = axes[idx // 2, idx % 2]
    for species in df['species'].unique():
        data = df[df['species'] == species][col]
        ax.hist(data, alpha=0.5, label=species)
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')
    ax.legend()
plt.tight_layout()
plt.show()

## 3. Data Preprocessing

### 3.1 Categorical Encoding

In [None]:
# For demonstration, we'll show both Label Encoding and One-Hot Encoding

# Label Encoding for target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['species'])

print("Original species:", df['species'].unique())
print("Encoded labels:", np.unique(y))
print("Mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

In [None]:
# One-Hot Encoding example (for categorical features if any)
# In this dataset, we don't have categorical features, but we'll demonstrate the technique

# Create a sample categorical feature for demonstration
df['size_category'] = pd.cut(df['sepal length (cm)'], bins=3, labels=['small', 'medium', 'large'])

# One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity
size_encoded = one_hot_encoder.fit_transform(df[['size_category']])
size_encoded_df = pd.DataFrame(size_encoded, columns=one_hot_encoder.get_feature_names_out(['size_category']))

print("One-Hot Encoded size_category:")
print(size_encoded_df.head())

### 3.2 Feature Selection and Train-Test Split

In [None]:
# Select features (using original numeric features)
X = df[iris.feature_names].values

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

### 3.3 Feature Normalization

In [None]:
# Standardize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Original feature means:", X_train.mean(axis=0))
print("Original feature stds:", X_train.std(axis=0))
print("\nScaled feature means:", X_train_scaled.mean(axis=0))
print("Scaled feature stds:", X_train_scaled.std(axis=0))

## 4. Model Training with Hyperparameter Optimization

In [None]:
# Define hyperparameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500]
}

# Create base model
base_model = LogisticRegression(random_state=42)

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(
    base_model,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Starting hyperparameter optimization...")
grid_search.fit(X_train_scaled, y_train)
print("\nOptimization complete!")

In [None]:
# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Get the best model
best_model = grid_search.best_estimator_

In [None]:
# Display top 5 parameter combinations
results_df = pd.DataFrame(grid_search.cv_results_)
results_df = results_df.sort_values('rank_test_score')
print("\nTop 5 parameter combinations:")
print(results_df[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head())

## 5. Model Evaluation

In [None]:
# Make predictions
y_train_pred = best_model.predict(X_train_scaled)
y_test_pred = best_model.predict(X_test_scaled)

# Calculate accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Training Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)

In [None]:
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred, target_names=iris.target_names))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

In [None]:
# Get prediction probabilities
y_test_proba = best_model.predict_proba(X_test_scaled)

# Visualize predictions with confidence
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, species in enumerate(iris.target_names):
    axes[i].hist(y_test_proba[:, i], bins=20, alpha=0.7)
    axes[i].set_title(f'Prediction Probability for {species}')
    axes[i].set_xlabel('Probability')
    axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()

## 6. Model Coefficients Analysis

In [None]:
# Display model coefficients
coef_df = pd.DataFrame(
    best_model.coef_,
    columns=iris.feature_names,
    index=iris.target_names
)

print("Model Coefficients:")
print(coef_df)

# Visualize coefficients
plt.figure(figsize=(10, 6))
sns.heatmap(coef_df, annot=True, cmap='coolwarm', center=0)
plt.title('Logistic Regression Coefficients')
plt.show()

## Summary

This notebook demonstrated a complete machine learning pipeline for logistic regression:

1. **Data Loading and Exploration**: Loaded the Iris dataset and performed exploratory data analysis
2. **Data Preprocessing**:
   - Label encoding for target variable
   - One-hot encoding demonstration for categorical features
   - Feature normalization using StandardScaler
3. **Model Training**: Used GridSearchCV for hyperparameter optimization
4. **Model Evaluation**: Assessed model performance using accuracy, classification report, and confusion matrix
5. **Model Interpretation**: Analyzed model coefficients to understand feature importance

Key takeaways:
- Always normalize/standardize features for logistic regression
- Use cross-validation for hyperparameter tuning to avoid overfitting
- Evaluate models using multiple metrics
- Interpret model coefficients to understand feature contributions