<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_13_sklearn_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scikit-Learn Pipelines

To achieve maximum flexibility in your pipeline setup, especially for swapping in different resampling methods or adding new features, leveraging scikit-learn pipelines is a great approach. Scikit-learn pipelines allow you to seamlessly integrate various preprocessing steps, feature engineering, resampling methods, and model training in a streamlined manner. Here’s a structured way to set this up:

1. **Use Scikit-Learn Pipelines for Preprocessing and Feature Engineering**: Define pipelines for common preprocessing tasks and feature engineering. These pipelines can be easily modified to include new steps as needed.

2. **Create Configurable Resampling Pipelines**: Set up your pipelines to allow easy swapping of different resampling techniques using scikit-learn's `ColumnTransformer` and `Pipeline`.

3. **Modular Functions for Flexibility**: Write modular functions to create and configure pipelines, making it easy to switch out different components.

4. **Parameterize the Pipeline**: Use function parameters to pass different resampling methods or feature engineering steps to your pipeline functions.

### Advantages of This Approach

1. **Flexibility**: Easily switch between different resampling methods, feature selection techniques, or models by simply changing the function parameters.
2. **Modularity**: The code is organized into modular functions, making it easier to maintain and extend.
3. **Readability**: Using scikit-learn pipelines keeps the workflow clear and concise, improving code readability.
4. **Reusability**: The same pipeline structure can be reused for different datasets or experiments with minimal changes.

By parameterizing the pipeline creation functions and leveraging scikit-learn’s pipeline capabilities, you can create a robust, flexible, and reusable framework for machine learning experiments.

### Load, Preprocess, Train & Evaluate Multiple Models

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score
from xgboost import XGBClassifier
import logging
from loan_data_utils import load_and_preprocess_data

# Parameters
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
categorical_columns = ['sex', 'education', 'marriage']
target = 'default_payment_next_month'

# Load and preprocess data
X, y = load_and_preprocess_data(url, categorical_columns, target)

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category']).columns.tolist()

# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(drop='first'))
        ]), categorical_features)
    ])

# Define the models to compare
models = [
    ('Logistic Regression', LogisticRegression(random_state=42, max_iter=1000)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('Support Vector Machine', SVC(random_state=42)),
    ('K-Nearest Neighbors', KNeighborsClassifier()),
    ('XGBoost', XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'))
]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Evaluate each model
results = {}
for name, model in models:
    # Create the pipeline for the current model
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Fit the pipeline to the training data
    pipeline.fit(X_train, y_train)

    # Predict the test data
    y_pred = pipeline.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='macro')
    results[name] = {'Accuracy': accuracy, 'F1 Score': f1}

    # Print the classification report
    print(f"Model: {name}")
    print(classification_report(y_test, y_pred))
    print("\n")

# Print the comparison of models
print("Model Comparison:")
for name, metrics in results.items():
    print(f"{name}: Accuracy = {metrics['Accuracy']:.4f}, F1 Score = {metrics['F1 Score']:.4f}")


Model: Logistic Regression
              precision    recall  f1-score   support

           0       0.82      0.97      0.89      4687
           1       0.70      0.24      0.36      1313

    accuracy                           0.81      6000
   macro avg       0.76      0.61      0.62      6000
weighted avg       0.79      0.81      0.77      6000



Model: Random Forest
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      4687
           1       0.63      0.37      0.46      1313

    accuracy                           0.81      6000
   macro avg       0.74      0.65      0.68      6000
weighted avg       0.80      0.81      0.80      6000



Model: Support Vector Machine
              precision    recall  f1-score   support

           0       0.84      0.96      0.89      4687
           1       0.68      0.34      0.45      1313

    accuracy                           0.82      6000
   macro avg       0.76      0.65      0.67   

### Write Loan Data Utils Script

In [2]:
script_content=r'''
import pandas as pd
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def load_data_from_url(url):
    try:
        df = pd.read_excel(url, header=1)
        logging.info("Data loaded successfully from URL.")
    except Exception as e:
        logging.error(f"Error loading data from URL: {e}")
        return None
    return df

def clean_column_names(df):
    df.columns = [col.lower().replace(' ', '_') for col in df.columns]
    return df

def remove_id_column(df):
    if 'id' in df.columns:
        df = df.drop(columns=['id'])
    return df

def rename_columns(df):
    rename_dict = {'pay_0': 'pay_1'}
    df = df.rename(columns=rename_dict)
    return df

def convert_categorical(df, categorical_columns):
    df[categorical_columns] = df[categorical_columns].astype('category')
    return df

def split_features_target(df, target):
    X = df.drop(columns=[target])
    y = df[target]
    return X, y

def load_and_preprocess_data(url, categorical_columns, target):
    df = load_data_from_url(url)
    if df is not None:
        df = clean_column_names(df)
        df = remove_id_column(df)
        df = rename_columns(df)
        df = convert_categorical(df, categorical_columns)
        X, y = split_features_target(df, target)
        return X, y
    return None, None



'''

# Write the script to a file
with open("loan_data_utils.py", "w") as file:
    file.write(script_content)

print("Script successfully written to loan_data_utils.py")
# Reload script to make functions available for use
import importlib
import loan_data_utils
importlib.reload(loan_data_utils)

from loan_data_utils import *


Script successfully written to loan_data_utils.py
