<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_00_what_are_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What are sklearn pipelines?

Sklearn pipelines are a way to streamline and automate the process of building and evaluating machine learning models. They allow you to chain together a sequence of data processing steps and a final estimator (like a classifier or regressor) into a single object. This way, you can treat the entire workflow as a single unit.

### What do sklearn pipelines do?

Pipelines in sklearn help you:

1. **Streamline Workflows**: Combine multiple steps (e.g., preprocessing, feature extraction, model training) into one cohesive workflow.
2. **Ensure Consistency**: Guarantee that the same transformations are applied to the training data and any new data (e.g., during prediction).
3. **Reduce Errors**: Minimize the risk of data leakage and inconsistencies by encapsulating the steps in a fixed order.
4. **Improve Code Maintainability**: Simplify the process of modifying or extending your workflow by having a single object to manage.

### Why are sklearn pipelines important?

1. **Prevent Data Leakage**: By ensuring that transformations are applied consistently, pipelines help prevent the inadvertent use of information from the test data in the training process.
2. **Simplify Code**: They make the code more readable and modular, reducing redundancy.
3. **Facilitate Hyperparameter Tuning**: Pipelines integrate smoothly with sklearn's hyperparameter tuning tools like `GridSearchCV` or `RandomizedSearchCV`, enabling the tuning of parameters across multiple steps of the workflow.

### Role of sklearn pipelines in Machine Learning

In machine learning, pipelines play a crucial role in:

1. **Preprocessing Data**: Transforming raw data into a suitable format for model training (e.g., scaling features, encoding categorical variables).
2. **Feature Engineering**: Creating new features or modifying existing ones to improve model performance.
3. **Model Training and Evaluation**: Simplifying the process of training models and evaluating their performance on test data.
4. **Productionizing Models**: Making it easier to deploy models in a consistent and reliable manner.

## Combining Multiple Preprocessing Steps

In [None]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select features and target
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = df['survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing for numeric columns (impute missing values and scale)
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns (impute missing values and one-hot encode)
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the full pipeline with a classifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    # ('classifier', LogisticRegression(max_iter=300))])
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.82      0.82      0.82       105
           1       0.74      0.74      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



#### Explanation of the Code

1. **Data Loading and Splitting**: Load the Titanic dataset and split it into training and testing sets.
2. **Preprocessing Pipelines**:
   - **Numeric Features**: Impute missing values with the median and scale the features.
   - **Categorical Features**: Impute missing values with 'missing' and one-hot encode the features.
3. **ColumnTransformer**: Combine the preprocessing steps for numeric and categorical features.
4. **Pipeline**: Chain the preprocessor and the classifier into one pipeline.
5. **Training and Evaluation**: Fit the pipeline on the training data, make predictions on the test data, and evaluate the model.

This example demonstrates how sklearn pipelines can make your machine learning workflow more efficient, consistent, and easy to manage.

## Using Custom Transformers

In [None]:
import seaborn as sns
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Custom transformer to add a feature
class FamilySizeAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['family_size'] = X['sibsp'] + X['parch'] + 1
        return X

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select features and target
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = df['survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing for numeric columns (impute missing values and scale)
numeric_features = ['age', 'fare', 'family_size']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns (impute missing values and one-hot encode)
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the full pipeline with a custom transformer and a classifier
pipeline = Pipeline(steps=[
    ('family_size_adder', FamilySizeAdder()),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.86      0.85      0.85       105
           1       0.79      0.80      0.79        74

    accuracy                           0.83       179
   macro avg       0.82      0.82      0.82       179
weighted avg       0.83      0.83      0.83       179



## Integrating Grid Search for Hyperparameter Tuning

In [None]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Custom transformer to add a feature
class FamilySizeAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['family_size'] = X['sibsp'] + X['parch'] + 1
        return X

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select features and target
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = df['survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing for numeric columns (impute missing values and scale)
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns (impute missing values and one-hot encode)
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create the full pipeline with a classifier
pipeline = Pipeline(steps=[
    ('family_size_adder', FamilySizeAdder()),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))])

# Define parameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_features': ['sqrt', 'log2'],
    'classifier__max_depth': [4, 6, 8, 10]
}

# Apply grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Predict and evaluate
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
print("Best parameters:", grid_search.best_params_)


              precision    recall  f1-score   support

           0       0.80      0.91      0.85       105
           1       0.85      0.68      0.75        74

    accuracy                           0.82       179
   macro avg       0.82      0.79      0.80       179
weighted avg       0.82      0.82      0.81       179

Best parameters: {'classifier__max_depth': 6, 'classifier__max_features': 'sqrt', 'classifier__n_estimators': 200}
