# Scikit-learn pipelines

> Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. 

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

## 1. Setup

In [1]:
# Standard library imports
import re

# Third party imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

## 2. Pipelines

Pipeline Components

In [2]:
def remove_special_characters(X: str, **fit_params) -> str:
    """Remove special characters from a string.

    Args:
        X (str): String to remove special characters from.

    Returns:
        str: String with special characters removed.
    """
    # Remove all characters that are not a-z, A-Z or 0-9.
    regex_pattern = "[^a-zA-Z0-9]"

    # Run removal
    output = re.sub(pattern=regex_pattern, repl='', string=X)

    return output

In [3]:
class CaseFixer:
    """Class to apply title-casing."""

    def fit(self, X: str, y = None, **fit_params) -> str:
        """Fit."""
        return self

    def transform(self, X: str, y = None, **fit_params) -> str:
        """Apply title casing."""
        return X.title()

    def fit_transform(self, X, y=None):
        """Fit then transform."""
        self.fit(X, y)

        return self.transform(X, y)

In [4]:
def print_kwargs(X, **kwargs):
    """Print keyword arguments."""
    print(f"Keyword arguments are: {dict(**kwargs)}")

    return X

Pipeline

In [5]:
pipe = Pipeline(steps=[
    ("test-keyword", FunctionTransformer(func=print_kwargs)),  # Keyword arguments passed in later via the arg_grid
    ("test-keyword-manual", FunctionTransformer(func=print_kwargs, kw_args={"text": "Manual"})),  # Pass keyword arguments directly here
    ('remove-special-characters', FunctionTransformer(func=remove_special_characters)),
    ("test-plain", "passthrough"),  # Generic pass-through
    ('fix-casing', CaseFixer())
    ])

Implementation

In [6]:
# Define arguments to pass to parameters of various steps in the pipeline
arg_grid = {
    "test-keyword__kw_args": {"Success": True},  # Pass keyword arguments to test-keyword
}

In [7]:
# Pass arguments to scikit-learn pipeline
pipe.set_params(**arg_grid)

In [8]:
example_input = ["!HeL.lO", "G%ooDby!.E"]

In [9]:
output = [pipe.transform(x) for x in example_input]

Keyword arguments are: {'Success': True}
Keyword arguments are: {'text': 'Manual'}
Keyword arguments are: {'Success': True}
Keyword arguments are: {'text': 'Manual'}


In [10]:
print(output)

['Hello', 'Goodbye']
