# <span style = "color:rebeccapurple"> The Pipeline class

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

OK, it is time to talk about pipelines. A pipeline is an **abstraction** representing the whole process through which we implement a machine learning solution. A simplified pipeline may look like this:

1. State the problem.
2. Gather data.
3. Split training and testing data.
4. Preprocess the data.
5. Train the model.
6. Evaluate and optimize the model.
7. Draw awesome figures.

Having such a guide is very useful, and a starting point. However:
- Not all machine learning tasks follow the recipe above, so we must remain flexible, and
- No realistic process is linear.

A realistic process would be more of a dynamic web of relationships. However, this model will do for now.

As it turns out, `scikit-learn` makes our life even easier by letting us specify our own pipeline. This is done through the `Pipeline` class, which we must import.

## <span style = "color:darkorchid"> Imports and data

In [1]:
# :: IMPORTS ::

# Scikit-learn specifics:
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import linear_model

# Helper modules
import pandas as pd

In [2]:
# :: Data ::

try:
    import google.colab
    !wget https://raw.githubusercontent.com/nuitrcs/scikit-learn-workshop-july2025/refs/heads/main/data/penguins.csv
    !wget https://raw.githubusercontent.com/nuitrcs/scikit-learn-workshop-july2025/refs/heads/main/data/fish.csv
    penguins_directory = "penguins.csv"
    fish_directory = "fish.csv"
    print("Successfully loaded files to Colab. Check folder on left column.")
except ModuleNotFoundError:
    penguins_directory = "data/penguins.csv"
    fish_directory = "data/fish.csv"
    print("Data should be in your local directory. Under the 'data' folder.")

Data should be in your local directory. Under the 'data' folder.


## <span style = "color:darkorchid"> Our first pipeline object

Each pipeline object requires a list of *steps* in the pipeline, which we specify as a list of tuples. The tuples consist of two elements: the name we want to give a given step (for example, "scaler", "regressor", etc.), and then the actual object that will realize that step of the pipeline. It looks a lot like the `ColumnTransformer`!

In [3]:
# Create a list of steps:
pipe_list = [
    ("step1_sd_scaler", preprocessing.StandardScaler()),
    ("step2_lm", linear_model.LinearRegression())
]

In [4]:
# Create the pipeline
my_pipe = Pipeline(pipe_list)
my_pipe

Nice! That right there is a pipeline object. Notice the output is the output of the `LinearRegression` object.

What's great about this is we can now just fit the whole pipeline to our dataset and it will magically work:

In [29]:
# Let's get some penguin data
num_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
cat_features = ["sex"]
pred_features = num_features + cat_features
target_feature = 'body_mass_g'

penguins_X_num = pd.read_csv(penguins_directory)[num_features]      # <-- Only numerical features (for now)
penguins_y = pd.read_csv(penguins_directory)[target_feature]

In [30]:
# Fit the pipeline:
my_pipe.fit(penguins_X_num, penguins_y)

So what did it do? It first scaled our data, and then it took the output of that data and used it to fit a linear regression model.

Notice we're only using the numerical features, we'll add the categorical ones later one.

The pipeline can calculate your $R^2$:

In [31]:
my_pipe.score(penguins_X_num, penguins_y)

0.7639366781169293

But it can't get you the regression coefficients directly. For that, you'll need to access the linear regression object. You can access each step with python indexing:

In [32]:
# Access first step
my_pipe[0]

In [33]:
# Access second step
my_pipe[1]

Now we can get the regression coefficients:

In [34]:
my_pipe[1].coef_

array([ 17.98051439,  35.07127531, 710.40104637])

We can also access the steps by name, as in a dictionary:

In [35]:
# Access linear regression object, and get coefficients from there.
my_pipe["step2_lm"].coef_

array([ 17.98051439,  35.07127531, 710.40104637])

### <span style = "color:teal"> Don't forget about the train test split!

OK, the above helped us get familiar with pipelines, but we didn't split our data. That's a sin! Let's atone:

In [14]:
# --- Get data ---
num_features = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]
cat_features = ["sex"]
pred_features = num_features + cat_features
target_feature = 'body_mass_g'

penguins_X_num = pd.read_csv(penguins_directory)[num_features]
penguins_y = pd.read_csv(penguins_directory)[target_feature]

# --- Split data right away ---
seed = 42
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X_num, penguins_y, test_size = .3, random_state=seed)

# --- Create pipeline ---
penguins_pipeline = Pipeline(
    [
        ("step1_sd_scaler", preprocessing.StandardScaler()),
        ("step2_lm", linear_model.LinearRegression())
    ]
)

# --- Fit pipeline ---
penguins_pipeline.fit(pX_train, py_train)

# --- Get score on test dataset ---
penguins_pipeline.score(pX_test, py_test)

0.7380897777018107

Now hold on a sec, didn't we have to preprocess the testing data also? Don't worry, the pipeline does it for us!

When we use the `fit()` method, the entire pipeline is fitted with the given data. However, when we use methods like `score()` and `predict()`, it uses the already fitted values to preprocess the given data, and then it performs the scoring or prediction on the preprocessed data.

How cool!

## <span style = "color:darkorchid"> Implementing ColumnTransformer

Above we only used the numerical data in our pipeline. But what if we also have categorical data, and we want a one hot encoder together with a standard scaler? Well, no sweat, we can take our `ColumnTransformer()` approach we learned, and use it with our pipeline!

<b>Load the data

In [15]:
# Let's get all penguin data
penguins_X = pd.read_csv(penguins_directory)[pred_features]    # <-- Using all predictive features
penguins_y = pd.read_csv(penguins_directory)[target_feature]

In [16]:
penguins_X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex
0,39.1,18.7,181.0,male
1,39.5,17.4,186.0,female
2,40.3,18.0,195.0,female
3,36.7,19.3,193.0,female
4,39.3,20.6,190.0,male


<b>Split the data!

In [17]:
seed = 42
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X, penguins_y, test_size = .3, random_state=seed)
pX_train.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,sex
22,40.5,17.9,187.0,female
284,49.2,18.2,195.0,male
294,52.8,20.0,205.0,male
56,37.6,17.0,185.0,female
175,47.3,15.3,222.0,male


<b>Create a ColumnTransformer:

In [18]:
# Column Transformer
col_trans = ColumnTransformer([
    ("cat", preprocessing.OneHotEncoder(sparse_output = False, drop = "if_binary"), cat_features),
    ("num", preprocessing.StandardScaler(), num_features)
])
col_trans

<b> Make a pipeline

In [19]:
# Make pipeline
penguins_pipeline = Pipeline([
    ("col_trans", col_trans),
    ("linear_model", linear_model.LinearRegression())
])
penguins_pipeline

<b> Now we fit and score:

In [20]:
# Fit pipeline
penguins_pipeline.fit(pX_train, py_train)
penguins_pipeline

In [21]:
# Calculate Score:
penguins_pipeline.score(pX_test, py_test)

0.7774200374725311

### Note

Just as we can make a column tranformer a part of a pipeline, you can make a pipeline a part of a column transformer. For example if each column requires several steps of preprocessing, before they merge, you would create a preprocessing pipeline for each of them, then join them with a column transformer, then embed that column transformer into a larger pipeline.

We don't have time to do this but it's useful info for the future

## <span style = "color:darkorchid"> Alice in Antarctica - Redux

Time to put everything we've done together:

In [43]:
# Step 1 - Load the data
penguins_X = pd.read_csv(penguins_directory)[pred_features]
penguins_y = pd.read_csv(penguins_directory)[target_feature]

# Step 2 - Split the data
seed = 42
test_fraction = .3
pX_train, pX_test, py_train, py_test = train_test_split(penguins_X, penguins_y, test_size = test_fraction, random_state=seed)

# Step 3 - Create transformers and pipelines
col_trans = ColumnTransformer([
    ("cat", preprocessing.OneHotEncoder(drop = "if_binary"), cat_features),
    ("num", preprocessing.StandardScaler(), num_features)
])

penguins_pipeline = Pipeline([
    ("col_trans", col_trans),
    ("linear_model", linear_model.LinearRegression())
])

# Step 4 - Fit full pipeline
penguins_pipeline.fit(pX_train, py_train)

# Step 5 - Evaluate
coeffs = penguins_pipeline['linear_model'].coef_
r2 = penguins_pipeline.score(pX_test, py_test)

print(f"Coefficients: {coeffs}\nR-squared: {r2:.2f}")

Coefficients: [ 585.21414061   30.16274125 -189.53804076  506.64167501]
R-squared: 0.78


Notice that we keep "abstracting away", and taking a perspective at higher and higher levels. This is the art of good object oriented programming, and also of a complicated machine learning / data science project.

## <span style = "color:red"> Bo's fishy quest - Redux

Create a full regression pipeline, as above, for the fish data Bo is using.
- Include a `ColumnTransformer` for the preprocessing steps.
- Feel free to change the `test_fraction` parameter if you wish.
- Discuss the coefficients and $R^2$ with your neighbor.