# A Machine Learning Pipeline for Classification with scikit-learn

In this example we are going to build a simple predictive model using machine learning. We are going to revisit the Titanic passenger list data set, and use it to train a classifier that tries to determine whether a passenger survived the disaster, based on the person's attributes in the passenger list. This is obviously an educational example using small data, but a similar sequence of steps can be applied to solve real-world predictive analytics tasks on large amounts of data.

## Preamble

In [None]:
import pandas
import sklearn
import numpy
import matplotlib.pyplot as plt
import seaborn

In [None]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style(dark=True)

## Loading the Data

In [None]:
data_path = "../.assets/data/titanic/titanic.csv"

In [None]:
!head {data_path}

... and always keep the documentation close for reference:

In [None]:
!cat ../.assets/data/titanic/titanic-documentation.txt

In [None]:
data = pandas.read_csv(data_path)

In [None]:
data.head()

## Machine Learning Building Blocks

A machine learning pipeline is a sequence of processing steps or stages that leads from the raw data to the desired result, e.g. a trained model or a prediction. **[scikit-learn](https://scikit-learn.org/)** provides an API to map this concept to code.

![](graphics/sklearn-elements.png)

**Estimator**

An `Estimator` is a learning algorithm, any algorithm that trains on data to generate predictions. An estimator implements a method `fit(X, y)`, which accepts data points or features `X` to learn from, as well as labels `y` on these data points. Unsupervised learning is also supported by some estimators: In this case, one can omit the labels `y.

An estimator also implements a `predict(X)` method that accepts data points and returs predictions.


**Examples**

- [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html): fits a simple linear model to the data to perform predictions

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
toy_data = pandas.DataFrame(
    {
        "x": numpy.linspace(0, 100, 100),
        "y": 2 * numpy.linspace(0, 100, 100) + numpy.random.normal(scale=25, size=100),
    }
)
toy_data["y_predict"] = LinearRegression()\
    .fit(
        X=toy_data[["x"]], 
        y=toy_data["y"]
    )\
    .predict(toy_data[["x"]])

ax = toy_data.plot(kind="scatter", x="x", y="y")
toy_data.plot(ax=ax, color="r", kind="scatter", x="x", y="y_predict")

**Transformer**

A `Transformer` implements a method `transform(X)` which converts a tabular dataset (`pandas.DataFrame`, `numpy.ndarray`) into another, and a method `fit(X)` to learn from data how to perform the transformation. This is typically used for preprocessing steps.

**Examples**

- [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler): scales the feature to a given interval
- [sklearn.preprocessing.OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html): encodes categorical variables by giving them an integer label
- [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html): performs **one-hot-encoding**, i.e. transforms a value of a categorical variable into a binary indicator vector.

In [None]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

In [None]:
numeric_attributes_scaled = pandas.DataFrame(
    MinMaxScaler(
        feature_range=(0,1)
    )\
        .fit(data[["Age", "Fare"]])\
        .transform(data[["Age", "Fare"]]),
    columns=["Age scaled", "Fare scaled"]
)


In [None]:
numeric_attributes_scaled.head()

In [None]:
categorial_attributes_onehot = pandas.DataFrame(
    OneHotEncoder(sparse=False)\
        .fit(data[["Sex"]].dropna())\
        .transform(data[["Sex"]].dropna()),
)


In [None]:
categorial_attributes_onehot.head()

**Pipeline**

A `Pipeline` is a sequence of steps or stages, which can be `Transformer`s followed by an `Estimator`. A `Pipeline` also behaves like an `Estimator`, implementing the `predict(X)` method.

Pipelines are a great tool for making machine learning workflows - including the preprocessing steps - explicit as Python objects. Putting our preprocessing steps that may influence model performance into pipelines, we are on the way to becoming true scikit-learn power users - fully utilizing the tools that the library provides. It also improves reproducibility.

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
example_pipeline = Pipeline(
    steps=[
        ("scaler", MinMaxScaler()),
        ("regressor", LinearRegression())
    ]
)

In [None]:
pandas.Series(
    example_pipeline.fit(X=toy_data[["x"]], y=toy_data["y"]).predict(toy_data[["x"]])
).plot()

## Data Preprocesssing

We want to train a classifier that predicts the target variable `Survived` - whether the passenger survived the Titanic disaster - depending on the input columns `Age`, `Fare`, `Sex` and `Embarked`. `Age` and `Fare`  contain numeric values, `Sex` and `Embarked` contain categorical values in the form of strings.

In [None]:
# select the columns used in this example
data = data[["Survived", "Age", "Fare", "Sex", "Embarked"]]

We note that there are a few missing values some of the columns:

In [None]:
for col in data.columns:
    print(col, " : ", data[data[col].isna()].shape[0])

### Dealing with Missing Values

 There are several strategies to deal with missing values in machine learning, including replacement with suitable **imputed values**, or simply dropping the affected rows.
 
We notice that a significant fraction of passengers have not given their age. Since we don't want to throw away so many passenger records, we will try to estimate a replacement value. On the other hand, throwing away a few rows where the `Embarked` info is missing will probaly not make a difference. 

In [None]:
data = data[~data["Embarked"].isnull()]

Ultimately, choosing the right strategy for replacing missing values depends on domain knowledge. A simple and common strategy is replacing with the mean of the other observations.  

In [None]:
seaborn.distplot(data[["Age"]].dropna())

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
seaborn.distplot(
    SimpleImputer(strategy="mean")\
        .fit(data[["Age"]])\
        .transform(data[["Age"]])
)

### Encoding Categorial Attributes

Categorial attributes in the form of strings, such as `Embarked`, need to be encoded numerically before being readable by the machine learning algorithm. Among different strategies available for this task, a common one is **one-hot-encoding**: The categorical value is replaced by a binary indicator vector. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder(sparse=False).fit(data[["Sex"]])

In [None]:
encoder.categories_

In [None]:
pandas.DataFrame(
    encoder.transform(data[["Sex"]]),
    columns=encoder.categories_
).head()


## Training the Classifier

We can now go on to the training phase in which a machine learning algorithm ingests the training data to build a predictive model - here, a classifier that predicts yes or no for survival.

For this **supervised learning problem**, we split the data into a **feature matrix** $X$ and a **label vector** $y$.

In [None]:
X, y = data[data.columns.difference(["Survived"])], data["Survived"]

In [None]:
X.head()

In [None]:
y.head()

Many types of classification algorithms exist, each with their own strengths and weaknesses whose discussion goes beyond the scope of this examples. A simple choice is building a single **decision tree**: 

In [None]:
from sklearn.tree import DecisionTreeClassifier

The classifier is an `Estimator` that expects numeric feature matrix and labels as inputs to `fit`, and then features as input to `predict`:

In [None]:
data_simple = data[["Age", "Fare", "Survived"]].dropna()
X_simple, y_simple = data_simple[data_simple.columns.difference(["Survived"])], data_simple["Survived"]

In [None]:
y_pred = DecisionTreeClassifier()\
    .fit(X_simple, y_simple)\
    .predict(X_simple)

In [None]:
y_pred[:10]

## Evaluation

As discusse in **[📓 ML for Classification](ml-classification-intro.ipynb)**, we can use a **train-test-split** and the **precision** and **recall** error metrics for evaluation.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
from sklearn.dummy import DummyClassifier

# TODO: replace this with your ML pipeline
pipeline = Pipeline(
    [
        ("dummy", DummyClassifier())
    ]
)

In [None]:
y_pred = pipeline.fit(X_train, y_train).predict(X_test)

In [None]:
precision_score(y_test, y_pred)

In [None]:
recall_score(y_test, y_pred)

## Exercise: Assembling the Full Pipeline

Let's revisit the model training workflow and implement it again, ideally as a single `Pipeline` that starts from the raw data and outputs a trained model.

In [None]:
# Your turn - implement the model training 




**Hint: Consider using the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to apply different preprocessing steps to different columns**

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_