# A Simple Machine Learning Pipeline with Spark

In this example we are going to build a simple predictive model using machine learning. We are going to revisit the Titanic passenger list data set, and use it to train a classifier that tries to determine whether a passenger survived the disaster, based on the person's attributes in the passenger list. This is obviously an educational example using small data, but a similar sequence of steps can be applied to solve real-world predictive analytics tasks on large amounts of distributed data. 

## Preamble

In [1]:
import findspark
findspark.init()
import pyspark

## Loading the Data

In [2]:
data_path = "../.assets/data/titanic/titanic.csv"

In [3]:
!head {data_path}

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S


... and always keep the documentation close for reference:

In [4]:
!cat ../.assets/data/titanic/titanic-documentation.txt

Data Dictionary

Variable	Definition	Key
survival 	Survival 	0 = No, 1 = Yes
pclass 	    Ticket class 	1 = 1st, 2 = 2nd, 3 = 3rd
sex 	    Sex 	
Age 	    Age in years 	
sibsp 	    # of siblings / spouses aboard the Titanic 	
parch 	    # of parents / children aboard the Titanic 	
ticket 	    Ticket number 	
fare 	    Passenger fare 	
cabin 	    Cabin number 	
embarked 	Port of Embarkation 	C = Cherbourg, Q = Queenstown, S = Southampton


Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for t



After creating a `SparkSession`, we read the contents of the .csv file into a DataFrame. For that we also need to define its schema.

In [5]:
spark = pyspark.sql.SparkSession \
    .builder \
    .appName("TitanicClassifier") \
    .getOrCreate()


21/09/15 14:19:08 WARN Utils: Your hostname, clsm1ba.local resolves to a loopback address: 127.0.0.1; using 192.168.0.8 instead (on interface en0)
21/09/15 14:19:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/09/15 14:19:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [7]:
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType

schema = StructType([
             StructField('PassengerId', StringType()),
             StructField('Survived', IntegerType()),
             StructField('Pclass', IntegerType()),
             StructField('Name', StringType()),
             StructField('Sex', StringType()),
             StructField('Age', IntegerType()),
             StructField('SibSp', IntegerType()),
             StructField('Parch', IntegerType()),
             StructField('Ticket', StringType()),
             StructField('Fare', DoubleType()),
             StructField('Cabin', StringType()),
             StructField('Embarked', StringType())
        ])


In [8]:
data = spark.read.csv(data_path, header=True, schema=schema)


In [9]:
data.show(5)

[Stage 0:>                                                          (0 + 1) / 1]

+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex|Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+---+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male| 22|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female| 38|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female| 26|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female| 35|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male| 35|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+---+-----+-----+---------------

                                                                                

Also take a look at the documentation of the dataset:

In [10]:
!cat ../.assets/data/titanic/titanic-documentation.txt

Data Dictionary

Variable	Definition	Key
survival 	Survival 	0 = No, 1 = Yes
pclass 	    Ticket class 	1 = 1st, 2 = 2nd, 3 = 3rd
sex 	    Sex 	
Age 	    Age in years 	
sibsp 	    # of siblings / spouses aboard the Titanic 	
parch 	    # of parents / children aboard the Titanic 	
ticket 	    Ticket number 	
fare 	    Passenger fare 	
cabin 	    Cabin number 	
embarked 	Port of Embarkation 	C = Cherbourg, Q = Queenstown, S = Southampton


Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for t

## Machine Learning Building Blocks

A machine learning pipeline is a sequence of processing steps or stages that leads from the raw data to the desired result, e.g. a trained model or a prediction. The [`pyspark.ml` module](https://spark.apache.org/docs/latest/ml-pipeline.html) provides an API to map this concept to code.

In [None]:
from pyspark.ml import Transformer, Estimator, Pipeline, PipelineModel

**Transformer**

A `Transformer` implements a method `transform()` which converts one DataFrame into another, generally by appending one or more columns. That could mean extracting features from a dataset, or performing prediction based on the given data.

**Estimator**

An `Estimator` is a learning algorithm, any algorithm that fits or trains on data. An Estimator implements a method `fit()`, which accepts a DataFrame and produces a `Model`, which is also a `Transformer`.


**Pipeline**

A `Pipeline` is a sequence of `PipelineStage`s, which can be `Transformer`s and `Estimator`s. A `Pipeline` also behaves like a `Estimator`, finally outputting a `PipelineModel`. What happens when you call the `fit` method of a transformer is the following: For `Estimator` stages, their `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`), and that Transformer’s `transform()` method is called on the DataFrame.

## Data Preprocesssing

We want to train a classifier that predicts the target variable `Survived` - whether the passenger survived the Titanic disaster - depending on the input columns `Age`, `Fare`, `Sex` and `Embarked`. `Age` and `Fare`  contain numeric values, `Sex` and `Embarked` contain categorical values in the form of strings.

In [None]:
# select the columns used in this example
data = data[["PassengerId", "Survived", "Age", "Fare", "Sex", "Embarked"]]

We note that there are a few missing values some of the columns:

In [None]:
for col in data.columns:
    print(col, " : ", data.filter(f"{col} is NULL").count())

 There are several strategies to deal with missing values in machine learning, including replacement with dummy values, but for simplicity, we simply want to ignore a row with missing values. There are multiple ways of dropping these rows from the DataFrame. We would like to do this as a stage in a `Pipeline`, which gives us the chance to learn about how to write our own custom `Transformer`s.

### Writing Custom Transformers for Data Cleanup

To drop rows with missing values as part of a pipeline, we write a custom transformer that performs this step. We need to subclass the `Transformer` class, and also implement a few expected attributes. (For this simple example, we don't need them to function, they just need to be there, so we set them to constant dummy values.)

In [None]:
class NaDropper(Transformer):
    """
    Drops rows with at least one not-a-number element
    """
    
    # lazy workaround - a transformer needs to have these attributes
    # TODO: replace if needed
    _defaultParamMap = dict()
    _paramMap = dict()
    _params = dict()
    uid = 0

    def __init__(self, cols=None):
        self.cols = cols


    def _transform(self, data):
        dataAfterDrop = data.dropna(subset=self.cols) 
        return dataAfterDrop
    
    def __repr__(self):
        """ Show a proper string representation when printing the pipeline stage"""
        return str(type(self))


We test our transformer by using it as a stage in a pipeline, which we first `fit` to the data and then use it to`transform` the data.

In [None]:
prepro_pipeline = Pipeline(stages=[NaDropper(cols=data.columns)])

In [None]:
data_clean = prepro_pipeline.fit(data).transform(data)

In [None]:
for col in data_clean.columns:
    print(col, " : ", data_clean.filter(f"{col} is NULL").count())

**Exercise: Using SQLTransformer**

As so often, there is more than one way to perform a task such as dropping missing values. The `SQLTransformer` executes arbitrary SQL statements on the DataFrame. Try applying it.

In [None]:
from pyspark.ml.feature import SQLTransformer
# Your turn: Use SQLTransformer to drop rows with missing values 



### Encoding Categorial Attributes

Categorial attributes in the form of strings, such as `Embarked`, need to be encoded numerically before being readable by the machine learning algorithm. Among different strategies available for this task, one of the simplest is assigning a numeric index to each categorial value. This is what the `StringIndexer` estimator does.

In [None]:
from pyspark.ml.feature import StringIndexer

In [None]:
enc_stages = []
enc_stages.append(StringIndexer(inputCol="Embarked", outputCol="Embarked_encoded"))
enc_stages.append(StringIndexer(inputCol="Sex", outputCol="Sex_encoded"))

In [None]:
data_encoded = Pipeline(stages=enc_stages).fit(data_clean).transform(data_clean)

In [None]:
data_encoded.show()

## Training the Classifier

We can now go on to the training phase in which a machine learning algorithm ingests the training data to build a predictive model - here, a classifier that predicts yes or no for survival.

Many types of classification algorithms exist, each with their own strengths and weaknesses whose discussion goes beyond the scope of this examples. A simple choice is building a single **decision tree**: 

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

A classifier expects the target column to be named `label`, so we are going to rename `Survived` accordingly:

In [None]:
data_encoded = data_encoded.withColumnRenamed("Survived", "label")

In order to evaluate classifier performance in a reliable way, we need to split our available data into a training and a test set. The latter is put aside and will be used for evaluation after training.

In [None]:
data_encoded.show()

In [None]:
splitRatio = 0.8
data_training, data_test = data_encoded.randomSplit([splitRatio, 1-splitRatio])

In [None]:
data_training.show()

In [None]:
data_test.show()

**Assembling Features & Training**

In `pyspark.ml`, the learning algorithm expects all features to train on to be placed in a single column of **feature vectors**:

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
assemble_features = VectorAssembler(inputCols=["Age", "Fare", "Sex_encoded", "Embarked_encoded"], 
                                    outputCol="features")

The last stage of the training is the ML algorithm itself. After this, we can trigger the training by calling `fit`.

In [None]:
classifier = DecisionTreeClassifier()

In [None]:
training_stages = [assemble_features, classifier]

In [None]:
model = Pipeline(stages=training_stages).fit(data_training)

This yields a fitted model. In order to perform classification, we call the `transform` method of the model:

In [None]:
predictions = model.transform(data_test)

In [None]:
predictions

The transformation has added three new columns to the DataFrame:

- prediction: the predicted label
- rawPrediction: the direct output of the classification algorithm - interpretaion may vary among algorithms
- probability: the probability of each label

In [None]:
predictions[["PassengerId", "prediction", "probability", "rawPrediction"]].show()

## Evaluation

`mllib.MulticlassMetrics` implements a number of standard metrics to evaluate the performance of a classifier.

In [None]:
import pandas
from pyspark.mllib.evaluation import MulticlassMetrics

In [None]:
# MulticlassMetrics expects label to be of type double
predictions = (predictions.withColumn("label", predictions["label"].cast("double")))

mm = MulticlassMetrics(predictions.select(["label", "prediction"]).rdd)
labels = sorted(predictions.select("prediction").rdd.distinct().map(lambda r: r[0]).collect())

metrics = pandas.DataFrame([(label, mm.precision(label=label), mm.recall(label=label), mm.fMeasure(label=label)) for label in labels],
                        columns=["label", "Precision", "Recall", "F1"])

In [None]:
metrics

## Exercise: Assembling the Full Pipeline

Let's revisit the model training workflow and implement it as a single `Pipeline` that starts from the raw data and outputs a trained model.

In [None]:
data_raw = spark.read.csv(data_path, header=True, schema=schema)
stages = []
# Your turn - implement the model training as a single Pipeline




---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_