# Model Selection with PySpark ML

In the following we go through an example workflow for model selection with PySpark ML. The goal is to evaluate several model engineering choices (e.g. choice of algorithm and hyperparameters).

## Preamble

In [None]:
import findspark
findspark.init()
import pyspark

## Example Dataset

In [None]:
data_path = "../.assets/data/titanic/titanic.csv"

In [None]:
spark = pyspark.sql.SparkSession \
    .builder \
    .appName("TitanicClassifier") \
    .getOrCreate()


In [None]:
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType

schema = StructType([
             StructField('PassengerId', StringType()),
             StructField('Survived', IntegerType()),
             StructField('Pclass', IntegerType()),
             StructField('Name', StringType()),
             StructField('Sex', StringType()),
             StructField('Age', IntegerType()),
             StructField('SibSp', IntegerType()),
             StructField('Parch', IntegerType()),
             StructField('Ticket', StringType()),
             StructField('Fare', DoubleType()),
             StructField('Cabin', StringType()),
             StructField('Embarked', StringType())
        ])


In [None]:
data = spark.read.csv(data_path, header=True, schema=schema)


In [None]:
data.show(5)

## Tuning Hyperparameters with Parameter Search

A relatively simple ML algorithm, such as the *decision tree algorithm*, already has a large number of parameters with which we could configure it before it sees the training data. All of these parameters can potentially influence the performance of the learned model. Which parameters to tweak is a matter of understanding the algorithm and understanding the data. 

Remembering the section on **model complexity**, we conclude that the **depth of a decision tree** (i.e. the maximum number of steps from the root to a leaf) is an important parameter: The shallower the tree, the fewer criteria it can check before arriving at a prediction - possibly risking _underfitting_. On the other hand, the deeper the tree, the higher the risk for _overfitting_.



![](https://upload.wikimedia.org/wikipedia/commons/f/ff/Decision_tree_model.png)

_The **depth** of a decision tree is an important hyperparameter for a decision tree learning algorithm. Here, the depth is 2._  [_Source_](https://commons.wikimedia.org/wiki/File:Decision_tree_model.png)

Let's have a look at the parameters of the `DecisionTreeClassifier` provided in `pyspark.ml`:

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

In [None]:
DecisionTreeClassifier?

The maximum depth that the decision tree can grow to is controlled by the `maxDepth` constructor parameter. What is the best choice for this parameter?


There is only one way to really know the optimal depth: **Experiment with different parameters and measure performance**. Fortunately PySpark ML has [**helpful tools**](https://spark.apache.org/docs/latest/ml-tuning.html) to make this possible in a few lines of code

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

The `ParamGridBuilder` class is there for building a grid of parameters that is searched for an optimum. For each point in the grid, the model is evaluated with this parameter combination.

In [None]:
dt_classifier = DecisionTreeClassifier()

In [None]:
param_grid = ParamGridBuilder()\
    .addGrid(dt_classifier.maxDepth, list(range(10)))

We can now pass this parameter grid to classes built for model evaluation, such as the `CrossValidator`, which performs _$k$-fold cross-validation_. (See [this notebook](../ml/ml-classification-intro.ipynb) to recapitulate how the idea behind cross-validation).

In [None]:
cv = CrossValidator(
    estimator=dt_classifier,
    estimatorParamMaps=param_grid,
    numFolds=3,
)

### Exercise: Decision Tree Depth Tuning on Titanic Data

**Build a simple survival classification model on the Titanic data set, and use the `CrossValidator` to determine the optimal `maxDepth` for the `DecisionTreeClassifier`!**

In [None]:
# Your code here

### Exercise: Algorithm Search

**Rather than tuning the parameters of one algorithm, we can also use the search tools to try out differnt types of algorithms. This can be done using a `Pipeline`. For this we treat the name of a pipeline stage as a parameter. Try it out!**

In [None]:
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, LogisticRegression, NaiveBayes

In [None]:
classifiers = {
    "Decision Tree": DecisionTreeClassifier,
    "Random Forest": RandomForestClassifier,
    "Gradient-boosted Trees": GBTClassifier,
    "Logistic Regression": LogisticRegression,
    "Naive Bayes": NaiveBayes
}

In [None]:
# Your code here

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_