# Building Custom ML transformers & Estimators

We will see how to create and use custom transformers and estimators in this chapter. Because of how similar transformers and estimators are, we start with in-depth coverage of the transformer and its fundamental building block, the Param. We will also see how to integrate custom transformers and estimators in an ML pipeline.

### Creating your own transformer

We implement a `ScalarNAFiller` transformer that fills the `null` values of a column with a scalar value instead of the `mean` or `median` when using the `Imputer`. Thanks to this, our dessert pipeline from chapter 13 will have a `ScalarNAFiller` stage that we’ll be able to use when running different scenarios—when optimizing hyperparameters, for instance — without changing the code itself. This improves the flexibility and robustness of our ML experiments.

Our blueprint for this section follows this plan:
1. Design our transformer: Params, inputs, and outputs.
2. Create the Params, inheriting some preconfigured ones as necessary.
3. Create the necessary getters and setters to get.
4. Create the initialization function to instantiate our transformer.
5. Create the transformation function.

The PySpark Transformer class (`pyspark.ml.Transformer`) provides many of the methods we used in chapter 13, such as `explainParams()` and `copy()`, plus a handful of other methods that will prove useful for implementing our own transformers. By sub-classing `Transformer`, we inherit all of this functionality for free, like we do in the following listing.


In [1]:
from pyspark.ml import Transformer

class ScalarNAFiller(Transformer):
    pass

<img src="images/custom_scalarna_filter.png">

#### Designing a transformer: Thinking in terms of Params and transformation

In chapters 12 and 13, we saw that a transformer (and, by extension, an estimator) is configured through a collection of Params. The `transform()` function always takes a data frame as an input and returns a transformed data frame. We want to stay consistent with our design to avoid problems at use time.

When designing a custom transformer, we always start by implementing a function that reproduces the behavior of our transformer. For the `ScalarNAFiller`, we leverage the `fillna()` function. 

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql import Column, DataFrame

spark = (
    SparkSession.builder.appName("Custom Transformers")
    .getOrCreate()
)

test_df = spark.createDataFrame(
    [
        [1, 2, 4, 1], 
        [3, 6, 5, 4], 
        [9, 4, None, 9], 
        [11, 17, None, 3]
    ],
    ["one", "two", "three", "four"],
)

def scalarNAFillerFunction(df: DataFrame, inputCol: Column, outputCol: str, filler: float = 0.0):
    return df.withColumn(outputCol, inputCol).fillna(
        filler, subset=outputCol
    )

scalarNAFillerFunction(test_df, F.col("three"), "five", -99.0).show()
# null in column 'three' has been replace by -99, our filler value


+---+---+-----+----+----+
|one|two|three|four|five|
+---+---+-----+----+----+
|  1|  2|    4|   1|   4|
|  3|  6|    5|   4|   5|
|  9|  4| null|   9| -99|
| 11| 17| null|   3| -99|
+---+---+-----+----+----+



we immediately see that we need three Params in our `ScalarNAFiller`:
- `inputCol` and `outputCol` are for the input and output columns, following the same behavior as the other transformers and estimators we’ve encountered thus far.
- `filler` contains a floating-point number for the value that `null` will be replaced with during the `transform()` method.

The data frame (df in code above) would get passed as an argument to the `transform()` method. Should we want to map this into the transformer blueprint introduced in chapter 13, it would look like figure below

<img src="images/scalarna_filler_blueprint.png">

#### Creating the Params of a transformer

In this section, we create the three Params (`inputCol`, `outputCol`, `filler`) for the `ScalarNAFiller` transformer. We learn how to define a Param from scratch that will play well with other Params. Params drive the behavior of the transformer and estimator, and allow for easy customization when running a pipeline.

First, we start with the creation of a custom Param, our filling value filler. To create a custom Param, PySpark provides a Param class with four attributes:
- A `parent`, which carries the value of the transformer once the transformer is instantiated.
- A `name`, which is the name of our Param. By convention, we set it to the same name as our Param.
- A `doc`, which is the documentation of our Param. This allows us to embed documentation for our Param when the transformer will be used.
- A `typeConverter`, which governs the type of the Param. This provides a standardized way to convert an input value to the right type. It also gives a relevant error message if, for example, you expect a floating-point number, but the user of the transformer provides a string.

Every custom Param we create needs to have `Params._dummy()` as a parent; this ensures that PySpark will be able to copy and change transformers' Params when you use or change them, for instance, during cross-validation (chapter 13). The name and doc are self-explanatory, so let’s spend a little more time on the `typeConverter`. 

Type converters are the way we instruct the Param about the type of value it should expect. Think of them like value annotations in Python, but with the option to try to convert the value. In the case of the filler, we want a floating-point number, so we use `TypeConverters.toFloat`.

In [3]:
from pyspark.ml.param import Param, Params, TypeConverters

filler = Param(
    Params._dummy(),
    "filler",
    "Value we want to replace our null values with.",
    typeConverter=TypeConverters.toFloat,
)

filler

Param(parent='undefined', name='filler', doc='Value we want to replace our null values with.')

Commonly used Params are defined in special classes called Mixin under the `pyspark.ml.param.shared` module.

In [4]:
class HasInputCols(Params):
    """Mixin for param inputCols: input column names."""
    inputCols = Param(
        Params._dummy(),
        "inputCols", "input column names.",
        typeConverter=TypeConverters.toListString,
    )

    def __init__(self):
        super(HasInputCols, self).__init__()

    def getInputCols(self):
        """Gets the value of inputCols or its default value. """
        return self.getOrDefault(self.inputCols)

To use these accelerated Param definitions, we simply have to sub-class them in our transformer class definition. Our updated class definition now has all three Params defined: two of them through a Mixin (`inputCol`, `outputCol`), and one custom (`filler`).

In [5]:
from pyspark.ml.param.shared import HasInputCol, HasOutputCol

class ScalarNAFiller(Transformer, HasInputCol, HasOutputCol):
    filler = Param(
        Params._dummy(),
        "filler",
        "Value we want to replace our null values with.",
        typeConverter=TypeConverters.toFloat,
    )

    pass

#### Getters and setters: Being a nice PySpark citizen

Based on the design of every PySpark transformer we have used so far, the simplest way to create setters is as follows: we first create a general method, `setParams()`, that allows us to change multiple parameters passed as keyword arguments. Then, creating the setter for any other Param will simply call `setParams()` with the relevant keyword argument. The `setParams()` method is difficult to get right at first; it needs to accept any Params our transformer has and then update only those we are passing as arguments.


In [10]:
from pyspark import keyword_only

@keyword_only
def setParams(self, *, inputCol=None, outputCol=None, filler=None):
    kwargs = self._input_kwargs
    return self._set(**kwargs)

With `setParams()` cleared out, it’s time to create the individual setters. That couldn’t be easier: simply call `setParams()` with the appropriate argument! In previously seen code, we saw that, while the getter for inputCol is provided, the setter is not because it would imply creating a generic `setParams()` that we’d override anyway.

In [6]:
def setFiller(self, new_filler):
    return self.setParams(filler=new_filler)

def setInputCol(self, new_inputCol):
    return self.setParams(inputCol=new_inputCol)

def setOutputCol(self, new_outputCol):
    return self.setParams(outputCol=new_outputCol)

The setters are done! Now it’s time for the getters. Unlike setters, getters for Mixin are already provided, so we only have to create `getFiller()`. We also do not have to create a generic `getParams()`, since the `Transformer` class provides `explainParam` and `explainParams` instead.

In [7]:
def getFiller(self):
    return self.getOrDefault(self.filler)

Putting it all together

In [11]:
class ScalarNAFiller(Transformer, HasInputCol, HasOutputCol):
    filler = [...] # elided for terseness
    
    @keyword_only
    def setParams(self, inputCol=None, outputCol=None, filler=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)
        
    def setFiller(self, new_filler):
        return self.setParams(filler=new_filler)
    
    def getFiller(self):
        return self.getOrDefault(self.filler)
    
    def setInputCol(self, new_inputCol):
        return self.setParams(inputCol=new_inputCol)
    
    def setOutputCol(self, new_outputCol):
        return self.setParams(outputCol=new_outputCol)