# Estimator

A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer.
When Pipeline.fit() is called, the stages are executed in order. 

If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model.
Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage. 

If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage.

The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages.

If stages is an empty list, the pipeline acts as an identity transformer.

In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("pysark-ml-pipeline").master("local[*]").getOrCreate()

In [6]:
df = spark.createDataFrame([
    (1, 'CS', 'MS'),
    (2, 'MATH', 'PHD'),
    (3, 'MATH', 'MS'),
    (4, 'CS', 'MS'),
    (5, 'CS', 'PHD'),
    (6, 'ECON', 'BS'),
    (7, 'ECON', 'BS'),
], ['id', 'dept', 'education'])

In [2]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

# stages

In [3]:
# Stage-1 : transform the `dept` column to numeric
stage_1 = StringIndexer(inputCol= 'dept', outputCol= 'dept_index')

# Stage-2 : transform the `education` column to numeric
stage_2 = StringIndexer(inputCol= 'education', outputCol= 'education_index')

# Stage-3 : one hot encode the numeric column `education_index`
stage_3 = OneHotEncoder(inputCols=['education_index'], outputCols=['education_OHE'])

# setup the pipeline: glue the stages together


In [4]:
pipeline = Pipeline(stages=[stage_1, stage_2, stage_3])

# fit the pipeline model and transform the data as defined


In [7]:
pipeline_model = pipeline.fit(df)

# view the transformed data


In [8]:
final_df = pipeline_model.transform(df)
final_df.show(truncate=False)

+---+----+---------+----------+---------------+-------------+
|id |dept|education|dept_index|education_index|education_OHE|
+---+----+---------+----------+---------------+-------------+
|1  |CS  |MS       |0.0       |0.0            |(2,[0],[1.0])|
|2  |MATH|PHD      |2.0       |2.0            |(2,[],[])    |
|3  |MATH|MS       |2.0       |0.0            |(2,[0],[1.0])|
|4  |CS  |MS       |0.0       |0.0            |(2,[0],[1.0])|
|5  |CS  |PHD      |0.0       |2.0            |(2,[],[])    |
|6  |ECON|BS       |1.0       |1.0            |(2,[1],[1.0])|
|7  |ECON|BS       |1.0       |1.0            |(2,[1],[1.0])|
+---+----+---------+----------+---------------+-------------+



# Binarizer
Binarize data means to set feature values to 0 or 1 according to a threshold.

In [9]:
from pyspark.ml.feature import Binarizer

raw_df = spark.createDataFrame([
    (1, 0.1),
    (2, 0.2),
    (3, 0.5),
    (4, 0.8),
    (5, 0.9),
    (6, 1.1)
], ["id", "feature"])

In [10]:
from pyspark.ml.feature import Binarizer
binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature")

In [13]:
print("Binarizer output with Threshold = %f" % binarizer.getThreshold())

Binarizer output with Threshold = 0.500000


In [None]:
binarized_df = binarizer.transform(raw_df)

In [14]:
binarized_df.show(truncate=False)

+---+-------+-----------------+
|id |feature|binarized_feature|
+---+-------+-----------------+
|1  |0.1    |0.0              |
|2  |0.2    |0.0              |
|3  |0.5    |0.0              |
|4  |0.8    |1.0              |
|5  |0.9    |1.0              |
|6  |1.1    |1.0              |
+---+-------+-----------------+

