# 6a. MLlib Introduction

Provides tools like:

* ML Algorithms
* Featurization: feature extraction, transformation, ...
* Pipelines
* Persistence: saving and loading algorithms
* Utilities: linear algebra, statistics, data handling

_Note: RDD-based APIs are considered in maintenance mode. DataFrame-based API is primary_

## Dependencies

MLlib uses linear algebra packages fro optimised numerical processing, which may call native acceleration libraries that are required. However, native acceleration libraries cannot be distributed together with Spark. Will see a warning message if it is not used

For Python you just need NumPy >=1.4

---

# Basic Statistics

MLlib is able to perform basic (and complex) statistics on large data

## Correlation

Provide the flexibility to calculate pairware correlation among many series. Includes Pearson's and Spearman's correlation algorithms

Output here is the correlation matrix for the input Dataset of Vectors using the specified method.

In [3]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName("mllib") \
  .getOrCreate()

data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]), ), 
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]), ), 
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]), ), 
				(Vectors.sparse(4, [(0, 9.0), (3, 1.0)]), ),
        ]
df = spark.createDataFrame(data, ["features"])
r1 = Correlation.corr(df, 'features').head()
print(f"Pearson correlation matrix:\n {str(r1[0])}")

r2 = Correlation.corr(df, 'features', 'spearman').head()
print(f"Spearman correlation matrix:\n {str(r2[0])}")

Pearson correlation matrix:
 DenseMatrix([[1.        , 0.05564149,        nan, 0.40047142],
             [0.05564149, 1.        ,        nan, 0.91359586],
             [       nan,        nan, 1.        ,        nan],
             [0.40047142, 0.91359586,        nan, 1.        ]])
Spearman correlation matrix:
 DenseMatrix([[1.        , 0.10540926,        nan, 0.4       ],
             [0.10540926, 1.        ,        nan, 0.9486833 ],
             [       nan,        nan, 1.        ,        nan],
             [0.4       , 0.9486833 ,        nan, 1.        ]])


## Hypothesis Testing

Determine whether a result is statistically significant. Only supports Pearson's Chi-squared tests for independence.

Chi-squared tests: for each feature and label pair, they are converted into a contingency matrix and computed. All labels and features must be categorical

In [5]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest

data = [(0.0, Vectors.dense(0.5, 10.0)),
        (0.0, Vectors.dense(1.5, 20.0)),
        (1.0, Vectors.dense(1.5, 30.0)),
        (0.0, Vectors.dense(3.5, 30.0)),
        (0.0, Vectors.dense(3.5, 40.0)),
        (1.0, Vectors.dense(3.5, 40.0))]
df = spark.createDataFrame(data, ['label', 'features'])

r = ChiSquareTest.test(df, 'features', 'label').head()

print("pValues: " + str(r.pValues))
print(f'degrees of freedom {r.degreesOfFreedom}')
print(f"Statistics: {str(r.statistics)}")

pValues: [0.6872892787909721,0.6822703303362126]
degrees of freedom [2, 3]
Statistics: [0.75,1.5]


# Summarizer

Vector column summary statistics (max, min, mean, median, std, variance, etc)

In [6]:
from pyspark.ml.stat import Summarizer
from pyspark.sql import Row
from pyspark.ml.linalg import Vector

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("appName").setMaster('master')
sc = SparkContext(conf=conf)

df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),
                     Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()
									
# Create summarizer for multiple metrics 'mean' and 'count'
summarizer = Summarizer.metrics('mean', 'count')

# Compute statistics for multiple metrics with weight
df.select(summarizer.summary(df.features, df.weight)).show(truncate=False)

# Compute statistics for multiple metrics without weight
df.select(summarizer.summary(df.features)).show(truncate=False)

# Compute statistics for single metric 'mean' with weight
df.select(summarizer.mean(df.features, df.weight)).show(truncate=False)

# Compute statistics for single metrics 'mean' without weight
df.select(summarizer.mean(df.features)).show(truncate=False)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=mllib, master=local[*]) created by getOrCreate at C:\Users\Martin Ho\AppData\Local\Temp\ipykernel_21920\2451862296.py:5 

---

# Data Source

How to use data source in ML to load data. Introduce 2 types of data sources beside the general data sources (e.g Parquet, CSV, JSON, etc)

1. Image data source
2. LIBSVM data source

## Image data source

Load image files from a directory. Can eb compressed images into raw representation. DataFrame has 1 `StructType` column: "image" containing image data stored as a schema. The Schema:

* origin: StringType (represents the file path of the image)
* height: IntegerType (height of the image)
* width: IntegerType (width of the image)
* nChannels: IntegerType (number of image channels)
* mode: IntegerType (OpenCV-compatible type)
* data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)

Will usually provide Spark SQL data source API to load images as a DataFrame

In [None]:
df = spark.read.format('image').option('dropInvalid', True).load('data/mllib/images/origin/kittens')
df.select('image.origin', 'image.width', 'image.height').show(truncate=False)

## LIBSVM data source

Used to load 'libsvm' type files from a directory. Contains 2 columns: 

* label: DoubleType (represents the instance label)
* features: VectorUDT (represents the feature vector)

In [8]:
df = spark.read.format('libsvm').option('numFeatures', '780').load('data/mllib/sample_libsvm_data.txt')
df.show(10)




---

# Pipelines

Provide a uniform set of high-level APIs built on top of DataFrames to help users create and tune practical machine learning pipelines.

## Main Concept

Makes it easier to combine multiple algorithms into a single pipeline or workflow. Pipelines are mostly inspired by scikit-learn projects

* `DataFrame`: Uses `DataFrame` from Spark SQL as the ML dataset which can hold different data types
* `Transformer`: Transforms one `DataFrame` into another (e.g transforming a `DataFrame` from features into one with predictions)
* `Estimator`: Algorithm that's fit on a `DataFrame` to produce a `Transformer` (e.g learning algorithm is the `Estimator` that trains and produces a model) 
* `Pipelines`: Chain multiple of the above into an ML workflow
* `Parameter`: Common API for specifying parameters

### DataFrame

* `DataFrame` supports many different basic and structured data types from Spark SQL. 
* Can also use `Vector` type. 
* Can be created implicitly or explicitly from a regular RDD. 
* Columns are also named

## Pipeline Components

### Transformers

Abstraction that includes feature transformation and learned models. Generally it implements the method `transform()` that appends new column(s). For example:

* Read a column, map it to a new column and output a new `DataFrame` with the mapped column appended
* read the column containing feature vectors and predict the label for each feature vector

### Estimators

Abstracts the concept of a learning algorithm or any algorithm that fits or trains data. Implements a method `fit()` which accepts data and produces a model (that is a `Transformer`)

## Properties of pipeline components

`Transformer.transform()` and `Estimators.fit()` are stateless. Each instance has a unique ID which is useful in specifying the parameters assocaited

## Pipeline

MLlib representations of workflows, a sequence of `PipelineStages` to be run in order.

__How it works__

* Each stage is either a `Transformer` or `Estimator`
* Input data is transformed as it passes through each stage
	- `Tranformer.transform()`
	- `Estimator.fit()` -> creates a model -> `Transform.transform()`
* All `Estimators` in the original Pipeline will become `Transformers` once the model has been fitted
* Each stages `transform()` method passes the newly formed dataset onto the next stage

_DAG Pipelines_: Pipelines can be structured as Directed Acyclic Graphs (DAG) to be non-linear in fashion, but must be specified in topological order

_Runtime checking_: Does not use compile-time checking, only runtime checking before running the pipeline. Done by using the `DataFrames` schema to get a description of the data types of columns to ensure the operations done are valid

_Unique Pipeline stages_: Each stage should be a unique instance. No reusing on the same declaration in the same pipeline

## Parameters

2 ways to pass parameters down to an algorithm:

1. Set parameters for an instance (e.g `lr.setMaxIter(10)` will set the `lr.fit()` to use at most 10 iterations)
2. Pass a `ParamMap` to the algorithm which will override any parameters previously set by the above method.

A single `ParamMap` can be applied to different algorithms each with their own parameters

## Example: Model training and testing

In [2]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName('ml example') \
  .getOrCreate()

# Prepare training data from a list of (label, features) tuples
training = spark.createDataFrame(
	[
    (1.0, Vectors.dense([0.0, 1.1, 0.1])),
    (0.0, Vectors.dense([2.0, 1.0, -1.0])),
    (0.0, Vectors.dense([2.0, 1.3, 1.0])),
    (1.0, Vectors.dense([0.0, 1.2, -0.5]))
  ],
	['label', 'features']
)

# Create a LogsiticRegression instance. This instance is an Estimator
lr = LogisticRegression(maxIter=10, regParam=0.01)
print(f"LogisticRegression parameters: \n {lr.explainParams()} \n")

# Learn the LogisticRegression model
model1 = lr.fit(training)

# model1 is a Model (i.e transformer produced by estimater) able to view parameters used during fit()
print(f"Model 1 was fit using parameters:\n {model1.extractParamMap()}")

LogisticRegression parameters: 
 aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The b

In [3]:
# specify parameters using Python dictionary in paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30 # override the first one
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55}) # update with multiple Params

# Combine paraMaps which are dictionaries
paramMap2 = {lr.probabilityCol: 'myProbability'}
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Learn a new model using the paramMapCombined parameters
model2 = lr.fit(training, paramMapCombined)
print(f"Model 2 was fitted using the following parameters:\n {model2.extractParamMap()}")

Model 2 was fitted using the following parameters:
 {Param(parent='LogisticRegression_857d80aa7eaf', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_857d80aa7eaf', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_857d80aa7eaf', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_857d80aa7eaf', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_857d80aa7eaf', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_857d80aa7eaf', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_857d80aa7eaf', name='maxB

In [5]:
# Prepare some test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# make predictions on test data using Transformer.transform() method
# model2 will output a "myProbability" column since we changed the name previously
prediction = model2.transform(test)
result = prediction.select('features', 'label', 'myProbability', 'prediction').collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

features=[-1.0,1.5,1.3], label=1.0 -> prob=[0.0570730499357254,0.9429269500642746], prediction=1.0
features=[3.0,2.0,-0.1], label=0.0 -> prob=[0.9238521956443227,0.07614780435567725], prediction=0.0
features=[0.0,2.2,-1.5], label=1.0 -> prob=[0.10972780286187782,0.8902721971381222], prediction=1.0


# Example: Model Pipeline

In [6]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from list of (id, text, label) tuples
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline with 3 steps
# tokenizer, hashingTf and lr
tokenizer = Tokenizer(inputCol='text', outputCol='words')
hasingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol='features')
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hasingTF, lr])

# Fit pipeline to training docuemnts
model = pipeline.fit(training)

# Prepare test documents, unlabeled (io, text) tuples
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print column of interest
prediction = model.transform(test)
selected = prediction.select('id', 'text', 'probability', 'prediction').collect()
for row in selected:
  print(
		"( %d, %s ) -> prob=%s, prediction=%f" % (row.id, row.text, row.probability, row.prediction) 
	)

( 4, spark i j k ) -> prob=[0.6292098489668488,0.37079015103315116], prediction=0.000000
( 5, l m n ) -> prob=[0.984770006762304,0.015229993237696027], prediction=0.000000
( 6, spark hadoop spark ) -> prob=[0.13412348342566147,0.8658765165743385], prediction=1.000000
( 7, apache hadoop ) -> prob=[0.9955732114398529,0.00442678856014711], prediction=0.000000


__Note: Subsequent sections of tutorial explain in detail the different algorithms available in Spark__

---

# Model Tuning: Selection and Hyperparameter tuning

Describe how to use the tools to help tune ML algorithms and Pipelines. Built-in Cross-Validation and other tools to optimize hyperparameters in algorithms

## Model Selection

Tuning can be done for individual `Estimators` or for entire `Pipelines` which include multiple algorithms, featurizations and other steps with multiple hyperparameters, rather than tuning individual elements

Tools like `CorssValidator` and `TrainvalidationSplit` require the following:

* `Estimator`: algorithm or Pipeline to tune
* `ParamMaps`: parameters to choose from ("Parameter grid" to search over)
* `Evaluator`L metric to measure how well-fitted a model is. Done to held-out test data

Works like:

1. Split train and test data into multiple separate datasets
2. Iterate through a set of `ParamMaps`. Each time, the fitted model is evaluated against the corresponding test set
3. Selected model is produced by the best-performing set of parameters

__Evaluators__

Each 'type' of problem has a corresponding Evaluator. E.g Regression problems = `RegressionEvaluator` | Multi-label classification = `MultilabelClassificationEvaluator`. The default metric can however be changed by the `setMetricName` method in each evaluator

__Parameter Grid__

Use `ParamGridBuilder` utility to create the parameter grid. By default parameters are evaluated in serial (series), but can be changed to parallel by setting `parallelism` with a value of >=2 before running model selection with `CrossValidator` or `TrainValidationSplit`. Parallelism should be carefully set to maximize performance but not exceed the cluster resources. Larger values does not necessarily increase performance

~10 should be sufficient


## Cross-Validation

In [9]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Prepare training documents, which are labelled
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure the ML pipeline
tokenizer = Tokenizer(inputCol='text', outputCol='words')
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol='features')
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Use ParamGridBuilder to construct a grid of parameters to search over
# With 3 values for hashingTF and 2 values for lr.regParam
# grid will have 3 x 2 = 6 parameters settings for CrossValidator to choose from
paramGrid = ParamGridBuilder() \
  .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
  .addGrid(lr.regParam, [0.1, 0.01]) \
  .build()

cross_val = CrossValidator(
	estimator=pipeline,
	estimatorParamMaps=paramGrid,
	evaluator=BinaryClassificationEvaluator(),
	numFolds=2
)

# Run cross-validation and choose best set of parameters
cv_model = cross_val.fit(training)

# Prepare test documents, which are unlabelled
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make prediction
prediction = cv_model.transform(test)
selected_cols = prediction.select('id', 'text', 'probability', 'prediction').collect()
for row in selected_cols:
  print(row)

Row(id=4, text='spark i j k', probability=DenseVector([0.3407, 0.6593]), prediction=1.0)
Row(id=5, text='l m n', probability=DenseVector([0.9432, 0.0568]), prediction=0.0)
Row(id=6, text='mapreduce spark', probability=DenseVector([0.3449, 0.6551]), prediction=1.0)
Row(id=7, text='apache hadoop', probability=DenseVector([0.9563, 0.0437]), prediction=0.0)


## Train-Validation Split

In [11]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Prepare training and testing data
data = spark.read.format('libsvm').load('data/mllib/sample_linear_regresion_data.txt')
train, test = data.randomSplit([0.9, 0.1], seed=12345)

lr = LinearRegression(maxIter=10)

# Create the paramGrid that will host all the different hyperparameters
paramGrid = ParamGridBuilder() \
  .addGrid(lr.regParam, [0.1, 0.01]) \
  .addGrid(lr.fitIntercept, [False, True]) \
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
  .build()

tvs = TrainValidationSplit(
	estimator=lr,
	estimatorParamMaps=paramGrid,
	evaluator=RegressionEvaluator(),
	trainRatio=0.8 # 80% to train, 20% to validate
)

# Run and choose the best set of parameters
model = tvs.fit(train)

# Make predictions on test data
model.transform(test) \
  .select('features', 'labels', 'prediction') \
  .show()


