<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Load-the-libraries" data-toc-modified-id="Load-the-libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the libraries</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Random-Forest" data-toc-modified-id="Random-Forest-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Random Forest</a></span></li><li><span><a href="#Lightgbm" data-toc-modified-id="Lightgbm-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Lightgbm</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Logistic Regression</a></span></li></ul></li><li><span><a href="#Different-models" data-toc-modified-id="Different-models-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Different models</a></span></li></ul></div>

# Description
Github link: https://github.com/Azure/mmlspark  

lightgbm doc: https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md  


Regression:
```python
from mmlspark.lightgbm import LightGBMRegressor
model = LightGBMRegressor(application='quantile',
                          alpha=0.3,
                          learningRate=0.3,
                          numIterations=100,
                          numLeaves=31).fit(train)
```


Classification:
```python
from mmlspark.lightgbm import LightGBMClassifier
model = LightGBMClassifier(learningRate=0.3,
                           numIterations=100,
                           numLeaves=31).fit(train)
```

# Load the libraries

In [1]:
import numpy as np
import pandas as pd
import os
HOME = os.path.expanduser('~')

import findspark
# findspark.init(HOME + "/Softwares/Spark/spark-3.0.0-bin-hadoop2.7")

# We need to use spark 2.4.6 to use lgbm
findspark.init(HOME + "/Softwares/Spark/spark-2.4.6-bin-hadoop2.7")

import pyspark
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

spark = (pyspark.sql.SparkSession.builder.appName("MyApp")

    # config for microsoft ml spark
    .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc2") 
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
         
    # usual
    .getOrCreate()
    )
import mmlspark

SEED = 100

df_eval = pd.DataFrame({
    "Model": [],
    "Description": [],
    "Accuracy": [],
    "Precision": [],
    "AUC": []
})

from pyspark.ml.feature import VectorAssembler

from pyspark.ml.classification import RandomForestClassifier
from mmlspark.lightgbm import LightGBMClassifier

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

print(f'pyspark version: {pyspark.__version__}')

pyspark version: 2.4.6


# Load the data

In [2]:
sdf = spark.read.csv('affairs.csv',inferSchema=True,header=True)
print((sdf.count(),len(sdf.columns)))

print(sdf.printSchema())
sdf.show(5)

(6366, 6)
root
 |-- rate_marriage: integer (nullable = true)
 |-- age: double (nullable = true)
 |-- yrs_married: double (nullable = true)
 |-- children: double (nullable = true)
 |-- religious: integer (nullable = true)
 |-- affairs: integer (nullable = true)

None
+-------------+----+-----------+--------+---------+-------+
|rate_marriage| age|yrs_married|children|religious|affairs|
+-------------+----+-----------+--------+---------+-------+
|            5|32.0|        6.0|     1.0|        3|      0|
|            4|22.0|        2.5|     0.0|        2|      0|
|            3|32.0|        9.0|     3.0|        3|      1|
|            3|27.0|       13.0|     3.0|        1|      1|
|            4|22.0|        2.5|     0.0|        1|      1|
+-------------+----+-----------+--------+---------+-------+
only showing top 5 rows



# Data Preparation

In [3]:
from pyspark.ml.feature import VectorAssembler

In [4]:
inputCols = ['rate_marriage', 'age', 'yrs_married', 'children', 'religious']
assembler = VectorAssembler(inputCols=inputCols, outputCol="features")

sdf = assembler.transform(sdf)

In [5]:
train,test = sdf.select(['features','affairs']).randomSplit([0.75,0.25],seed=SEED)

In [6]:
train.count()

4803

In [7]:
train.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1| 1561|
|      0| 3242|
+-------+-----+



In [8]:
test.groupBy('affairs').count().show()

+-------+-----+
|affairs|count|
+-------+-----+
|      1|  492|
|      0| 1071|
+-------+-----+



# Modelling

## Random Forest

```python
RandomForestClassifier(
    featuresCol='features',
    labelCol='label',
    predictionCol='prediction',
    probabilityCol='probability',
    rawPredictionCol='rawPrediction',
    maxDepth=5,
    maxBins=32,
    minInstancesPerNode=1,
    minInfoGain=0.0,
    maxMemoryInMB=256,
    cacheNodeIds=False,
    checkpointInterval=10,
    impurity='gini',
    numTrees=20,
    featureSubsetStrategy='auto',
    seed=None,
    subsamplingRate=1.0,
)
Docstring:     
`Random Forest <http://en.wikipedia.org/wiki/Random_forest>`_
learning algorithm for classification.
It supports both binary and multiclass labels, as well as both continuous and categorical
features.

>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0, 1.0])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> result = model.transform(test0).head()
>>> result.prediction
0.0
>>> numpy.argmax(result.probability)
0
>>> numpy.argmax(result.rawPrediction)
0
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> model.trees
[DecisionTreeClassificationModel (uid=...) of depth..., DecisionTreeClassificationModel...]
>>> rfc_path = temp_path + "/rfc"
>>> rf.save(rfc_path)
>>> rf2 = RandomForestClassifier.load(rfc_path)
>>> rf2.getNumTrees()
3
>>> model_path = temp_path + "/rfc_model"
>>> model.save(model_path)
>>> model2 = RandomForestClassificationModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
```

In [9]:
from pyspark.ml.classification import RandomForestClassifier

model = RandomForestClassifier(labelCol='affairs',seed=SEED)

# model.save('lgb.pkl')
# model = LightGBMClassifier.load('lgb.pkl')

model = model.fit(train)
test_preds = model.transform(test)

acc = MulticlassClassificationEvaluator(
    labelCol='affairs',
    metricName='accuracy'
    ).evaluate(test_preds)

precision = MulticlassClassificationEvaluator(
    labelCol='affairs',
    metricName='weightedPrecision'
    ).evaluate(test_preds)

auc = BinaryClassificationEvaluator(
    labelCol='affairs'
   ).evaluate(test_preds)


row = ["rf",'default',acc,precision,auc]

df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates()
df_eval

Unnamed: 0,Model,Description,Accuracy,Precision,AUC
0,rf,default,0.713372,0.691098,0.736099


## Lightgbm

```python
LightGBMClassifier(
    baggingFraction=1.0,
    baggingFreq=0,
    baggingSeed=3,
    binSampleCount=200000,
    boostFromAverage=True,
    boostingType='gbdt',
    categoricalSlotIndexes=[],
    categoricalSlotNames=[],
    defaultListenPort=12400,
    driverListenPort=0,
    earlyStoppingRound=0,
    featureFraction=1.0,
    featuresCol='features',
    featuresShapCol='',
    improvementTolerance=0.0,
    initScoreCol=None,
    isProvideTrainingMetric=False,
    isUnbalance=False,
    labelCol='label',
    lambdaL1=0.0,
    lambdaL2=0.0,
    leafPredictionCol='',
    learningRate=0.1,
    maxBin=255,
    maxBinByFeature=[],
    maxDeltaStep=0.0,
    maxDepth=-1,
    metric='',
    minDataInLeaf=20,
    minGainToSplit=0.0,
    minSumHessianInLeaf=0.001,
    modelString='',
    negBaggingFraction=1.0,
    numBatches=0,
    numIterations=100,
    numLeaves=31,
    numTasks=0,
    objective='binary',
    parallelism='data_parallel',
    posBaggingFraction=1.0,
    predictionCol='prediction',
    probabilityCol='probability',
    rawPredictionCol='rawPrediction',
    repartitionByGroupingColumn=True,
    slotNames=[],
    thresholds=None,
    timeout=1200.0,
    topK=20,
    useBarrierExecutionMode=False,
    validationIndicatorCol=None,
    verbosity=1,
    weightCol=None,
)
Docstring:     
Args:

    baggingFraction (double): Bagging fraction (default: 1.0)
    baggingFreq (int): Bagging frequency (default: 0)
    baggingSeed (int): Bagging seed (default: 3)
    binSampleCount (int): Number of samples considered at computing histogram bins (default: 200000)
    boostFromAverage (bool): Adjusts initial score to the mean of labels for faster convergence (default: true)
    boostingType (str): Default gbdt = traditional Gradient Boosting Decision Tree. Options are: gbdt, gbrt, rf (Random Forest), random_forest, dart (Dropouts meet Multiple Additive Regression Trees), goss (Gradient-based One-Side Sampling).  (default: gbdt)
    categoricalSlotIndexes (list): List of categorical column indexes, the slot index in the features column (default: [I@1d15cacb)
    categoricalSlotNames (list): List of categorical column slot names, the slot name in the features column (default: [Ljava.lang.String;@518b5988)
    defaultListenPort (int): The default listen port on executors, used for testing (default: 12400)
    driverListenPort (int): The listen port on a driver. Default value is 0 (random) (default: 0)
    earlyStoppingRound (int): Early stopping round (default: 0)
    featureFraction (double): Feature fraction (default: 1.0)
    featuresCol (str): features column name (default: features)
    featuresShapCol (str): Output SHAP vector column name after prediction containing the feature contribution values (default: )
    improvementTolerance (double): Tolerance to consider improvement in metric (default: 0.0)
    initScoreCol (str): The name of the initial score column, used for continued training
    isProvideTrainingMetric (bool): Whether output metric result over training dataset. (default: false)
    isUnbalance (bool): Set to true if training data is unbalanced in binary classification scenario (default: false)
    labelCol (str): label column name (default: label)
    lambdaL1 (double): L1 regularization (default: 0.0)
    lambdaL2 (double): L2 regularization (default: 0.0)
    leafPredictionCol (str): Predicted leaf indices's column name (default: )
    learningRate (double): Learning rate or shrinkage rate (default: 0.1)
    maxBin (int): Max bin (default: 255)
    maxBinByFeature (list): Max number of bins for each feature (default: [I@3ca10949)
    maxDeltaStep (double): Used to limit the max output of tree leaves (default: 0.0)
    maxDepth (int): Max depth (default: -1)
    metric (str): Metrics to be evaluated on the evaluation data.  Options are: empty string or not specified means that metric corresponding to specified objective will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added). None (string, not a None value) means that no metric will be registered, aliases: na, null, custom. l1, absolute loss, aliases: mean_absolute_error, mae, regression_l1. l2, square loss, aliases: mean_squared_error, mse, regression_l2, regression. rmse, root square loss, aliases: root_mean_squared_error, l2_root. quantile, Quantile regression. mape, MAPE loss, aliases: mean_absolute_percentage_error. huber, Huber loss. fair, Fair loss. poisson, negative log-likelihood for Poisson regression. gamma, negative log-likelihood for Gamma regression. gamma_deviance, residual deviance for Gamma regression. tweedie, negative log-likelihood for Tweedie regression. ndcg, NDCG, aliases: lambdarank. map, MAP, aliases: mean_average_precision. auc, AUC. binary_logloss, log loss, aliases: binary. binary_error, for one sample: 0 for correct classification, 1 for error classification. multi_logloss, log loss for multi-class classification, aliases: multiclass, softmax, multiclassova, multiclass_ova, ova, ovr. multi_error, error rate for multi-class classification. cross_entropy, cross-entropy (with optional linear weights), aliases: xentropy. cross_entropy_lambda, intensity-weighted cross-entropy, aliases: xentlambda. kullback_leibler, Kullback-Leibler divergence, aliases: kldiv.  (default: )
    minDataInLeaf (int): Minimal number of data in one leaf. Can be used to deal with over-fitting. (default: 20)
    minGainToSplit (double): The minimal gain to perform split (default: 0.0)
    minSumHessianInLeaf (double): Minimal sum hessian in one leaf (default: 0.001)
    modelString (str): LightGBM model to retrain (default: )
    negBaggingFraction (double): Negative Bagging fraction (default: 1.0)
    numBatches (int): If greater than 0, splits data into separate batches during training (default: 0)
    numIterations (int): Number of iterations, LightGBM constructs num_class * num_iterations trees (default: 100)
    numLeaves (int): Number of leaves (default: 31)
    numTasks (int): Advanced parameter to specify the number of tasks.  MMLSpark tries to guess this based on cluster configuration, but this parameter can be used to override. (default: 0)
    objective (str): The Objective. For regression applications, this can be: regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. For classification applications, this can be: binary, multiclass, or multiclassova.  (default: binary)
    parallelism (str): Tree learner parallelism, can be set to data_parallel or voting_parallel (default: data_parallel)
    posBaggingFraction (double): Positive Bagging fraction (default: 1.0)
    predictionCol (str): prediction column name (default: prediction)
    probabilityCol (str): Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability)
    rawPredictionCol (str): raw prediction (a.k.a. confidence) column name (default: rawPrediction)
    repartitionByGroupingColumn (bool): Repartition training data according to grouping column, on by default. (default: true)
    slotNames (list): List of slot names in the features column (default: [Ljava.lang.String;@17330338)
    thresholds (list): Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold
    timeout (double): Timeout in seconds (default: 1200.0)
    topK (int): The top_k value used in Voting parallel, set this to larger value for more accurate result, but it will slow down the training speed. It should be greater than 0 (default: 20)
    useBarrierExecutionMode (bool): Use new barrier execution mode in Beta testing, off by default. (default: false)
    validationIndicatorCol (str): Indicates whether the row is for training or validation
    verbosity (int): Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug (default: 1)
    weightCol (str): The name of the weight column
```

In [10]:
from mmlspark.lightgbm import LightGBMClassifier

model = LightGBMClassifier(labelCol='affairs')

# model.save('lgb.pkl')
# model = LightGBMClassifier.load('lgb.pkl')

model = model.fit(train)
test_preds = model.transform(test)

acc = MulticlassClassificationEvaluator(
    labelCol='affairs',
    metricName='accuracy'
    ).evaluate(test_preds)

precision = MulticlassClassificationEvaluator(
    labelCol='affairs',
    metricName='weightedPrecision'
    ).evaluate(test_preds)

auc = BinaryClassificationEvaluator(
    labelCol='affairs'
   ).evaluate(test_preds)


row = ["lgb",'default',acc,precision,auc]
df_eval.loc[len(df_eval)] = row
df_eval = df_eval.drop_duplicates()
df_eval

Unnamed: 0,Model,Description,Accuracy,Precision,AUC
0,rf,default,0.713372,0.691098,0.736099
1,lgb,default,0.71849,0.702969,0.724033


## Logistic Regression

In [11]:
from mmlspark.train import TrainClassifier
from pyspark.ml.classification import LogisticRegression


model = TrainClassifier(model=LogisticRegression(), labelCol="affairs", numFeatures=256).fit(train)

In [16]:
from mmlspark.train import ComputeModelStatistics, TrainedClassifierModel

prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
df_metrics = metrics.toPandas()

df_metrics

Unnamed: 0,evaluation_type,confusion_matrix,accuracy,precision,recall,AUC
0,Classification,"DenseMatrix([[943., 128.],\n [323....",0.711452,0.569024,0.343496,0.744358


In [18]:
df_metrics['confusion_matrix'][0]

DenseMatrix(2, 2, [943.0, 323.0, 128.0, 169.0], False)