# LightGBM

[LightGBM](https://github.com/Microsoft/LightGBM) is an open-source,
distributed, high-performance gradient boosting (GBDT, GBRT, GBM, or
MART) framework. This framework specializes in creating high-quality and
GPU enabled decision tree algorithms for ranking, classification, and
many other machine learning tasks. LightGBM is part of Microsoft's
[DMTK](http://github.com/microsoft/dmtk) project.

### Advantages of LightGBM

-   **Composability**: LightGBM models can be incorporated into existing
    SparkML Pipelines, and used for batch, streaming, and serving
    workloads.
-   **Performance**: LightGBM on Spark is 10-30% faster than SparkML on
    the Higgs dataset, and achieves a 15% increase in AUC.  [Parallel
    experiments](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#parallel-experiment)
    have verified that LightGBM can achieve a linear speed-up by using
    multiple machines for training in specific settings.
-   **Functionality**: LightGBM offers a wide array of [tunable
    parameters](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst),
    that one can use to customize their decision tree system. LightGBM on
    Spark also supports new types of problems such as quantile regression.
-   **Cross platform** LightGBM on Spark is available on Spark, PySpark, and SparklyR

### LightGBM Usage:

- LightGBMClassifier
- LightGBMRegressor

## Bankruptcy Prediction with LightGBM Classifier

<img src="https://mmlspark.blob.core.windows.net/graphics/Documentation/bankruptcy image.png" width="800" style="float: center;"/>

In this example, we use LightGBM to build a classification model in order to predict bankruptcy.

#### Read dataset

In [None]:
dataset = spark.read.format("csv")\
  .option("header", True)\
  .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv")
# print dataset size
print("records read: " + str(dataset.count()))

In [None]:
# convert features to double type
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
for colName in dataset.columns:
  dataset = dataset.withColumn(colName, col(colName).cast(DoubleType()))
print("Schema: ")
dataset.printSchema()

In [None]:
dataset.show(n=3, truncate=False, vertical=True)

#### Split the dataset into train and test

In [None]:
train, test = dataset.randomSplit([0.85, 0.15], seed=1)

#### Add featurizer to convert features to vector

In [None]:
from pyspark.ml.feature import VectorAssembler
feature_cols = dataset.columns[1:]
featurizer = VectorAssembler(
    inputCols=feature_cols,
    outputCol='features'
)
train_data = featurizer.transform(train)['Bankrupt?', 'features']
test_data = featurizer.transform(test)['Bankrupt?', 'features']

In [None]:
train_data.show(10)

#### Check if the data is unbalanced

In [None]:
train_data.groupBy("Bankrupt?").count().show()

#### Model Training

In [None]:
from mmlspark.lightgbm import LightGBMClassifier
model = LightGBMClassifier(objective="binary", featuresCol="features", labelCol="Bankrupt?", isUnbalance=True)

In [None]:
model = model.fit(train_data)

In [None]:
from mmlspark.lightgbm import LightGBMClassificationModel
model.saveNativeModel("/lgbmcmodel")
model = LightGBMClassificationModel.loadNativeModelFromFile("/lgbmcmodel")

In [None]:
print(model.getFeatureImportances())

#### Model Prediction

In [None]:
predictions = model.transform(test_data)
predictions.limit(10).toPandas()

In [None]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics(evaluationMetric="classification", labelCol='Bankrupt?', scoredLabelsCol='prediction').transform(predictions)
display(metrics)

## Quantile Regression for Drug Discovery with LightGBMRegressor

<img src="https://mmlspark.blob.core.windows.net/graphics/Documentation/drug.png" width="800" style="float: center;"/>

In this example, we show how to use LightGBM to build a simple regression model.

#### Read dataset

In [None]:
triazines = spark.read.format("libsvm")\
    .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight")

In [None]:
# print some basic info
print("records read: " + str(triazines.count()))
print("Schema: ")
triazines.printSchema()
triazines.limit(10).toPandas()

#### Split dataset into train and test

In [None]:
train, test = triazines.randomSplit([0.85, 0.15], seed=1)

#### Model Training

In [None]:
from mmlspark.lightgbm import LightGBMRegressor
model = LightGBMRegressor(objective='quantile',
                          alpha=0.2,
                          learningRate=0.3,
                          numLeaves=31).fit(train)

In [None]:
print(model.getFeatureImportances())

#### Model Prediction

In [None]:
scoredData = model.transform(test)
scoredData.limit(10).toPandas()

In [None]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics(evaluationMetric='regression',
                                 labelCol='label',
                                 scoresCol='prediction') \
            .transform(scoredData)
metrics.toPandas()