## 106 - Quantile Regression with LightGBM

We will demonstrate how to use the LightGBM quantile regressor with
TrainRegressor and ComputeModelStatistics on the Triazines dataset.


This sample demonstrates how to use the following APIs:
- [`TrainRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)
- [`LightGBMRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/LightGBMRegressor.html)
- [`ComputeModelStatistics`
  ](http://mmlspark.azureedge.net/docs/pyspark/ComputeModelStatistics.html)

In [None]:
dataFile = "triazines.scale.svmlight"
import os, urllib.request
if not os.path.isfile(dataFile):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/"+dataFile, dataFile)
triazines = spark.read.format("libsvm").load(dataFile)

In [None]:
%%local
dataFile = "triazines.scale.svmlight"
dataFilePath = "/datasets/"+dataFile
tmpLocalPath = "/tmp/"+dataFile
import subprocess
if subprocess.call(["hdfs", "dfs", "-test", "-d", dataFilePath]):
    from urllib import urlretrieve
    urlretrieve("https://mmlspark.azureedge.net/datasets/"+dataFile, tmpLocalPath)
    print subprocess.check_output(
            "hdfs dfs -mkdir -p %s" % dataFilePath,
            stderr=subprocess.STDOUT, shell=True)
    print subprocess.check_output(
            "hdfs dfs -copyFromLocal -f "+tmpLocalPath+" "+dataFilePath,
            stderr=subprocess.STDOUT, shell=True)

In [None]:
dataFile = "triazines.scale.svmlight"
dataFilePath = "/datasets/"+dataFile
triazines = spark.read.format("libsvm").load(dataFilePath)

In [None]:
# print some basic info
print("records read: " + str(triazines.count()))
print("Schema: ")
triazines.printSchema()
triazines.limit(10).toPandas()

Split the dataset into train and test

In [None]:
train, test = triazines.randomSplit([0.85, 0.15], seed=1)

Train the quantile regressor on the training data.

In [None]:
from mmlspark import LightGBMRegressor
model = LightGBMRegressor(application='quantile',
                          alpha=0.2,
                          learningRate=0.3,
                          numLeaves=31).fit(train)

Score the regressor on the test data.

In [None]:
scoredData = model.transform(test)
scoredData.limit(10).toPandas()

Compute metrics using ComputeModelStatistics

In [None]:
from mmlspark import ComputeModelStatistics
metrics = ComputeModelStatistics(evaluationMetric='regression',
                                 labelCol='label',
                                 scoresCol='prediction') \
            .transform(scoredData)
metrics.toPandas()