## 102 - Training Regression Algorithms with the L-BFGS Solver

In this example, we run a linear regression on the *Flight Delay* dataset to predict the delay times.

We demonstrate how to use the `MultiColumnAdapter`, `TrainRegressor` and `ComputePerInstanceStatistics` APIs.

First, import the packages.

In [None]:
import numpy as np
import mmlspark

Next, import the CSV dataset.

In [None]:
# load raw data from small-sized 30 MB CSV file (trimmed to contain just what we use)
dataFilePath = "On_Time_Performance_2012_9.csv"
import os, urllib
if not os.path.isfile(dataFilePath):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/" + dataFilePath,
                               dataFilePath)

from pyspark.sql.types import StructType, StructField, StringType, DoubleType
schema = StructType([StructField("Quarter", DoubleType(), False),
    StructField("Month", DoubleType(), False),
    StructField("DayofMonth", DoubleType(), False),
    StructField("DayOfWeek", DoubleType(), False),
    StructField("Carrier", StringType(), False),
    StructField("OriginAirportID", StringType(), False),
    StructField("DestAirportID", StringType(), False),
    StructField("CRSDepTime", DoubleType(), False),
    StructField("DepTimeBlk", StringType(), False),
    StructField("CRSArrTime", DoubleType(), False),
    StructField("ArrDelay", DoubleType(), False),
    StructField("ArrTimeBlk", StringType(), False),
    StructField("Diverted", StringType(), False)])

flightDelay = spark.read.option("header", "true").csv(dataFilePath,
    schema=schema)

# Print information on the dataset we loaded
print("Records read: " + str(flightDelay.count()))
print("Schema:")
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

Split the dataset into train and test sets.

In [None]:
train, test = flightDelay.randomSplit([0.75, 0.25])

Train a regressor on dataset with `l-bfgs`.

In [None]:
from mmlspark import MultiColumnAdapter, SelectColumns, TrainRegressor, TrainedRegressorModel
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer
# Convert columns to categorical
catCols = ["Carrier", "DepTimeBlk", "ArrTimeBlk"]
catColsOut = list(map(lambda s: s + "Out", catCols))
siModel = MultiColumnAdapter(baseStage=StringIndexer(handleInvalid="skip"),
                             inputCols=catCols,
                             outputCols=catColsOut)
numericColumns = catColsOut + ["Quarter", "DayofMonth", "DayOfWeek","OriginAirportID",
                               "DestAirportID","CRSDepTime", "CRSArrTime", "ArrDelay"]
selectNums = SelectColumns(cols=numericColumns)
lr = LinearRegression().setSolver("l-bfgs").setRegParam(0.1).setElasticNetParam(0.3)
trModel = TrainRegressor(model=lr, labelCol="ArrDelay")
model = Pipeline(stages=[siModel, selectNums, trModel]).fit(train)
modelName = "flightDelayModel.mml"
model.write().overwrite().save(modelName)

Score the regressor on the test data.

In [None]:
flightDelayModel = PipelineModel.load(modelName)
scoredData = flightDelayModel.transform(test)
scoredData.limit(10).toPandas()

Compute model metrics against the entire scored dataset

In [None]:
from mmlspark import ComputeModelStatistics
metrics = ComputeModelStatistics().transform(scoredData)
metrics.toPandas()

Finally, compute and show per-instance statistics, demonstrating the usage
of `ComputePerInstanceStatistics`.

In [None]:
from mmlspark import ComputePerInstanceStatistics
evalPerInstance = ComputePerInstanceStatistics().transform(scoredData)
evalPerInstance.select("ArrDelay", "Scores", "L1_loss", "L2_loss").limit(10).toPandas()