## 102 - Training Regression Algorithms with the L-BFGS Solver

In this example, we run linear regression on the *Flight Delay* dataset to predict the delay times.

We demonstrate how to use the `TrainRegressor` and the `ComputePerInstanceStatistics` APIs.

First, import the packages.

In [None]:
import numpy as np
import pandas as pd
import mmlspark

Next, import the CSV dataset.

In [None]:
# load raw data from small-sized 30 MB CSV file (trimmed to contain just what we use)
dataFilePath = "On_Time_Performance_2012_9.csv"
import os, urllib
if not os.path.isfile(dataFilePath):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/" + dataFilePath,
                               dataFilePath)
flightDelay = spark.createDataFrame(
    pd.read_csv(dataFilePath,
                dtype={"Month": np.float64, "Quarter": np.float64,
                       "DayofMonth": np.float64, "DayOfWeek": np.float64,
                       "OriginAirportID": np.float64, "DestAirportID": np.float64,
                       "CRSDepTime": np.float64, "CRSArrTime": np.float64}))
# Print information on the dataset we loaded
print("Records read: " + str(flightDelay.count()))
print("Schema:")
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

Split the dataset into train and test sets.

In [None]:
train, test = flightDelay.randomSplit([0.75, 0.25])

Train a regressor on dataset with `l-bfgs`.

In [None]:
from mmlspark import MultiColumnAdapter, RenameColumn, TrainRegressor
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer
# Convert columns to categorical
catCols = ["Carrier", "DepTimeBlk", "ArrTimeBlk"]
tmpTag = "Tmp"
catColsTmp = list(map(lambda colName: colName + tmpTag, catCols))
siModel = MultiColumnAdapter(StringIndexer(), inputCols=catCols, outputCol=catColsTmp)
renameColumns = MultiColumnAdapter(RenameColumn(), inputCols=catColsTmp, outputCol=catCols)

lr = LinearRegression().setSolver("l-bfgs").setRegParam(0.1).setElasticNetParam(0.3)
lrModel = TrainRegressor(model=lr, labelCol="ArrDelay")

model = Pipeline(stages=[siModel, renameColumns, lr]).fit(train)
model.write().overwrite().save("flightDelayModel.mml")

Score the regressor on the test data.

In [None]:
flightDelayModel = PipelineModel.load("flightDelayModel.mml")
scoredData = flightDelayModel.transform(test)
scoredData.limit(10).toPandas()

Compute model metrics against the entire scored dataset

In [None]:
from mmlspark import ComputeModelStatistics
metrics = ComputeModelStatistics().transform(scoredData)
metrics.toPandas()

Finally, compute and show per-instance statistics, demonstrating the usage
of `ComputePerInstanceStatistics`.

In [None]:
from mmlspark import ComputePerInstanceStatistics
evalPerInstance = ComputePerInstanceStatistics().transform(scoredData)
evalPerInstance.select("ArrDelay", "Scores", "L1_loss", "L2_loss").limit(10).toPandas()