## 102 - Training Regression Algorithms with the L-BFGS Solver

In this example, we run a linear regression on the *Flight Delay* dataset to predict the delay times.

We demonstrate how to use the `TrainRegressor` and the `ComputePerInstanceStatistics` APIs.

First, import the packages.

In [0]:
import numpy as np
import pandas as pd
import mmlspark

Next, import the CSV dataset.

In [0]:
flightDelay = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/On_Time_Performance_2012_9.parquet")
# print some basic info
print("records read: " + str(flightDelay.count()))
print("Schema: ")
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

Unnamed: 0,Quarter,Month,DayofMonth,DayOfWeek,Carrier,OriginAirportID,DestAirportID,CRSDepTime,DepTimeBlk,CRSArrTime,ArrDelay,ArrTimeBlk,Diverted
0,3,9,14,5,UA,12266,13495,1909,1900-1959,2022,-6.0,2000-2059,0.0
1,3,9,14,5,UA,14679,12266,1001,1000-1059,1518,-17.0,1500-1559,0.0
2,3,9,14,5,UA,11697,12266,1146,1100-1159,1335,-22.0,1300-1359,0.0
3,3,9,14,5,UA,12266,14747,1819,1800-1859,2102,-7.0,2100-2159,0.0
4,3,9,14,5,UA,12889,12266,818,0800-0859,1325,-13.0,1300-1359,0.0
5,3,9,14,5,UA,14908,11292,1110,1100-1159,1430,21.0,1400-1459,0.0
6,3,9,14,5,UA,11042,11697,850,0800-0859,1141,-14.0,1100-1159,0.0
7,3,9,14,5,UA,12266,14107,1420,1400-1459,1513,-2.0,1500-1559,0.0
8,3,9,14,5,UA,15304,12266,1112,1100-1159,1230,-8.0,1200-1259,0.0
9,3,9,14,5,UA,12266,12339,1903,1900-1959,2230,5.0,2200-2259,0.0


Split the dataset into train and test sets.

In [0]:
train,test = flightDelay.randomSplit([0.75, 0.25])

Train a regressor on dataset with `l-bfgs`.

# пояснения
ждем данные в таком виде

In [0]:
from mmlspark.train import TrainRegressor, TrainedRegressorModel
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer
# Convert columns to categorical
catCols = ["Carrier", "DepTimeBlk", "ArrTimeBlk"]
trainCat = train
testCat = test

stringIndexer аналог laberl encoder in sclearn

In [0]:

for catCol in catCols:
    simodel = StringIndexer(inputCol=catCol, outputCol=catCol + "Tmp").fit(train) # string indexer обучили на трейне
    trainCat = simodel.transform(trainCat).drop(catCol).withColumnRenamed(catCol + "Tmp", catCol)
    testCat = simodel.transform(testCat).drop(catCol).withColumnRenamed(catCol + "Tmp", catCol)

In [0]:
trainCat.limit(10).toPandas()

Unnamed: 0,Quarter,Month,DayofMonth,DayOfWeek,OriginAirportID,DestAirportID,CRSDepTime,CRSArrTime,ArrDelay,Diverted,Carrier,DepTimeBlk,ArrTimeBlk
0,3,9,1,6,11042,13303,1715,2015,-1.0,0.0,6.0,2.0,5.0
1,3,9,1,6,11057,13303,1340,1550,-9.0,0.0,6.0,5.0,6.0
2,3,9,1,6,11066,12953,1510,1655,-22.0,0.0,6.0,11.0,0.0
3,3,9,1,6,11066,13303,1815,2100,-7.0,0.0,6.0,13.0,12.0
4,3,9,1,6,11193,13303,1815,2100,-26.0,0.0,6.0,13.0,12.0
5,3,9,1,6,11278,12478,905,1015,-7.0,0.0,6.0,10.0,3.0
6,3,9,1,6,11278,12478,1925,2055,-10.0,0.0,6.0,12.0,5.0
7,3,9,1,6,11278,14492,1830,1935,-17.0,0.0,6.0,13.0,9.0
8,3,9,1,6,11298,10781,2000,2115,-7.0,0.0,6.0,14.0,12.0
9,3,9,1,6,11298,11995,745,1115,2.0,0.0,6.0,4.0,8.0


In [0]:
# LinearRegression()

In [0]:
lr = LinearRegression().setRegParam(0.1).setElasticNetParam(0.3)
model = TrainRegressor(model=lr, labelCol="ArrDelay").fit(trainCat)

Save, load, or Score the regressor on the test data.

In [0]:
import random
model_name = "flightDelayModel_{}.mml".format(random.randint(1, 25))
model.write().overwrite().save(model_name)
flightDelayModel = TrainedRegressorModel.load(model_name)

scoredData = flightDelayModel.transform(testCat) # колонка scores - double
scoredData.limit(10).toPandas()

Unnamed: 0,Quarter,Month,DayofMonth,DayOfWeek,OriginAirportID,DestAirportID,CRSDepTime,CRSArrTime,ArrDelay,Diverted,Carrier,DepTimeBlk,ArrTimeBlk,scores
0,3,9,1,6,10397,13303,1640,1845,-19.0,0.0,6.0,9.0,1.0,4.743162
1,3,9,1,6,11140,11298,600,720,-3.0,0.0,6.0,3.0,15.0,-4.950821
2,3,9,1,6,11278,12478,1030,1140,-16.0,0.0,6.0,1.0,8.0,-1.422851
3,3,9,1,6,11298,14108,955,1145,-6.0,0.0,6.0,10.0,8.0,-2.322877
4,3,9,1,6,12191,11298,1445,1555,7.0,0.0,6.0,8.0,6.0,3.543184
5,3,9,1,6,13303,10397,1410,1605,-13.0,0.0,6.0,8.0,0.0,3.618645
6,3,9,1,6,13303,11057,1105,1305,1.0,0.0,6.0,7.0,4.0,-1.058956
7,3,9,1,6,13303,12451,1745,1905,-13.0,0.0,6.0,2.0,9.0,5.574811
8,3,9,1,6,13303,14122,800,1045,56.0,0.0,6.0,0.0,3.0,-2.425202
9,3,9,1,6,10140,14107,1540,1556,-5.0,0.0,3.0,11.0,6.0,5.883286


Compute model metrics against the entire scored dataset

In [0]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics().transform(scoredData) # есть соглашение, что нужно посчитать
metrics.toPandas()

Unnamed: 0,mean_squared_error,root_mean_squared_error,R^2,mean_absolute_error
0,1097.90065,33.134584,0.045032,17.499942


Finally, compute and show per-instance statistics, demonstrating the usage
of `ComputePerInstanceStatistics`.

In [0]:
from mmlspark.train import ComputePerInstanceStatistics
evalPerInstance = ComputePerInstanceStatistics().transform(scoredData)
evalPerInstance.select("ArrDelay", "Scores", "L1_loss", "L2_loss").limit(10).toPandas()

Unnamed: 0,ArrDelay,Scores,L1_loss,L2_loss
0,-19.0,4.743162,23.743162,563.737764
1,-3.0,-4.950821,1.950821,3.805702
2,-16.0,-1.422851,14.577149,212.493263
3,-6.0,-2.322877,3.677123,13.521236
4,7.0,3.543184,3.456816,11.949578
5,-13.0,3.618645,16.618645,276.179366
6,1.0,-1.058956,2.058956,4.239298
7,-13.0,5.574811,18.574811,345.0236
8,56.0,-2.425202,58.425202,3413.504183
9,-5.0,5.883286,10.883286,118.445922
