# 105 - Training Regressions 

This example notebook is similar to "102 - Training Regression Algorithms with the L-BFGS Solver". In this example, we will convert data columns using `DataConversion()` instead of making them be categorical columns.

This sample demonstrates how to use the following APIs:
- `TrainRegressor`: [TrainRegressor](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)
- `ComputePerInstanceStatistics`: [ComputePerInstanceStatistics](http://mmlspark.azureedge.net/docs/pyspark/ComputePerInstanceStatistics.html)
- `DataConversion`: [DataConversion](http://mmlspark.azureedge.net/docs/pyspark/DataConversion.html)

First, import packages

In [None]:
import numpy as np
import pandas as pd
import mmlspark

Next, import the CSV dataset

In [None]:
dataFile = "On_Time_Performance_2012_9.csv"
import os, urllib
if not os.path.isfile(dataFile):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/"+dataFile, dataFile)
flightDelay = spark.createDataFrame(pd.read_csv(dataFile))
#print some basic info
print("records read: " + str(flightDelay.count()))
print("Schema: ")
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

Use the `DataConversion` API to convert the columns listed to double.

In [None]:
from mmlspark import DataConversion
flightDelay = DataConversion(col="Quarter,Month,DayofMonth,DayOfWeek,OriginAirportID,DestAirportID,CRSDepTime,CRSArrTime",
                             convertTo="double").transform(flightDelay)
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

Split the datasest into train and test sets.

In [None]:
train, test = flightDelay.randomSplit([0.75, 0.25])

Train a regressor on the dataset with l-bfgs

In [None]:
from mmlspark import TrainRegressor, TrainedRegressorModel
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer

# Convert columns to categorical
catCols = ["Carrier", "DepTimeBlk", "ArrTimeBlk"]
for catCol in catCols:
    siModel = StringIndexer(inputCol=catCol, outputCol=catCol+"Tmp").fit(train)
    trainCat = siModel.transform(train).drop(catCol).withColumnRenamed(catCol+"Tmp", catCol)
    testCat = siModel.transform(test).drop(catCol).withColumnRenamed(catCol+"Tmp", catCol)
lf = LinearRegression().setSolver("l-bfgs").setRegParam(0.1).setElasticNetParam(0.3)
model = TrainRegressor(model=lr, labelCol="ArrDelay").fit(trainCat)
model.write().overwrite().save("flightDelayModel.mml")

Score the regressor on the test data.

In [None]:
flightDelayModel - TrainedRegressorModel.load("floightDelayModel.mml")
scoredData = flightDelayModel.transform(testCat)
scoredData.limit(10).toPandas()

Compute model metrics against the entire scored dataset

In [None]:
from mmlspark import ComputeModelStatistics
metrics = ComputeModelStatistics().transform(scoredData)
metrics.toPandas()

Finally, compute and show per-instance statistics, demonstrating the usage of `ComputePerInstanceStatistics`

In [None]:
from mmlspark import ComputePerInstanceStatistics
evalPerInstance = ComputePerInstanceStatistics().transform(scoredData)
evalPerInstance.select("ArrDelay", "Scores", "L1_loss", "L2_loss").limit(10).toPandas()