# 104 - Train, Test, Evaluate for Regression with Auto Imports Dataset

This sample notebook is based on the Gallery "Sample 6: Train, Test, Evaluate for Regression: Auto Imports Dataset" for AzureML Studio. This experiment demonstrates how to build a regression model to predict the automobile's price. The process includes training, testing, and evaluating the model on the Autom Imports data set.

This sample demonstrates the use of several members of the mmlspark library:
- TrainRegressor
- SummarizeData
- CleanMissingData
- ComputeStatistics
- FindBestModel

First, import packages

In [None]:
import numpy as np
import pandas as pd
import mmlspark

Declare the schema for the data that will be read in. Allow all fields to be nullable, so that missing values can be handled appropriately, such as replacing them with the mean or median value for that column.

In [None]:
from pyspark.sql.types import LongType, StringType, DoubleType, StructType, StructField

tableSchema = StructType([StructField("symboling",         LongType(),   True),
                          StructField("normalized-losses", DoubleType(), True),
                          StructField("make",              StringType(), True),
                          StructField("fuel-type",         StringType(), True),
                          StructField("aspiration",        StringType(), True),
                          StructField("body-style",        StringType(), True),
                          StructField("drive-wheels",      StringType(), True),
                          StructField("engine-location",   StringType(), True),
                          StructField("wheel-base",        DoubleType(), True),
                          StructField("length",            DoubleType(), True),
                          StructField("width",             DoubleType(), True),
                          StructField("height",            DoubleType(), True),
                          StructField("curb-weight",       LongType(),   True),
                          StructField("engine-type",       StringType(), True),
                          StructField("num-of-cylinders",  StringType(), True),
                          StructField("engine-size",       LongType(),   True),
                          StructField("fuel-system",       StringType(), True),
                          StructField("bore",              DoubleType(), True),
                          StructField("stroke",            DoubleType(), True),
                          StructField("compression-ratio", DoubleType(), True),
                          StructField("horsepower",        DoubleType(), True),
                          StructField("peak-rpm",          DoubleType(), True),
                          StructField("city-mpg",          LongType(),   True),
                          StructField("highway-mpg",       LongType(),   True),
                          StructField("price",             DoubleType(), True)])


Read the data from the AutomobilePriceRaw.csv file into a pandas dataframe. Specify possible reprsentations of missing values, and drop the 'num-of-doors' column as the data is read in. 

In [None]:
dataFile = "AutomobilePriceRaw.csv"
import os, urllib
if not os.path.isfile(dataFile):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/"+dataFile, dataFile)
data = spark.createDataFrame(pd.read_csv(dataFile,
                                         na_values=['',' ','?'],
                                         usecols=lambda x: x not in ['num-of-doors']), tableSchema)

Summarize the data using `SummarizeData` and print the summary. Note that several columns have missing values (normalized-losses, bore, stroke, horsepower, peak-rpm, price)

In [None]:
##summary = mmlspark.SummarizeData().transform(df)
from mmlspark import SummarizeData
summary = SummarizeData().transform(data)
summary.toPandas()

Now use the `CleanMissingData` API to replace the missing values with something more useful or meaningful. In this case, we will replace missing values in numeric columns with the median value for the column. Then, Summarize again and note the differences. Notice that all columns have 205 rows, and none of the rows contains missing values. Also, notice that the boundaries on the quartile bins have shifted slightly due to the missing value replacement.

In [None]:
from mmlspark import CleanMissingData
cols = ["normalized-losses", "stroke", "bore", "horsepower", "peak-rpm", "price"]
cleanModel = CleanMissingData(cleaningMode="Median", inputCols=cols, outputCols=cols).fit(data)
data = cleanModel.transform(data)
summary = SummarizeData().transform(data)
summary.toPandas()

Split the dataset into train and test datasets.

In [None]:
#split the data into training and testing datasets
train, test = data.randomSplit([0.6, 0.4], seed=123)
train.limit(10).toPandas()

Create a Poisson Regression model using the `GeneralizedLinearRegressor` API from Spark and train it on the train dataset.

In [None]:
#train Poisson Regression Model
from pyspark.ml.regression import GeneralizedLinearRegression
from mmlspark import TrainRegressor

glr = GeneralizedLinearRegression(family="poisson", link="log")
poissonModel = TrainRegressor(model=glr, labelCol="price", numFeatures=256).fit(train)
poissonPrediction = poissonModel.transform(test)

Next, create a Random Forest Regression model using the `RandomRorestRegressor` API from spark and train it on the train dataset.

In [None]:
#train Random Forest regression on the same training data:
from pyspark.ml.regression import RandomForestRegressor

rfr = RandomForestRegressor(maxDepth=30, maxBins=128, numTrees=8, minInstancesPerNode=1)
randomForestModel = TrainRegressor(model=rfr, labelCol="price", numFeatures=256).fit(train)
randomForestPrediction = randomForestModel.transform(test)

Compute basic statistics for the Poisson model using `ComputeModelStatistics`

In [None]:
# Use ComputeStatistics to evaluate the PoissonRegressor:
from mmlspark import ComputeModelStatistics
poissonMetrics = ComputeModelStatistics().transform(poissonPrediction)
print("Poisson Metrics")
poissonMetrics.toPandas()

Similarly, compute the statistics for the Random Forest model.

In [None]:
# Use ComputeStatistics to evaluate the RandomForestRegresspr"
randomForestMetrics = ComputeModelStatistics().transform(randomForestPrediction)
print("Random Forest Metrics")
randomForestMetrics.toPandas()

Determine which model is better, using `FindBestModel` with the `evaluationMetric` set to "r2", and compute the model statistics for that model.

In [None]:
# for a given metric, find the better model
from mmlspark import FindBestModel
rModels = [poissonModel, randomForestModel]
bestModel = FindBestModel(models=rModels, evaluationMetric="r2").fit(test)
prediction = bestModel.transform(test)
statistics = ComputeModelStatistics().transform(prediction)
statistics.toPandas()

Save the best model for use in scoring

In [None]:
bestModel.write().overwrite().save("flightDelayBestModel.mml")