In [1]:
%load_ext sparksql_magic

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()

## Evaluating a Regression Model

In this exercise, you will create a pipeline for a linear regression model, and then test and evaluate the model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [2]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Load the source data
csv = spark.read.csv('../data/flights.csv', inferSchema=True, header=True)

# Select features and label
data = csv.select("DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay", col("ArrDelay").alias("label"))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

### Define the Pipeline and Train the Model
Now define a pipeline that creates a feature vector and trains a regression model

In [3]:
# Define the pipeline
assembler = VectorAssembler(inputCols = ["DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay"], outputCol="features")
lr = LinearRegression(labelCol="label",featuresCol="features", maxIter=10, regParam=0.3)
pipeline = Pipeline(stages=[assembler, lr])

# Train the model
piplineModel = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [4]:
prediction = piplineModel.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show()

+--------------------+-------------------+---------+
|            features|         prediction|trueLabel|
+--------------------+-------------------+---------+
|[1.0,1.0,10140.0,...|-3.5792773856765576|      -12|
|[1.0,1.0,10140.0,...|-3.5792773856765576|       -9|
|[1.0,1.0,10140.0,...| -8.762752216152363|      -14|
|[1.0,1.0,10140.0,...|-3.7747128464894155|      -12|
|[1.0,1.0,10140.0,...|-12.760665579872075|       -9|
|[1.0,1.0,10140.0,...|-11.763057705939486|       -6|
|[1.0,1.0,10140.0,...|-5.7774104623439495|       -9|
|[1.0,1.0,10140.0,...|-5.7774104623439495|       -5|
|[1.0,1.0,10140.0,...| -3.782194714478771|       -5|
|[1.0,1.0,10140.0,...|-1.7869789666135916|       -1|
|[1.0,1.0,10140.0,...|  34.12690449495963|       55|
|[1.0,1.0,10140.0,...|-13.759633793439095|      -15|
|[1.0,1.0,10140.0,...| -9.769202297708736|      -25|
|[1.0,1.0,10140.0,...| -4.781162928045788|       -9|
|[1.0,1.0,10140.0,...|-2.7859471801806093|       -3|
|[1.0,1.0,10140.0,...| -4.983626810303192|    

### Examine the Predicted and Actual Values
You can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.
Run the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot, specifying **-** as the function to show the unaggregated values.

In [5]:
predicted.createOrReplaceTempView("regressionPredictions")

In [6]:
%%sparksql
SELECT trueLabel, prediction FROM regressionPredictions

only showing top 20 row(s)


0,1
trueLabel,prediction
-12,-3.5792773856765576
-9,-3.5792773856765576
-14,-8.762752216152363
-12,-3.7747128464894155
-9,-12.760665579872075
-6,-11.763057705939486
-9,-5.7774104623439495
-5,-5.7774104623439495
-5,-3.782194714478771


### Retrieve the Root Mean Square Error (RMSE)
There are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the predicted and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE.


In [7]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)
print("Root Mean Square Error (RMSE):", rmse)

Root Mean Square Error (RMSE): 13.26915556694366
