# Linear Regression: Improving our model

In this notebook we will be adding additional features to our model, as well as discuss how to handle categorical features.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - One Hot Encode categorical variables
 - Use the Pipeline API
 - Save and load models

#### Importing modules and disabling MLflow

In [0]:
import os
import mlflow
mlflow.autolog(disable=True)

### Setting the default database and user name  
##### Substitute "renato" by your name in the `username` variable.

In [0]:
## Put your name here
username = "renato"

dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dsacademy_embedded_wave3_{username}")
spark.sql(f"USE dsacademy_embedded_wave3_{username}")
spark.conf.set("spark.sql.shuffle.partitions", 40)

spark.sql("SET spark.databricks.delta.formatCheck.enabled = false")
spark.sql("SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true")

Out[2]: DataFrame[key: string, value: string]

In [0]:
deltaPath = os.path.join("/", "tmp", username)    #If we were writing to the root folder and not to the DBFS
if not os.path.exists(deltaPath):
    os.mkdir(deltaPath)
    
print(deltaPath)

airbnbDF = spark.read.format("delta").load(deltaPath)

/tmp/renato


In [0]:
airbnbDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,6.0,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3.0,1.0,2.0,1.0,14.0,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,3.0,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5.0,1.0,3.0,5.0,350.0,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,19.0,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6.0,2.0,4.0,1.0,181.0,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,100.0,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,3.0,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3.0,1.0,2.0,5.0,347.0,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2.0,1.0,1.0,3.0,52.0,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,4.0,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,117.0,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4.0,2.0,1.0,3.0,69.0,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,t,1.0,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4.0,2.0,2.0,3.0,50.0,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,2.0,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4.0,1.0,2.0,5.0,178.0,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Checking Null Values in Spark Dataframe

In [0]:
for col in airbnbDF.columns:
  print(col, ":", airbnbDF.filter(f"{col} is NULL").count())

host_is_superhost : 0
instant_bookable : 0
host_total_listings_count : 2
neighbourhood_cleansed : 0
latitude : 0
longitude : 0
property_type : 0
room_type : 0
accommodates : 0
bedrooms : 0
beds : 0
minimum_nights : 0
number_of_reviews : 0
review_scores_rating : 0
review_scores_accuracy : 0
review_scores_cleanliness : 0
review_scores_checkin : 0
review_scores_communication : 0
review_scores_location : 0
review_scores_value : 0
price : 0
bedrooms_na : 0
beds_na : 0
review_scores_rating_na : 0
review_scores_accuracy_na : 0
review_scores_cleanliness_na : 0
review_scores_checkin_na : 0
review_scores_communication_na : 0
review_scores_location_na : 0
review_scores_value_na : 0


#### Imputing Null Values  
[Python](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.fillna.html)

In [0]:
airbnbDF.filter("host_total_listings_count is NULL").display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,,Mariahilf,48.19336,16.34596,Entire rental unit,Entire home/apt,3.0,1.0,2.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,,Leopoldstadt,48.22447,16.38696,Entire rental unit,Entire home/apt,2.0,1.0,1.0,10.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,80.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
airbnbDF = airbnbDF.fillna(0)

In [0]:
airbnbDF.filter("host_total_listings_count is NULL").display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na


## Train/Test Split

Let's use the same 80/20 split with the same seed as the previous notebook so we can compare our results apples to apples (unless you changed the cluster config!)

In [0]:
(trainDF, testDF) = airbnbDF.randomSplit([.8, .2], seed=42)

## Categorical Variables

There are a few ways to handle categorical features:
* Assign them a numeric value
* Create "dummy" variables (also known as One Hot Encoding)
* Generate embeddings (mainly used for textual data)

### One Hot Encoder
Here, we are going to One Hot Encode (OHE) our categorical variables.
Spark doesn't have a `dummies` function, and OHE is a two step process.  
First, we need to use `StringIndexer` to map a string column of labels to an ML column of label indices  
[API Python](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html)  

Then, we can apply the `OneHotEncoder` to the output of the StringIndexer  
[API Python](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html)

In [0]:
categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
categoricalCols

Out[10]: ['host_is_superhost',
 'instant_bookable',
 'neighbourhood_cleansed',
 'property_type',
 'room_type']

In [0]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
indexOutputCols = [x + "Index" for x in categoricalCols]
oheOutputCols = [x + "OHE" for x in categoricalCols]

stringIndexer = StringIndexer(inputCols=categoricalCols, outputCols=indexOutputCols, handleInvalid="skip")
oheEncoder = OneHotEncoder(inputCols=indexOutputCols, outputCols=oheOutputCols)

In [0]:
numericCols = [field for (field, dataType) in trainDF.dtypes if ((dataType == "double") & (field != "price"))]

## Vector Assembler

Now we can combine our OHE categorical features with our numeric features.

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
assemblerInputs = oheOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")

## Linear Regression

Now that we have all of our features, let's build a linear regression model.

In [0]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol="price", featuresCol="features")

## Pipeline

Let's put all these stages in a Pipeline. A `Pipeline` is a way of organizing all of our transformers and estimators [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Pipeline)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.Pipeline).

This way, we don't have to worry about remembering the same ordering of transformations to apply to our test dataset.

In [0]:
from pyspark.ml import Pipeline

stages = [stringIndexer, oheEncoder, vecAssembler, lr]
pipeline = Pipeline(stages=stages)

pipelineModel = pipeline.fit(trainDF)

## Saving Models

We can save our models to persistent storage (e.g. DBFS) in case our cluster goes down so we don't have to recompute our results.

In [0]:
userhome = os.path.join("/", "tmp", username)    #We are writing to the root folder and not to the DBFS
if not os.path.exists(userhome):
    os.mkdir(userhome)
    
print(userhome)

/tmp/renato


In [0]:
pipelinePath = userhome + "/machine-learning-p/lr_pipeline_model"
pipelineModel.write().overwrite().save(pipelinePath)

## Loading models

When you load in models, you need to know the type of model you are loading back in (was it a linear regression or logistic regression model?).

For this reason, we recommend you always put your transformers/estimators into a Pipeline, so you can always load the generic PipelineModel back in.

In [0]:
from pyspark.ml import PipelineModel

savedPipelineModel = PipelineModel.load(pipelinePath)

## Apply model to test set

In [0]:
predDF = savedPipelineModel.transform(testDF)

display(predDF.select("features", "price", "prediction"))

features,price,prediction
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 24, 75, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.2148, 16.3617, 4.0, 1.0, 2.0, 2.0, 121.0, 4.72, 4.87, 4.84, 4.84, 4.64, 4.9, 4.68))",54.0,78.45643475274517
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 25, 76, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.21607, 16.36432, 1.0, 1.0, 1.0, 1.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",40.0,43.351186223739376
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 25, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.21616, 16.34503, 3.0, 1.0, 2.0, 2.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 5.0))",23.0,68.59730190986102
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 25, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.21667, 16.35667, 2.0, 1.0, 1.0, 2.0, 4.0, 5.0, 5.0, 4.67, 5.0, 4.67, 5.0, 5.0))",29.0,37.91575156422613
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 24, 75, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.21782, 16.36079, 2.0, 1.0, 1.0, 4.0, 5.0, 5.0, 5.0, 4.8, 4.8, 5.0, 5.0, 5.0))",58.0,54.7537595361639
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 25, 76, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.21831, 16.36243, 3.0, 1.0, 3.0, 1.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",100.0,72.40822163148061
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 24, 75, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.21949, 16.35994, 4.0, 1.0, 1.0, 7.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",80.0,103.5162333874905
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 25, 76, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.22014, 16.3672, 1.0, 1.0, 1.0, 2.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",17.0,43.52047369019516
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 24, 75, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.22184, 16.34624, 4.0, 2.0, 2.0, 4.0, 3.0, 5.0, 5.0, 4.67, 5.0, 5.0, 5.0, 5.0))",30.0,103.97967899545132
"Map(vectorType -> sparse, length -> 102, indices -> List(0, 7, 24, 75, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 97, 98, 99, 100, 101), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 48.22185, 16.36678, 2.0, 1.0, 1.0, 15.0, 4.83, 4.89, 4.83, 4.93, 4.93, 4.81, 4.76, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",60.0,79.85512996671719


## Evaluate model

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regressionEvaluator.evaluate(predDF)
r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 77.83741103350475
R2 is 0.29736750156634084


As you can see, our RMSE decreased when compared to the model without one-hot encoding, and the R2 increased as well!

Code modified and enhanced from 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>