# Regression: Predicting Rental Price

In this notebook, we will use the dataset we cleansed in the previous lab to predict Airbnb rental prices in San Francisco.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Use the SparkML API to build a linear regression model
 - Identify the differences between estimators and transformers

In [0]:
import os

### Setting the default database and user name  
##### Substitute "renato" by your name in the `username` variable.

In [0]:
## Put your name here
username = "renato"

dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dsacademy_embedded_wave3_{username}")
spark.sql(f"USE dsacademy_embedded_wave3_{username}")
spark.conf.set("spark.sql.shuffle.partitions", 40)

spark.sql("SET spark.databricks.delta.formatCheck.enabled = false")
spark.sql("SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true")

Out[2]: DataFrame[key: string, value: string]

In [0]:
deltaPath = os.path.join("/", "tmp", username)    #If we were writing to the root folder and not to the DBFS
if not os.path.exists(deltaPath):
    os.mkdir(deltaPath)
    
print(deltaPath)

airbnbDF = spark.read.format("delta").load(deltaPath)

/tmp/renato


In [0]:
airbnbDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,6.0,Donaustadt,48.24262,16.42767,Room in bed and breakfast,Hotel room,3.0,1.0,2.0,1.0,14.0,4.71,4.86,4.93,4.93,4.86,4.71,4.5,110.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,3.0,Leopoldstadt,48.21924,16.37831,Entire rental unit,Entire home/apt,5.0,1.0,3.0,5.0,350.0,4.75,4.8,4.65,4.91,4.93,4.75,4.69,69.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,t,19.0,Rudolfsheim-Fnfhaus,48.18434,16.32701,Entire rental unit,Entire home/apt,6.0,2.0,4.0,1.0,181.0,4.83,4.9,4.88,4.89,4.93,4.59,4.7,145.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21496,16.37161,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,100.0,4.64,4.73,4.55,4.8,4.91,4.89,4.59,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,3.0,Leopoldstadt,48.21778,16.37847,Entire rental unit,Entire home/apt,3.0,1.0,2.0,5.0,347.0,4.65,4.77,4.51,4.93,4.95,4.86,4.58,68.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21351,16.37282,Entire rental unit,Entire home/apt,2.0,1.0,1.0,3.0,52.0,4.63,4.67,4.35,4.69,4.75,4.88,4.56,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,4.0,Leopoldstadt,48.2176,16.38018,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,117.0,4.77,4.74,4.68,4.8,4.75,4.81,4.71,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,6.0,Innere Stadt,48.21318,16.37486,Entire rental unit,Entire home/apt,4.0,2.0,1.0,3.0,69.0,4.58,4.8,4.76,4.83,4.92,4.85,4.73,140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,t,1.0,Ottakring,48.22207,16.31594,Entire rental unit,Entire home/apt,4.0,2.0,2.0,3.0,50.0,4.87,4.94,4.71,4.94,4.96,4.4,4.73,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,f,2.0,Favoriten,48.17437,16.39339,Entire condo,Entire home/apt,4.0,1.0,2.0,5.0,178.0,4.77,4.87,4.67,4.88,4.87,3.98,4.66,87.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Train/Test Split

![](https://files.training.databricks.com/images/301/TrainTestSplit.png)

**Question**: Why is it necessary to set a seed? What happens if I change my cluster configuration?

In [0]:
trainDF, testDF = airbnbDF.randomSplit([.8, .2], seed=42)
print(trainDF.cache().count())

9504


Let's change the # of partitions (to simulate a different cluster configuration), and see if we get the same number of data points in our training set.

In [0]:
trainRepartitionDF, testRepartitionDF = (airbnbDF
                                         .repartition(24)
                                         .randomSplit([.8, .2], seed=42))

print(trainRepartitionDF.count())

9469


## Linear Regression

We are going to build a very simple model predicting `price` just given the number of `bedrooms`.

**Question**: What are some assumptions of the linear regression model?

In [0]:
display(trainDF.select("price", "bedrooms"))

price,bedrooms
80.0,1.0
62.0,1.0
20.0,1.0
38.0,1.0
87.0,1.0
30.0,1.0
60.0,1.0
20.0,1.0
35.0,1.0
70.0,1.0


In [0]:
display(trainDF.select("price", "bedrooms").summary())

summary,price,bedrooms
count,9504.0,9504.0
mean,96.1851851851852,1.3324915824915824
stddev,211.65808789688143,0.8700305947203256
min,9.0,1.0
25%,47.0,1.0
50%,71.0,1.0
75%,104.0,1.0
max,9270.0,19.0


In [0]:
display(trainDF)

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,f,,Leopoldstadt,48.22447,16.38696,Entire rental unit,Entire home/apt,2.0,1.0,1.0,10.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,80.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,f,,Mariahilf,48.19336,16.34596,Entire rental unit,Entire home/apt,3.0,1.0,2.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.21569,16.34861,Private room in rental unit,Private room,2.0,1.0,1.0,2.0,6.0,5.0,4.83,5.0,5.0,4.83,5.0,5.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.21569,16.35775,Entire rental unit,Entire home/apt,4.0,1.0,3.0,3.0,11.0,4.91,5.0,4.91,5.0,5.0,4.91,4.82,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.21571,16.35701,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,87.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,f,1.0,Alsergrund,48.21613,16.34589,Private room in rental unit,Private room,2.0,1.0,1.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.21622,16.3627,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,27.0,4.85,4.96,4.89,4.89,4.93,5.0,4.81,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.21626,16.34287,Private room in rental unit,Private room,2.0,1.0,1.0,2.0,3.0,4.0,4.0,3.0,5.0,5.0,4.0,4.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.21641,16.34797,Entire rental unit,Entire home/apt,2.0,1.0,1.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,f,1.0,Alsergrund,48.216629533198855,16.36352280240944,Private room in rental unit,Private room,2.0,1.0,1.0,5.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,70.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


There do appear some outliers in our dataset for the price ($10,000 a night??). Just keep this in mind when we are building our models :).

We will use `LinearRegression` to build our first model  
[Python](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html)

The cell below will [fail](https://stackoverflow.com/questions/61056160/illegalargumentexception-column-must-be-of-type-structtypetinyint-sizeint-in) because the Linear Regression estimator expects a vector of values as input.  We will fix that with VectorAssembler below.

In [0]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol="bedrooms", labelCol="price")

# Uncomment when running
lrModel = lr.fit(trainDF)

[0;31m---------------------------------------------------------------------------[0m
[0;31mIllegalArgumentException[0m                  Traceback (most recent call last)
[0;32m<command-74902913649601>[0m in [0;36m<cell line: 6>[0;34m()[0m
[1;32m      4[0m [0;34m[0m[0m
[1;32m      5[0m [0;31m# Uncomment when running[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 6[0;31m [0mlrModel[0m [0;34m=[0m [0mlr[0m[0;34m.[0m[0mfit[0m[0;34m([0m[0mtrainDF[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_pyspark.py[0m in [0;36mpatched_method[0;34m(self, *args, **kwargs)[0m
[1;32m     28[0m             [0mcall_succeeded[0m [0;34m=[0m [0;32mFalse[0m[0;34m[0m[0;34m[0m[0m
[1;32m     29[0m             [0;32mtry[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m---> 30[0;31m                 [0mresult[0m [0;34m=[0m [0moriginal_method[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34

## Vector Assembler

What went wrong? Turns out that the Linear Regression **estimator** (`.fit()`) expected a column of Vector type as input.

We can easily get the values from the `bedrooms` column into a single vector using `VectorAssembler`  
[Python](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html).  
VectorAssembler is an example of a **transformer**.  
Transformers take in a DataFrame, and return a new DataFrame with one or more columns appended to it.  
They do not learn from your data, but apply rule based transformations.

You can see an example of how to use VectorAssembler on the [ML Programming Guide](https://spark.apache.org/docs/latest/ml-features.html#vectorassembler).

In [0]:
from pyspark.ml.feature import VectorAssembler

vecAssembler = VectorAssembler(inputCols=["bedrooms"], outputCol="features")
vecTrainDF = vecAssembler.transform(trainDF)

In [0]:
vecTrainDF.limit(10).display()

host_is_superhost,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na,features
f,f,,Leopoldstadt,48.22447,16.38696,Entire rental unit,Entire home/apt,2.0,1.0,1.0,10.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,80.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,,Mariahilf,48.19336,16.34596,Entire rental unit,Entire home/apt,3.0,1.0,2.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21569,16.34861,Private room in rental unit,Private room,2.0,1.0,1.0,2.0,6.0,5.0,4.83,5.0,5.0,4.83,5.0,5.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21569,16.35775,Entire rental unit,Entire home/apt,4.0,1.0,3.0,3.0,11.0,4.91,5.0,4.91,5.0,5.0,4.91,4.82,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21571,16.35701,Entire rental unit,Entire home/apt,2.0,1.0,1.0,2.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,87.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21613,16.34589,Private room in rental unit,Private room,2.0,1.0,1.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21622,16.3627,Private room in rental unit,Private room,2.0,1.0,2.0,2.0,27.0,4.85,4.96,4.89,4.89,4.93,5.0,4.81,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21626,16.34287,Private room in rental unit,Private room,2.0,1.0,1.0,2.0,3.0,4.0,4.0,3.0,5.0,5.0,4.0,4.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.21641,16.34797,Entire rental unit,Entire home/apt,2.0,1.0,1.0,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"
f,f,1.0,Alsergrund,48.216629533198855,16.36352280240944,Private room in rental unit,Private room,2.0,1.0,1.0,5.0,0.0,4.83,4.89,4.83,4.93,4.93,4.81,4.76,70.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"Map(vectorType -> dense, length -> 1, values -> List(1.0))"


In [0]:
lr = LinearRegression(featuresCol="features", labelCol="price")
lrModel = lr.fit(vecTrainDF)

## Inspect the model

In [0]:
m = lrModel.coefficients[0]
b = lrModel.intercept

print(f"The formula for the linear regression line is y = {m:.2f}x + {b:.2f}")

The formula for the linear regression line is y = 37.43x + 46.31


## Apply model to test set

In [0]:
vecTestDF = vecAssembler.transform(testDF)

predDF = lrModel.transform(vecTestDF)

predDF.select("bedrooms", "features", "price", "prediction").show()

+--------+--------+-----+------------------+
|bedrooms|features|price|        prediction|
+--------+--------+-----+------------------+
|     1.0|   [1.0]| 54.0| 83.73893439860554|
|     1.0|   [1.0]| 40.0| 83.73893439860554|
|     1.0|   [1.0]| 23.0| 83.73893439860554|
|     1.0|   [1.0]| 29.0| 83.73893439860554|
|     1.0|   [1.0]| 58.0| 83.73893439860554|
|     1.0|   [1.0]|100.0| 83.73893439860554|
|     1.0|   [1.0]| 80.0| 83.73893439860554|
|     1.0|   [1.0]| 17.0| 83.73893439860554|
|     2.0|   [2.0]| 30.0|121.17221524533119|
|     1.0|   [1.0]| 60.0| 83.73893439860554|
|     1.0|   [1.0]| 20.0| 83.73893439860554|
|     1.0|   [1.0]| 85.0| 83.73893439860554|
|     1.0|   [1.0]| 25.0| 83.73893439860554|
|     1.0|   [1.0]| 70.0| 83.73893439860554|
|     1.0|   [1.0]| 55.0| 83.73893439860554|
|     1.0|   [1.0]|120.0| 83.73893439860554|
|     1.0|   [1.0]| 30.0| 83.73893439860554|
|     1.0|   [1.0]| 44.0| 83.73893439860554|
|     1.0|   [1.0]| 27.0| 83.73893439860554|
|     1.0|

## Evaluate Model

Let's see how our linear regression model with just one variable does.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regressionEvaluator.evaluate(predDF)
r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF)

print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 86.25336450843317
R2 is 0.13717072526575846


#### It's still not that great. Let's see how we can further decrease it in the next notebook.

Code modified and enhanced from 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>