d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Regression: Predicting Rental Price

In this notebook, we will use the dataset we cleansed in the previous lab to predict Airbnb rental prices in San Francisco.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Use the SparkML API to build a linear regression model
 - Identify the differences between estimators and transformers

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
filePath = "dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.parquet/"
airbnbDF = spark.read.parquet(filePath)

## Train/Test Split

![](https://files.training.databricks.com/images/301/TrainTestSplit.png)

**Question**: Why is it necessary to set a seed? What happens if I change my cluster configuration?

In [0]:
trainDF, testDF = airbnbDF.randomSplit([.8, .2], seed=42)
print(trainDF.cache().count())

Let's change the # of partitions (to simulate a different cluster configuration), and see if we get the same number of data points in our training set.

In [0]:
trainRepartitionDF, testRepartitionDF = (airbnbDF
                                         .repartition(24)
                                         .randomSplit([.8, .2], seed=42))

print(trainRepartitionDF.count())

## Linear Regression

We are going to build a very simple model predicting `price` just given the number of `bedrooms`.

**Question**: What are some assumptions of the linear regression model?

In [0]:
display(trainDF.select("price", "bedrooms"))

price,bedrooms
200.0,1.0
130.0,1.0
95.0,1.0
250.0,1.0
250.0,3.0
115.0,1.0
105.0,1.0
86.0,1.0
100.0,1.0
220.0,2.0


In [0]:
display(trainDF.select("price", "bedrooms").summary())

summary,price,bedrooms
count,5780.0,5780.0
mean,214.47249134948095,1.35
stddev,325.8499109968376,0.9396893597086264
min,10.0,0.0
25%,100.0,1.0
50%,150.0,1.0
75%,240.0,2.0
max,10000.0,14.0


In [0]:
display(trainDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,flexible,f,0.0,Diamond Heights,37.7431,-122.44509,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.71178,-122.38762,Apartment,Entire home/apt,3.0,1.0,1.0,1.0,Real Bed,90.0,13.0,88.0,10.0,9.0,8.0,9.0,10.0,10.0,130.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.72241,-122.39829,Guest suite,Entire home/apt,4.0,1.0,1.0,3.0,Real Bed,1.0,12.0,98.0,10.0,10.0,10.0,10.0,9.0,10.0,95.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.72979,-122.37094,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,180.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.73072,-122.38907,House,Entire home/apt,6.0,3.0,3.0,3.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Bayview,37.7352,-122.38566,House,Private room,2.0,1.0,1.0,1.0,Real Bed,2.0,100.0,96.0,10.0,9.0,10.0,10.0,9.0,10.0,115.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.7326,-122.41423,Condominium,Private room,2.0,1.5,1.0,1.0,Real Bed,2.0,36.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,105.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.73615,-122.41245,House,Private room,2.0,1.0,1.0,2.0,Real Bed,1.0,194.0,91.0,9.0,9.0,10.0,10.0,9.0,9.0,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.73765,-122.41247,Apartment,Entire home/apt,4.0,1.0,1.0,2.0,Real Bed,2.0,4.0,95.0,10.0,10.0,10.0,9.0,9.0,10.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.73826,-122.41693,House,Entire home/apt,4.0,1.0,2.0,2.0,Real Bed,4.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,220.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


There do appear some outliers in our dataset for the price ($10,000 a night??). Just keep this in mind when we are building our models :).

We will use `LinearRegression` to build our first model [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.regression.LinearRegression).

The cell below will fail because the Linear Regression estimator expects a vector of values as input. We will fix that with VectorAssembler below.

In [0]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol="bedrooms", labelCol="price")

# Uncomment when running
# lrModel = lr.fit(trainDF)

## Vector Assembler

What went wrong? Turns out that the Linear Regression **estimator** (`.fit()`) expected a column of Vector type as input.

We can easily get the values from the `bedrooms` column into a single vector using `VectorAssembler` [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.feature.VectorAssembler). VectorAssembler is an example of a **transformer**. Transformers take in a DataFrame, and return a new DataFrame with one or more columns appended to it. They do not learn from your data, but apply rule based transformations.

You can see an example of how to use VectorAssembler on the [ML Programming Guide](https://spark.apache.org/docs/latest/ml-features.html#vectorassembler).

In [0]:
from pyspark.ml.feature import VectorAssembler

vecAssembler = VectorAssembler(inputCols=["bedrooms"], outputCol="features")

vecTrainDF = vecAssembler.transform(trainDF)

In [0]:
display(vecTrainDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na,features
f,flexible,f,0.0,Diamond Heights,37.7431,-122.44509,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bayview,37.71178,-122.38762,Apartment,Entire home/apt,3.0,1.0,1.0,1.0,Real Bed,90.0,13.0,88.0,10.0,9.0,8.0,9.0,10.0,10.0,130.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bayview,37.72241,-122.39829,Guest suite,Entire home/apt,4.0,1.0,1.0,3.0,Real Bed,1.0,12.0,98.0,10.0,10.0,10.0,10.0,9.0,10.0,95.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bayview,37.72979,-122.37094,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,180.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bayview,37.73072,-122.38907,House,Entire home/apt,6.0,3.0,3.0,3.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"List(1, 1, List(), List(3.0))"
f,flexible,f,1.0,Bayview,37.7352,-122.38566,House,Private room,2.0,1.0,1.0,1.0,Real Bed,2.0,100.0,96.0,10.0,9.0,10.0,10.0,9.0,10.0,115.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bernal Heights,37.7326,-122.41423,Condominium,Private room,2.0,1.5,1.0,1.0,Real Bed,2.0,36.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,105.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bernal Heights,37.73615,-122.41245,House,Private room,2.0,1.0,1.0,2.0,Real Bed,1.0,194.0,91.0,9.0,9.0,10.0,10.0,9.0,9.0,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bernal Heights,37.73765,-122.41247,Apartment,Entire home/apt,4.0,1.0,1.0,2.0,Real Bed,2.0,4.0,95.0,10.0,10.0,10.0,9.0,9.0,10.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(1.0))"
f,flexible,f,1.0,Bernal Heights,37.73826,-122.41693,House,Entire home/apt,4.0,1.0,2.0,2.0,Real Bed,4.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,220.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"List(1, 1, List(), List(2.0))"


In [0]:
lr = LinearRegression(featuresCol="features", labelCol="price")
lrModel = lr.fit(vecTrainDF)

## Inspect the model

In [0]:
m = lrModel.coefficients[0]
b = lrModel.intercept

print(f"The formula for the linear regression line is y = {m:.2f}x + {b:.2f}")

## Apply model to test set

In [0]:
vecTestDF = vecAssembler.transform(testDF)

predDF = lrModel.transform(vecTestDF)

predDF.select("bedrooms", "features", "price", "prediction").show()

## Evaluate Model

Let's see how our linear regression model with just one variable does. Does it beat our baseline model?

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regressionEvaluator.evaluate(predDF)
print(f"RMSE is {rmse}")

Wahoo! Our RMSE is better than our baseline model. However, it's still not that great. Let's see how we can further decrease it in future notebooks.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>