-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Regression: Predicting Rental Price

In this notebook, we will use the dataset we cleansed in the previous lab to predict Airbnb rental prices in San Francisco.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - **Use the SparkML API to build a linear regression model**
 - **Identify the differences between estimators and transformers**

In [0]:
%run "./Includes/Classroom-Setup"

In [0]:
file_path = f"{datasets_dir}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)

## Train/Test Split

![](https://files.training.databricks.com/images/301/TrainTestSplit.png)

**Question**: Why is it necessary to set a seed? What happens if I change my cluster configuration?

In [0]:
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)
print(train_df.cache().count())

Let's change the # of partitions (to simulate a different cluster configuration), and see if we get the same number of data points in our training set.

In [0]:
train_repartition_df, test_repartition_df = (airbnb_df
                                             .repartition(24)
                                             .randomSplit([.8, .2], seed=42))

print(train_repartition_df.count())

## Linear Regression

We are going to build a very simple model predicting **`price`** just given the number of **`bedrooms`**.

**Question**: What are some assumptions of the linear regression model?

In [0]:
display(train_df.select("price", "bedrooms"))

price,bedrooms
85.0,1.0
45.0,1.0
128.0,1.0
100.0,1.0
250.0,1.0
250.0,2.0
125.0,0.0
80.0,1.0
72.0,1.0
150.0,2.0


In [0]:
display(train_df.select("price", "bedrooms").summary())

summary,price,bedrooms
count,5786.0,5786.0
mean,215.2701348081576,1.3370203940546146
stddev,335.00495198272256,0.9336511382658126
min,10.0,0.0
25%,100.0,1.0
50%,150.0,1.0
75%,235.0,2.0
max,10000.0,14.0


In [0]:
display(train_df)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,flexible,f,1.0,Bayview,37.72001,-122.39249,House,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,2.0,128.0,97.0,10.0,10.0,10.0,10.0,9.0,10.0,85.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.7325,-122.39221,House,Private room,1.0,1.0,1.0,1.0,Real Bed,31.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,45.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Bernal Heights,37.73905,-122.41269,Apartment,Private room,1.0,1.0,1.0,1.0,Real Bed,30.0,1.0,80.0,10.0,8.0,10.0,10.0,8.0,10.0,128.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.7422,-122.42091,Guest suite,Private room,4.0,1.0,1.0,3.0,Real Bed,3.0,49.0,95.0,10.0,10.0,10.0,10.0,10.0,9.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.74552,-122.41195,Apartment,Entire home/apt,2.0,2.0,1.0,1.0,Real Bed,2.0,4.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Financial District,37.7842,-122.39925,Apartment,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,183.0,3.0,74.0,6.0,6.0,4.0,10.0,10.0,8.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Glen Park,37.74185,-122.42977,Apartment,Entire home/apt,3.0,1.0,0.0,2.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,125.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Haight Ashbury,37.76637,-122.4467,House,Private room,2.0,1.0,1.0,1.0,Real Bed,7.0,50.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Haight Ashbury,37.77407,-122.44556,Condominium,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Inner Richmond,37.77777,-122.45531,House,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,30.0,74.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


There does appear to be some outliers in our dataset for the price ($10,000 a night??). Just keep this in mind when we are building our models.

We will use <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html?highlight=linearregression#pyspark.ml.regression.LinearRegression" target="_blank">LinearRegression</a> to build our first model.

The cell below will fail because the Linear Regression estimator expects a vector of values as input. We will fix that with VectorAssembler below.

In [0]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol="bedrooms", labelCol="price")

# Uncomment when running
lr_model = lr.fit(train_df)

## Vector Assembler

What went wrong? Turns out that the Linear Regression **estimator** (**`.fit()`**) expected a column of Vector type as input.

We can easily get the values from the **`bedrooms`** column into a single vector using <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html?highlight=vectorassembler#pyspark.ml.feature.VectorAssembler" target="_blank">VectorAssembler</a>. VectorAssembler is an example of a **transformer**. Transformers take in a DataFrame, and return a new DataFrame with one or more columns appended to it. They do not learn from your data, but apply rule based transformations.

You can see an example of how to use VectorAssembler on the <a href="https://spark.apache.org/docs/latest/ml-features.html#vectorassembler" target="_blank">ML Programming Guide</a>.

In [0]:
from pyspark.ml.feature import VectorAssembler

vec_assembler = VectorAssembler(inputCols=["bedrooms"], outputCol="features")

vec_train_df = vec_assembler.transform(train_df)

In [0]:
lr = LinearRegression(featuresCol="features", labelCol="price")
lr_model = lr.fit(vec_train_df)

## Inspect the model

In [0]:
m = lr_model.coefficients[0]
b = lr_model.intercept

print(f"The formula for the linear regression line is y = {m:.2f}x + {b:.2f}")

## Apply model to test set

In [0]:
vec_test_df = vec_assembler.transform(test_df)

pred_df = lr_model.transform(vec_test_df)

pred_df.select("bedrooms", "features", "price", "prediction").show(5)

## Evaluate Model

Let's see how our linear regression model with just one variable does. Does it beat our baseline model?

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regression_evaluator.evaluate(pred_df)
print(f"RMSE is {rmse}")

Wahoo! Our RMSE is better than our baseline model. However, it's still not that great. Let's see how we can further decrease it in future notebooks.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>