-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Linear Regression II Lab

Alright! We're making progress. Still not a great RMSE or R2, but better than the baseline or just using a single feature.

In the lab, you will see how to improve our performance even more.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Use **RFormula to simplify the process of using StringIndexer, OneHotEncoder, and VectorAssembler**
 - **Transform the price into log(price), predict, and exponentiate the result for a lower RMSE**

In [0]:
%run "../Includes/Classroom-Setup"

In [0]:
file_path = f"{datasets_dir}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

## RFormula

#### Instead of manually specifying which columns are categorical to the StringIndexer and OneHotEncoder, <a href="(https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.RFormula.html?highlight=rformula#pyspark.ml.feature.RFormula" target="_blank">RFormula</a> can do that automatically for you.

- With RFormula, if you have any columns of type String, it treats it as a categorical feature and string indexes & one hot encodes it for us. Otherwise, it leaves as it is. Then it combines all of one-hot encoded features and numeric features into a single vector, called **`features`**.

- You can see a detailed example of how to use RFormula <a href="https://spark.apache.org/docs/latest/ml-features.html#rformula" target="_blank">here</a>.

In [0]:
# TODO
from pyspark.ml import Pipeline
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

r_formula = RFormula(formula="price ~ .", featuresCol="features", labelCol="price", handleInvalid="skip")
lr = LinearRegression(labelCol='price',featuresCol='features')
pipeline = Pipeline(stages=[r_formula,lr])
pipeline_model = pipeline.fit(train_df)
pred_df = pipeline_model.transform(test_df)

regression_evaluator = RegressionEvaluator(labelCol='price',predictionCol='prediction')

rmse = regression_evaluator.setMetricName("rmse").evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

## Log Scale

Now that we have verified we get the same result using RFormula as above, we are going to improve upon our model. If you recall, our price dependent variable appears to be log-normally distributed, so we are going to try to predict it on the log scale.

Let's convert our price to be on log scale, and have the linear regression model predict the log price

In [0]:
from pyspark.sql.functions import log

display(train_df.select(log("price")))

ln(price)
4.442651256490317
3.80666248977032
4.852030263919617
4.605170185988092
5.521460917862246
5.521460917862246
4.8283137373023015
4.382026634673881
4.276666119016055
5.010635294096256


In [0]:
# ANSWER
from pyspark.sql.functions import col, log

log_train_df = train_df.withColumn("log_price", log(col("price")))
log_test_df = test_df.withColumn("log_price", log(col("price")))

r_formula = RFormula(formula="log_price ~ . - price", featuresCol="features", labelCol="log_price", handleInvalid="skip") 

lr.setLabelCol("log_price").setPredictionCol("log_pred")
pipeline = Pipeline(stages=[r_formula, lr])
pipeline_model = pipeline.fit(log_train_df)
pred_df = pipeline_model.transform(log_test_df)

## Exponentiate

In order to interpret our RMSE, we need to convert our predictions back from logarithmic scale.

In [0]:
display(pred_df)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na,log_price,features,log_pred
f,flexible,f,1.0,Bayview,37.72979,-122.37094,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,180.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.521460917862246,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 21, 43, 44, 45, 70, 72, 73, 74, 75, 76, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.72979, -122.37094, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 180.0, 1.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",4.079482408607532
f,flexible,f,1.0,Bayview,37.73555,-122.39779,House,Private room,1.0,1.0,1.0,1.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,70.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.248495242049359,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 21, 43, 44, 46, 71, 72, 73, 74, 75, 76, 80, 82, 83, 84, 85, 86, 87, 88, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.73555, -122.39779, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 30.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",4.252669092247771
f,flexible,f,1.0,Bernal Heights,37.73615,-122.41245,House,Private room,2.0,1.0,1.0,2.0,Real Bed,1.0,194.0,91.0,9.0,9.0,10.0,10.0,9.0,9.0,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.454347296253507,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 13, 43, 44, 46, 71, 72, 73, 74, 75, 76, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.73615, -122.41245, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0, 194.0, 91.0, 9.0, 9.0, 10.0, 10.0, 9.0, 9.0))",4.368422670767188
f,flexible,f,1.0,Bernal Heights,37.74552,-122.41195,Apartment,Entire home/apt,2.0,2.0,1.0,1.0,Real Bed,2.0,4.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.521460917862246,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 13, 43, 44, 45, 70, 72, 73, 74, 75, 76, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.74552, -122.41195, 1.0, 1.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 4.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",5.002795952982154
f,flexible,f,1.0,Downtown/Civic Center,37.7797,-122.42109,Apartment,Private room,1.0,1.0,1.0,1.0,Real Bed,31.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,60.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0943445622221,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 11, 43, 44, 45, 71, 72, 73, 74, 75, 76, 80, 82, 83, 84, 85, 86, 87, 88, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.7797, -122.42109, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 31.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",4.442041437052694
f,flexible,f,1.0,Financial District,37.78424,-122.39925,Apartment,Private room,2.0,1.0,1.0,1.0,Real Bed,180.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,100.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.605170185988092,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 29, 43, 44, 45, 71, 72, 73, 74, 75, 76, 80, 82, 83, 84, 85, 86, 87, 88, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.78424, -122.39925, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 180.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",4.235412660302188
f,flexible,f,1.0,Haight Ashbury,37.77407,-122.44556,Condominium,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.276666119016055,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 14, 43, 44, 47, 71, 72, 73, 74, 75, 76, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.77407, -122.44556, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 100.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0))",4.898269545998119
f,flexible,f,1.0,Marina,37.79876,-122.43327,Apartment,Entire home/apt,3.0,1.0,1.0,1.0,Real Bed,100.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,135.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.90527477843843,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 19, 43, 44, 45, 70, 72, 73, 74, 75, 76, 80, 82, 83, 84, 85, 86, 87, 88, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.79876, -122.43327, 1.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1.0, 100.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",5.047700602117686
f,flexible,f,1.0,Noe Valley,37.74683,-122.43746,Guest suite,Private room,2.0,1.0,1.0,1.0,Real Bed,2.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,120.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.787491742782046,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 15, 43, 44, 48, 71, 72, 73, 74, 75, 76, 80, 82, 83, 84, 85, 86, 87, 88, 91, 92, 93, 94, 95, 96, 97, 98), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.74683, -122.43746, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 98.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))",4.4553587274045015
f,flexible,f,1.0,Noe Valley,37.74802,-122.43521,Apartment,Private room,1.0,1.0,1.0,1.0,Real Bed,30.0,12.0,90.0,9.0,8.0,9.0,9.0,9.0,8.0,89.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.48863636973214,"Map(vectorType -> sparse, length -> 99, indices -> List(0, 3, 6, 7, 15, 43, 44, 45, 71, 72, 73, 74, 75, 76, 80, 81, 82, 83, 84, 85, 86, 87, 88), values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 37.74802, -122.43521, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 30.0, 12.0, 90.0, 9.0, 8.0, 9.0, 9.0, 9.0, 8.0))",4.373512960089244


In [0]:
from pyspark.sql.functions import exp
exp_df = pred_df.withColumn('prediction',exp(col('log_pred')))

rmse = regression_evaluator.setMetricName("rmse").evaluate(exp_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(exp_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

#### Nice job! You have increased the R2 and dropped the RMSE significantly in comparison to the previous model.

In the next few notebooks, we will see how we can reduce the RMSE even more.

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>