<i18n value="45bb1181-9fe0-4255-b0f0-b42637fc9591"/>



#Linear Regression Lab

In the previous lesson, we predicted price using just one variable: bedrooms. Now, we want to predict price given a few other features.

Steps:
1. Use the features: **`bedrooms`**, **`bathrooms`**, **`bathrooms_na`**, **`minimum_nights`**, and **`number_of_reviews`** as input to your VectorAssembler.
1. Build a Linear Regression Model
1. Evaluate the **`RMSE`** and the **`R2`**.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Build a linear regression model with multiple features
 - Compute various metrics to evaluate goodness of fit

In [0]:
%run "../Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| dropping the schema "lpalum_y9gq_da_sml"...(0 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/scalable-machine-learning-with-apache-spark"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(5 seconds)
| completed (5 seconds total)

Creating & using the schema "lpalum_y9gq_da_sml"...(0 seconds)
Predefined tables in "lpalum_y9gq_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02

Setup completed (7 seconds)


In [0]:
file_path = f"{DA.paths.datasets}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

In [0]:
# ANSWER
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression

vec_assembler = VectorAssembler(inputCols=["bedrooms", "bathrooms", "bathrooms_na", "minimum_nights", "number_of_reviews"], outputCol="features")

vec_train_df = vec_assembler.transform(train_df)
vec_test_df = vec_assembler.transform(test_df)

lr_model = LinearRegression(featuresCol="features", labelCol="price").fit(vec_train_df)

pred_df = lr_model.transform(vec_test_df)

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse = regression_evaluator.evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 146.66557395182022
R2 is 0.32567578612556003


<i18n value="25a260af-8d6e-4897-8228-80074c4f1d64"/>



Examine the coefficients for each of the variables.

In [0]:
for col, coef in zip(vec_assembler.getInputCols(), lr_model.coefficients):
    print(col, coef)
  
print(f"intercept: {lr_model.intercept}")

bedrooms 115.67218110629409
bathrooms 15.32773278579743
bathrooms_na -59.66329665713672
minimum_nights -0.5012697007580986
number_of_reviews -0.29570073989207096
intercept: 61.14012549013641


<i18n value="218d51b8-7453-4f6a-8965-5a60e8c80eaf"/>



## Distributed Setting

Although we can quickly solve for the parameters when the data is small, the closed form solution doesn't scale well to large datasets. 

Spark uses the following approach to solve a linear regression problem:

* First, Spark tries to use matrix decomposition to solve the linear regression problem. 
* If it fails, Spark then uses <a href="https://spark.apache.org/docs/latest/ml-advanced.html#limited-memory-bfgs-l-bfgs" target="_blank">L-BFGS</a> to solve for the parameters. L-BFGS is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables. The <a href="https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm" target="_blank">BFGS</a> method belongs to <a href="https://en.wikipedia.org/wiki/Quasi-Newton_method" target="_blank">quasi-Newton methods</a>, which are used to either find zeroes or local maxima and minima of functions iteratively. 

If you are interested in how linear regression is implemented in the distributed setting and bottlenecks, check out these lecture slides:
* <a href="https://files.training.databricks.com/static/docs/distributed-linear-regression-1.pdf" target="_blank">distributed-linear-regression-1</a>
* <a href="https://files.training.databricks.com/static/docs/distributed-linear-regression-2.pdf" target="_blank">distributed-linear-regression-2</a>

<i18n value="f3e00d9e-3b02-44cf-87b7-20b54ba350c9"/>



### Next Steps

Yikes! We built a pretty bad model. In the next notebook, we will see how we can further improve upon our model.