d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Linear Regression Lab

In the previous lesson, we predicted price using just one variable: bedrooms. Now, we want to predict price given a few other features.

Steps:
0. Use the features: `bedrooms`, `bathrooms`, `bathrooms_na`, `minimum_nights`, and `number_of_reviews` as input to your VectorAssembler.
0. Build a Linear Regression Model
0. Evaluate the `RMSE` and the `R2`.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Build a linear regression model with multiple features
 - Compute various metrics to evaluate goodness of fit

In [0]:
%run "../Includes/Classroom-Setup"

In [0]:
filePath = "dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnbDF = spark.read.format("delta").load(filePath)
(trainDF, testDF) = airbnbDF.randomSplit([.8, .2], seed=42)

In [0]:
# TODO

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression

vecAssembler = VectorAssembler(
    inputCols = ["bedrooms", "bathrooms", "bathrooms_na", "minimum_nights", "number_of_reviews"],
    outputCol = "features"
)

vecTrainDF = vecAssembler.transform(trainDF)
vecTestDF = vecAssembler.transform(testDF)

lrModel = LinearRegression(featuresCol="features", labelCol="price").fit(vecTrainDF)

predDF = lrModel.transform(vecTestDF)

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse = regressionEvaluator.evaluate(predDF)
r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

Examine the coefficients for each of the variables.

In [0]:
for col, coef in zip(["bedrooms", "bathrooms", "bathrooms_na", "minimum_nights", "number_of_reviews"], lrModel.coefficients):
  print(col, coef)
  
print(f"intercept: {lrModel.intercept}")

## Distributed Setting

Although we can quickly solve for the parameters when the data is small, the closed form solution doesn't scale well to large datasets. 

Spark uses the following approach to solve a linear regression problem:

* First, Spark tries to use matrix decomposition to solve the linear regression problem. 
* If it fails, Spark then uses [L-BFGS](https://spark.apache.org/docs/latest/ml-advanced.html#limited-memory-bfgs-l-bfgs) to solve for the parameters. L-BFGS is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables. The [BFGS](https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm) method belongs to [quasi-Newton methods](https://en.wikipedia.org/wiki/Quasi-Newton_method), which are used to either find zeroes or local maxima and minima of functions iteratively. 

If you are interested in how linear regression is implemented in the distributed setting and bottlenecks, check out these lecture slides:
* [distributed-linear-regression-1](https://files.training.databricks.com/static/docs/distributed-linear-regression-1.pdf)
* [distributed-linear-regression-2](https://files.training.databricks.com/static/docs/distributed-linear-regression-2.pdf)

### Next Steps

Yikes! We built a pretty bad model. In the next notebook, we will see how we can further improve upon our model.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>