-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Classification: Logistic Regression

Up until this point, we have only examined regression use cases. Now let's take a look at how to handle classification.

For this lab, we will use the same Airbnb dataset, but instead of predicting price, we will predict if host is a <a href="https://www.airbnb.com/superhost" target="_blank">superhost</a> or not in San Francisco.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - **Build a Logistic Regression model**
 - **Use various metrics to evaluate model performance**

In [0]:
%run "../Includes/Classroom-Setup"

In [0]:
file_path = f"{datasets_dir}/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)

## Baseline Model

Before we build any Machine Learning models, we want to build a baseline model to compare to. We are going to start by predicting if a host is a <a href="https://www.airbnb.com/superhost" target="_blank">superhost</a>. 

For our baseline model, we are going to predict no on is a superhost and evaluate our accuracy. We will examine other metrics later as we build more complex models.

0. Convert our **`host_is_superhost`** column (t/f) into 1/0 and call the resulting column **`label`**. DROP the **`host_is_superhost`** afterwards.
0. Add a column to the resulting DataFrame called **`prediction`** which contains the literal value **`0.0`**. We will make a constant prediction that no one is a superhost.

After we finish these two steps, then we can evaluate the "model" accuracy. 

Some helpful functions:
* <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html#pyspark.sql.functions.when" target="_blank">when()</a>
* <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn" target="_blank">withColumn()</a>
* <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.lit.html?highlight=lit#pyspark.sql.functions.lit" target="_blank">lit()</a>

In [0]:
# TODO
from pyspark.sql.functions import when,col,lit

label_df = airbnb_df.select(when(col('host_is_superhost') == 't',1.0).otherwise(0.0).alias('label'),'*')

pred_df = label_df.withColumn('prediction',lit(0.0))

display(pred_df)

label,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na,prediction
1.0,t,moderate,f,2.0,South of Market,37.77818,-122.41444,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,1.0,14.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,199.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0.0,f,strict_14_with_grace_period,t,6.0,Western Addition,37.77905,-122.43992,House,Private room,6.0,1.0,2.0,3.0,Real Bed,1.0,36.0,93.0,9.0,10.0,10.0,10.0,9.0,8.0,425.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,t,strict_14_with_grace_period,f,2.0,West of Twin Peaks,37.73313,-122.46054,House,Private room,4.0,1.0,2.0,3.0,Real Bed,3.0,207.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,129.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0.0,f,flexible,t,14.0,Western Addition,37.78172,-122.43861,House,Private room,2.0,1.0,1.0,1.0,Real Bed,30.0,1.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,55.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0.0,f,strict_14_with_grace_period,f,44.0,Mission,37.75825,-122.41512,Apartment,Entire home/apt,4.0,1.0,2.0,2.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,156.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
0.0,f,strict_14_with_grace_period,t,1.0,Inner Sunset,37.7491,-122.46782,Apartment,Entire home/apt,3.0,1.0,1.0,1.0,Real Bed,1.0,29.0,95.0,10.0,10.0,10.0,10.0,9.0,9.0,125.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,t,moderate,f,2.0,Castro/Upper Market,37.76925,-122.43077,Apartment,Entire home/apt,5.0,1.0,2.0,2.0,Real Bed,3.0,17.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,290.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0.0,f,super_strict_60,f,16.0,Financial District,37.78931,-122.40231,Serviced apartment,Entire home/apt,6.0,2.5,2.0,3.0,Real Bed,1.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,9.0,1300.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,t,moderate,t,2.0,Castro/Upper Market,37.76077,-122.43786,Apartment,Entire home/apt,3.0,1.0,1.0,1.0,Real Bed,30.0,10.0,98.0,10.0,9.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0.0,f,moderate,t,1.0,Outer Sunset,37.75359,-122.48137,Apartment,Entire home/apt,4.0,1.0,2.0,3.0,Real Bed,4.0,38.0,99.0,10.0,10.0,10.0,10.0,9.0,10.0,180.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Evaluate model

For right now, let's use accuracy as our metric. This is available from <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=multiclassclassificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator" target="_blank">MulticlassClassificationEvaluator</a>.

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

mc_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print(f"The accuracy is {100*mc_evaluator.evaluate(pred_df):.2f}%")

## Train-Test Split

Alright! Now we have built a baseline model. The next step is to split our data into a train-test split.

In [0]:
train_df, test_df = label_df.randomSplit([.8, .2], seed=42)
print(train_df.cache().count())

## Visualize

Let's look at the relationship between **`review_scores_rating`** and **`label`** in our training dataset.

In [0]:
display(train_df.select("review_scores_rating", "label"))

review_scores_rating,label
97.0,0.0
98.0,0.0
80.0,0.0
95.0,0.0
100.0,0.0
74.0,0.0
98.0,0.0
96.0,0.0
100.0,0.0
96.0,0.0


## Logistic Regression

Now build a <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html?highlight=logisticregression#pyspark.ml.classification.LogisticRegression" target="_blank">logistic regression model</a> using all of the features (HINT: use RFormula). Put the pre-processing step and the Logistic Regression Model into a Pipeline.

In [0]:
# TODO
from pyspark.ml import Pipeline
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import LogisticRegression

r_formula = RFormula(formula='label ~.',
                     featuresCol='features',
                     labelCol='label',
                     handleInvalid='skip'
                    )
lr = LogisticRegression(labelCol='label',featuresCol='features')
pipeline = Pipeline(stages=[r_formula,lr])
pipeline_model = pipeline.fit(train_df)
pred_df = pipeline_model.transform(test_df)

## Evaluate

What is AUROC useful for? Try adding additional evaluation metrics, like Area Under PR Curve.

In [0]:
# ANSWER
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

mc_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print(f"The accuracy is {100*mc_evaluator.evaluate(pred_df):.2f}%")

bc_evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
print(f"The area under the ROC curve: {bc_evaluator.evaluate(pred_df):.2f}")

bc_evaluator.setMetricName("areaUnderPR")
print(f"The area under the PR curve: {bc_evaluator.evaluate(pred_df):.2f}")

## Add Hyperparameter Tuning

Try changing the hyperparameters of the logistic regression model using the cross-validator. By how much can you improve your metrics?

In [0]:
# TODO
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

param_grid = (ParamGridBuilder()
              .addGrid(lr.regParam,[0.1,0.2])
              .addGrid(lr.elasticNetParam,[0.0,0.5,1.0])
              .build()
             )

cv = CrossValidator(estimator=lr,evaluator=mc_evaluator,estimatorParamMaps=param_grid,numFolds=3,parallelism=4, seed=42)

pipeline = Pipeline(stages=[r_formula,cv])

pipeline_model = pipeline.fit(train_df)

pred_df = pipeline_model.transform(test_df)

## Evaluate again

In [0]:
mc_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print(f"The accuracy is {100*mc_evaluator.evaluate(pred_df):.2f}%")

bc_evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
print(f"The area under the ROC curve: {bc_evaluator.evaluate(pred_df):.2f}")

## Super Bonus

#### Try using MLflow to track your experiments!

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>