d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Data Exploration

In this notebook, we will use the dataset we cleansed in the previous lab to do some Exploratory Data Analysis (EDA).

This will help us better understand our data to make a better model.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Identify log-normal distributions
 - Build a baseline model and evaluate

In [0]:
%run "../Includes/Classroom-Setup"

Let's keep 80% for the training set and set aside 20% of our data for the test set. We will use the `randomSplit` method [Python](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Dataset).

We will discuss more about the train-test split later, but throughout this notebook, do your data exploration on `trainDF`.

In [0]:
filePath = "dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnbDF = spark.read.format("delta").load(filePath)
trainDF, testDF = airbnbDF.randomSplit([.8, .2], seed=42)

Let's make a histogram of the price column to explore it (change the number of bins to 300).

In [0]:
display(trainDF.select("price"))

price
85.0
45.0
128.0
100.0
250.0
250.0
125.0
80.0
72.0
150.0


Is this a <a href="https://en.wikipedia.org/wiki/Log-normal_distribution" target="_blank">Log Normal</a> distribution? Take the `log` of price and check the histogram. Keep this in mind for later :).

In [0]:
# TODO
from pyspark.sql.functions import log

display(trainDF.select(log("price")))

ln(price)
4.442651256490317
3.80666248977032
4.852030263919617
4.605170185988092
5.521460917862246
5.521460917862246
4.8283137373023015
4.382026634673881
4.276666119016055
5.010635294096256


Now take a look at how `price` depends on some of the variables:
* Plot `price` vs `bedrooms`
* Plot `price` vs `accommodates`

Make sure to change the aggregation to `AVG`.

In [0]:
display(trainDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,flexible,f,1.0,Bayview,37.72001,-122.39249,House,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,2.0,128.0,97.0,10.0,10.0,10.0,10.0,9.0,10.0,85.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bayview,37.7325,-122.39221,House,Private room,1.0,1.0,1.0,1.0,Real Bed,31.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,45.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Bernal Heights,37.73905,-122.41269,Apartment,Private room,1.0,1.0,1.0,1.0,Real Bed,30.0,1.0,80.0,10.0,8.0,10.0,10.0,8.0,10.0,128.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.7422,-122.42091,Guest suite,Private room,4.0,1.0,1.0,3.0,Real Bed,3.0,49.0,95.0,10.0,10.0,10.0,10.0,10.0,9.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Bernal Heights,37.74552,-122.41195,Apartment,Entire home/apt,2.0,2.0,1.0,1.0,Real Bed,2.0,4.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Financial District,37.7842,-122.39925,Apartment,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,183.0,3.0,74.0,6.0,6.0,4.0,10.0,10.0,8.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Glen Park,37.74185,-122.42977,Apartment,Entire home/apt,3.0,1.0,0.0,2.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,125.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,f,1.0,Haight Ashbury,37.76637,-122.4467,House,Private room,2.0,1.0,1.0,1.0,Real Bed,7.0,50.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Haight Ashbury,37.77407,-122.44556,Condominium,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,2.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Inner Richmond,37.77777,-122.45531,House,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,30.0,74.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's take a look at the distribution of some of our categorical features

In [0]:
display(trainDF.groupBy("room_type").count())

room_type,count
Shared room,145
Entire home/apt,3520
Private room,2121


Which neighbourhoods have the highest number of rentals? Display the neighbourhoods and their associated count in descending order.

In [0]:
# TODO
from pyspark.sql.functions import col

display(trainDF.groupBy("neighbourhood_cleansed").count().orderBy(col("count").desc()))

neighbourhood_cleansed,count
Mission,572
South of Market,501
Western Addition,478
Downtown/Civic Center,445
Castro/Upper Market,330
Bernal Heights,291
Haight Ashbury,289
Noe Valley,256
Outer Sunset,222
Potrero Hill,176


#### How much does the price depend on the location?

In [0]:
trainDF.createOrReplaceTempView("trainDF")

We can use displayHTML to render any HTML, CSS, or JavaScript code.

In [0]:
%python

from pyspark.sql.functions import col

trainDF = spark.table("trainDF")

lat_long_price_values = trainDF.select(col("latitude"),col("longitude"),col("price")/600).collect()

lat_long_price_strings = [
  "[{}, {}, {}]".format(lat, long, price) 
  for lat, long, price in lat_long_price_values
]

v = ",\n".join(lat_long_price_strings)

# DO NOT worry about what this HTML code is doing! We took it from Stack Overflow :-)
displayHTML("""
<html>
<head>
 <link rel="stylesheet" href="https://unpkg.com/leaflet@1.3.1/dist/leaflet.css"
   integrity="sha512-Rksm5RenBEKSKFjgI3a41vrjkw4EVPlJ3+OiI65vTjIdo9brlAacEuKOiQ5OFh7cOI1bkDwLqdLw3Zg0cRJAAQ=="
   crossorigin=""/>
 <script src="https://unpkg.com/leaflet@1.3.1/dist/leaflet.js"
   integrity="sha512-/Nsx9X4HebavoBvEBuyp3I7od5tA0UzAxs+j83KgC8PU0kgB4XiK4Lfe4y4cgBtaRJQEIFCW+oC506aPT2L1zw=="
   crossorigin=""></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/leaflet.heat/0.2.0/leaflet-heat.js"></script>
</head>
<body>
    <div id="mapid" style="width:700px; height:500px"></div>
  <script>
  var mymap = L.map('mapid').setView([37.7587,-122.4486], 12);
  var tiles = L.tileLayer('http://{s}.tile.osm.org/{z}/{x}/{y}.png', {
    attribution: '&copy; <a href="http://osm.org/copyright">OpenStreetMap</a> contributors',
}).addTo(mymap);
  var heat = L.heatLayer([""" + v + """], {radius: 25}).addTo(mymap);
  </script>
  </body>
  </html>
""")

## Baseline Model

Before we build any Machine Learning models, we want to build a baseline model to compare to. We also want to determine a metric to evaluate our model. Let's use RMSE here.

For this dataset, let's build a baseline model that always predict the average price and one that always predicts the [median](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.approxQuantile) price, and see how we do. Do this in two separate steps:

0. `trainDF`: Extract the average and median price from `trainDF`, and store them in the variables `avgPrice` and `medianPrice`, respectively.
0. `testDF`: Create two additional columns called `avgPrediction` and `medianPrediction` with the average and median price from `trainDF`, respectively. Call the resulting DataFrame `predDF`. 

Some useful functions:
* avg() [Python](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.avg)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$)
* col() [Python](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.col)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$)
* lit() [Python](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lit)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.functions$)
* approxQuantile() [Python](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.approxQuantile)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrameStatFunctions) [**HINT**: There is no median function, so you will need to use approxQuantile]
* withColumn() [Python](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Dataset)

In [0]:
# TODO
from pyspark.sql.functions import avg, lit

avgPrice = trainDF.select(avg("price")).first()[0]
medianPrice = trainDF.approxQuantile("price", probabilities=[0.5], relativeError=0.01)[0]

predDF = testDF.withColumn("avgPrediction", lit(avgPrice))\
                .withColumn("medianPrediction", lit(medianPrice))

## Evaluate model

We are going to use SparkML's `RegressionEvaluator` to compute the [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation) for our average price and median price predictions [Python](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.RegressionEvaluator)/[Scala](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.evaluation.RegressionEvaluator). We will dig into evaluators in more detail in the next notebook.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionMeanEvaluator = RegressionEvaluator(predictionCol="avgPrediction", labelCol="price", metricName="rmse")
print(f"The RMSE for predicting the average price is: {regressionMeanEvaluator.evaluate(predDF)}")

regressionMedianEvaluator = RegressionEvaluator(predictionCol="medianPrediction", labelCol="price", metricName="rmse")
print(f"The RMSE for predicting the median price is: {regressionMedianEvaluator.evaluate(predDF)}")

Wow! We can see that always predicting median or mean doesn't do too well for our dataset. Let's see if we can improve this with a machine learning model!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>