d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Lab 6: Reading from Cosmos vs Reading from Parquet

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
 - Configure and use an Azure Cosmos database
 - Read from a parquet file
 - Compare runtimes of using Cosmos and parquet

In [3]:
%run "./../Includes/Classroom-Setup"

## Configure and Use an Azure Cosmos Database

Run the following cell to configure a Cosmos database to read our airbnb predictions from.

In [6]:
PrimaryRead = "SidFEUCE2RA5qMYdUonLdwp4bAYW6Xj4R5Xdw4Bo7F4fkG9anp7IhjwEZc9wOvM4FJBU84efcupWZTrabFlinA==" # Read only keys
Endpoint = "https://airbnbpredictions.documents.azure.com:443/"
CosmosDatabase =  "predictions"
CosmosCollection = "predictions"

if not PrimaryRead:
  raise Exception("Don't forget to specify the cosmos keys in this cell.")

cosmosConfig = {
  "Endpoint": Endpoint,
  "Masterkey": PrimaryRead,
  "Database": CosmosDatabase,
  "Collection": CosmosCollection
}

Create a Spark DataFrame to read from Cosmos DB to see its contents.

In [8]:
cosmos_prediction_df = (spark.read
    .format("com.microsoft.azure.cosmosdb.spark")
    .options(**cosmosConfig)
    .load()
   )
 
display(cosmos_prediction_df)

This is the standard airbnb dataset with some additional metadata.

Based off of the above code to create a DataFrame from Cosmos DB, fill in the `predict_cosmos(id)` function to read from Cosmos DB and return the predicted price (a float type) based on the column `id`.

In [10]:
# ANSWER
from pyspark.sql.functions import col

def predict_cosmos(id):
  # read in df from Cosmos
  prediction = (spark.read
    .format("com.microsoft.azure.cosmosdb.spark")
    .options(**cosmosConfig)
    .load()
    .filter(col("id") == id)
    .select("prediction")
    .first()
  )[0]
  return prediction

id_predict = "7b22b1d3-c634-4bad-a854-17e0669aa685"
p = predict_cosmos(id_predict)

print(p)
assert type(p) == float, " `predict_cosmos` should return 1 float type"

## Reading from a Parquet File

-sandbox
Similar to the previous cell, fill in the `predict_parquet(id)` function to return the prediction of row `id` from the parquet file stored at path `/mnt/training/airbnb/sf-listings/prediction.parquet`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Load in a parquet file path using `spark.read.parquet`.

In [13]:
# ANSWER

def predict_parquet(id):
  prediction = (spark.read
    .parquet("/mnt/training/airbnb/sf-listings/prediction.parquet")
    .filter(col("id") == id)
    .select("prediction")
    .first()
  )[0]
  return prediction

p = predict_parquet(id_predict)
print(p)
assert type(p) == float, " `predict_parquet` should return 1 float type"

## Compare Runtimes of Cosmos and Parquet

Run the following cells to look at the average time it takes to execute 5 different queries on from Cosmos versus a parquet file.

In [15]:
num_queries = 5

id_list = cosmos_prediction_df.limit(num_queries).select("id").collect()
ids = [i[0] for i in id_list]

ids

In [16]:
%timeit -n3 [predict_cosmos(id) for id in ids]

In [17]:
%timeit -n3 [predict_parquet(id) for id in ids]

-sandbox
Which performed better?  What are the trade-offs between using the two solutions in production?

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> These results will depend in part on the load on Cosmos generated by the class.

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>