-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Activity by Traffic Lab
Process streaming data to display total active users by traffic source.

##### Objectives
1. Read data stream
2. Get active users by traffic source
3. Execute query with display() and plot results
4. Execute the same streaming query with DataStreamWriter
5. View results being updated in the query table
6. List and stop all active streams

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=datastreamreader#pyspark.sql.streaming.DataStreamReader" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=datastreamwriter#pyspark.sql.streaming.DataStreamWriter" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streamingquery#pyspark.sql.streaming.StreamingQuery" target="_blank">StreamingQuery</a>
- <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streamingquerymanager#pyspark.sql.streaming.StreamingQueryManager" target="_blank">StreamingQueryManager</a>

### Setup
Run the cells below to generate data and create the **`schema`** string needed for this lab.

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
schema = "device STRING, ecommerce STRUCT<purchase_revenue_in_usd: DOUBLE, total_item_quantity: BIGINT, unique_items: BIGINT>, event_name STRING, event_previous_timestamp BIGINT, event_timestamp BIGINT, geo STRUCT<city: STRING, state: STRING>, items ARRAY<STRUCT<coupon: STRING, item_id: STRING, item_name: STRING, item_revenue_in_usd: DOUBLE, price_in_usd: DOUBLE, quantity: BIGINT>>, traffic_source STRING, user_first_touch_timestamp BIGINT, user_id STRING"

### 1. Read data stream
- Use schema stored in **`schema`**
- Set to process 1 file per trigger
- Read from parquet with filepath stored in **`eventsPath`**

Assign the resulting DataFrame to **`df`**.

In [0]:
# TODO
df = (spark
     .readStream
     .schema(schema)
     .option("maxFilesPerTrigger", 1)
     .parquet(eventsPath)
     )
df.isStreaming

**CHECK YOUR WORK**

In [0]:
assert df.isStreaming
assert df.columns == ["device", "ecommerce", "event_name", "event_previous_timestamp", "event_timestamp", "geo", "items", "traffic_source", "user_first_touch_timestamp", "user_id"]

### 2. Get active users by traffic source
- Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
- Group by **`traffic_source`**
  - Aggregate the approximate count of distinct users and alias with "active_users"
- Sort by **`traffic_source`**

In [0]:
from pyspark.sql.functions import col, approx_count_distinct, count
spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism) ##Set default shuffle partitions to number of cores on your cluster 
trafficDF = (df
             .groupBy(col('traffic_source'))
             .agg(approx_count_distinct('user_id').alias('active_users'))
             .sort('traffic_source')
            )
             
display(trafficDF)

traffic_source,active_users
direct,438886
email,281525
facebook,956769
google,1781961
instagram,530050
youtube,253321


**CHECK YOUR WORK**

In [0]:
assert str(trafficDF.schema) == "StructType(List(StructField(traffic_source,StringType,true),StructField(active_users,LongType,false)))"

### 3. Execute query with display() and plot results
- Execute results for **`trafficDF`** using display()
- Plot the streaming query results as a bar graph

In [0]:
# TODO
display(trafficDF)

traffic_source,active_users
direct,438886
email,281525
facebook,956769
google,1781961
instagram,530050
youtube,253321


**CHECK YOUR WORK**
- You bar chart should plot `traffic_source` on the x-axis and `active_users` on the y-axis
- The top three traffic sources in descending order should be `google`, `facebook`, and `instagram`.

### 4. Execute the same streaming query with DataStreamWriter
- Name the query "active_users_by_traffic"
- Set to "memory" format and "complete" output mode
- Set a trigger interval of 1 second

In [0]:
# TODO
trafficQuery = (trafficDF
                .writeStream
                .queryName('active_users_by_traffic')
                .format('memory')
                .outputMode('complete')
                .trigger(processingTime='1 second')
                .start()
)

**CHECK YOUR WORK**

In [0]:
untilStreamIsReady("active_users_by_traffic")
assert trafficQuery.isActive
assert "active_users_by_traffic" in trafficQuery.name
assert trafficQuery.lastProgress["sink"]["description"] == "MemorySink"

### 5. View results being updated in the query table
Run a query in a SQL cell to display the results from the **`active_users_by_traffic`** table

In [0]:
%sql
select * from active_users_by_traffic

traffic_source,active_users
direct,438886
email,281525
facebook,956769
google,1781961
instagram,530050
youtube,253321


**CHECK YOUR WORK**  
Your query should eventually result in the following values.

|traffic_source|active_users|
|---|---|
|direct|438886|
|email|281525|
|facebook|956769|
|google|1781961|
|instagram|530050|
|youtube|253321|

### 6. List and stop all active streams
- Use SparkSession to get list of all active streams
- Iterate over the list and stop each query

In [0]:
# TODO
# ANSWER
for s in spark.streams.active:
    print(s.name)
    s.stop()

**CHECK YOUR WORK**

In [0]:
assert not trafficQuery.isActive

### Classroom Cleanup
Run the cell below to clean up resources.

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>