# Activity by Traffic Lab
Process streaming data to display total active users by traffic source.

##### Objectives
1. Read data stream
2. Get active users by traffic source
3. Execute query with display() and plot results
4. Execute the same streaming query with DataStreamWriter
5. View results being updated in the query table
6. List and stop all active streams

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html" target="_blank">StreamingQuery</a>

### Setup
Run the cells below to generate data and create the **`schema`** string needed for this lab.

In [0]:
%run ../Includes/Classroom-Setup-5.1c

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03"


Validating the locally installed datasets...(3 seconds)

Creating & using the schema "da_lpalum_7163_asp"...(0 seconds)

Predefined tables in "da_lpalum_7163_asp":
  -none-

Predefined paths variables:
  DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03
  DA.paths.user_db:     dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks/database.db
  DA.paths.working_dir: dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks
  DA.paths.checkpoints: dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks/_checkpoints
  DA.paths.sales:       dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03/ecommerce/sales/sales.delta
  DA.paths.users:       dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-d

### 1. Read data stream
- Set to process 1 file per trigger
- Read from Delta with filepath stored in **`DA.paths.events`**

Assign the resulting Query to **`df`**.

In [0]:
# ANSWER
df = (spark.readStream
           .option("maxFilesPerTrigger", 1)
           .format("delta")
           .load(DA.paths.events))

**1.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_1_1(df)

Points,Test,Result
1,The query is streaming,
1,DataFrame contains all 10 columns,


### 2. Get active users by traffic source
- Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
- Group by **`traffic_source`**
  - Aggregate the approximate count of distinct users and alias with "active_users"
- Sort by **`traffic_source`**

In [0]:
# ANSWER
from pyspark.sql.functions import col, approx_count_distinct, count

spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)

traffic_df = (df
              .groupBy("traffic_source")
              .agg(approx_count_distinct("user_id").alias("active_users"))
              .sort("traffic_source")
             )

**2.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_2_1(traffic_df.schema)

Points,Test,Result
1,Schema is of type StructType,
1,Schema contians two fields,
1,"Schema contains ""traffic_source"" of type StringType (nullable=None)",
1,"Schema contains ""active_users"" of type LongType (nullable=None)",


### 3. Execute query with display() and plot results
- Execute results for **`traffic_df`** using display()
- Plot the streaming query results as a bar graph

In [0]:
# ANSWER
display(traffic_df)

traffic_source,active_users
direct,186007
email,153110
facebook,375096
google,698092
instagram,252908
youtube,117543


**3.1: CHECK YOUR WORK**
- You bar chart should plot **`traffic_source`** on the x-axis and **`active_users`** on the y-axis
- The top three traffic sources in descending order should be **`google`**, **`facebook`**, and **`instagram`**.

### 4. Execute the same streaming query with DataStreamWriter
- Name the query "active_users_by_traffic"
- Set to "memory" format and "complete" output mode
- Set a trigger interval of 1 second

In [0]:
# ANSWER
traffic_query = (traffic_df
                 .writeStream
                 .queryName("active_users_by_traffic")
                 .format("memory")
                 .outputMode("complete")
                 .trigger(processingTime="1 second")
                 .start())

DA.block_until_stream_is_ready("active_users_by_traffic")

Processed 0 of 2 batches...
Processed 0 of 2 batches...
Processed 0 of 2 batches...
Processed 1 of 2 batches...
Processed 2 of 2 batches...
The stream is now active with 2 batches having been processed.


**4.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_4_1(traffic_query)

Points,Test,Result
1,The query is active,
1,"The query name is ""active_users_by_traffic"".",
1,"The format is ""MemorySink"".",


### 5. View results being updated in the query table
Run a query in a SQL cell to display the results from the **`active_users_by_traffic`** table

In [0]:
%sql
-- ANSWER
SELECT * FROM active_users_by_traffic

traffic_source,active_users
direct,313780
email,219383
facebook,596595
google,1154893
instagram,390978
youtube,177061


**5.1: CHECK YOUR WORK**
Your query should eventually result in the following values.

|traffic_source|active_users|
|---|---|
|direct|438886|
|email|281525|
|facebook|956769|
|google|1781961|
|instagram|530050|
|youtube|253321|

### 6. List and stop all active streams
- Use SparkSession to get list of all active streams
- Iterate over the list and stop each query

In [0]:
# ANSWER
for s in spark.streams.active:
    print(s.name)
    s.stop()

display_query_2
active_users_by_traffic


**6.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_6_1(traffic_query)

Points,Test,Result
1,The query has been stopped,


### Classroom Cleanup
Run the cell below to clean up resources.

In [0]:
DA.cleanup()

Resetting the learning environment...
...dropping the schema "da_lpalum_7163_asp"...(0 seconds)
...removing the working directory "dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks"...(0 seconds)


Validating the locally installed datasets...(4 seconds)
