#### Streaming Log Processing

This is a simple exercise in log processing.  The log files come from various servers at various time points.
Each record in a log file is of the form

```serverID,severity,timestamp```

* serverID is a string unique to the server
* severity is a value 2 (no error, just a service call), or 1 (minor error), or 0 (fatal/severe error)
* timestamp is an integer starting at 1 (bigger numbers mean later)

For this exercise these log files will be "delivered" by being placed in a directory, for example ```/FileStore/tables/logdata-live```.
The log files for this small example have two servers, s1 and s2 and log records for times 1 through 10.
The files are delivered with one file per server for five time units.  For example, the file s115.csv has records for server 1 for times 1 through 5.

You want to process these new records incrementally, and are interested in these two "reports"

*  The *volume report* reports by server the number of SEV2 events divided by the number of time units.  The number of time units for our purposes is (max(timestamp) - min(timestamp)) + 1.  This volume report will not be cumulative -- i.e. every time new log data comes in, the mapping from server to sev2 volume is updated
* The *sev0 log* -- this is a sequence of records of the form ```serverID timestamp``` recording a SEV0 event reported by the server.  This report grows over time -- each time a new log file is processed, new records are appended to the end.

Your final result should be two streaming queries
* One that *modifies* the volume report, which is stored in memory
* One that *appends to* the sev0 log, which is stored as a Parquet file

In [None]:
# Verify that there are four log files in your staging directory
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
# Create the directory that will hold your "live streaming files"
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
#  Create the schema for the log files


In [None]:
# Create the streaming DataFrame (readStream) on your log directory, using the schema you just created


In [None]:
# Use the data frame you just created to create another data frame with the 
# sev2 volume report.  It should have columns 'serverID' and 'avgVolume'


In [None]:
# Create and start a query (writeStream) that generates the sev2 report;  it is an in-memory sink.


In [None]:
# Write a (very simple) spark SQL query to show the contents of your query.  It should initially be empty


In [None]:
# Copy two files into your 'live data' directory for both servers, for both servers and time period 1 through 5
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
# Rerun the query to show that the sev2 volume report has been updated


In [None]:
# Now copy the log files for times 6 to 10
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
# Run the query again to verify that the report was updated. Be sure to wait for a little while
# to make sure the query is updated.


#### The SEV0 log

In [None]:
# Delete all files from your "live" directory
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
# Create a data frame on top of your original data frame that holds the raw data, 
# this data frame for the sev0 report is just <serverID> <time stamp>


In [None]:
# Create a query on your sev0 data frame that writes the table to a parquet file, 
#  appending new records to the file
# https://stackoverflow.com/questions/55859868/pyspark-structured-streaming-write-to-parquet-in-batches


In [None]:
# Display the query content by reading the parquet file (it should be empty)


In [None]:
# Copy in the files for timestamp 1 through 5
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
# Display the query again by reading the parquet file.  Are there new records?


In [None]:
# Copy in the files for timestamp 6 through 10
# If using EMR notebook, do this at the AWS console and leave this cell blank

In [None]:
# Display the query again by reading the parquet file.  Are there new records?


In [None]:
# Be tidy, stop all your streaming queries!


In [None]:
# Verify that there are no active streams
