# Lab 8: Spark Streaming For Log Processing

This is a simple exercise in log processing.  The log files come from various servers at various time points.
Each record in a log file is of the form ```serverID,severity,timestamp```, where  
    - `serverID` is a string unique to the server  
    - `severity` is a value of 2 (referred to as `SEV2` that represents no error, just a service call),  1 (referred to as `SEV1` that represents a minor error), or 0 (referred to as `SEV0` that represents a fatal/severe error)    
    - `timestamp` is an integer starting at 1 (bigger numbers mean later)  

For this lab, the four log files (on Canvas and Teams) will be "delivered" by being placed in an S3 bucket, for example `s3://spark-bucket-week9/LogDataLive/`.
There are two servers in the log files, `s1` and `s2`, and the log records range from `t1` to `t10`.  
The files are delivered with one file per server for five time units. For example, the file `s115.csv` has records for server `s1` for `t1` to `t5`.

You want to process these new records incrementally, and are interested in these two "reports":

1. The *volume report*: reports the number of `SEV2` events divided by the number of time units for each server. The number of time units for our purposes is `max(timestamp) - min(timestamp) + 1`. This volume report will not be cumulative, i.e., every time new log data comes in, the mapping from the server to `SEV2` events is updated  
2. The *SEV0 log*: this is a sequence of records of the form ```serverID timestamp``` recording the timestamp of a `SEV0` event reported by a server. This report grows over time, i.e., each time a new log file is processed, new records are appended to the end.

Your final results should be produced by two streaming queries:
1. One that *modifies* the `SEV2` volume report, which is stored in memory
2. One that *appends* to the `SEV0` log report, which is stored as a csv file in your S3 bucket

### Submission
There are two files (not a zip file) to submit in this lab: 
1. A retrospective report in a file `retrospective.pdf`: a reflection on the assignment, with the following components   
    a. Your name   
    b. How much time you spent on the assignment   
    c. Were there aspects of the assignment that were particularly challenging? Particularly confusing?     
    d. What were the main learning takeaways from this lab – that is, did it introduce particular concepts or techniques that might help you as an analyst or engineer in the future?   
2. This notebook file `lab8-YOURNAME.ipynb`   
    a. make sure the output is also saved when saving and downloading your notebook  
    b. make sure your results are copied to the last four cells of this notebook  

In [None]:
# Make sure (1) you have uploaded those 4 logs files to your S3 folder 'LogData'
#           (2) you have created an empty S3 folder 'LogDataLive' (for simulating log streams)
#           (3) you have created an empty S3 folder 'Lab9Output' (for saving your results)
#           (4) replace the following URIs with yours
s3_log_data_uri      = #'s3://spark-bucket-week9/LogData/'
s3_log_data_live_uri = #'s3://spark-bucket-week9/LogDataLive/'
s3_lab9_output_uri   = #'s3://spark-bucket-week9/Lab9Output/'

In [None]:
print('Did you check the comment above?')

In [None]:
# Create the schema for the log files based on the above description of the data 
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

logSchema = ??

In [None]:
# Create the streaming DataFrame (readStream) on your log directory, using the schema you just created
streamingLogData = spark.readStream.??

## Part 1: Get the SEV2 volume report

In [None]:
# Use the data frame you just created to create another data frame with the 
# sev2 volume report.  It should have columns 'serverID' and 'avgVolume'

from pyspark.sql.functions import ?? import functions your needed

volumeReportDataFrame = streamingLogData..??

In [None]:
# Create and start a query (writeStream) that generates the sev2 report;  it is an in-memory sink.
volumeReportQuery = volumeReportDataFrame.??

In [None]:
# Write a (very simple) spark SQL query to show the contents of your query. It should initially be empty
spark.sql??

In [None]:
## DO NOT EDIT THIS CELL

# Helper functions for moving files in S3

import boto

# delete all the files in a folder
def empty_s3_folder(bucket_name = 'spark-bucket-week9', folder_path = 'LogDataLive/'):
    # establish a connection to S3
    conn = boto.connect_s3(host='s3.amazonaws.com')
    s3_bucket = conn.get_bucket(bucket_name)

    # iterate through the objects in the folder
    for key in s3_bucket.list(prefix=folder_path):
        if not str(key).endswith('/>'):
            key.delete()
        
    print(f'All files in the {folder_path} folder are removed')
    

# copy file from one folder to live folder; simulating live data stream
def copy_s3_file_to_live_folder(log_file, bucket_name='spark-bucket-week9'):
    # establish a connection to S3
    conn = boto.connect_s3(host='s3.amazonaws.com')
    
    # remember to pass in your bucket name
    my_bucket = conn.get_bucket(bucket_name) 
    # make sure you have these folders
    src_folder = 'LogData/'
    dst_folder = 'LogDataLive/'
    
    # copy from one folder to another folder of the same bucket
    my_bucket.copy_key(dst_folder + log_file, bucket_name, src_folder  + log_file)
    print(f'Copied {log_file} to {dst_folder}')


In [None]:
# First make sure the "live data" folder is empty

In [None]:
# Copy two log files from t1 to t5 into your 'LogDataLive' folder for both servers 
copy_s3_file_to_live_folder(??)
copy_s3_file_to_live_folder(??)

In [None]:
# Rerun the query to show that the sev2 volume report has been updated (wait a while)
spark.sql??

In [None]:
# Copy two log files from t6 to t10 into your 'LogDataLive' folder for both servers 
copy_s3_file_to_live_folder(??)
copy_s3_file_to_live_folder(??)

In [None]:
# Run the query again to verify that the report was updated. 
# Be sure to wait for a little while to make sure the query is updated.
spark.sql??

## Part 2. Get the SEV0 log report

In [None]:
# Delete all files from your "live" directory before working on this part
empty_s3_folder()

In [None]:
# Create a data frame on top of your original data frame that holds the raw data, 
# this data frame for the sev0 report is just <serverID> <time stamp> ordered by timestamp, 
# and by server ID within timestamp

sev0 = streamingLogData.??

In [None]:
# Create a query on your sev0 data frame that writes the table to a csv file, 
sev0SaveQuery = sev0.??

In [None]:
# Copy two files into your 'LogDataLive' folder for both servers for time period 1 through 5
copy_s3_file_to_live_folder(??)
copy_s3_file_to_live_folder(??)

In [None]:
# The above log stream will cause our streaming job to produce some results to our S3 bucket

In [None]:
# Copy two files into your 'LogDataLive' folder for both servers for time period 6 through 10
copy_s3_file_to_live_folder(??)
copy_s3_file_to_live_folder(??)

In [None]:
# The above log stream will *again* cause our streaming job to produce some results to our S3 bucket

In [None]:
# Now you're done with the lab
# clean up / stop all running streaming jobs


## Put Your Results Here
There should be four groups of numbers produced from the above two parts. For easy grading, copy your results into the following cells though we will run your notebook.   

**Important Note**: Make sure your notebook can be executed from beginning to end without error. You should check that before you hand it in. Simply putting results into a non-working notebook will not be considered as a valid submission.

In Part 1, after you streamed the first two log files, `s115.csv` and `s215.csv`, 
what is the produced volume of `SEV2` on each server?   Put your output in this Markdown cell, and wrap it in <pre> tag.

In Part 1, after you streamed the last two log files, `s1610.csv` and `s2610.csv`, what is the
produced volume of `SEV2` on each server?   Put your output in this Markdown cell, and wrap it in <pre> tag.

In Part 2, after you streamed the first two log files, `s115.csv` and `s215.csv`, what is the current `SEV0` log?
  Put your output in this Markdown cell, and wrap it in <pre> tag.

In Part 2, after you streamed the last two log files, `s1610.csv` and `s2610.csv`, what is the current `SEV0` log?   Put your output in this Markdown cell, and wrap it in <pre> tag.