# Final Exam Part 2

## Background

In previous semesters, I used an single attendance quiz to track  attendance in each course.  Students took multiple attempts of the same quiz, one at the start of each class. Consequently, the number of attempts a student took on this quiz represents the number of class sessions that student attended.

In some, but not all, of my courses I also provide practice quizzes that students can use to prepare for actual quizzes and tests.  These quizzes pull questions randomly from a bank of questions, allow students unlimited attempts, and are not used as part of the students grade.

For this part of the exam, you will collect simulated data from mock classes into one table, then and create a summarize table.

## Assessment Overview

When evaluating your work, here is what I will generally be looking for each grade level.

* **A-level work.** Data is processed using the functional tools from `composable` in 1-2 pipes using quality names.  Complex code/expressions is refactored. 
* **B-level work.** Data is processed using the list comprehensions and the `reduce` function (if necessary). All code uses quality names.
* **C-level work.** Data is processed using `for` loops using quality names.
* **D-level work.** Tasks are completed, but rely on a brute-force approach.

## File structure

The files found in `attendance_example` folder contains (made-up and random) examples of the D2L files that I use to summarize my attendance quizzes and practice quizzes.  Note that there is important information that you need to extract from the file path. 

<img src="./img/attendance_example_tree.png" width ="600">

### Task 1 - Combine the attendance data

<img src="./img/tracking_attendance.png" width="600">

Your first task it to combine the attendance data for all courses into a table with the following columns.
* `Program` (e.g., `stat`)
* `Course` (e.g., `491`)
* `Section` (e.g., `s1`)
* `UserName`
* `FirstName`
* `LastName`
* `TotalAttendance` (number of classes attended)
* `OutOf` (total possible number of classes, assume this is the same as the max)
* `PercentAttendance`
    
Write this table to a parquet file that is partitioned by program and course number.

In [2]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
         .config("spark.executor.memory", '8g')
         .config("spark.driver.memory", '8g')
         .appName('Ops')
         .getOrCreate())

22/12/08 13:55:59 WARN Utils: Your hostname, jt7372wd222 resolves to a loopback address: 127.0.1.1; using 172.21.137.216 instead (on interface eth0)
22/12/08 13:55:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/08 13:56:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/08 13:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [91]:
from composable.strict import map, filter, sorted
from composable.sequence import reduce, to_list
from composable.glob import glob
from more_pyspark import to_pandas
import re
from pyspark.sql.functions import lit, col, sum, count

In [79]:
attendance_paths = ['./attendance_example/dsci494s7/Attendance Quiz - User Attempts.csv','./attendance_example/stat180s18/Attendance Quiz - User Attempts.csv','./attendance_example/stat491s1/Attendance Quiz - User Attempts.csv'  ]


dsci494s7 = spark.read.csv(attendance_paths[0], sep=',', header = True)

stat180s18 = spark.read.csv(attendance_paths[1], sep=',', header = True)

stat491s1 = spark.read.csv(attendance_paths[2], sep=',', header = True)

In [83]:
dsci494s7_added_columns = (dsci494s7
                           .withColumn("Program", lit("dsci"))
                           .withColumn("Course", lit("494"))
                           .withColumn("Section", lit("s7"))
)
stat180s18_added_columns = (stat180s18
                           .withColumn("Program", lit("stat"))
                           .withColumn("Course", lit("180"))
                           .withColumn("Section", lit("s18"))
)
stat491s1_added_columns = (stat491s1
                           .withColumn("Program", lit("stat"))
                           .withColumn("Course", lit("491"))
                           .withColumn("Section", lit("s1"))
)

In [85]:
union_data_frame = (dsci494s7_added_columns
.union(stat180s18_added_columns)
.union(stat491s1_added_columns)
)

In [92]:
(union_data_frame
.groupBy(col("UserName"))
.agg(count(col("UserName").alias("TotalAttendance"))))

DataFrame[UserName: string, count(UserName AS TotalAttendance): bigint]

In [86]:
union_data_frame.take(2) >> to_pandas

Unnamed: 0,Org Defined ID,UserName,FirstName,LastName,Attempt #,Score,Out Of,Attempt_Start,Attempt_End,Percent,Program,Course,Section
0,14460432,au9747cp,Jericho,Greer,1,1,1,2019-01-14 14:00:00,2019-01-14 14:06:00,100 %,dsci,494,s7
1,14460432,au9747cp,Jericho,Greer,2,1,1,2019-01-16 14:00:00,2019-01-16 14:08:00,100 %,dsci,494,s7


### Task 2 - Combine the practice quiz data

<img src="./img/tracking_practice_attempts.png" width="600">

Some of the classes contained in `attendance_example.zip` contain information about attempts on practice quizzes for four modules.  We want to create a table for each class that summarizes the practice quiz attempts.  This table should contain the following columns: 
* `Program` (e.g., `stat`)
* `Course` (e.g., `491`)
* `Section` (e.g., `s1`)
* `UserName`
* `FirstName`
* `LastName`
* `Module 1 Attempts`, 
* `Module 2 Attempts`, 
* `Module 3 Attempts`, 
* `Module 4 Attempts`, and 
* `Total Attempts`.  

Note that, for example, `Module 1 Attempts` contains the total number of attempts each student made on the corresponding quiz and `Total Attempts` contains the total number of attempts on all four quizzes.

Write this table to a parquet file that is partitioned by program and course number.

In [1]:
# Your code here

### Task 3 - Summarize attendance and practice quiz attempts

Finally, you need to create a overall summary of the attendance and practice quiz attempts.  This table will have one row per course and include the following summaries.

* `Program` (e.g., `stat`)
* `Course` (e.g., `491`)
* `Section` (e.g., `s1`)
* `Min(Attendance)`
* `Mean(Attendance)`
* `Max(Attendance)`
* `Mean(Module 1 Attempts)`, 
* `Mean(Module 2 Attempts)`, 
* `Mean(Module 3 Attempts)`, 
* `Mean(Module 4 Attempts)`, and 
* `Mean(Total Attempts)`.  

Write this table to a parquet file that is partitioned by program and course number.

In [1]:
# Your code here

## Deliverables

1. You should commit and push your work in this notebook along with each of the files from the last three tasks.  
2. Submit a WORD doc on D2L containing a link to your repository.