In [3]:
!pip install more-pyspark

Collecting more-pyspark
  Downloading more_pyspark-0.1.4-py3-none-any.whl (3.5 kB)
Collecting composable>=0.4.0
  Downloading composable-0.4.0-py3-none-any.whl (5.1 kB)
Installing collected packages: composable, more-pyspark
  Attempting uninstall: composable
    Found existing installation: composable 0.2.5
    Uninstalling composable-0.2.5:
      Successfully uninstalled composable-0.2.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
more-dfply 0.2.10 requires composable<0.3.0,>=0.2.5, but you have composable 0.4.0 which is incompatible.[0m
Successfully installed composable-0.4.0 more-pyspark-0.1.4


# Using `reduce` in data management.

There are two common tasks that can be solved using `reduce`.

1. Dot-chaining/piping similar actions.
2. Any many-to-one operation like UNION or JOIN on many files.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Ops').getOrCreate()

22/11/04 15:01:49 WARN Utils: Your hostname, jt7372wd222 resolves to a loopback address: 127.0.1.1; using 172.30.75.123 instead (on interface eth0)
22/11/04 15:01:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/04 15:01:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Example 1 - Transforming the eagle data

In a previous activity, we had to perform similar transformations on many columns.  In `pyspark` this can be accomplished using many similar mutates.

In [2]:
from more_pyspark import to_pandas

eagle = spark.read.csv('./data/bald_eagle_subsample.csv', header=True, inferSchema=True)

eagle.take(2) >> to_pandas

AnalysisException: Path does not exist: file:/home/fahad/module-6-lectures-nameer1811/data/bald_eagle_subsample.csv

: 

#### Applying the `sqrt` transform with many `withColumn1

In [37]:
from pyspark.sql.functions import col, sqrt

(eagle
.withColumn('sqrt_KPH', sqrt(col('KPH')))
.withColumn('sqrt_Sn', sqrt(col('Sn')))
.withColumn('sqrt_AGL0', sqrt(col('AGL0')))
.withColumn('sqrt_abs_angle', sqrt(col('abs_angle')))
.withColumn('sqrt_absVR', sqrt(col('absVR')))
).take(2) >> to_pandas

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR,sqrt_KPH,sqrt_Sn,sqrt_AGL0,sqrt_abs_angle,sqrt_absVR
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167,5.728001,2.624881,0.141421,0.079229,0.046548
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.0,-0.12,0.57,0.12,5.443345,2.791057,0.0,0.754983,0.34641


#### Rewritten using the accumulator pattern

In [38]:
from more_pyspark import cols_from
from composable.strict import filter

measurements = eagle.columns >> cols_from('KPH')
sqrt_cols = measurements >> filter(lambda col: col != 'VerticalRate')

df = eagle
for c in sqrt_cols:
    df = df.withColumn('sqrt_' + c, sqrt(col(c)))
df.take(2) >> to_pandas

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR,sqrt_KPH,sqrt_Sn,sqrt_AGL0,sqrt_abs_angle,sqrt_absVR
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167,5.728001,2.624881,0.141421,0.079229,0.046548
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.0,-0.12,0.57,0.12,5.443345,2.791057,0.0,0.754983,0.34641


#### Refactored using `reduce`

In [39]:
from composable.sequence import reduce

add_sqrt = lambda df, c: df.withColumn('sqrt_' + c, sqrt(col(c)))

eagle_w_sqrt = reduce(add_sqrt, sqrt_cols, eagle)

eagle_w_sqrt.take(2) >> to_pandas

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR,sqrt_KPH,sqrt_Sn,sqrt_AGL0,sqrt_abs_angle,sqrt_absVR
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167,5.728001,2.624881,0.141421,0.079229,0.046548
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.0,-0.12,0.57,0.12,5.443345,2.791057,0.0,0.754983,0.34641


## Example 2 - Performing a UNION on more than 2 files.

The other common task solved by `reduce` is combination many files using verbs such as UNION or JOIN.  We will illustrate by combining the `./data/uber*.csv` files, which are sample of the [538 Uber TLC FOIL data](https://github.com/fivethirtyeight/uber-tlc-foil-response).

Furthermore, we will illustrate using a pipe to perform the steps.

#### Step 1 - Make `pipeable`/helper functions

#### A pipeable glob

In [6]:
from glob import glob as original_glob
from composable import pipeable

glob = pipeable(original_glob)

('./data/uber*.csv' 
 >> glob
)

['./data/uber-raw-data-jun14-sample.csv',
 './data/uber-raw-data-may14-sample.csv',
 './data/uber-raw-data-aug14-sample.csv',
 './data/uber-raw-data-sep14-sample.csv',
 './data/uber-raw-data-apr14-sample.csv',
 './data/uber-raw-data-jul14-sample.csv']

#### a `read_csv` helper

In [7]:
from more_pyspark import pprint_schema
from uber_schema import uber_datetime_format, uber_schema

read_uber_csv = lambda path: spark.read.csv(path, header=True, schema=uber_schema, timestampFormat=uber_datetime_format)

read_uber_csv('./data/uber-raw-data-jun14-sample.csv').take(2) >> to_pandas

Unnamed: 0,Date/Time,Lat,Lon,Base
0,2014-06-19 16:49:00,40.7568,-73.9701,B02682
1,2014-06-12 21:25:00,40.6463,-73.7768,B02598


In [8]:
read_uber_csv('./data/uber-raw-data-jun14-sample.csv') >> pprint_schema

StructType([StructField('Date/Time', TimestampType(), True),
            StructField('Lat', DoubleType(), True),
            StructField('Lon', DoubleType(), True),
            StructField('Base', StringType(), True)])


In [9]:
from composable.strict import map

uber_dfs = ('./data/uber*.csv' 
             >> glob
             >> map(read_uber_csv)
            )

uber_dfs

[DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string],
 DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string],
 DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string],
 DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string],
 DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string],
 DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string]]

### Brute-force solution

In [10]:
(uber_dfs[0]
 .union(uber_dfs[1])
 .union(uber_dfs[2])
 .union(uber_dfs[3])
 .union(uber_dfs[4])
 .union(uber_dfs[5])
).take(2) >> to_pandas

Unnamed: 0,Date/Time,Lat,Lon,Base
0,2014-06-19 16:49:00,40.7568,-73.9701,B02682
1,2014-06-12 21:25:00,40.6463,-73.7768,B02598


### Using the accumulator pattern

In [11]:
output_df = uber_dfs[0]
for df in uber_dfs[1:]:
    output_df.union(df)
output_df.take(2) >> to_pandas

Unnamed: 0,Date/Time,Lat,Lon,Base
0,2014-06-19 16:49:00,40.7568,-73.9701,B02682
1,2014-06-12 21:25:00,40.6463,-73.7768,B02598


### Refactored using `reduce`

In [12]:
(uber_dfs
 >> reduce(lambda out_df, df: out_df.union(df))
)

DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string]

#### Click

In [13]:
('./data/uber*.csv' 
 >> glob
 >> map(read_uber_csv)
 >> reduce(lambda out_df, df: out_df.union(df))
)

DataFrame[Date/Time: timestamp, Lat: double, Lon: double, Base: string]

## <font color="red"> Exercise 1 </font>

Use `reduce` to mean-center and standardize the `sqrt` column, as well as `VerticalRate`, in the eagle data.

In [45]:
# Your code there

from pyspark.sql.window import Window
from pyspark.sql.functions import *
from more_pyspark import *

w = Window.partitionBy()
col_mean = lambda c: mean(col(c)).over(w)
std_dev = lambda c: stddev(col(c)).over(w)
z_score = lambda c: (col(c) - col_mean(c))/std_dev(c)
z_score_acc = lambda df, c: df.withColumn('z_score_'+c, z_score(c))

z_score_cols = eagle_w_sqrt.columns >> cols_from('sqrt_KPH')
z_score_cols.append('VerticalRate')

eagle_standardized = reduce(z_score_acc, z_score_cols, eagle_w_sqrt)
eagle_standardized.take(2) >> to_pandas


22/11/03 12:37:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/11/03 12:37:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/11/03 12:37:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/11/03 12:37:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/11/03 12:37:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR,...,sqrt_Sn,sqrt_AGL0,sqrt_abs_angle,sqrt_absVR,z_score_sqrt_KPH,z_score_sqrt_Sn,z_score_sqrt_AGL0,z_score_sqrt_abs_angle,z_score_sqrt_absVR,z_score_VerticalRate
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167,...,2.624881,0.141421,0.079229,0.046548,-0.672224,-0.957445,-1.874878,-1.697594,-2.083444,0.024918
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.0,-0.12,0.57,0.12,...,2.791057,0.0,0.754983,0.34641,-0.920741,-0.712885,-1.895064,-0.356132,-1.394379,-0.057365


## <font color="red"> Exercise 2 </font>

In all of my class, I use an attendance quiz to track student attendance.  In previous semesters, I reused the same quiz each day and students take multiple attempts at the same quiz, one per class; so that number of attempts a student takes on this quiz represents the number of class session that student has attended.

In some, but not all, of my courses I also provide practice quizzes that students can use to prepare for actual quizzes and tests.  **In this example, you should ignore these CSV files.** 

In this exercise, you will combine (simulated) attendance data from my (mock) classes into one summary table.

#### Tasks 

The files found in the `./data/attendance_example` sub-folders contains (made-up and random) examples of the D2L files that I use to summarize my attendance quizzes and practice quizzes

1. Use `glob` to find the path to all *attendance* CSV files.
2. Write following helper function that takes a path and use regular expressions to extract the class name and the module number, combining and returning both in a single output string.
3. Write a function that task a path, reads in the corresponding CSV, and adds a `Class/Section` column containing the relevant entry for that table.  Be sure to test this on one of the paths found in **1.**.
4. Write a pipe that 
    1. Starts with the `glob` search string.
    2. Uses `glob` to find all paths.
    3. Maps your function from **3.** onto all the paths.
    4. Uses reduce to UNION the files into one master data frame.
5. Create a summary table that shows the 10 worst students in terms of attendance.

In [None]:
# Your code here