# Activity 5.3 - Returning to the Nest

## Processing Dr. Bergen's Eagle Data in `pyspark`

In a previous homework, you performed a data management task for Dr. Bergen, Director of the WSU Statistical Consulting Center.  The associated data can be found in the `data` folder of this repository.  Below, you will find the instructions for the original task.

>    Dr. Bergen had the following to say about the data.
>
>     - One row = one GPS measurement.  
>     - Subsample of 10K GPS points from a couple bald eagles in Iowa. 
>     - **Context.** need to use the flight characteristics to perform $k$-means clustering of the flight points.  
>
>    Variables to be used for clustering include
>
>    - `KPH` (km per hour; an instantaneous measure of speed; measured by the GPS device);
>    - `Sn` (an average speed; given 2 time points and at locations and something like );
>    - `AGL0` (meters above ground level);
>    - `VerticalRate` (change in AGL between two time points; large negative if descending quickly; large positiveif ascending quickly);
>    - `absVR` (absolute value of VerticalRate); and
>    - `abs_angle`c(absolute value of turn angle, in radians; larger values equal more “tortuous”, i.e. twisty flight)
>
>    All variables except for `VerticalRate` are skewed and all variables need to be mean-centered and standardized prior to clustering.
>
>    <img src="./img/summary_of_features.png"/>
>
>    Note that data is 
>
>    - *mean-centered* by subtracting the mean of the column from each entry.
>    - *standardized* by dividing each entry by the standard deviation of the column.

### Tasks

In this activity, you will redo the following tasks in `pyspark` using the STACK-TRANSFORM-UNSTACK trick.

- Read the data into `pyspark` and assure that the columns have the correct type.  Define a schema as needed.
- Apply `sqrt` transform to `KPH`, `Sn`, `AGL0`, `absVR` and `abs_angle`.  
- Mean-center and standardize transformed variables from above as well as `VerticalRate`
- Visualize the transformed features.  
    - Because `pyspark` lacks visualization tools, you should convert the results back to a `pandas.Dataframe` then use a [seaborn multi-plot grid](https://seaborn.pydata.org/tutorial/axis_grids.html) to plot all the variables on the same panel.  **HINT.** To make this work, you will need to stack all of the transformed features.

**Deliverables.** You should keep any code cells you used to test/figure-out the solution, but the end result should be two cells,

1. A cell loading spark and reading in the data frame.
2. A second cell containing all the code and data management in one dot chain; along with any other objects used in the pipe.
3. A third cell containing all the code needed to convert the data frame back to pandas and create your visualization.

Note that these three cells should work independent of the rest of your code: If I restart the kernel and run only these cells, everything should work.



In [26]:
# Hint 1.  pyspark includes sqrt, mean, and sd functions.
from pyspark.sql.functions import sqrt, mean, stddev
from more_pyspark import *
from pyspark.sql import SparkSession
from dfply import *
from more_dfply import *
from composable import pipeable
from pyspark.sql.functions import array, explode, struct, lit, col, collect_list

spark = SparkSession.builder.appName('Ops').getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [2]:
# Hint 2.  The Apache Arrow library allows fast conversion of data frames back to pandas.  
!pip install pyarrow

# The `toPandas` method effectively replaces `collect`. 
# Example:
# pandas_df = spark_df.toPandas() # <== requires pyarrow


Collecting pyarrow
  Downloading pyarrow-9.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.3 MB)
[K     |████████████████████████████████| 35.3 MB 2.2 MB/s eta 0:00:01
Installing collected packages: pyarrow
Successfully installed pyarrow-9.0.0


In [32]:
# Your code here

eagles = spark.read.csv('./data/bald_eagle_subsample.csv', header=True, inferSchema=True, nullValue='-')

eagles.toPandas()

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.00,-0.120000,0.570000,0.120000
2,106,F,Fledgling,7/6/19 7:02,35.42,8.58,13.15,0.490000,2.010000,0.490000
3,106,F,Fledgling,7/6/19 7:02,32.87,9.13,10.88,-0.450000,1.100000,0.450000
4,106,F,Fledgling,7/6/19 7:02,35.37,10.01,7.28,-0.720000,0.370000,0.720000
...,...,...,...,...,...,...,...,...,...,...
9995,106,F,Juvenile,12/27/19 11:33,28.39,7.98,159.85,0.140000,0.120000,0.140000
9996,106,F,Juvenile,12/27/19 11:33,34.15,8.49,154.67,-0.860000,0.470000,0.860000
9997,106,F,Juvenile,12/27/19 11:33,30.15,8.43,152.39,-0.370000,0.960000,0.370000
9998,106,F,Juvenile,12/27/19 11:33,55.43,11.30,142.03,-1.720000,0.050000,1.720000


In [27]:
@pipeable
def gather(var_lbl, val_lbl, cols_to_stack, df):
    make_array = lambda var_name, val_name, cols: (array(*(struct(lit(c).alias(var_name), 
                                                                  col(c).alias(val_name))
                                                           for c in cols)))
    return (df
            .withColumn('var_val_array', 
                        make_array(var_lbl, 
                                   val_lbl, 
                                   cols_to_stack))
            .withColumn("vars_and_vals", explode(col('var_val_array')))
            .withColumn(var_lbl, col("vars_and_vals").getItem(var_lbl))
            .withColumn(val_lbl, col("vars_and_vals").getItem(val_lbl))
            .drop(*(cols_to_stack + ['var_val_array', "vars_and_vals"])))

In [40]:
cols = ['KPH','Sn','AGL0','VerticalRate','abs_angle','absVR']


eagles_stacked = (eagles >> gather('Variables', 'Values', [name for name in cols if name != 'VerticalRate'])
)

eagles_stacked.toPandas()

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,VerticalRate,Variables,Values
0,105,F,Fledgling,7/4/19 9:01,-0.002167,KPH,32.810000
1,105,F,Fledgling,7/4/19 9:01,-0.002167,Sn,6.890000
2,105,F,Fledgling,7/4/19 9:01,-0.002167,AGL0,0.020000
3,105,F,Fledgling,7/4/19 9:01,-0.002167,abs_angle,0.006277
4,105,F,Fledgling,7/4/19 9:01,-0.002167,absVR,0.002167
...,...,...,...,...,...,...,...
49995,106,F,Juvenile,12/27/19 11:34,0.560000,KPH,45.810000
49996,106,F,Juvenile,12/27/19 11:34,0.560000,Sn,13.520000
49997,106,F,Juvenile,12/27/19 11:34,0.560000,AGL0,145.420000
49998,106,F,Juvenile,12/27/19 11:34,0.560000,abs_angle,0.260000


In [None]:
def standardize(column):
    mean = np.mean(X[column])
    std = np.std(X[column])
    return (X[column] - mean)/std