# Module 3 Homework

## Processing Dr. Bergen's Eagle Data

Dr. Bergen, Director of the WSU Statistical Consulting Center, has a data processing task for you.  The associated data can be found in the `data` folder of this repository.  

Dr. Bergen had the following to say about the data.

 - One row = one GPS measurement.  
 - Subsample of 10K GPS points from a couple bald eagles in Iowa. 
 - **Context.** need to use the flight characteristics to perform $k$-means clustering of the flight points.  
 
Variables to be used for clustering include

- `KPH` (km per hour; an instantaneous measure of speed; measured by the GPS device);
- `Sn` (an average speed; given 2 time points and at locations and something like );
- `AGL0` (meters above ground level);
- `VerticalRate` (change in AGL between two time points; large negative if descending quickly; large positive if ascending quickly);
- `absVR` (absolute value of VerticalRate); and
- `abs_angle`c(absolute value of turn angle, in radians; larger values equal more “tortuous”, i.e. twisty flight)

All variables except for `VerticalRate` are skewed and all variables need to be mean-centered and standardized prior to clustering.

<img src="./img/summary_of_features.png"/>

Note that data is 

- *mean-centered* by subtracting the mean of the column from each entry.
- *standardized* by dividing each entry by the standard deviation of the column.

### Tasks

You need to use the techniques from this modules lectures to perform the following tasks.

- Apply `sqrt` transform to `KPH`, `Sn`, `AGL0`, `absVR` and `abs_angle`
- Mean-center and standardize transformed variables from above as well as `VerticalRate`
- Visualize the transformed features.  Use a [seaborn multi-plot grid](https://seaborn.pydata.org/tutorial/axis_grids.html) to plot all the variables on the same panel.  **HINT.** To make this work, you will need to stack all of the transformed features.

Because you are applying the same transformations multiple times, you will perform the task twice, once for each of the method covered in Activity 3.2: (A) `dict` unpacking and (B) Stack, transform, unstack.

#### Problem 1

First, complete the task using the `dict` unpacking techniques from Lecture 3.5.

In [1]:
import pandas as pd
from math import log, e
import pandas as pd
from dfply import *
import matplotlib.pylab as plt
%matplotlib inline
from more_dfply import ifelse, case_when
from math import sqrt
eagles = pd.read_csv('/Users/paytonsimmons/github-classroom/wsu-stat489/module-3-homework-paytonsimmons/data/bald_eagle_subsample.csv')
eagles.head()

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,KPH,Sn,AGL0,VerticalRate,abs_angle,absVR
0,105,F,Fledgling,7/4/19 9:01,32.81,6.89,0.02,-0.002167,0.006277,0.002167
1,105,F,Fledgling,7/4/19 9:01,29.63,7.79,0.0,-0.12,0.57,0.12
2,106,F,Fledgling,7/6/19 7:02,35.42,8.58,13.15,0.49,2.01,0.49
3,106,F,Fledgling,7/6/19 7:02,32.87,9.13,10.88,-0.45,1.1,0.45
4,106,F,Fledgling,7/6/19 7:02,35.37,10.01,7.28,-0.72,0.37,0.72


In [2]:
log1p = lambda num, base=e: log(num + 1, base)
columns_to_log = ['KPH','Sn','AGL0','absVR','abs_angle']
log_dimensions = {'log_{0}'.format(c):X[c].apply(log1p)
                  for c in columns_to_log}
eagles.head()
(eagles
>> gather("Measure", "Value", columns_to_log)
>> mutate(MeanSqrt = ifelse(X.Value != 'C',
                         X.Value.apply(sqrt),
                         (True,
                         X.Value)))
>> group_by(X.Measure)
>> mutate(Centered = X.Value - X.Value.mean())
)

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,VerticalRate,Measure,Value,MeanSqrt,Centered
0,105,F,Fledgling,7/4/19 9:01,-0.002167,KPH,32.81,5.728001,-10.725584
1,105,F,Fledgling,7/4/19 9:01,-0.120000,KPH,29.63,5.443345,-13.905584
2,106,F,Fledgling,7/6/19 7:02,0.490000,KPH,35.42,5.951470,-8.115584
3,106,F,Fledgling,7/6/19 7:02,-0.450000,KPH,32.87,5.733236,-10.665584
4,106,F,Fledgling,7/6/19 7:02,-0.720000,KPH,35.37,5.947268,-8.165584
...,...,...,...,...,...,...,...,...,...
49995,106,F,Juvenile,12/27/19 11:33,0.140000,abs_angle,0.12,0.346410,-1.006806
49996,106,F,Juvenile,12/27/19 11:33,-0.860000,abs_angle,0.47,0.685565,-0.656806
49997,106,F,Juvenile,12/27/19 11:33,-0.370000,abs_angle,0.96,0.979796,-0.166806
49998,106,F,Juvenile,12/27/19 11:33,-1.720000,abs_angle,0.05,0.223607,-1.076806


#### Problem 2

Now redo the problem, this time using the $Stack\rightarrow Transform\rightarrow Unstack$ technique. 

In [4]:
measures = ['KPH', 'Sn', 'AGL0', 'absVR', 'abs_angle']
measures_stacked = (eagles
                    >> gather("Measures","Value", measures)
                    >> mutate(MeanSqrt = ifelse(X.Value != 'C',
                                                X.Value.apply(sqrt),
                                                (True,
                                                X.Value)))
                    >> group_by(X.Measures)
                    >> mutate(Centered = X.Value - X.Value.mean()))
measures_stacked.head()

Unnamed: 0,Animal_ID,Sex,Age2,LocalTime,VerticalRate,Measures,Value,MeanSqrt,Centered
0,105,F,Fledgling,7/4/19 9:01,-0.002167,KPH,32.81,5.728001,-10.725584
1,105,F,Fledgling,7/4/19 9:01,-0.12,KPH,29.63,5.443345,-13.905584
2,106,F,Fledgling,7/6/19 7:02,0.49,KPH,35.42,5.95147,-8.115584
3,106,F,Fledgling,7/6/19 7:02,-0.45,KPH,32.87,5.733236,-10.665584
4,106,F,Fledgling,7/6/19 7:02,-0.72,KPH,35.37,5.947268,-8.165584


In [91]:
(measures_stacked
>> spread(X.Measures, X.Value)
)

ValueError: Duplicate identifiers

**Deliverables.** You should keep any code cells you used to test/figure-out the solution, but the end result should be two cells,

1. A cell containing all necessary import statements
2. A second cell containing all the code and data management in one pipe; along with all other objects used in the pipe.
3. A third cell containing all the code needed to create your visualization.

Note that these three cells should work independent of the rest of your code: If I restart the kernel and run only these cells, everything should work.