In [None]:
import pandas as pd
import numpy as np

# Multiindex

If you set an index to more than one columnn you are creating multi index or Hieararchical index. This makes asking questions based on indexes a lot more easier, and also opens the possibility of working with multidimensional data. 

We'll use the example sourced from [here](https://chrisalbon.com/python/pandas_hierarchical_data.html). 

In [None]:
# Create dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

In [None]:
df_1_ind = df.set_index('regiment')
df_1_ind

In [None]:
# How do we get the average scores, based on the regiment? 
df_1_ind.mean(level = 'regiment')

In [None]:
# How about you want to get the mean scores, based on the company but not the regiment? 

# Set the hierarchical index to be by regiment, and then by company
df_2_ind = df.set_index(['regiment', 'company'])
df_2_ind

<div class="alert alert-block alert-info">
<p>
Having multiple indexes will give you an easy way to model more than two dimensional data with DataFrames, which are by default a two dimensional data structures. 
</p>
<p>
For the above example, you can imagine each regiment is a two-dimensional array giving details about the company, names and the scores, and they are stacked one below the other. 
</p>
</div>

In [None]:
df_2_ind.mean(level='company')
df_2_ind.mean(level='regiment')
df_2_ind.mean(level=['regiment','company'])

# Pandas Aggregation


In [None]:
# We'll be using our college scorecard dataset in this tutorial.
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

### The `describe()` method
The `describe()` method is available on both **`Series`** and **`DataFrame`** objects and outputs a variety of aggregations that are very useful in getting the general "sense" of a dataset.


In [None]:
# You can specify **`include='all'`** to force Pandas
# to evaluate all columns.  It will inject NaN where
# a calculation cannot be done.
college_scorecard.describe()
# college_scorecard.describe(include='all')

## Airline Data

We will using a sample dataset of the flight schedules data that is available on Kaggle [here](https://www.kaggle.com/usdot/flight-delays)

This is only a sample of the original data. You will use the original data in your Group Project!

In [None]:
flights = pd.read_csv('./data/flight_sample.csv')
flights.sample(10)
# flights.describe(include="all")

## Activity


### Selection Without using GroupBy

**NOTE**: The following three questions does not involve any `groupby`

1. Returning to the `flights` dataframe, extract only the flight details of the American Airlines (AA) using a mask. 
2. What is the median DISTANCE, TAXI_IN times and TAXI_OUT times? 
3. How about median DISTANCE, TAXI_IN and TAXI_OUT times for United Airlines (UA)? 


In [None]:
# Question 1
# aa_flights = get all flights using masking
# aa_flights

In [None]:
# Question 2


In [None]:
# Question 3


# Pandas Grouping

## The `groupby()` Method

So far, all the calculations that we've done on **`DataFrame`** objects have looked at the values of columns as a whole.

The `groupby()` method allows you to move into deeper forms analysis by splitting up the rows of a dataset into groups by the values in specified row(s). You can think of this in some ways as putting rows into buckets for evaluation.

### Specifying how to Split your Dataset into Groups
Of course, before we can perform evaluations on groups, we have to create them from an existing dataframe. 

Let's explore how **`groupby()`** provides a variety of ways to split up your datasets. We'll explore some of these here, starting with the most simple.

#### Single Column Grouping

In [None]:
flights_by_airline = flights.groupby(['AIRLINE'])
# print(flights_by_airline.groups)
# flights_by_airline.head( )

The **`groupby()`** method returns an type called **`DataFrameGroupBy`**. We will explore it in more depth shortly, but for now just know that it has an attribute called **`groups`** which provides a *`dict`* object with the **labels** of each group and the **corresponding index values** in the original dataframe that belong to that group.

If you look above, you can see there is a group labelled 'AA' will index values [2,   19,   43,   55,   59,   64,   71,   74,   82,   92, ...].

You can think of this as a record of all the groups that we will perform calculations on later.

#### Multi Column Grouping

You can specify multiple columns if you wish to split your data up in multiple levels:

In [None]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
flights_by_airline_month.groups

### Aggregations after GroupBy

For example, let us say you want to find out the average distance traveled by each airline, you can do that using the following aggregeate function

In [None]:
flights.head()

In [None]:
flights_by_airline = flights.groupby(['AIRLINE'])

In [None]:
flights_by_airline.mean()[:10]

In [None]:
avg_by_airline = flights_by_airline[['DISTANCE', 'TAXI_OUT', 'TAXI_IN']].mean()

**NOTE**: The double [[ ]] for computing the summary stististics. The first pair [] is used to look into the `DataFrameGroupyBy` object the second pair [] is used to list all the columns you want to produce the summary statistics. 

In [None]:
avg_by_airline

## Activity
### Gerneralizing using GroupBy

4. Instead of doing this for each airline separately, can you do this for all airlines at a the same time using `groupby`?
5. Extract the median DISTANCE for SouthWest airlines (WN) and assign it a variable `median_distance_WN`. 
6. What is the median DISTANCE, TAXI_IN times and TAXI_OUT times per airline per month? Hint: Notice that we want to group by two different columns.
7. Extract the median TAXI_OUT for SouthWest airlines (WN) in December (12) and assign it a variable `median_distance_WN_12`. 

In [None]:
# Question 4
# get median 'DISTANCE', 'TAXI_IN', 'TAXI_OUT' for all airlines


In [None]:
# Question 5
# Extract the median DISTANCE for SouthWest airlines (WN) using loc[]
# Assign it a variable median_distance_WN


In [None]:
# Question 6
# What is the median DISTANCE, TAXI_IN times and TAXI_OUT times per airline per month? 


In [None]:
# Question 7: 
# Extract the median TAXI_OUT for SouthWest airlines (WN) in December (12) 
# Assign it a variable median_distance_WN_12


### Understanding the Aggregation After GroupBy: Method Dispatching

Let us now look at how the Aggregations on the DataFrameGroupBy objects work. In the **`DataFrameGroupBy`** objects, any method not found on the object itself is forwarded ("**dispatched**") to all the groups that it contains.

That is why we were able to ask for the *`median`* of a **`flights_by_airline`** object above and get something back: it is (1) "dispatching" the *`median`* method call to each group (that is each airline), (2) collecting the results and (3) presenting them to us.

In [None]:
flights_by_airline = flights.groupby(['AIRLINE'])

In [None]:
flights_by_airline.median()

In [None]:
# Compute the median for the entire DataFrameGroupBy object and then select 'DISTANCE' column 
flights_by_airline.median()[['DISTANCE']]

In [None]:
# Select the 'DISTANCE' Column as a dataframe and then compute the median
flights_by_airline[ ['DISTANCE'] ].median()

In [None]:
# Select the 'DISTANCE' Column as a numpy array and then compute the median
flights_by_airline['DISTANCE'].median()

**Question**: Which of the above methods should be preferred? 

### Methods of `DataFrameGroupBy` Objects
Now we will understand the various operations built into the `DataFrameGroupBy` object type.

#### The `aggregate()` Method
At first, the `aggregate()` method appears to be quite similiar to what we just covered when we talked about method dispatching. It performs aggregations on the groups in a **`DataFrameGroupBy`** object.

In [None]:
flights_by_airline.aggregate('mean')

The difference is that the **`aggregate()`** method gives you some additional options that are not available if you rely on method dispatching as shown above.

In [None]:
# You can pass multiple aggregates as a list.
# Here will we get various aggregates for each
# column of our flights_by_airline object.
flights_by_airline.aggregate([np.mean, 'min', 'max'])[:5]

<div class="alert alert-block alert-warning">
<p>
It is important to notice that you are able to pass both strings and functions to the `aggregate()` method. It is probably best to choose one approach and stick with it rather than mixing and matching like I've done here.
</p>
</div>

In [None]:
flights_by_airline.aggregate([np.mean, np.min, np.max]).head(5)

Your textbook also talks about using a dict to apply labels to the aggregation columns so that they can have user friendly names like 'Longest Distance' rather than just 'max'.

This sort of functionality is, however, deprecated in Pandas, which means that it will be removed in future versions.

To accomplish the same thing, we should instead append a `rename()` method after our `aggregate()` method like so:

In [None]:
# Using `rename()` to apply friendly labels to output columns
flights_by_airline[['DISTANCE','TAXI_OUT']].aggregate(
    [np.mean, np.min, np.max]).rename(
        columns={'mean': 'Avg. Distance', 
                 'amin': 'Shortest Distance', 
                 'amax': 'Longest Distance'})

<div class="alert alert-block alert-danger">
<p>
Note, there are three main things happening in the above statement. 
<ul>
<li> flights_by_airline['DISTANCE'] selects the distance column for analysis</li>
<li> flights_by_airline['DISTANCE'].aggregate([np.mean, np.min, np.max]) computes the average, min and max of the distance column selected</li>
<li> Finally .rename() function is appropriately renaming the columns according the dictionary we have given  </li>
</ul>
</p>
</div>

The recommended way of using a **`dict`** with the **`aggregate()`** method is actually to specify which aggregation(s) to perform on what columns. You can use it to specify different aggregation(s) on a per-column basis.

Here I'll use it to get the high/low values for DISTANCE and the mean for TAXI_IN on our *`flights_by_airline_month`* object.

In [None]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])

# Notice how using this style automatically filters
# out all columns you don't specify.
flights_by_airline_month.aggregate(
        {'DISTANCE': [np.min, np.max], 
         'TAXI_IN': np.mean}
).tail(20)

## Activity: 

We will work again on the `college-loan-default-rates.csv` and `college-scorecard-data-scrubbed.csv` datasets. 

Use `aggregate()` method to produce

1. The average, minimum and maximum `full_time_retention_rate_4_year` per state using `college-scorecard-data-scrubbed.csv` dataset. 
    * After producing the above summary statistics, make sure you rename your columns for average, minimum and maximum as `Avg. Retention`, `Low Retention`, and `High Retention` respectively. 
2. Produce per state and city, minimum and maximum for the `sat_average` column and average for the `full_time_retention_rate_4_year` column. 

3. Which state has the highest average four year retention rate (`full_time_retention_rate_4_year`)? Which has the lowest average? 


In [None]:
# For this tutorial, we will need both of our datasets.
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv')

college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

college_scorecard.head()

In [None]:
# Question 1
# The average, minimum and maximum full_time_retention_rate_4_year 
# per state using college-scorecard-data-scrubbed.csv dataset.
# After producing the above summary statistics, 
# make sure you rename your columns for 
# average, minimum and maximum as Avg. Retention, Low Retention, and High Retention respectively.

# scores_by_state = 
# scores_by_state_summary = 


In [None]:
# Question 2
# Produce per state and city, minimum and maximum for the sat_average column 
# and average for the full_time_retention_rate_4_year column

# scores_by_state_city = 
# scores_by_state_city_summary =  

In [None]:
# Question 3
# Return to the summary of state scores from first step...
# Which state has the highest average four year retention rate (full_time_retention_rate_4_year)? 
# Which has the lowest average?



<div class="alert alert-block alert-warning">
<h3> Important Notes</h3>
<p> </p> 
When producing any of the summary statistics using group by, you can assign your intermediate operations to the variables. In the entire section above, I have been mostly trying to produce the results to show them to you. However, you can assign the results to a variable for using it in the future. **See the example below.** 
</div>

In [None]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
aggregation = {
                'DISTANCE': [np.min, np.max], 
                 'TAXI_IN': np.mean
              }
column_names = { 'amin': "Minimum", 'amax': 'Maximum', 'mean': 'Average' }

summary_distance_taxi_in = flights_by_airline_month.aggregate(aggregation).rename(columns=column_names)

In [None]:
summary_distance_taxi_in.head()

In [None]:
# Remember from the last class that we can do aggregations at multiple levels using Hierarchical index. 
summary_distance_taxi_in.mean(level='AIRLINE')

In [None]:
summary_distance_taxi_in.mean(level='MONTH')