In [None]:
import pandas as pd
import numpy as np

# Using apply() and groupby() to create your own groups

Here we will show a simple use of apply() method and groupby() method that can be very useful. Note, that this *far more simple* use of the apply() method than the one shown in the advanced topics in Lecture 18 notebook. 

Let us say you want to group the 'AIRLINE' but not necessarily the same airline but according to the their alliance. How can you achieve this? THe following are the groups of airlines that are in each of the alliances

Star Alliance:
* UA - United Airlines
* OO - Skywest Airlines

Oneworld Alliance:
* AA - American Airlines
* US - US Airlines
* MQ - American Eagle Airlines Inc. 

SkyTeam Alliance:
* DL - Delta Airlines
* EV - Atlantic Sotheast Airlines
* VX - Virgin America

NoAlliance; Not in any alliance:

* F9 - Forntier
* B6 - Jetblue
* NK - Spirit
* WN - Southwest
* HA - Hawaiian
* AS - Alaskan 


In [None]:
flights = pd.read_csv('./data/flight_sample.csv')
flights.head()

In [None]:
def get_alliance(airline):
    if airline in ['UA','OO']:
        return 'Star'
    elif airline in ['AA', 'US', 'MQ']:
        return 'Oneworld'
    elif airline in ['DL','EV', 'VX']:
        return 'SkyTeam'
    elif airline in ['AS', 'F9', 'B6', 'NK', 'WN', 'HA']:
        return 'NoAlliance'

In [None]:
# Test the function
get_alliance('OO')

In [None]:
# CREATING a new column called 'Alliance' and assigning the alliance based on the function
flights['Alliance'] = flights['AIRLINE'].apply(get_alliance)

In [None]:
flights.sample(5)

In [None]:
flights_by_alliance = flights.groupby(['Alliance'])

In [None]:
type(flights_by_alliance)

In [None]:
df_means = flights_by_alliance['DISTANCE','TAXI_IN'].mean()

In [None]:
df_means

In [None]:
type(df_means)

In [None]:
df_means.loc['Star']['DISTANCE']

### Detour:  `sort_values()`, a method to sort rows based on a column

In [None]:
df_means

In [None]:
df_means.sort_values(['DISTANCE'], inplace=True, ascending=False)
df_means

### Activity:

1. Drop all rows in `college_scorecard_small` that has any missing values

2. Add another column `sat_avg_level` to the `college_scorecard_small` DataFrame. It is assigned the following values based on the values in `sat_average`. **You need to write a function and use ``apply()`` method**
   * Lower_sat
       - sat_average <= 973 
   * Below_avg_sat
       - 973 < sat_average <= 1039
   * Abv_avg_sat
       - 1039 < sat_average <= 1120
   * Higher_sat
       - sat_average> 1120
   
3. Is there a relationship between `sat_avg_level` and `pell_grant_receipents`? How about relationship with `full_time_retention_rate_4_year`? 
   * Group by sat_avg_level and find the average for the rest of the two columns and make your interpretation. 

In [None]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

# I'm extracting only three columns and creating a copy for this analysis. 
college_scorecard_small = college_scorecard[['sat_average', 'pell_grant_receipents','full_time_retention_rate_4_year']].copy()
college_scorecard_small.head()

In [None]:
# drop any data with missing data and make sure it stays that way


In [None]:
# create a function to categorize the sat_average values


In [None]:
# test function


In [None]:
# add a sat_avg_level column using the function above


In [None]:
# group by our new column

# get the averages of 'pell_grant_receipents','full_time_retention_rate_4_year'
# display the reuslts

# how would you get the 
# * min and max of the sat averages 
# * meidan pell grant recipients 
# * average retention rate?

# Pivot Tables: Two-dimensional GroupBy

We have seen how the ``GroupBy`` abstraction lets us explore relationships within a dataset.
A *pivot table* is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data.
The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.

In [None]:
flights.head()

In [None]:
# index is the row grouping and columns is the column grouping, 
# the first parameter is the one that is aggregated

flight_pvt = flights.pivot_table(
        'DISTANCE',
        index='DAY_OF_WEEK', 
        columns = 'Alliance'
        )

flight_pvt

### Modifying the default behavior with `aggfunc` keyword argument

In [None]:
flight_pvt = flights.pivot_table('DISTANCE',index='DAY_OF_WEEK', 
                                 columns = 'Alliance', 
                                 aggfunc = np.sum)
flight_pvt

### You can get the totals using `margins` keyword argument

In [None]:
flight_pvt = flights.pivot_table('DISTANCE',index='DAY_OF_WEEK', 
                                 columns = 'Alliance', 
                                 aggfunc = np.sum, 
                                 margins=True)
flight_pvt

In [None]:
# get second day of the week for 'No Alliance'
flight_pvt.loc[2]['NoAlliance']

In [None]:
flights.groupby(['DAY_OF_WEEK','Alliance'])[['DISTANCE']].sum()

## Activity: Birthrate Data

As a more interesting example, let's take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC).
This data can be found at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
(this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, [this blog post](http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/)):

1. Create a column called `decade` in the births_df dataframe loaded below. 
    * Use the column called `year` to create the `decade`. For example, if you were born in 1969 it should say your decade is 1960
2. Create a pivot table that counts number of `births` in each decade and also based on whether they were male or female

In [None]:
births_df = pd.read_csv('./data/births.csv')

In [None]:
births_df.sample(4)

In [None]:
(1969//10)*10

In [None]:
# add a 'decade' column and use the formula (births_df['year'] // 10 ) * 10


In [None]:
# Create a pivot table that counts number of births in each decade and also based on whether they were male or female


<div class="alert alert-block alert-danger">
<h3> CAUTION AHEAD </h3>
<p> </p>
<p> The topics discussed ahead are advanced and you need to absolutely make sure you understand everything discussed in the previous classes to move forward.</p>
</div>

### Advanced Topics: filter() and transform()

These functions give a lot more flexibility on `DataFrameGroupBY` objects and they are discussed below. They are advanced topics, however, I **strongly encourage** you to read through them and you could use them for finding very interesting patterns in the data. 

In [None]:
# Starter code for the Advanced Topics, you will need to run this before you use them further. 
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv')

college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')



#### The `filter()` Method

You can use the `filter()` method to generate a new dataframe after filtering out groups that don't pass a given criteria. It allows you to answer questions like this: *what states in college scorecard have rows where the average SAT score (for the state) is above 1100?*

To use this method, you must pass in a function that takes a single parameter, which is the group to evaluate. The function must return either `True`/`False` depending on whether or not the *rows of the group* should be kept or discarded in the new dataframe.

So, with this in mind, let's define a `sat_filter` function so that groups with average SAT scores of less than 1100 are dropped from consideration.

In [None]:
colleges_by_state = college_scorecard.groupby(['state'])

In [None]:
def size_filter(group):
    if group['sat_average'].mean() >= 1150:
        return True
    else:
        return False

And now let's use it on to see which rows remain in the new dataframe after applying the filter:

In [None]:
college_scorecard[['state','city', 'sat_average']].sample(10)

In [None]:
# this doesn't work because it's not grouped
# college_scorecard[['state','city', 'sat_average']].filter(size_filter)

In [None]:
# Just to reduce the complexity here, I'm only going
# to display the `sat_average`, `state`, and `city` fields 
filter_results = colleges_by_state[['institution_name','state','city', 'sat_average']].filter(size_filter)
filter_results

There are a couple of ***really*** important things to notice here:
1. Unlike the **`aggregate`** method, the data returned here is not grouped by state as you probably expected it to be. The filter is used on a grouped dataframe, but it returns a new "normal" dataframe.
2. Notice that we have a bunch of rows for Washington DC and Rhode Island, but nothing else. If we've done things correctly, this would mean that the colleges in those two states have average SAT scores of at least 1150. 

#### The `transform()` Method

You use the **`transform()`** method to generate a new dataframe that modifies/transforms the values of the grouped dataframes columns.

That probably just confused the heck out of you. So we will start with a practical example.

Let's say that we wanted to center the data for the *`year_1_default_rate`* and *`year_2_default_rate`* columns of our **`college_loan_defaults_by_state`** grouped dataframe. 

Let's step through how we could do that with **`transform()`**.

Just like with the **`filter()`** method, we have to create a function that we will pass to the **`transform`** method, but this time the function will evaluate each series (column) of each group, rather than the groups as a whole.

In [None]:
# Just extracting three columns for this analysis
college_loan_defaults_subset = college_loan_defaults[['name', 'state', 'year_1_default_rate']]
college_loan_defaults_subset.head()

In [None]:
college_loan_defaults_by_state = college_loan_defaults_subset.groupby('state')

In [None]:
#the mean for each state
college_loan_defaults_by_state['year_1_default_rate'].mean()[:5]

In [None]:
# This function will be called on each 
# series of each group in your DataFrameGroupBy object
def center_default_rate(series):
    return series - series.mean()

In [None]:
# We'll also use the rename() method to apply some friendly column names.
transformed_default_rates = college_loan_defaults_by_state.transform(
    center_default_rate).rename(
        columns={'year_1_default_rate': 'centered_year_1_default_rate'})

transformed_default_rates.head()

<div class="alert alert-block alert-info">
<p>
Our `college_loan_defaults_by_state` dataframe included four columns: name, state, and year_1_default_rate.
</p> 
<p>But here in the returned dataframe we only have `centered_year_1_default_rate`. The reason for this is that the other two columns were strings, and you can't calculate the mean of a series of strings.
</p>
<p>
Because of this, Pandas just silently drops them from the new dataframe that is returned from the `tranform` method.
</p>
</div>

So now we have our centered rates in a new dataframe. Let's merge together the result of our **`transform`** method and our *`college_loan_defaults_subset`* dataframe. 

In [None]:
# Make sure to specify the indices as the "join column" or Pandas
# will try to join the dataframes based on the shared 'year_1_default_rate' column.
pd.merge(college_loan_defaults_subset, transformed_default_rates, 
         left_index=True, 
         right_index=True)[:5]