# Groupby 

files needed = ('Most-Recent-Cohorts-Scorecard-Elements.csv')

We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have? 

Pandas provides the `groupby( )` method to ease computing statistics by group ([docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as 

> split-apply-combine

Here is the canonical [illustration](https://www.oreilly.com/library/view/learning-pandas/9781783985128/ch09s02.html). The big idea is to 
1. **Split** the data up into groups. The groups are defined by *key* variables.
2. **Apply** some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
3. **Combine** the results of the 'apply' from each group into a new data structure.
  
  
Apply-split-combine is an incredibly powerful feature of pandas. We will cover the basics here. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('precision', 3)       # this tells pandas to print out 3 decimal places when we print a DataFrame

## College Scorecard
Let's take this opportunity to learn about a new dataset: [The College Scorecard](https://collegescorecard.ed.gov/data/). The data are compiled by the Dept. of Education to help students evaluate higher education institutions. The data are very well documented and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables. 

We will be working off of the 'most recent data' file. It is in our shared folder, but you can also get it from [here](https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-Scorecard-Elements.csv). 

\[For extra practice, you can try to open the dataset directly from the url, rather than downloading it first.\] 

In [None]:
colscd = pd.read_csv('Most-Recent-Cohorts-Scorecard-Elements.csv')
colscd.head()

This dataset is too big for our needs. Let's rename the variables to something easier to understand and keep just a few variables that look interesting. 

In [None]:
colscd = colscd.rename(columns = {'CONTROL':'ownership', 'INSTNM':'name', 'STABBR':'state', 'PREDDEG':'type', 'SATVRMID':'sat_read_med', 
                      'SATMTMID':'sat_math_med', 'SATWRMID':'sat_write_med', 'PCIP24':'sh_las', 'PCIP51':'sh_bus',
                     'PCIP11':'sh_cs', 'MD_EARN_WNE_P10':'earn_10', 'GRAD_DEBT_MDN_SUPP':'debt_at_grad'})

In [None]:
cols_to_keep = ['name', 'state', 'ownership', 'type','sat_read_med',  'sat_math_med', 'sat_write_med',
                'sh_las', 'sh_bus', 'sh_cs', 'earn_10', 'debt_at_grad']

colscd = colscd[cols_to_keep]

colscd.head()

The ownership and type variables are coded as integers. I would rather they were easy to understand. 

In [None]:
type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)

own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()

Set the index to the university name. 
How do we look?

In [None]:
colscd.set_index('name', inplace=True)
colscd.loc['University of Wisconsin-Madison']

Unless I read the documentation wrong, (or made some other mistake) this says UW didn't give out an liberal arts degrees. I doubt that it true...

One last check before we get to work. 

In [None]:
colscd.dtypes

Doh! looks like the earnings and debt came in as objects instead of floats.

The culprit is the 'PrivacySuppressed' flag. We could have told `read_csv` about this if we knew in advance. I found this problem by using `colscd['earn_10'].unique()`.

Instead, let's practice `to_numeric( )` which tries to convert a column to numeric values. I pass the parameter `error='coerce'` to tell the method to set anything it cannot convert to a NaN.  


In [None]:
colscd['earn_10'] = pd.to_numeric(colscd['earn_10'], errors='coerce')
colscd['debt_at_grad'] = pd.to_numeric(colscd['debt_at_grad'], errors='coerce')
colscd.dtypes

## 1. Split: groupby( )
We pass groupby a 'key' which tells the method which variable to, well, group by. This is the **split** step.

What is `colscd_grouped`?

In [None]:
colscd_grouped = colscd.groupby('state')
print(type(colscd_grouped))

A DataFrameGroupBy object. This is basically a DataFrame + the grouping information. 

What does it look like? A DataFrameGroupBy is an iterable object. It returns subsets of the original DataFrame by group. In our case, the groups are defined by state. 

The `.get_group()` returns a group. 

In [None]:
colscd_grouped.get_group('WI').sort_index()
#for g in colscd_grouped:
#    print(g)

## 2. + 3. Apply and combine
A major use of groupby is to perform some kind of aggregation. This is the **apply** and **combine** step. Let's take the grouped data and compute some means. 

In [None]:
all_means = colscd_grouped.mean()  # apply the mean operator to the grouped data

print(type(all_means))             # what do we get back?

In [None]:
# Ah, a DataFrame. We know what to do with that. 
all_means.head(10)

When we used mean() one the grouped data, it **applied** the mean method to each group, which creates one number per group (for each column). It then **combined** the means into a DataFrame, one number per group per column. Nice.  

Notice that the categorical data (name, state, type) have been dropped.

Here we can see the result of pd.set_option('precision'). The output is limited to 3 decimal places. 

## 1. + 2. + 3. Split-apply-combine

Computing the grouped data first helped us understand what was happening, but we can do the whole split-apply-combine in one step. One simple line of code.

In [None]:
all_means = colscd.groupby('state').mean()
all_means.head(10)

### Aggregation methods

Some common aggregation methods include: `.mean()`, `.sum()`, `.std()`, `.describe()`, `.min()`, `.max()`, but there are many more. Any function that returns a scalar will work. 

### gropuby( ) on a subset of columns
We may not care about all the columns in our datset for a particular groupby. We can subset our DataFrame as usual and compute a groupby. 

Let's focus on the median SAT scores. We will group by the 'ownership variable.

In [None]:
# Grab the cols we want from the df before using the groupby. Remember to keep the grouping variable, too.
sat_medians_1 = colscd[['sat_read_med', 'sat_math_med', 'sat_write_med', 'ownership']].groupby('ownership').median()
sat_medians_1

## Practice

1. Create a dataset with only public institutions. Name it `pub`

The `quantile( )` method computes quantiles from the data. (e.g., `quantile(0.5)` computes the median, or the the 50th quantile)

2. Let's look at a measure of the earnings spread for different **institution types**

    a. Compute the 75th quantile for 'earn_10' for each 'type'.
    
    b. Compute the 50th quantile for 'earn_10' for each 'type'.
    
    c. Compute the 25th quantile for 'earn_10' for each 'type'.
    
You should have three new DataFrames, each containing the one of the quantile statistics. 

2d. For each type, compute the difference between the 75 percentile and the 25 percentile and divide it by the median. 

This is sometimes called the *quartile-based coefficient of variation*. It is a measure of the variability of a variable. It is less sensitive to outliers than the coefficient of variation, which is the standard deviation divided by the mean. 

Wow, a lot of dispersion in the grad_only group. Let's practice some more. 



3. How do reading and writing scores correlate?

    a. Compute the median SAT reading and writing scores by **state**. 


3b. Create a scatter plot with the median reading score on the x axis and writing score on the y axis.  

If you finished early try:
1. Adding the 45-degree line to the plot. 
2. Replacing the data marker with the two-letter state abbreviation at each point

### Working with quantiles
We learned how to use `.cut()` and `.qcut()` to create discrete variables or 'bins'. Let's cut the data by the share of business degrees.  


In [None]:
# Making a copy of the original data so we don't change it
colscd_degrees = colscd

# Cut into 3 bins
colscd_degrees['bus_rank'] = pd.cut(colscd_degrees['sh_bus'], 3, right=False)
colscd_degrees.head(2)

Remember, cut returns a Categorical object. We can use this object as our key variable in a groupby.

In [None]:
earn_bus = colscd_degrees.groupby('bus_rank')['earn_10']

In [None]:
earn_bus.count()

### Several statistics at once
Once we have grouped our data, we have been hitting it with methods to compute statistics: mean(), count(),...

We now introduced the `agg( )` method, which lets us compute several moments at once --- you can even pass it a user defined function. 

In [None]:
# This is the same as earn_bus.count()
earn_bus.agg('count')

In [None]:
# But agg() lets us compute many stats at once
earn_bus.agg(['count', 'mean', 'median', 'std', 'max'])

Schools that focus on business outcomes don't seem to offer greater earnings opportunities. 

## Practice
1. Write a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame). Name the function 'avg5'.

The input, name it `x`,  will be a column of a DataFrame. The output is a single number. 

2. Test your function on column 'a' of the DataFrame defined below. The answer should be 8.

```python
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
```

Now return to `colscd`

3. Drop any observation that has 'debt_at_grad' == NaN
4. Compute the mean, median, and avg5  'debt_at_grad' by **'ownership'**. Compute them all at once using `.agg()`.

### groupby( ) with many keys
Can we group by several keys? You know we can. Let's compute the medians this time.

In [None]:
sat_medians = colscd.groupby(['ownership','type']).median()
sat_medians

Now we have a MultiIndexed DataFrame with the summary statistics, this time, the median. 

In [None]:
sat_medians = colscd.groupby(['ownership','type'])[['sat_read_med', 'sat_math_med', 'sat_write_med']].median()
sat_medians

The three ownership types all have institutions that predominately offer bachelors degrees. Let's grab that set of statistics. 

In [None]:
bach_sat_med = sat_medians.xs('bach', level='type')         # xs() indexes data from a MultiIndex
print(bach_sat_med)

How do the median SAT scores compare across public and private institutions? 

There are a few new plotting tricks here...

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(21,6))

# Set up the color scheme. This makes it easier to fiddle with.
bar_color = 'red'
bar_alpha = 0.35

# Plot one SAT variable on each axes
ax[0].bar(bach_sat_med.index, bach_sat_med['sat_math_med'], color=bar_color, alpha=bar_alpha)
ax[1].bar(bach_sat_med.index, bach_sat_med['sat_read_med'], color=bar_color, alpha=bar_alpha)
ax[2].bar(bach_sat_med.index, bach_sat_med['sat_write_med'],color=bar_color, alpha=bar_alpha)

# Titles!
ax[0].set_title('SAT reading')
ax[1].set_title('SAT math')
ax[2].set_title('SAT writing')

# I am only setting the ylabel on the left-most. Save some non-data ink.
ax[0].set_ylabel('Median score')

# Set these common parameters by looping over the axes.
for a in ax:
    a.spines['top'].set_visible(False)
    a.spines['right'].set_visible(False)
    a.grid(axis='y', color='white')                # Still experimenting with this...
    a.xaxis.set_tick_params(length=0)              # Kill the xaxis ticks
    a.yaxis.set_tick_params(length=0)              # Kill the yaxis ticks
    
plt.show()


Interesting. Private for-profit institutions seem to have about the same quality of writing scores, a bit lower math scores and substantially lower reading scores. Once we fire up some stats model packages, we can do formal tests to see if they are significantly different.  

## Extra practice

If you want to practice some more, try writing three functions: One returns the 25 percentile, one returns the 50th percentile and one returns the 75 percentile. 


Then redo question 2 from the first practice, but using only one groupby and and the `.agg()` method. 

2. Let's look at a measure of the earnings spread for different institution types
   1. Compute the 75th quantile for 'earn_10' for each 'type'.
   2. Compute the 50th quantile for 'earn_10' for each 'type'.
   3. Compute the 25th quantile for 'earn_10' for each 'type'.

You should have three new DataFrames, each containing the one of the quantile statistics.


2d. For each type, compute the difference between the 75 percentile and the 25 percentile and divide it by the median. 

This is sometimes called the *quartile-based coefficient of variation*. It is a measure of the variability of a variable. It is less sensitive to outliers than the coefficient of variation, which is the standard deviation divided by the mean. 