# Filter and Transform with Groupby

All of the groupby chapters thus far have focused on aggregation, which is the most common operation to perform. However, there are many more calculations we can perform on our groups besides return a single value. In this chapter, we cover the groupby `filter` method, which filters entire groups as a whole from DataFrames and is similar to boolean selection. We'll also cover the groupby `transform` method, which performs an operation to the entire group and returns a Series or DataFrame the same length as the original.


## The groupby `filter` method

The groupby `filter` method does boolean selection for entire groups. The entire group is kept or rejected as a whole. A DataFrame with the same number of columns is returned. An example with a small fake dataset can help us learn how it works.

In [3]:
import pandas as pd
item = ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'D']
quantity = [2, 10, 3, 7, 6, 5, 2, 10, 12]
data = {'item': item, 'quantity': quantity}
df = pd.DataFrame(data)
df

Unnamed: 0,item,quantity
0,A,2
1,A,10
2,B,3
3,B,7
4,B,6
5,C,5
6,C,2
7,D,10
8,D,12


### Review boolean selection

Before we filter by group, let's review boolean selection from earlier in the book. With boolean selection, we create a boolean Series (usually by using one of the comparison operators) and then pass this filter to *just the brackets*.  Here, we select all the rows with quantity greater than 4.

In [4]:
filt = df['quantity'] > 4
df[filt]

Unnamed: 0,item,quantity
1,A,10
3,B,7
4,B,6
5,C,5
7,D,10
8,D,12


### Filter by group total

Instead of filtering by each individual row, we can filter entire groups. Let's say we want to keep the groups with a total quantity greater than 15. We could start by finding the total quantity using a basic groupby aggregation.

In [5]:
total = (df.groupby('item')
           .agg(total_quantity=('quantity', 'sum'))
           .reset_index())
total

Unnamed: 0,item,total_quantity
0,A,12
1,B,16
2,C,7
3,D,22


We can use normal boolean selection to filter this aggregated DataFrame down to just the items that meet our criteria.

In [6]:
filt = total['total_quantity'] > 15
total[filt]

Unnamed: 0,item,total_quantity
1,B,16
3,D,22


Let's get just the items that meet this criteria as a Series.

In [7]:
items = total.loc[filt, 'item']
items

1    B
3    D
Name: item, dtype: object

From here we can use the `isin` method on our original DataFrame to get the desired result.

In [8]:
filt2 = df['item'].isin(items)
df[filt2]

Unnamed: 0,item,quantity
2,B,3
3,B,7
4,B,6
7,D,10
8,D,12


### Shortcut with the groupby `filter`

The groupby `filter` method handles this procedure in a more direct manner. It is a somewhat complicated method so it will take some time to understand. You first must create a function that returns a single boolean value. pandas will implicitly pass this function a DataFrame consisting of just the rows of the current group.

Take a look at the `find_total` function below. It gets called once per group. It receives the current group as a DataFrame and assigns it to the variable `sub_df`. You can call any normal DataFrame methods on `sub_df`. Here, we select the quantity column and sum it. We then compare this sum against 15 and return a boolean.

In [1]:
def find_total(sub_df):
    return sub_df['quantity'].sum() > 15

We pass this function to the groupby `filter` method to complete the selection.

In [9]:
df.groupby('item').filter(find_total)

Unnamed: 0,item,quantity
2,B,3
3,B,7
4,B,6
7,D,10
8,D,12


### Viewing each "Sub-DataFrame"

The variable name `sub_df` was chosen to signify that the object being passed to `find_total` was indeed a DataFrame. Let's print out each sub-DataFrame during each call to `find_total` to inspect what is happening.

In [10]:
def find_total2(sub_df):
    print(sub_df, end='\n\n')
    return sub_df['quantity'].sum() > 15

This function will be called four times, once for each group, and print out the current sub-DataFrame and then return a boolean.

In [11]:
df.groupby('item').filter(find_total2)

  item  quantity
0    A         2
1    A        10

  item  quantity
2    B         3
3    B         7
4    B         6

  item  quantity
5    C         5
6    C         2

  item  quantity
7    D        10
8    D        12



Unnamed: 0,item,quantity
2,B,3
3,B,7
4,B,6
7,D,10
8,D,12


## Getting a nicer display

Instead of printing to the screen, we can use the `display_html` function from the `IPython.display` module to get the same HTML output that we are accustomed to. This can be quite helpful when debugging. Below, a decorator function is created that outputs the styled DataFrame HTML to the screen inside a div element using the CSS flexbox layout (displays the DataFrames horizontally). pandas adds a `name` attribute to each sub-DataFrame that stores the current group, which is used as a caption for the DataFrame output.

In [None]:
from IPython import display
def display_wrapper(func):
    def wrapper(sub_df, data=None, width=900, margin=50, max_ct=8, max_rows=10):
        """
        Parameters
        ----------
        sub_df: sub-DataFrame of group passed from pandas groupby
        
        data: dictionary holding the html string and the current group number 
                 {'html': '', 'ct': 0}
                 
        width: pixel width of output area
        
        margin: pixels between DataFrames
        
        max_ct: the maximum number of DataFrames to output to the screen
        """
        if data['ct'] < max_ct:
            data['ct'] += 1
            caption = f'Group {sub_df.name}'
            if isinstance(sub_df, pd.Series):
                sub_df = sub_df.to_frame()
            df_styled = sub_df.style.set_caption(caption)
            data['html'] += df_styled.to_html(max_rows=max_rows)
            style = f'style="width:{width}px; display:flex; flex-wrap:wrap; gap:30px"'
            final_html = f'<div {style}>{data["html"]}</div>'
            display.clear_output()
            display.display_html(final_html, raw=True)
        return func(sub_df)
    return wrapper
    
@display_wrapper
def find_total3(sub_df):
    return sub_df['quantity'].sum() > 15

When we call the `filter` method now, we pass it a dictionary that will continue collecting the HTML of each sub-DataFrame as a string and the count of the group, which is limited by `max_ct`.

In [None]:
df.groupby('item').filter(find_total3, data={'html': '', 'ct': 0})

### Using an anonymous function

If the custom function can be written in a single line, you may use an anonymous function. The same sub-DataFrame is passed to it like above.

In [12]:
df.groupby('item').filter(lambda sub_df: sub_df['quantity'].sum() > 15)

Unnamed: 0,item,quantity
2,B,3
3,B,7
4,B,6
7,D,10
8,D,12


### Summary of the groupby `filter` method

* Must write a custom function
* The custom function implicitly gets passed a DataFrame of just that group
* The custom function must return a single boolean value
* Each group is either kept or dropped based on the returned boolean value
* The end result is the original DataFrame (same number of columns) with the rows of groups that met the criteria

## Finding actors that appear in at least 25 movies

Let's complete a more practical example with the movie dataset by filtering for actors that have appeared in at least 25 movies. Only a few of the columns are read.

In [13]:
cols = ['title', 'year', 'content_rating', 'director_name', 
        'actor1', 'num_reviews', 'imdb_score']
movie = pd.read_csv('../data/movie.csv', usecols=cols)
movie.head(3)

Unnamed: 0,title,year,content_rating,director_name,actor1,num_reviews,imdb_score
0,Avatar,2009.0,PG-13,James Cameron,CCH Pounder,723.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Gore Verbinski,Johnny Depp,302.0,7.1
2,Spectre,2015.0,PG-13,Sam Mendes,Christoph Waltz,602.0,6.8


### Create a custom function

Our custom function is very simple. We merely need to check if the number of rows of the implicitly passed DataFrame is 25 or more.

In [14]:
movie_top_actor = (movie.groupby('actor1')
                        .filter(lambda sub_df: len(sub_df) >= 25))
movie_top_actor.head()

Unnamed: 0,title,year,content_rating,director_name,actor1,num_reviews,imdb_score
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Gore Verbinski,Johnny Depp,302.0,7.1
6,Spider-Man 3,2007.0,PG-13,Sam Raimi,J.K. Simmons,392.0,6.2
13,Pirates of the Caribbean: Dead Man's Chest,2006.0,PG-13,Gore Verbinski,Johnny Depp,313.0,7.3
14,The Lone Ranger,2013.0,PG-13,Gore Verbinski,Johnny Depp,450.0,6.5
18,Pirates of the Caribbean: On Stranger Tides,2011.0,PG-13,Rob Marshall,Johnny Depp,448.0,6.7


In [15]:
movie_top_actor.shape

(416, 7)

Let's verify the results by returning the frequency of occurrence for each `actor1` of the returned DataFrame.

In [16]:
movie_top_actor['actor1'].value_counts()

actor1
Robert De Niro       48
Johnny Depp          36
Nicolas Cage         32
Matt Damon           29
Denzel Washington    29
J.K. Simmons         29
Bruce Willis         28
Harrison Ford        27
Steve Buscemi        27
Liam Neeson          27
Robin Williams       27
Robert Downey Jr.    26
Bill Murray          26
Jason Statham        25
Name: count, dtype: int64

## Multiple conditions

The custom function you create to filter your data can test as many conditions as you desire as long as it returns a single boolean value. Let's return all movies that have an actor1 with 25 or more appearances along with an average IMDB score greater than 7. We define a function that evaluates each condition.

In [17]:
def top_actor_score(sub_df):
    return len(sub_df) >= 25 and sub_df['imdb_score'].mean() > 7

Pass this function to the groupby `filter` method to get the result.

In [18]:
movie_top_actor_score = movie.groupby('actor1').filter(top_actor_score)
movie_top_actor_score.shape

(56, 7)

Only 56 rows remain in this filtered DataFrame than the previous one. Let's verify that each actor1 left meets both criteria.

In [19]:
(movie_top_actor_score.groupby('actor1')
                      .agg(num_movies=('actor1', 'size'),
                           mean_imdb_score=('imdb_score', 'mean')))

Unnamed: 0_level_0,num_movies,mean_imdb_score
actor1,Unnamed: 1_level_1,Unnamed: 2_level_1
Denzel Washington,29,7.055172
Harrison Ford,27,7.159259


## The groupby `transform` method

The `groupby` transform method performs a calculation on each group just like `agg`, but returns the same number of values as rows in the group.

### Aggregation with `transform`

The groupby `transform` method can perform an aggregation just like the `agg` method, but returns the aggregated value for each row in the group. Let's review the groupby `agg` method on the example dataset to sum the quantity of each item.

In [20]:
df.groupby('item').agg(total_quantity=('quantity', 'sum'))

Unnamed: 0_level_0,total_quantity
item,Unnamed: 1_level_1
A,12
B,16
C,7
D,22


We can perform the same aggregation with `transform`, but it returns the same number of rows as the original. The syntax for `transform` is different than `agg`. The aggregating column (quantity) is placed in the brackets following the call to `groupby` and then the `transform` method is called with the string name of the aggregation. A Series is returned.

In [21]:
df.groupby('item')['quantity'].transform('sum')

0    12
1    12
2    16
3    16
4    16
5     7
6     7
7    22
8    22
Name: quantity, dtype: int64

### Can append result to the original DataFrame

Since `transform` always returns an object the same length as the original DataFrame, it is common to append the result to the original DataFrame. 

In [22]:
df2 = df.copy()
df2['group total'] = df.groupby('item')['quantity'].transform('sum')
df2

Unnamed: 0,item,quantity,group total
0,A,2,12
1,A,10,12
2,B,3,16
3,B,7,16
4,B,6,16
5,C,5,7
6,C,2,7
7,D,10,22
8,D,12,22


### `transform` second use case - return a new value for each row in the group

You can also use `transform` to apply a specific transformation to each value in the group. For instance, we can divide each value in the group by the total of that specific group. For this, we need a custom function.

In [23]:
def divide_max(sub_series):
    return sub_series / sub_series.sum()

The `transform` method must either return a single value or a sequence of values the same length as each group. In this instance, it returns a Series the same length as the group.

In [24]:
df2['perc_of_total'] = df.groupby('item')['quantity'].transform(divide_max).round(2)
df2

Unnamed: 0,item,quantity,group total,perc_of_total
0,A,2,12,0.17
1,A,10,12,0.83
2,B,3,16,0.19
3,B,7,16,0.44
4,B,6,16,0.38
5,C,5,7,0.71
6,C,2,7,0.29
7,D,10,22,0.45
8,D,12,22,0.55


### Implicitly passed a Series

The `transform` method is different than `filter` in that it implicitly passes just a Series of data to the custom function. You only have access to that one Series inside of the custom function and not all of the columns like you do with `filter`. It can be instructive to print out everything that is happening within the custom function. Here, we print out both the implicitly passed original Series and the returned transformed Series for each group.

In [25]:
def divide_max2(sub_series):
    result = sub_series / sub_series.sum()
    print("Original", sub_series, sep='\n', end='\n\n')
    print("Transformed", result, sep='\n', end='\n\n\n')
    return sub_series / sub_series.sum()

df.groupby('item')['quantity'].transform(divide_max2)

Original
0     2
1    10
Name: A, dtype: int64

Transformed
0    0.166667
1    0.833333
Name: A, dtype: float64


Original
2    3
3    7
4    6
Name: B, dtype: int64

Transformed
2    0.1875
3    0.4375
4    0.3750
Name: B, dtype: float64


Original
5    5
6    2
Name: C, dtype: int64

Transformed
5    0.714286
6    0.285714
Name: C, dtype: float64


Original
7    10
8    12
Name: D, dtype: int64

Transformed
7    0.454545
8    0.545455
Name: D, dtype: float64




0    0.166667
1    0.833333
2    0.187500
3    0.437500
4    0.375000
5    0.714286
6    0.285714
7    0.454545
8    0.545455
Name: quantity, dtype: float64

### `transform` must return either a single value or a Series the same length as the group

The custom function that you use with `transform` must return either a single value or a Series the same exact length as the group. Our first use-case returned an aggregation (a single value), while our second returned the Series divided by the max of each group.

### Find difference from the mean

Let's read in the City of Houston employee dataset and transform each salary so that it shows the difference between it and the mean salary of that employee's department.

In [26]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


We define a custom function that subtracts the mean of that group from all the values in the group.

In [27]:
def sub_mean(s):
    return (s - s.mean()).round(-3)

We call the `transform` method with this function and create a new column which informs us how much more or less each employee is making relative to the mean of their department.

In [28]:
emp['salary_diff_mean'] = emp.groupby('dept')['salary'].transform(sub_mean)
emp.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race,salary_diff_mean
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White,21000.0
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic,21000.0
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black,-2000.0
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,Male,Hispanic,9000.0
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White,3000.0


## Transforming multiple columns

It's possible to use the `transform` method on multiple columns instead of just one that we've been using. We begin by reading in a few columns of the college dataset.

In [29]:
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, index_col='instnm')
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,AL,1,,,291.0


Place all of the columns you desire to pass through the `transform` column in a list with brackets following the call to `groupby`. The following takes the mean SAT verbal and SAT math scores for each state.

In [30]:
mean_sat = (college.groupby('stabbr')[['satvrmid', 'satmtmid']]
                   .transform('mean')
                   .round(0))
mean_sat.head(3)

Unnamed: 0_level_0,satvrmid,satmtmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,508.0,504.0
University of Alabama at Birmingham,508.0,504.0
Amridge University,508.0,504.0


These columns can then be appended to the original DataFrame.

In [31]:
college[['sat_verbal_mean', 'sat_math_mean']] = mean_sat
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds,sat_verbal_mean,sat_math_mean
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0,508.0,504.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0,508.0,504.0
Amridge University,AL,1,,,291.0,508.0,504.0


Let's filter for a different state (Texas) so verify that the mean scores are different.

In [32]:
college.query('stabbr == "TX"').head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds,sat_verbal_mean,sat_math_mean
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Abilene Christian University,TX,1,530.0,545.0,3572.0,511.0,523.0
Alvin Community College,TX,0,,,4682.0,511.0,523.0
Amarillo College,TX,0,,,9346.0,511.0,523.0


### Standardization

A common transformation for numeric columns is to subtract the mean and divide by the standard deviation. This is called **standardization** and is often completed before performing machine learning. It provides a relative metric of how many standard deviations away from the mean each value is. This metric is also known as the **z-score**. Let's define a custom function to produce the calculation.

In [34]:
def standardize(s):
    return (s - s.mean()) / s.std()

Let's standardize the SAT score and undergraduate population columns by state.

In [35]:
(college.groupby('stabbr')[['satvrmid', 'satmtmid', 'ugds']]
        .transform(standardize)
        .round(2)
        .head(3))

Unnamed: 0_level_0,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama A & M University,-1.55,-1.43,0.3
University of Alabama at Birmingham,1.13,1.03,1.84
Amridge University,,,-0.54


### Transforming all columns

If no columns are provided after the call to the `groupby` method then all columns will be transformed. If a column cannot be transformed (such as string column when taking the mean), then it will be silently dropped. Here we transform all of the numeric columns by immediately calling the `transform` method after grouping.

In [37]:
college.groupby('stabbr').transform('mean').head(3)

Unnamed: 0_level_0,relaffil,satvrmid,satmtmid,ugds,sat_verbal_mean,sat_math_mean
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama A & M University,0.25,508.47619,504.285714,2789.865169,508.0,504.0
University of Alabama at Birmingham,0.25,508.47619,504.285714,2789.865169,508.0,504.0
Amridge University,0.25,508.47619,504.285714,2789.865169,508.0,504.0


### Summary of the groupby `transform` method

* Syntax - `df.groupby('grouping col')['transformed col'].transform(func)`
* The function accepts a pandas Series of all the values in the group
* The function must return either a single value or a Series the same length as the group
* Define either a custom function or use a string name of a pandas aggregation function
* If a single value is returned from the custom function, then that value is repeated for the length of the group
* The final pandas object returned always has the same number of values as the original

## Exercises

Execute the cell below to reread the college dataset and use it for the exercises below.

In [38]:
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, index_col='instnm')
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,AL,1,,,291.0


### Exercise 1

<span style="color:green; font-size:16px">Filter the college DataFrame for states that have more than 500,000 total undergraduate students. Can you verify your results?</span>

In [45]:
college.groupby('stabbr').filter(lambda sub_df: sub_df['ugds'].sum() > 500_000)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Prince Institute-Southeast,IL,0,,,84.0
Everest College-Phoenix,AZ,1,,,4102.0
Collins College,AZ,0,,,83.0
Empire Beauty School-Paradise Valley,AZ,1,,,25.0
Empire Beauty School-Tucson,AZ,0,,,126.0
...,...,...,...,...,...
Vantage College,TX,1,,,
SAE Institute of Technology San Francisco,CA,1,,,
National Personal Training Institute of Cleveland,OH,1,,,
Bay Area Medical Academy - San Jose Satellite Location,CA,1,,,


In [46]:
college.groupby('stabbr').filter(lambda sub_df: sub_df['ugds'].sum() > 500_000)['stabbr'].value_counts()

stabbr
CA    773
TX    472
NY    459
FL    436
PA    394
OH    352
IL    300
AZ    133
Name: count, dtype: int64

In [53]:
pop_state = college.groupby('stabbr').agg(total_ugds=('ugds','sum'))

pop_state_500 = pop_state['total_ugds'] > 500_000


pop_state[pop_state_500].sort_values(by='total_ugds', ascending=False)

Unnamed: 0_level_0,total_ugds
stabbr,Unnamed: 1_level_1
CA,2304492.0
TX,1277374.0
NY,993623.0
FL,959753.0
PA,604942.0
IL,599816.0
OH,537638.0
AZ,520439.0


### Exercise 2

<span style="color:green; font-size:16px">Filter the college DataFrame for states that have a an average undergraduate student population greater than 2,500 and have more than 30 religiously affiliated schools. Can you verify your results?</span>

In [54]:
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,AL,1,,,291.0


In [None]:
def df_filt(sub_df):
    #print(sub_df, end='\n\n')
    return (sub_df['ugds'].mean() > 2500) and (sub_df['relaffil'].sum() > 30)

In [61]:
c2 = college.groupby('stabbr').filter(df_filt)

c2

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Academy of Art University,CA,0,,,9885.0
ITT Technical Institute-Rancho Cordova,CA,0,,,500.0
Academy of Chinese Culture and Health Sciences,CA,0,,,
The Academy of Radio and TV Broadcasting,CA,0,,,14.0
Avalon School of Cosmetology-Alameda,CA,0,,,253.0
...,...,...,...,...,...
WestMed College - Merced,CA,1,,,
Vantage College,TX,1,,,
SAE Institute of Technology San Francisco,CA,1,,,
Bay Area Medical Academy - San Jose Satellite Location,CA,1,,,


In [62]:
c2.groupby('stabbr').agg(mean_ugds=('ugds','mean'),total_relaffil=('relaffil','sum'))

Unnamed: 0_level_0,mean_ugds,total_relaffil
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
CA,3518.308397,164
GA,2642.571429,37
IN,2653.559055,62
MI,2643.016043,48
TX,2998.530516,96
VA,2694.9,44


### Exercise 3

<span style="color:green; font-size:16px">The maximum SAT score for each test is 800. Create a new column in the college dataset that shows each school's percentage of maximum for each SAT score.</span>

In [63]:
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,AL,1,,,291.0


In [104]:
def sat_perc(s):
    return (s / 800).round(2) * 100


In [105]:
sat_perc_score = college.groupby('instnm')[['satvrmid','satmtmid']].transform(sat_perc)

college3 = college.copy()

college3[['satvrmid_perc','satmtmid_perc']] =  sat_perc_score

college3

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds,satvrmid_perc,satmtmid_perc
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0,53.0,52.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0,71.0,71.0
Amridge University,AL,1,,,291.0,,
University of Alabama in Huntsville,AL,0,595.0,590.0,5451.0,74.0,74.0
Alabama State University,AL,0,425.0,430.0,4811.0,53.0,54.0
...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,CA,1,,,,,
Rasmussen College - Overland Park,KS,1,,,,,
National Personal Training Institute of Cleveland,OH,1,,,,,
Bay Area Medical Academy - San Jose Satellite Location,CA,1,,,,,


### Use the City of Houston dataset

Execute the following cell to read in the City of Houston employee dataset and then use it for the following exercises.

In [70]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Exercise 4

<span style="color:green; font-size:16px">Filter it so that only position titles with an average salary of 100,000 remain. Can you verify your results?</span>

In [71]:
emp.groupby('title').filter(lambda x: x['salary'].mean() > 100_000)

Unnamed: 0,dept,title,hire_date,salary,sex,race
16,Other,ASSOCIATE JUDGE OF MUNICIPAL COURTS,2005-11-09,107744.00,Male,Hispanic
17,Police,POLICE COMMANDER,1983-02-07,115821.42,Male,White
19,Other,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),2002-05-28,95783.00,Female,Hispanic
39,Houston Airport System,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,2017-08-15,112270.00,Male,Black
48,Fire,ASSISTANT FIRE CHIEF,1994-11-07,115835.98,Male,Hispanic
...,...,...,...,...,...,...
24159,Fire,"PHYSICIAN,MD",2017-01-09,342784.00,Male,Asian
24219,Other,ERP BUSINESS SYSTEMS CONSULTANT,2001-07-09,92104.00,Female,White
24238,Other,DEPUTY CIO - IT INFRASTRUCTURE (EXE LVL),2006-12-04,162915.00,Male,White
24267,Houston Public Works,SUPERVISING ENGINEER,2011-03-21,98703.00,Female,Asian


In [73]:
emp.groupby('title').filter(lambda x: x['salary'].mean() > 100_000)['title'].value_counts()

title
ASSISTANT DIRECTOR (EXECUTIVE LEVEL)        78
SUPERVISING ENGINEER                        69
DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV    66
SENIOR ASSISTANT CITY ATTORNEY II           51
POLICE COMMANDER                            45
                                            ..
IT ARCHITECT - APPLICATIONS                  1
GENERAL SERVICES DIRECTOR                    1
CHIEF OF STAFF-MAYOR'S OFFICE (EXECUTIVE     1
DIR MAYOR'S OFFICE SPECIAL EVENTS EX LEV     1
DEPUTY CIO - IT INFRASTRUCTURE (EXE LVL)     1
Name: count, Length: 107, dtype: int64

In [76]:
emp1 = emp.groupby('title').agg(avg_sal=('salary','mean'))

emp1_filt = emp1['avg_sal'] > 100_000

emp1[emp1_filt]

Unnamed: 0_level_0,avg_sal
title,Unnamed: 1_level_1
ADMINISTRATION & REGULATORY AFFAIRS DIR,180000.000000
ADMINISTRATIVE JUDGE OF MUNICIPAL COURTS,135176.000000
AIRPORT BUSINESS DEVELOPMENT COORDINATOR,101518.714286
ASSISTANT AIRPORT MANAGER,103567.500000
ASSISTANT CHIEF POLICY OFFICER (EXECUTIV,110424.000000
...,...
SOLID WASTE DIRECTOR,195000.000000
STAFF PSYCHOLOGIST,104880.375000
"STAFF VETERINARIAN,DVM",118453.000000
SUPERVISING ENGINEER,102287.000000


### Exercise 5

<span style="color:green; font-size:16px">Filter the employee dataset so that only position titles with at least 5 employees and an average salary of 80,000 remain. Can you verify the results?</span>

In [77]:
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


In [106]:
def func2(df):
    return  len(df) >= 5 and df['salary'].mean() >= 80_000

In [107]:
emp5 = emp.groupby('title').filter(func2)

emp5

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic
16,Other,ASSOCIATE JUDGE OF MUNICIPAL COURTS,2005-11-09,107744.00,Male,Hispanic
17,Police,POLICE COMMANDER,1983-02-07,115821.42,Male,White
19,Other,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),2002-05-28,95783.00,Female,Hispanic
...,...,...,...,...,...,...
24271,Other,DIVISION MANAGER,1989-10-30,85372.00,Male,Black
24276,Other,DIVISION MANAGER,1993-09-21,89623.00,Male,Black
24288,Fire,DISTRICT CHIEF,1982-06-28,89590.02,Male,White
24292,Houston Airport System,SENIOR STAFF ANALYST (EXECUTIVE LEVEL),2018-11-19,95004.00,Male,Black


### Exercise 6

<span style="color:green; font-size:16px">Add a column to the DataFrame that contains the median salary based on department, sex, and race.</span>

In [111]:
emp6 = emp.copy()

emp6['med_sal_dep_sex_race']  = emp.groupby(['dept','sex','race'])['salary'].transform('median')

emp6

Unnamed: 0,dept,title,hire_date,salary,sex,race,med_sal_dep_sex_race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White,73479.00
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic,47445.00
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.00,Male,Black,38813.00
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic,68116.62
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White,73479.00
...,...,...,...,...,...,...,...
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.10,Male,Black,68116.62
24304,Other,SENIOR PROCUREMENT SPECIALIST,2016-03-28,76175.00,Female,Black,52915.00
24305,Houston Public Works,WATER SERVICE INSPECTOR I,2015-09-14,35173.00,Male,Black,38813.00
24306,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,2008-05-19,67198.00,Female,Black,50773.00


### Exercise 7

<span  style="color:green; font-size:16px">Add a new column, `pct_max_dept_sex`, to the employee DataFrame that holds the employees percentage of the maximum salary for each department and sex. For instance, if a male HPD employee makes 80,000 and the maximum male HPD salary is 120,000 then the value for this employee would be 80,000/120,000 or 0.667. Verify this value for the first employee.</span>

In [116]:
emp7 = emp.copy()

emp7['pct_max_dept_sex'] =  emp.groupby(['dept','sex'])['salary'].transform(lambda x: (x / x.max()).round(2) * 100)

emp7

Unnamed: 0,dept,title,hire_date,salary,sex,race,pct_max_dept_sex
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White,31.0
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic,30.0
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.00,Male,Black,23.0
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic,27.0
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White,25.0
...,...,...,...,...,...,...,...
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.10,Male,Black,27.0
24304,Other,SENIOR PROCUREMENT SPECIALIST,2016-03-28,76175.00,Female,Black,37.0
24305,Houston Public Works,WATER SERVICE INSPECTOR I,2015-09-14,35173.00,Male,Black,16.0
24306,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,2008-05-19,67198.00,Female,Black,36.0


In [115]:
emp.groupby(['dept','sex']).agg(max_salary=('salary','max'))

Unnamed: 0_level_0,Unnamed: 1_level_0,max_salary
dept,sex,Unnamed: 2_level_1
Fire,Female,342784.0
Fire,Male,342784.0
Health & Human Services,Female,186685.0
Health & Human Services,Male,186685.0
Houston Airport System,Female,180250.0
Houston Airport System,Male,275000.0
Houston Public Works,Female,275000.0
Houston Public Works,Male,216300.0
Library,Female,170000.0
Library,Male,115315.0
