### Codio Activity 3.4: Aggregation Operations

**Expected Time: 60 Minutes**

**Total Points: 10**

This activity focuses on using a DataFrame's `groupby` method.  As a reminder, this method allows aggregation by a category or condition in the DataFrame.  The dataset for the activity again is the gapminder dataset that comes from `plotly`.

## Index:

- [Problem 1](#Problem-1:-Average-Life-Expectancy-by-Year)
- [Problem 2](#Problem-2:-GDP-by-Continent)
- [Problem 3](#Problem-3:-Aggregating-with-multiple-functions)
- [Problem 4](#Problem-4:-Grouping-on-Numeric-Conditions)
- [Problem 5](#Problem-5:-Multiple-Grouping)

In [1]:
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
gapminder = px.data.gapminder()

In [3]:
gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
 6   iso_alpha  1704 non-null   object 
 7   iso_num    1704 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 106.6+ KB


In [4]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


[Back to top](#Index:) 

### Problem 1: Average Life Expectancy by Year

**2 Points**

Use the `groupby` method to determine the average life expectancy by year.  Assign your results as a DataFrame object to `ans1` below.

In [5]:
### GRADED

ans1 = None

### BEGIN SOLUTION
ans1 = gapminder.groupby('year')[['lifeExp']].mean()
type(ans1)
### END SOLUTION

# Answer check
print(type(ans1))
ans1

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,lifeExp
year,Unnamed: 1_level_1
1952,49.05762
1957,51.507401
1962,53.609249
1967,55.67829
1972,57.647386
1977,59.570157
1982,61.533197
1987,63.212613
1992,64.160338
1997,65.014676


In [6]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans1_ = gapminder_.groupby('year')[['lifeExp']].mean()
#
#
#
assert type(ans1_) == type(ans1)
pd.testing.assert_frame_equal(ans1, ans1_)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 2: GDP by Continent

**2 Points**

Use the `groupby` method to determine the median GDP per capita for each continent.  Assign your response as a DataFrame to `ans2` below.

In [7]:
### GRADED

ans2 = None

### BEGIN SOLUTION
ans2 = gapminder.groupby('continent')[['gdpPercap']].median()
type(ans2)
### END SOLUTION

# Answer check
print(type(ans2))
ans2

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,gdpPercap
continent,Unnamed: 1_level_1
Africa,1192.138217
Americas,5465.509853
Asia,2646.786844
Europe,12081.749115
Oceania,17983.303955


In [8]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans2_ = gapminder_.groupby('continent')[['gdpPercap']].median()
#
#
#
assert type(ans2_) == type(ans2)
pd.testing.assert_frame_equal(ans2, ans2_)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 3: Aggregating with multiple functions

**2 Points**

Use the `groupby.agg()` method to return a DataFrame with multiple summary values.  Group the data by continent and return the `mean`, `median`, and `standard` deviation of the `gdpPercap`.  Assign your solution to `ans3` below.

In [9]:
### GRADED

ans3 = None

### BEGIN SOLUTION
ans3 = gapminder.groupby('continent')[['gdpPercap']].agg(['mean', 'median', 'std'])
type(ans3)
### END SOLUTION

# Answer check
print(type(ans3))
ans3

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,gdpPercap,gdpPercap,gdpPercap
Unnamed: 0_level_1,mean,median,std
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Africa,2193.754578,1192.138217,2827.929863
Americas,7136.110356,5465.509853,6396.764112
Asia,7902.150428,2646.786844,14045.373112
Europe,14469.475533,12081.749115,9355.213498
Oceania,18621.609223,17983.303955,6358.983321


In [10]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans3_ = gapminder_.groupby('continent')[['gdpPercap']].agg(['mean', 'median', 'std'])
#
#
#
assert type(ans3_) == type(ans3)
pd.testing.assert_frame_equal(ans3, ans3_)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 4: Grouping on Numeric Conditions

**2 Points**

Besides grouping on categorical features, a numeric condition can also be used to split the data. For example, group the countries based on their population and calculate the average life expectancy for each group. Create two groups: one for countries with a population greater than 300,000,000 and another for countries with a population less than or equal to 300,000,000.
NOTE: This question should return two groups based on whether the population is greater than 300,000,000 (true) or not (false). In the next question, you will groupby multiple features. 

In [11]:
### GRADED

ans4 = None

### BEGIN SOLUTION
ans4 = gapminder.groupby(gapminder['pop'] > 300_000_000)[['lifeExp']].mean()
type(ans4)
### END SOLUTION

# Answer check
print(type(ans4))
ans4

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,lifeExp
pop,Unnamed: 1_level_1
False,59.491833
True,58.306267


In [12]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans4_ = gapminder_.groupby(gapminder_['pop'] > 300_000_000)[['lifeExp']].mean()
#
#
#
assert type(ans4_) == type(ans4)
pd.testing.assert_frame_equal(ans4, ans4_)
### END HIDDEN TESTS

[Back to top](#Index:) 

### Problem 5: Multiple Grouping

**2 Points**

Finally, a groupby can be called on a heirarchy of conditions.  The list of group categories should be passed in order of the grouping heirarchy.  For example:

```python
gapminder.groupby(['continent', 'country'])
```

would first group on the continent and then each country within the continent. 

Below, subset the data to the Americas and Europe and save your subset to `ans5a`. This will create a dataframe with only americas and Europe in the data set. Then, group on the continent and determine the average life expectancy. Save the resulting DataFrame to `ans5b` below. Your final result should be the life expectancy for each country on the American or European continents.

In [13]:
### GRADED

ans5a = None

### BEGIN SOLUTION
ans5a = gapminder[gapminder['continent'].isin(['Americas', 'Europe'])]
ans5b = ans5a.groupby(['continent', 'country'])[['lifeExp']].mean()
type(ans5a)
### END SOLUTION

# Answer check
print(type(ans5a))
print(type(ans5b))
ans5b.head()

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp
continent,country,Unnamed: 2_level_1
Americas,Argentina,69.060417
Americas,Bolivia,52.504583
Americas,Brazil,62.2395
Americas,Canada,74.90275
Americas,Chile,67.430917


In [14]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans5a_ = gapminder_[gapminder_['continent'].isin(['Americas', 'Europe'])]
ans5b_ = ans5a_.groupby(['continent', 'country'])[['lifeExp']].mean()
#
#
#
assert type(ans5a_) == type(ans5a)
pd.testing.assert_frame_equal(ans5a, ans5a_)
pd.testing.assert_frame_equal(ans5b, ans5b_)
### END HIDDEN TESTS