### Codio Activity 3.4: Aggregation Operations

**Expected Time: 60 Minutes**

**Total Points: 10**

This activity focuses on using a DataFrame's `groupby` method.  As a reminder, this method allows aggregation by a category or condition in the DataFrame.  The dataset for the activity again is the gapminder dataset that comes from `plotly`.

## Index:

- [Problem 1](#Problem-1:-Average-Life-Expectancy-by-Year)
- [Problem 2](#Problem-2:-GDP-by-Continent)
- [Problem 3](#Problem-3:-Aggregating-with-multiple-functions)
- [Problem 4](#Problem-4:-Grouping-on-Numeric-Conditions)
- [Problem 5](#Problem-5:-Multiple-Grouping)

In [2]:
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
gapminder = px.data.gapminder()

In [4]:
gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
 6   iso_alpha  1704 non-null   object 
 7   iso_num    1704 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 106.6+ KB


In [8]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


[Back to top](#Index:) 

### Problem 1: Average Life Expectancy by Year

**2 Points**

Use the `groupby` method on the `gapminder` DataFrame to group the values in the `year` column. Use a double square bracket notation to apply this operation on the `lifeExp` column. Then, use the pandas function `mean()` to  compute the average value.

Assign your results as a DataFrame object to `ans1` below.

In [26]:
### GRADED

ans1 = None

# YOUR CODE HERE
ans1 = gapminder.groupby('year')[['lifeExp']].mean()

# Answer check
print(type(ans1))
ans1

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,lifeExp
year,Unnamed: 1_level_1
1952,49.05762
1957,51.507401
1962,53.609249
1967,55.67829
1972,57.647386
1977,59.570157
1982,61.533197
1987,63.212613
1992,64.160338
1997,65.014676


[Back to top](#Index:) 

### Problem 2: GDP by Continent

**2 Points**

Use the `groupby` method on the `gapminder` DataFrame to group the values in the `continent` column. Use a double square bracket notation to apply this operation on the `gdpPercap` column. Then, use the pandas function `median()` to compute the median value.


Assign your response as a DataFrame to `ans2` below.

In [28]:
### GRADED

ans2 = None

# YOUR CODE HERE
ans2 = gapminder.groupby('continent')[['gdpPercap']].median()

# Answer check
print(type(ans2))
ans2

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,gdpPercap
continent,Unnamed: 1_level_1
Africa,1192.138217
Americas,5465.509853
Asia,2646.786844
Europe,12081.749115
Oceania,17983.303955


[Back to top](#Index:) 

### Problem 3: Aggregating with multiple functions

**2 Points**

Use the `groupby` method on the `gapminder` DataFrame to group the values in the `continent` column. Use a double square bracket notation to apply this operation on the `gdpPercap` column. Then, use the pandas function `agg()` with argument equal to `['mean', 'median', 'std']` to compute the mean, the median, and the standard deviation.


Assign your solution to `ans3` below.

In [17]:
### GRADED

ans3 = None

# YOUR CODE HERE
ans3 = gapminder.groupby('continent')[['gdpPercap']].agg(['mean', 'median', 'std'])

# Answer check
print(type(ans3))
ans3

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,gdpPercap,gdpPercap,gdpPercap
Unnamed: 0_level_1,mean,median,std
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Africa,2193.754578,1192.138217,2827.929863
Americas,7136.110356,5465.509853,6396.764112
Asia,7902.150428,2646.786844,14045.373112
Europe,14469.475533,12081.749115,9355.213498
Oceania,18621.609223,17983.303955,6358.983321


[Back to top](#Index:) 

### Problem 4: Grouping on Numeric Conditions

**2 Points**

Besides grouping on categorical features, a numeric condition can also be used to split the data. For example, group the countries based on their population and calculate the average life expectancy for each group. 

Use the `groupby` method on the `gapminder` DataFrame to group the entries for which the `pop` column is greater than `300_000_000`. Use a double square bracket notation to apply this operation on the `lifeExp` column. Then, use the pandas function `mean()` to compute the average value.


NOTE: This question should return two groups based on whether the population is greater than 300,000,000 (true) or not (false). In the next question, you will groupby multiple features.

In [30]:
### GRADED

ans4 = None

# YOUR CODE HERE
ans4 = gapminder.groupby(gapminder['pop'] > 300_000_000)[['lifeExp']].mean()

# Answer check
print(type(ans4))
ans4

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,lifeExp
pop,Unnamed: 1_level_1
False,59.491833
True,58.306267


[Back to top](#Index:) 

### Problem 5: Multiple Grouping

**2 Points**

Finally, a groupby can be called on a heirarchy of conditions.  The list of group categories should be passed in order of the grouping heirarchy.  For example:

```python
gapminder.groupby(['continent', 'country'])
```

would first group on the continent and then each country within the continent. 

Selects the rows from the DataFrame `gapminder` where the value in the `continent` column is either `Americas` or `Europe`. For this step, use the `.isin()` method to create a boolean mask. Assign this result to `ans5a`.

Use the `groupby` method on the `ans5a` DataFrame to group the entries in the columns `continent` and `country`. Use a double square bracket notation to apply this operation on the `lifeExp` column. Then, use the pandas function `mean()` to compute the average value.

In [36]:
### GRADED

ans5a = None

# YOUR CODE HERE
# Select rows where the continent is either Americas or Europe
ans5a = gapminder[gapminder['continent'].isin(['Americas', 'Europe'])]

# Step 2: Group by 'continent' and 'country'
ans5b = ans5a.groupby(['continent', 'country'])[['lifeExp']].mean()

# Answer check
print(type(ans5a))
print(type(ans5b))
ans5b.head()

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp
continent,country,Unnamed: 2_level_1
Americas,Argentina,69.060417
Americas,Bolivia,52.504583
Americas,Brazil,62.2395
Americas,Canada,74.90275
Americas,Chile,67.430917
