### Codio Activity 3.7: Filtering

**Expected Time**: 60 Minutes

**Total Points**: 10

This activity focuses on using the `filter` method that pandas groupby objects make available.  This function returns group elements filtered by a function passed to the argument.  Gapminder from plotly continues as our example dataset.

#### Index:

- [Problem 1](#Problem-1:-Counting-the-Original-Group-Size)
- [Problem 2](#Problem-2:-Filtering-by-Population)
- [Problem 3](#Problem-3:-What-continents-have-average-population-over-20M?)
- [Problem 4](#Problem-4:-What-countries-have-an-average-life-expectancy-over-60?)
- [Problem 5](#Problem-5:-Determining-the-percent-by-of-countries-with-average-life-expectancy-over-60-by-continent.)

In [17]:
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [18]:
gapminder = px.data.gapminder()

In [19]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


### Example Usage

The example below separates the pandas `groupby` object from the `filter` operation.  The result of grouping on the `continent` column is bound to the `groups` variable below.  Then, a filter is applied to limit the continents to those with mean lifeExpectancy greater than 70.  The result is a DataFrame, and in this example only Europe and Oceania remain.

In [20]:
#there are 1704 rows to begin with
gapminder.shape

(1704, 8)

In [21]:
#create groupby object
groups = gapminder.groupby('continent')

In [22]:
#apply the filtering operation
filtered_lifeExp = groups.filter(lambda x: x['lifeExp'].mean() > 70)
filtered_lifeExp.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
12,Albania,Europe,1952,55.23,1282697,1601.056136,ALB,8
13,Albania,Europe,1957,59.28,1476505,1942.284244,ALB,8
14,Albania,Europe,1962,64.82,1728137,2312.888958,ALB,8
15,Albania,Europe,1967,66.22,1984060,2760.196931,ALB,8
16,Albania,Europe,1972,67.69,2263554,3313.422188,ALB,8


In [23]:
#only Europe and Oceania remain
filtered_lifeExp.continent.unique()

array(['Europe', 'Oceania'], dtype=object)

[Back to top](#Index:) 


### Problem 1: Counting the Original Group Size

**2 Points**

Use the `groupby` method and `size` method on these groups to determine the count of countries in each continent. Save your result as a series to `ans1` below.  

In [24]:
#from notebook.services.config import ConfigManager; ConfigManager().update('notebook', {"CodeCell": {"cm_config": {"autoCloseBrackets": False}}})
#import warnings; warnings.filterwarnings('ignore')

### GRADED

#ans1 = gapminder.drop_duplicates(subset=["country"]).groupby('continent').size()
#ans1 = gapminder.groupby('country').first().reset_index().groupby('continent').size()
#display(ans1)
ans1 = groups.size()

# Answer check
print(ans1)
print(type(ans1))

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64
<class 'pandas.core.series.Series'>


[Back to top](#Index:) 

### Problem 2: Filtering by Population

**2 Points**

Now, we use the `filter` method to limit the data to countries with average population greater than 20,000,000.  Assign the resulting DataFrame to `ans2` below.

In [25]:
### GRADED

ans2 = gapminder.groupby('country').filter(lambda x: x['pop'].mean() > 20e6)
#ansx = gapminder.groupby('country')[["pop"]].mean()
#ansx["pop_millions"] = ansx["pop"]/1e6
#ansx = ansx.query("pop_millions > 20")

# Answer check
#print(ansx)
print(ans2)
print(type(ans2))

        country continent  year  lifeExp       pop    gdpPercap iso_alpha  \
48    Argentina  Americas  1952   62.485  17876956  5911.315053       ARG   
49    Argentina  Americas  1957   64.399  19610538  6856.856212       ARG   
50    Argentina  Americas  1962   65.142  21283783  7133.166023       ARG   
51    Argentina  Americas  1967   65.634  22934225  8052.953021       ARG   
52    Argentina  Americas  1972   67.065  24779799  9443.038526       ARG   
...         ...       ...   ...      ...       ...          ...       ...   
1651    Vietnam      Asia  1987   62.820  62826491   820.799445       VNM   
1652    Vietnam      Asia  1992   67.662  69940728   989.023149       VNM   
1653    Vietnam      Asia  1997   70.672  76048996  1385.896769       VNM   
1654    Vietnam      Asia  2002   73.017  80908147  1764.456677       VNM   
1655    Vietnam      Asia  2007   74.249  85262356  2441.576404       VNM   

      iso_num  
48         32  
49         32  
50         32  
51         

[Back to top](#Index:) 


### Problem 3: What continents have average population over 20M?

**2 Points**

Use the result of the filtering operation (a DataFrame) to groupby the continent and get the size of each group.  What continents had an average population greater than 20,000,000?  Assign your answer as a list of strings of continent names to `ans3` below.

In [26]:
### GRADED

ans3 = list(gapminder.groupby('continent').filter(lambda x: x['pop'].mean() > 20e6).continent.unique())

# Answer check
print(ans3)
print(type(ans3))

['Asia', 'Americas']
<class 'list'>


[Back to top](#Index:) 


### Problem 4: What countries have an average life expectancy over 60?

**2 Points**

Use the `filter` method to limit the data to countries whose average life expectancy is over the age of 60.  Assign your solution as a DataFrame to `ans4` below.  

In [27]:
### GRADED

#ans4 = groups.filter(lambda x: x['lifeExp'].mean() > 60)
ans4 = gapminder.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60)

# Answer check
print(ans4)
print(type(ans4))

                 country continent  year  lifeExp      pop    gdpPercap  \
12               Albania    Europe  1952   55.230  1282697  1601.056136   
13               Albania    Europe  1957   59.280  1476505  1942.284244   
14               Albania    Europe  1962   64.820  1728137  2312.888958   
15               Albania    Europe  1967   66.220  1984060  2760.196931   
16               Albania    Europe  1972   67.690  2263554  3313.422188   
...                  ...       ...   ...      ...      ...          ...   
1663  West Bank and Gaza      Asia  1987   67.046  1691210  5107.197384   
1664  West Bank and Gaza      Asia  1992   69.718  2104779  6017.654756   
1665  West Bank and Gaza      Asia  1997   71.096  2826046  7110.667619   
1666  West Bank and Gaza      Asia  2002   72.370  3389578  4515.487575   
1667  West Bank and Gaza      Asia  2007   73.422  4018332  3025.349798   

     iso_alpha  iso_num  
12         ALB        8  
13         ALB        8  
14         ALB       

[Back to top](#Index:) 


### Problem 5: Determining the percent by of countries with average life expectancy over 60 by continent.

**2 Points**

Determine the percent of countries with life expectancy over 60 as follows:

- Count the countries per continent in the original dataset using the `groupby` operation and save your result as `ans5a` below.
- Count the countries per continent in the filtered data (`ans4`) using the `groupby` operation and save your result as `ans5b` below.
- Divide answer 5b by 5a to determine the percent and assign your answer to `ans5c` below.

In [28]:
ans5a = gapminder.groupby('continent')[["country"]].size()
ans5b = gapminder.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60).groupby('continent')[["country"]].size()
ans5c = ans5b/ans5a

# Answer check
print("\nans5a: # countries per continent")
print(ans5a)

print("\nans5b: # countries with average life expectancy over 60 per continent")
print(ans5b)

print("\nans5c: percent of countries with avg. exp > 60 per continent")
print(ans5c)


ans5a: # countries per continent
continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

ans5b: # countries with average life expectancy over 60 per continent
continent
Africa       36
Americas    216
Asia        204
Europe      348
Oceania      24
dtype: int64

ans5c: percent of countries with avg. exp > 60 per continent
continent
Africa      0.057692
Americas    0.720000
Asia        0.515152
Europe      0.966667
Oceania     1.000000
dtype: float64
