### Codio Activity 3.7: Filtering

**Expected Time**: 60 Minutes

**Total Points**: 10

This activity focuses on using the `filter` method that pandas groupby objects make available.  This function returns group elements filtered by a function passed to the argument.  Gapminder from plotly continues as our example dataset.

#### Index:

- [Problem 1](#Problem-1:-Counting-the-Original-Group-Size)
- [Problem 2](#Problem-2:-Filtering-by-Population)
- [Problem 3](#Problem-3:-What-continents-have-average-population-over-20M?)
- [Problem 4](#Problem-4:-What-countries-have-an-average-life-expectancy-over-60?)
- [Problem 5](#Problem-5:-Determining-the-percent-by-of-countries-with-average-life-expectancy-over-60-by-continent.)

In [1]:
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
gapminder = px.data.gapminder()

In [3]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


### Example Usage

The example below separates the pandas `groupby` object from the `filter` operation.  The result of grouping on the `continent` column is bound to the `groups` variable below.  Then, a filter is applied to limit the continents to those with mean lifeExpectancy greater than 70.  The result is a DataFrame, and in this example only Europe and Oceania remain.

In [5]:
#there are 1704 rows to begin with
gapminder.shape
#gapminder.info()

(1704, 8)

In [34]:
#create groupby object
groups = gapminder.groupby('continent')
gapminder.sort_values("year")

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
528,France,Europe,1952,67.410,42459667,7029.809327,FRA,250
540,Gabon,Africa,1952,37.003,420702,4293.476475,GAB,266
1656,West Bank and Gaza,Asia,1952,43.160,1030585,1515.592329,PSE,275
552,Gambia,Africa,1952,30.000,284320,485.230659,GMB,270
...,...,...,...,...,...,...,...,...
1127,Niger,Africa,2007,56.867,12894865,619.676892,NER,562
1139,Nigeria,Africa,2007,46.859,135031164,2013.977305,NGA,566
1151,Norway,Europe,2007,80.196,4627926,49357.190170,NOR,578
1175,Pakistan,Asia,2007,65.483,169270617,2605.947580,PAK,586


In [9]:
def mle_gt_70(s):
    mle_gt_70 = s["lifeExp"].mean()
    return mle_gt_70 > 70

In [14]:
#apply the filtering operation
#filtered_lifeExp = groups.filter(lambda x: x['lifeExp'].mean() > 70)
filtered_lifeExp = groups.filter(mle_gt_70)
filtered_lifeExp.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
12,Albania,Europe,1952,55.23,1282697,1601.056136,ALB,8
13,Albania,Europe,1957,59.28,1476505,1942.284244,ALB,8
14,Albania,Europe,1962,64.82,1728137,2312.888958,ALB,8
15,Albania,Europe,1967,66.22,1984060,2760.196931,ALB,8
16,Albania,Europe,1972,67.69,2263554,3313.422188,ALB,8


In [13]:
#only Europe and Oceania remain
filtered_lifeExp.continent.unique()

array(['Europe', 'Oceania'], dtype=object)

[Back to top](#Index:) 


### Problem 1: Counting the Original Group Size

**2 Points**

Use the `groupby` method and `size` method on these groups to determine the count of countries in each continent. Save your result as a series to `ans1` below.  

In [8]:
### GRADED

ans1 = None

### BEGIN SOLUTION
ans1 = gapminder.groupby('continent').size()
### END SOLUTION

# Answer check
print(ans1)
print(type(ans1))

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64
<class 'pandas.core.series.Series'>


In [9]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans1_ = gapminder_.groupby('continent').size()
#
#
#
assert type(ans1_) == type(ans1)
pd.testing.assert_series_equal(ans1, ans1_)
### END HIDDEN TESTS

[Back to top](#Index:) 


### Problem 2: Filtering by Population

**2 Points**

Now, we use the `filter` method to limit the data to countries with average population greater than 20,000,000.  Assign the resulting DataFrame to `ans2` below.

In [27]:
### GRADED

ans2 = None

### BEGIN SOLUTION
ans2 = gapminder.groupby('country').filter(lambda x: x['pop'].mean() > 20_000_000)
### END SOLUTION
ans2 = ans2.groupby('country').size().index.tolist()
# Answer check
print(ans2)
#print(type(ans2))

['Argentina', 'Bangladesh', 'Brazil', 'Canada', 'China', 'Colombia', 'Congo, Dem. Rep.', 'Egypt', 'Ethiopia', 'France', 'Germany', 'India', 'Indonesia', 'Iran', 'Italy', 'Japan', 'Korea, Rep.', 'Mexico', 'Morocco', 'Myanmar', 'Nigeria', 'Pakistan', 'Philippines', 'Poland', 'Romania', 'South Africa', 'Spain', 'Sudan', 'Tanzania', 'Thailand', 'Turkey', 'United Kingdom', 'United States', 'Vietnam']


In [11]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans2_ = gapminder_.groupby('country').filter(lambda x: x['pop'].mean() > 20_000_000)
#
#
#
assert type(ans2_) == type(ans2)
pd.testing.assert_frame_equal(ans2, ans2_)
### END HIDDEN TESTS

[Back to top](#Index:) 


### Problem 3: What continents have average population over 20M?

**2 Points**

Use the result of the filtering operation (a DataFrame) to groupby the continent and get the size of each group.  What continents had an average population greater than 20,000,000?  Assign your answer as a list of strings of continent names to `ans3` below.

In [30]:
### GRADED

ans3 = None

### BEGIN SOLUTION
ans3 = gapminder.groupby('continent').filter(lambda x: x['pop'].mean() > 20_000_000)
ans3 = ans3.groupby('continent').size().index.tolist()
### END SOLUTION

# Answer check
print(ans3)
print(type(ans3))

['Americas', 'Asia']
<class 'list'>


In [13]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans3_ = gapminder_.groupby('continent').filter(lambda x: x['pop'].mean() > 20_000_000).groupby('continent').size().index.tolist()
#
#
#
assert type(ans3_) == type(ans3)
assert set(ans3) == set(ans3_)
### END HIDDEN TESTS

[Back to top](#Index:) 


### Problem 4: What countries have an average life expectancy over 60?

**2 Points**

Use the `filter` method to limit the data to countries whose average life expectancy is over the age of 60.  Assign your solution as a DataFrame to `ans4` below.  

In [31]:
### GRADED

ans4 = None

### BEGIN SOLUTION
ans4 = gapminder.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60)
### END SOLUTION

# Answer check
print(ans4)
print(type(ans4))

                 country continent  year  lifeExp      pop    gdpPercap  \
12               Albania    Europe  1952   55.230  1282697  1601.056136   
13               Albania    Europe  1957   59.280  1476505  1942.284244   
14               Albania    Europe  1962   64.820  1728137  2312.888958   
15               Albania    Europe  1967   66.220  1984060  2760.196931   
16               Albania    Europe  1972   67.690  2263554  3313.422188   
...                  ...       ...   ...      ...      ...          ...   
1663  West Bank and Gaza      Asia  1987   67.046  1691210  5107.197384   
1664  West Bank and Gaza      Asia  1992   69.718  2104779  6017.654756   
1665  West Bank and Gaza      Asia  1997   71.096  2826046  7110.667619   
1666  West Bank and Gaza      Asia  2002   72.370  3389578  4515.487575   
1667  West Bank and Gaza      Asia  2007   73.422  4018332  3025.349798   

     iso_alpha  iso_num  
12         ALB        8  
13         ALB        8  
14         ALB       

In [15]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans4_ = gapminder_.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60)
#
#
#
assert type(ans4_) == type(ans4)
pd.testing.assert_frame_equal(ans4, ans4_)
### END HIDDEN TESTS

[Back to top](#Index:) 


### Problem 5: Determining the percent by of countries with average life expectancy over 60 by continent.

**2 Points**

Determine the percent of countries with life expectancy over 60 as follows:

- Count the countries per continent in the original dataset using the `groupby` operation and save your result as `ans5a` below.
- Count the countries per continent in the filtered data (`ans4`) using the `groupby` operation and save your result as `ans5b` below.
- Divide answer 5b by 5a to determine the percent and assign your answer to `ans5c` below.

In [32]:
### GRADED

ans5a = None
ans5b = None
ans5c = None

### BEGIN SOLUTION
ans5a = gapminder.groupby('continent')[['country']].size()
ans5b = gapminder.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60).groupby('continent')[['country']].size()
ans5c = ans5b/ans5a
### END SOLUTION

# Answer check
print(ans5c)
print(type(ans5c))

continent
Africa      0.057692
Americas    0.720000
Asia        0.515152
Europe      0.966667
Oceania     1.000000
dtype: float64
<class 'pandas.core.series.Series'>


In [17]:
### BEGIN HIDDEN TESTS
gapminder_ = px.data.gapminder()
ans5a_ = gapminder_.groupby('continent')[['country']].size()
ans5b_ = gapminder_.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60).groupby('continent')[['country']].size()
ans5c_ = ans5b_/ans5a_



pd.testing.assert_series_equal(ans5a, ans5a_)
pd.testing.assert_series_equal(ans5b, ans5b_)
pd.testing.assert_series_equal(ans5c, ans5c_)
### END HIDDEN TESTS