<a href="https://colab.research.google.com/github/jaidatta71/Chatbot/blob/main/self_study_codlab_activity_3_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Self-Study Colab Activity 3.3: Filtering

**Expected Time**: 60 Minutes



This activity focuses on using the `filter` method that pandas groupby objects make available.  This function returns group elements filtered by a function passed to the argument.  Gapminder from plotly continues as our example dataset.

#### Index:

- [Problem 1](#Problem-1:-Counting-the-Original-Group-Size)
- [Problem 2](#Problem-2:-Filtering-by-Population)
- [Problem 3](#Problem-3:-What-continents-have-average-population-over-20M?)
- [Problem 4](#Problem-4:-What-countries-have-an-average-life-expectancy-over-60?)
- [Problem 5](#Problem-5:-Determining-the-percent-by-of-countries-with-average-life-expectancy-over-60-by-continent.)

In [3]:
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [4]:
gapminder = px.data.gapminder()

In [5]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


### Example Usage

The example below separates the pandas `groupby` object from the `filter` operation.  The result of grouping on the `continent` column is bound to the `groups` variable below.  Then, a filter is applied to limit the continents to those with a mean life expectancy greater than 70.  The result is a DataFrame, and in this example, only Europe and Oceania remain.

In [6]:
#there are 1704 rows to begin with
gapminder.shape

(1704, 8)

In [7]:
#create groupby object
groups = gapminder.groupby('continent')

In [8]:
#apply the filtering operation
filtered_lifeExp = groups.filter(lambda x: x['lifeExp'].mean() > 70)
filtered_lifeExp.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
12,Albania,Europe,1952,55.23,1282697,1601.056136,ALB,8
13,Albania,Europe,1957,59.28,1476505,1942.284244,ALB,8
14,Albania,Europe,1962,64.82,1728137,2312.888958,ALB,8
15,Albania,Europe,1967,66.22,1984060,2760.196931,ALB,8
16,Albania,Europe,1972,67.69,2263554,3313.422188,ALB,8


In [9]:
#only Europe and Oceania remain
filtered_lifeExp.continent.unique()

array(['Europe', 'Oceania'], dtype=object)

[Back to top](#Index:)


### Problem 1: Counting the Original Group Size



Use the `groupby` method on the `gapminder` DataFrame to group the `continent` column. Next, use the `size()` method on these groups to determine the count of countries in each continent. Save your result as a series to `ans1` below.  

In [10]:


ans1 = gapminder.groupby('continent').size()



# Answer check
print(ans1)
print(type(ans1))

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64
<class 'pandas.core.series.Series'>


[Back to top](#Index:)

### Problem 2: Filtering by Population



Use the `groupby` method on the `gapminder` DataFrame to group the `country` column.  Next, use the `filter` method to limit the data to countries with an average population greater than 20,000,000.  Assign the resulting DataFrame to `ans2` below.

In [11]:
ans2 = gapminder.groupby('country').filter(pop_grt_20m)
# ans2 = gapminder.groupby('country').filter(lambda x: x['pop'].mean() > 20000000)
def pop_grt_20m(s):
    return s["pop"].mean() > 20000000


# Answer check
print(ans2)
print(type(ans2))

NameError: name 'pop_grt_20m' is not defined

[Back to top](#Index:)


### Problem 3: What continents have average population over 20M?



Use the `groupby` method on the `gapminder` DataFrame to group the `continent` column. Next, use the `filter` method to limit the data to countries with an average population greater than 20,000,000. Use another `group by` method with an argument equal to `continent`. To this method, chain the following `size().index.tolist()` to get a list of continent names.

Assign your answer  to `ans3` below.

In [12]:
def pop_grt_20m(s):
    return s["pop"].mean() > 20000000

ans3 = gapminder.groupby('continent').filter(pop_grt_20m).groupby('continent').size().index.tolist()
     # gapminder.groupby('continent').filter(pop_grt_20m).groupby('continent').size().index.tolist()

# Answer check
print(ans3)
print(type(ans3))

['Americas', 'Asia']
<class 'list'>


[Back to top](#Index:)


### Problem 4: What countries have an average life expectancy over 60?



Use the `groupby` method on the `gapminder` DataFrame to group the `country` column. Next, use the `filter` method to limit the data to countries with average life expectancy greater than 60.

Assign your solution as a DataFrame to `ans4` below.  

In [13]:
def life_60(s):
  return s["lifeExp"].mean() > 60


ans4 = gapminder.groupby('country').filter(life_60)
      #gapminder.groupby('country').filter(lambda x: x['lifeExp'].mean() > 60)
      #gapminder.groupby('country').filter(lambda x: x['lifeExp'].mean()>60)



# Answer check
print(ans4)
print(type(ans4))

                 country continent  year  lifeExp      pop    gdpPercap  \
12               Albania    Europe  1952   55.230  1282697  1601.056136   
13               Albania    Europe  1957   59.280  1476505  1942.284244   
14               Albania    Europe  1962   64.820  1728137  2312.888958   
15               Albania    Europe  1967   66.220  1984060  2760.196931   
16               Albania    Europe  1972   67.690  2263554  3313.422188   
...                  ...       ...   ...      ...      ...          ...   
1663  West Bank and Gaza      Asia  1987   67.046  1691210  5107.197384   
1664  West Bank and Gaza      Asia  1992   69.718  2104779  6017.654756   
1665  West Bank and Gaza      Asia  1997   71.096  2826046  7110.667619   
1666  West Bank and Gaza      Asia  2002   72.370  3389578  4515.487575   
1667  West Bank and Gaza      Asia  2007   73.422  4018332  3025.349798   

     iso_alpha  iso_num  
12         ALB        8  
13         ALB        8  
14         ALB       

[Back to top](#Index:)


### Problem 5: Determining the percent by of countries with average life expectancy over 60 by continent.



Determine the percent of countries with life expectancy over 60 as follows:

- Use the `groupby` method on the `gapminder` DataFrame to group the `continent` column. Use a double square bracket notation to apply this grouping on the `country` column. Next, use the `size` method to count the countries in each continent. Assign your result to `ans5a` below.


- Use the `groupby` method on the `gapminder` DataFrame to group the `country` column. Next, use the `filter` method to select the countries for which the average life expectancy  is greater than 60. Chain `.groupby('continent')[['country']].size()` to count the number of countries in each continent. Save your result as `ans5b` below.


- Divide answer `ans5b` by `ans5a` to determine the percent and assign your answer to `ans5c` below.

In [31]:

ans5a = gapminder.groupby('continent')[['country']].size()
ans5b = gapminder.groupby('country').filter(life_60).groupby('continent')["country"].size()
ans5c = ans5b/ans5a

# Answer check
print(ans5c)
print(type(ans5c))

continent
Africa      0.057692
Americas    0.720000
Asia        0.515152
Europe      0.966667
Oceania     1.000000
dtype: float64
<class 'pandas.core.series.Series'>
