# Filter, grouping and binning 

## Import pandas

In [1]:
import pandas as pd

## Import data

In [2]:
URL = "https://raw.githubusercontent.com/kirenz/datasets/master/height_clean_cols.csv"
df = pd.read_csv(URL)

## Filter 

Using a single column’s values to filter data (boolean indexing).

In [3]:
df[df["height"] >= 167]

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date
17,Tom,18,167,166.2,male,42,1.67,63.91,22.92,2022-10-08
18,Steven,19,167,167.3,male,42,1.67,75.71,27.15,2022-10-08
19,Emanuel,20,168,168.5,male,42,1.68,79.22,28.07,2022-10-08


Combine filtering for two columns or more with `&` (and)

In [4]:
df[(df["height"] >= 167) & (df["weight"] < 74) ]

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date
17,Tom,18,167,166.2,male,42,1.67,63.91,22.92,2022-10-08


You can also combine filtering for two columns or more with `|` (or)

In [5]:
df[(df["height"] >= 167) | (df["weight"] < 74) ]

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date
1,Peter,2,163,163.5,male,42,1.63,70.57,26.56,2022-10-08
5,Sophia,6,164,164.4,female,42,1.64,58.06,21.59,2022-10-08
11,Mila,12,165,167.4,female,42,1.65,68.03,24.99,2022-10-08
12,Fin,13,165,165.5,male,42,1.65,68.01,24.98,2022-10-08
15,Marc,16,166,166.5,male,42,1.66,63.15,22.92,2022-10-08
16,Ralph,17,166,166.6,male,42,1.66,62.02,22.51,2022-10-08
17,Tom,18,167,166.2,male,42,1.67,63.91,22.92,2022-10-08
18,Steven,19,167,167.3,male,42,1.67,75.71,27.15,2022-10-08
19,Emanuel,20,168,168.5,male,42,1.68,79.22,28.07,2022-10-08


- Filter people with a weight greater as 84 and save it as `df_weight_greater_84`

In [6]:
### BEGIN SOLUTION
df_weight_greater_84 = df[df["weight"] > 84]
### END SOLUTION

In [7]:
"""Check if your code returns the correct output"""
assert len(df_weight_greater_84) == 2
assert df_weight_greater_84.iloc[0, 0] == "Stefanie"
assert df_weight_greater_84.iloc[1, 0] == "Eric"

Always use the [isin()](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html#pandas.Series.isin) method if you have multiple conditions within one column:

In [8]:
df[df["name"].isin(["Tom", "Steven"])]

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date
17,Tom,18,167,166.2,male,42,1.67,63.91,22.92,2022-10-08
18,Steven,19,167,167.3,male,42,1.67,75.71,27.15,2022-10-08


## Grouping

By “group by” we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria

- **Applying** a function to each group independently

- **Combining** the results into a data structure

Grouping and then applying the `mean()` function to the resulting groups. Furthermore, we round the results and transpose the data.

In [9]:
df.groupby("gender").mean(numeric_only=True).round(2).T

gender,female,male
id,7.82,13.78
height,164.36,165.78
average_height_parents,164.86,165.94
number,42.0,42.0
height_m,1.64,1.66
weight,75.34,72.66
bmi,27.9,26.45


## Segment data into bins

Use the function [cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. 

In our example, we create a body mass index category. The standard weight status categories associated with BMI ranges for adults are shown in the following table:

BMI	| Weight Status
---| ---
Below 18.5 |	Underweight
18.5 - 24.9 |	Normal or Healthy Weight
25.0 - 29.9 |	Overweight
30.0 and Above |	Obese

Source: [U.S. Department of Health & Human Services](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html)

In our function, we discretize the variable `bmi` into four bins according to the table above:

- The bins [0, 18.5, 25, 30, float('inf')] indicate (0,18.5], (18.5,25], (25,30], (30, float('inf))
- `float('inf')` is used for setting  variable with an infinitely large value

In [10]:
df['bmi_category'] = pd.cut(df['bmi'], 
                            bins=[0, 18.5, 25, 30, float('inf')], 
                            labels=['underweight', 'normal', 'overweight', "obese"])

In [11]:
df['bmi_category'].head(7)

0         obese
1    overweight
2    overweight
3    overweight
4         obese
5        normal
6    overweight
Name: bmi_category, dtype: category
Categories (4, object): ['underweight' < 'normal' < 'overweight' < 'obese']

Example of how to discretize into four equal-sized bins:

In [13]:
df['bmi_category_2'] = pd.cut(df['bmi'], 
                            bins=4, 
                            labels=['group1', 'group2', 'group3', "group4"])

In [14]:
df['bmi_category_2'].head(7)

0    group4
1    group2
2    group3
3    group3
4    group4
5    group1
6    group3
Name: bmi_category_2, dtype: category
Categories (4, object): ['group1' < 'group2' < 'group3' < 'group4']

Example of how to discretize into four equal-sized bins if you don't need labels

In [15]:
df['bmi_category_3'] = pd.cut(df['bmi'], 
                            bins=4, 
                            labels=False)

In [16]:
df['bmi_category_3'].head(7)

0    3
1    1
2    2
3    2
4    3
5    0
6    2
Name: bmi_category_3, dtype: int64

- Use the variable height to create a new variable called `height_category` with three bins and labels:

  - 0 to 165 (label it `group1`)
  - 166 to 167 (label it `group2`)
  - 168 and taller (label it `group3`)

In [17]:
### BEGIN SOLUTION
df['height_category'] = pd.cut(df['height'], 
                            bins=[0, 165, 167, float('inf')], 
                            labels=['group1', 'group2', 'group3'])
### END SOLUTION

In [18]:
"""Check if your code returns the correct output"""
assert df['height_category'].value_counts().group1 == 13
assert df['height_category'].value_counts().group2 == 6
assert df['height_category'].value_counts().group3 == 1
