# Filtering, grouping and binning 

## Import pandas

In [1]:
import pandas as pd

## Import data

In [2]:
# URL of data
URL = "https://raw.githubusercontent.com/kirenz/datasets/master/height_clean_cols.csv"

In [3]:
df = pd.read_csv(URL)

df["gender"] = df["gender"].astype("category")
df['id'] = df['id'].astype(str)

## Filter (boolean indexing)

Using a single column’s values to select data.

In [35]:
df[df["height"] > 180]

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date


Using the [isin()](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html#pandas.Series.isin) method for filtering:

In [36]:
df[df["name"].isin(["Tom", "Lisa"])]

Unnamed: 0,name,id,height,average_height_parents,gender,number,height_m,weight,bmi,date
17,Tom,18,167,166.2,male,42,1.67,63.91,22.92,2022-10-08


## Grouping

By “group by” we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria

- **Applying** a function to each group independently

- **Combining** the results into a data structure

Grouping and then applying the mean() function to the resulting groups.

In [55]:
df.groupby("gender").mean(numeric_only=True).round(2).T

gender,female,male
height,164.36,165.78
average_height_parents,164.86,165.94
number,42.0,42.0
height_m,1.64,1.66
weight,75.34,72.66
bmi,27.9,26.45


## Segment data into bins

Use the function [cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. 

In our example, we create a body mass index category. The standard weight status categories associated with BMI ranges for adults are shown in the following table:

BMI	| Weight Status
---| ---
Below 18.5 |	Underweight
18.5 - 24.9 |	Normal or Healthy Weight
25.0 - 29.9 |	Overweight
30.0 and Above |	Obese

Source: [U.S. Department of Health & Human Services](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html)

In our function, we discretize the variable `bmi` into four bins according to the table above:

- The bins [0, 18.5, 25, 30, float('inf')] indicate (0,18.5], (18.5,25], (25,30], (30, float('inf))
- `float('inf')` is used for setting  variable with an infinitely large value

In [38]:
df['bmi_category'] = pd.cut(df['bmi'], 
                            bins=[0, 18.5, 25, 30, float('inf')], 
                            labels=['underweight', 'normal', 'overweight', "obese"])

In [39]:
df['bmi_category']

0          obese
1     overweight
2     overweight
3     overweight
4          obese
5         normal
6     overweight
7          obese
8     overweight
9     overweight
10    overweight
11        normal
12        normal
13         obese
14    overweight
15        normal
16        normal
17        normal
18    overweight
19    overweight
Name: bmi_category, dtype: category
Categories (4, object): ['underweight' < 'normal' < 'overweight' < 'obese']