**Important notes:**

- Use your **HdM ID** (e.g. the xy123 in ***xy123***@hdm-stuttgart.de) as **NAME**


- Don't change the name of the file and don't delete any cells.


- Make sure you fill in any place that says  <font color='green'> \# YOUR CODE HERE </font> or "YOUR ANSWER HERE".


- The function `NotImplementedError()` prevents you from hand in tasks with empty cells. Simply delete the function if you start working on a cell with this entry.


- Before you turn this problem in (i.e., after you completed all tasks), make sure everything runs as expected: Restart the kernel and run all cells. If you use:
  - *Visual Studio Code*: select "Restart" and then "Run All" 
  - *Colab*: in the menubar, select `Runtime` and click on `Restart and run all`
  - *Jupyter Notebook*: in the menubar, select `Kernel` and click on `Restart & Run All`


Good luck!

In [None]:
NAME = ""

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Filter, grouping and binning 

## Import pandas

In [None]:
import pandas as pd

## Import data

In [None]:
URL = "https://raw.githubusercontent.com/kirenz/datasets/master/height_clean_cols.csv"
df = pd.read_csv(URL)

## Filter 

Using a single column’s values to filter data (boolean indexing).

In [None]:
df[df["height"] >= 167]

Combine filtering for two columns or more with `&` (and)

In [None]:
df[(df["height"] >= 167) & (df["weight"] < 74) ]

You can also combine filtering for two columns or more with `|` (or)

In [None]:
df[(df["height"] >= 167) | (df["weight"] < 74) ]

- Filter people with a weight greater as 84 and save it as `df_weight_greater_84`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Check if your code returns the correct output"""
assert len(df_weight_greater_84) == 2
assert df_weight_greater_84.iloc[0, 0] == "Stefanie"
assert df_weight_greater_84.iloc[1, 0] == "Eric"

Always use the [isin()](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html#pandas.Series.isin) method if you have multiple conditions within one column:

In [None]:
df[df["name"].isin(["Tom", "Steven"])]

## Grouping

By “group by” we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria

- **Applying** a function to each group independently

- **Combining** the results into a data structure

Grouping and then applying the `mean()` function to the resulting groups. Furthermore, we round the results and transpose the data.

In [None]:
df.groupby("gender").mean(numeric_only=True).round(2).T

## Segment data into bins

Use the function [cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. 

In our example, we create a body mass index category. The standard weight status categories associated with BMI ranges for adults are shown in the following table:

BMI	| Weight Status
---| ---
Below 18.5 |	Underweight
18.5 - 24.9 |	Normal or Healthy Weight
25.0 - 29.9 |	Overweight
30.0 and Above |	Obese

Source: [U.S. Department of Health & Human Services](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html)

In our function, we discretize the variable `bmi` into four bins according to the table above:

- The bins [0, 18.5, 25, 30, float('inf')] indicate (0,18.5], (18.5,25], (25,30], (30, float('inf))
- `float('inf')` is used for setting  variable with an infinitely large value

In [None]:
df['bmi_category'] = pd.cut(df['bmi'], 
                            bins=[0, 18.5, 25, 30, float('inf')], 
                            labels=['underweight', 'normal', 'overweight', "obese"])

In [None]:
df['bmi_category'].head(7)

In [None]:
Example of how to discretize into four equal-sized bins:

In [None]:
df['bmi_category_2'] = pd.cut(df['bmi'], 
                            bins=4, 
                            labels=['group1', 'group2', 'group3', "group4"])

In [None]:
df['bmi_category_2'].head(7)

Example of how to discretize into four equal-sized bins if you don't need labels

In [None]:
df['bmi_category_3'] = pd.cut(df['bmi'], 
                            bins=4, 
                            labels=False)

In [None]:
df['bmi_category_3'].head(7)

- Use the variable height to create a new variable called `height_category` with three bins and labels:

  - 0 to 164 (label it `group1`)
  - 165 to 166 (label it `group2`)
  - 167 and taller (label it `group3`)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Check if your code returns the correct output"""
assert df['height_category'].value_counts().group1 == 13
assert df['height_category'].value_counts().group2 == 6
assert df['height_category'].value_counts().group3 == 1
