# Introduction to grouping

In [1]:
import pandas as pd

names = ["Erika Schumacher", "Javi López", "Maria Rovira", "Ana Gromek", 
         "Shekhar Biswas", "Muriel Adams", "Saira Polom", "Alex Kubiak", 
         "Kit Ching", "Dog Woof"]
ages = [22, 50, 23, 29, 44, 30, 25, 71, 35, 2]
nations = ["DE", "ES", "ES", "PL", "IN", "FR", "IN", "PL", "UK", "XX"]
siblings = [2, 0, 4, 1, 1, 2, 3, 7, 0, 9]
colours = ["Red", "Yellow", "Yellow", "Blue", "Red", "Yellow", "Blue", "Blue", "Red", "Gray"]



people = pd.DataFrame({"name":names,
                       "age":ages,
                       "country":nations,
                       "siblings":siblings,
                       "favourite_colour":colours
                      })

people.head()

##1&nbsp;`.value_counts()`

For categorical columns, there is an easy way to find out how many values belong to that category: using `.value_counts()`.

Have a look at the `.value_counts()` documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html).

In [None]:
# Find the number of students from each country.
people["country"].value_counts()

ES    2
PL    2
IN    2
DE    1
FR    1
UK    1
XX    1
Name: country, dtype: int64

Similar to SQL, it is possible to group data into categories while applying aggregation functions to the data. The grouping is done using `.groupby()`.

Some of the possible aggregation functions are
- `.mean()`,
- `.sum()`,
- `.count()`,
- `.max()`,
- ...

Have a look at the `.groupby()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

In [None]:
# Get the average age and number of siblings per group of favourite colour.
people.groupby("favourite_colour").mean()

Unnamed: 0_level_0,age,siblings
favourite_colour,Unnamed: 1_level_1,Unnamed: 2_level_1
Blue,41.666667,3.666667
Gray,2.0,9.0
Red,33.666667,1.0
Yellow,34.333333,2.0


##2&nbsp;`.agg()`

`.agg` allows you to aggregate data. This can be done for multiple columns at the same time, and also multiple aggregations can be applied.

> `.aggregate()` will work the same way as `.agg()`. It is recommended though to use `.agg()`.

Have a look at the `.agg()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html).

In [None]:
# Get the sum and min values for the "ages" and "siblings" columns each.
people[["age", "siblings"]].agg(["sum", "min"])

Unnamed: 0,age,siblings
sum,331,29
min,2,0


`.agg()` is frequently used in combination with `.groupby()` too.

In [None]:
# Group by colour.
# For each colour, get the average age and the total number of siblings.
people.groupby("favourite_colour").agg({"age": "mean", "siblings": "sum"})

Unnamed: 0_level_0,age,siblings
favourite_colour,Unnamed: 1_level_1,Unnamed: 2_level_1
Blue,41.666667,11
Gray,2.0,9
Red,33.666667,3
Yellow,34.333333,6


## 3.&nbsp;Challenges

### Exercise 1

For each nationality, compute the maximum age.

In [None]:
# Your code here.

In [None]:
people.groupby("country").agg({"age": "max"})

Unnamed: 0_level_0,age
country,Unnamed: 1_level_1
DE,22
ES,50
FR,30
IN,44
PL,71
UK,35
XX,2


### Exercise 2

For each colours-nationalities combination, compute the min, average and max number of siblings.

In [None]:
# Your code here.

In [None]:
people.groupby(["favourite_colour", "country"]).agg({"siblings":["min", "mean", "max"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,siblings,siblings,siblings
Unnamed: 0_level_1,Unnamed: 1_level_1,min,mean,max
favourite_colour,country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Blue,IN,3,3.0,3
Blue,PL,1,4.0,7
Gray,XX,9,9.0,9
Red,DE,2,2.0,2
Red,IN,1,1.0,1
Red,UK,0,0.0,0
Yellow,ES,0,2.0,4
Yellow,FR,2,2.0,2
