In [1]:
import pandas as pd

In [6]:
# Create the dataframe example

# Data
data = {
    'name': ['Ana', 'Bob', 'Mary', 'Ana', 'Bob', 'Mary'],
    'city': ['Madrid', 'Barcelona', 'Sevilla', 'Madrid', 'Barcelona', 'Sevilla'],
    'sales': [200, 150, 100, 250, 300, 120],
    'year': [2023, 2023, 2023, 2024, 2024, 2024]
}

# dataframe
df = pd.DataFrame(data)

# GroupBy

In **Pandas**, the `groupby()` function is used to group data columns or `Series` as known in **Pandas** of a `DataFrame`. This function is very useful because can be used along with aggregation functions like *sum*, *average*, *counting* among others. 

## Example 1: Use GroupBy with an aggregation function

This is one of the most common applications of the `groupby` method. Group the data in a column given a condition or common feature and apply an aggregation function like `mean()`, `count()`, or `sum()`.

Group the column name and add the sales.

In [8]:
gr_name_sum = df.groupby('name')['sales'].sum()
gr_name_sum

name
Ana     450
Bob     450
Mary    220
Name: sales, dtype: int64

## Example 2: Group multiple columns

You can group by multiple columns at once to create more complex analysis.

In the next example, group by name and city and calculate the average sales.

In [9]:
gr_name_city_mean = df.groupby(['name', 'city'])['sales'].mean()
gr_name_city_mean

name  city     
Ana   Madrid       225.0
Bob   Barcelona    225.0
Mary  Sevilla      110.0
Name: sales, dtype: float64

## Example 3: Apply multiple aggregation functions

You can apply multiple aggregation functions using the method `agg()`.

In the next example, group by city and apply multiple aggregation functions.

In [11]:
gr_city_agg1 = df.groupby('city')['sales'].agg(['sum', 'mean', 'min', 'max'])
gr_city_agg1

Unnamed: 0_level_0,sum,mean,min,max
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Barcelona,450,225.0,150,300
Madrid,450,225.0,200,250
Sevilla,220,110.0,100,120


## Example 4: Filter results after grouping

You can filter the groups after the first grouping using the method `filter()`. This method allows you to apply a condition and and filter specific elements.

In the next example, filter the cities whose sales sum exceeds 300, only save the ones that don't.

In [13]:
gr_filter_300 = df.groupby('city').filter(lambda x: x['sales'].sum() < 300)
gr_filter_300

Unnamed: 0,name,city,sales,year
2,Mary,Sevilla,100,2023
5,Mary,Sevilla,120,2024


## Example 5: Grouping and counting

If you want to count the number of times an element repeats itself in every group, you can use the methods `count()` or `size()`.

In this example, count the number of sales per city:

In [14]:
gr_counting_sales = df.groupby('city').size()
gr_counting_sales

city
Barcelona    2
Madrid       2
Sevilla      2
dtype: int64

## Example 6: Iterate over groups

You can iterate over the groups created by `groupby()` using a `for-loop`. Every iteration returns the name of the group and the sub-DataFrame.

In this example, iterate over the group created for **names**:

In [19]:
gr_iterate = df.groupby('name')

for name, group in gr_iterate:
    print('Group Name:', name)
    print('-')
    print(group)
    print(20*'*')

Group Name: Ana
-
  name    city  sales  year
0  Ana  Madrid    200  2023
3  Ana  Madrid    250  2024
********************
Group Name: Bob
-
  name       city  sales  year
1  Bob  Barcelona    150  2023
4  Bob  Barcelona    300  2024
********************
Group Name: Mary
-
   name     city  sales  year
2  Mary  Sevilla    100  2023
5  Mary  Sevilla    120  2024
********************


## Example 7: Group and fill lacking values

If you have missing values, you can use `groupby()` together with `transform()` and apply functions over the group(s) and fill those values.

In the next example, fill the missing values with the group mean:

In [20]:
df['sales'] = df.groupby('city')['sales'].transform(lambda x: x.fillna(x.mean()))
df['sales']

0    200
1    150
2    100
3    250
4    300
5    120
Name: sales, dtype: int64

## Example 8: Grouping many columns (Series) and apply multiple operations

If you need to work with different columns, you can use `agg()` method, like in the *example 2*, to specify the aggregation functions to apply on each column.

In this example, group by **year** and apply different functions to the columns `sales` and `year`:

In [23]:
gr_multiple_aggr = df.groupby('year').agg({
    'sales': ['sum', 'mean'],
    'year': 'count',
})

gr_multiple_aggr

Unnamed: 0_level_0,sales,sales,year
Unnamed: 0_level_1,sum,mean,count
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2023,450,150.0,3
2024,670,223.333333,3


## Example 9: Reset the index after `groupby()`

Sometimes, after applying grouping with `groupby()`, the index in the result can be the non-expected, you can reset the index with `reset_index()`.

In this example, group by **name** and reset the index:

In [24]:
gr_name_sum_reset = df.groupby('name')['sales'].sum().reset_index()
gr_name_sum_reset

Unnamed: 0,name,sales
0,Ana,450
1,Bob,450
2,Mary,220


In [26]:
# for comparison purposes, from the example 1
gr_name_sum

name
Ana     450
Bob     450
Mary    220
Name: sales, dtype: int64

## Example 10: Apply a custom function with `apply()`

The method `apply()` allows us to apply specific functions to our groups.

In the next example, we will calculate the final price adding the taxes to the sales prices using the function `apply()`:

In [30]:
def add_taxes(group):
    return (group['sales'] + group['sales']*0.1)

sales_plus_taxes = df.groupby('sales').apply(add_taxes)
sales_plus_taxes

  sales_plus_taxes = df.groupby('sales').apply(add_taxes)


sales   
100    2    110.0
120    5    132.0
150    1    165.0
200    0    220.0
250    3    275.0
300    4    330.0
Name: sales, dtype: float64