# Aggregations and other groupwise operations - Intro

In this chapter we consider hiearchical (or group) structures in the data. These can be used to perform various data preparation steps, in particular:
* **Aggregation:** compute a summary statistic (or statistics) for each group. Some examples:
  * Compute group sums or means.
  * Compute group sizes / counts.
* **Transformation:** perform some group-specific computations and return a like-indexed object. Some examples:
  * Standardize data (zscore) within a group.
  * Filling NAs within groups with a value derived from each group.

For more details, see [https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)



# Preparations

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("../../data/raw/financial_data_intro.csv")
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,34816.074,3713.506,False
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,37161.97,4226.559,False
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,40276.807,3591.888,False
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,43026.854,4386.107,False


# Aggregations over the whole dataset (without `groupby()`)

Before we continue with groupwise operations, let us calculate some aggregate statistics for the whole dataset, without considering any hiearchies in the data.

## the `describe()` method revisited

You have already seen the `describe()` method, which exists for both `pd.DataFrame` and `pd.Series` (a single column). It returns summary statistics of numeric columns for the whole dataset

In [2]:
# describe a whole DataFrame
df.describe()

Unnamed: 0,u_company_name_id,u_year,cb_naics,cb_at,cb_ni
count,824.0,824.0,824.0,824.0,824.0
mean,45233.387136,2012.11165,379140.90534,113685.9,2191.241188
std,32869.292595,4.166015,207892.935774,296348.9,5047.784304
min,2172.0,2005.0,325.0,0.0,-50119.0
25%,14651.0,2009.0,324110.0,763.492,-0.31275
50%,34617.0,2012.0,336111.0,16471.87,484.862
75%,77954.0,2016.0,523930.0,84863.73,3470.27075
max,109031.0,2019.0,999977.0,2261780.0,50778.396


In [3]:
# describe a single column (i.e., a pd.Series)
df["u_year"].describe()

count     824.000000
mean     2012.111650
std         4.166015
min      2005.000000
25%      2009.000000
50%      2012.000000
75%      2016.000000
max      2019.000000
Name: u_year, dtype: float64

## `mean()`, `sum()`, and other built-in aggregation methods

Both `pd.DataFrame` and `pd.Series` provide many built-in aggregation methods like `min`, `max`, `mean`, `sum`, `std`, `count`, `quantile` etc.. Here some examples.

Note that these aggregation methods return a `pd.Series` of aggregated values when use with `pd.DataFrame`; and a scalar when used with a `pd.Series`!

In [4]:
df.mean(numeric_only=True)

u_company_name_id         45233.387136
u_year                     2012.111650
cb_naics                 379140.905340
cb_at                    113685.916945
cb_ni                      2191.241188
cb_financial_industry         0.158981
dtype: float64

In [5]:
df["cb_at"].sum()

np.float64(93677195.56300001)

In [6]:
df.max()

u_company_name_id             109031
u_year                          2019
u_company_name           voxeljet AG
cb_naics                      999977
u_iso3                           USA
u_fye                     2019-12-31
cb_cusip                   G39108108
cb_at                      2261780.0
cb_ni                      50778.396
cb_financial_industry           True
dtype: object

In [7]:
# the 20% percentiles (but only for numeric variables)
df.select_dtypes(["int", "float"]).quantile(q=0.2)

u_company_name_id     11568.0000
u_year                 2008.0000
cb_naics             311612.0000
cb_at                   399.4678
cb_ni                    -7.9550
Name: 0.2, dtype: float64

Note that these aggregation methods return a `pd.Series` of aggregated values when use with `pd.DataFrame`; and a scalar when used with a `pd.Series`!

In [8]:
print(f"type(df.max()): {type(df.max())}")
print(f"type(df['cb_at'].max()): {type(df['cb_at'].max())}")

type(df.max()): <class 'pandas.core.series.Series'>
type(df['cb_at'].max()): <class 'numpy.float64'>



## `DataFrame.agg()`

`aggregate()` - or simply `agg()` - allow more control over which columns to aggregate using which functions compared to `describe()` (produces a standard set of aggregations) and the other presented methods such as `mean()` and `sum()` (produce only one aggregation).

[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)


In [9]:
# specifying several specific aggregation functions:
df.select_dtypes(["int", "float"]).agg(["count", "mean", "std"])

Unnamed: 0,u_company_name_id,u_year,cb_naics,cb_at,cb_ni
count,824.0,824.0,824.0,824.0,824.0
mean,45233.387136,2012.11165,379140.90534,113685.916945,2191.241188
std,32869.292595,4.166015,207892.935774,296348.893884,5047.784304


In [10]:
# specifying different aggregations per column
df.agg(
    {
        "u_company_name": "count",
        "cb_at": ["count", "mean", "std"],
        "cb_ni": ["count", "mean", "std", "max"],
    }
)

Unnamed: 0,u_company_name,cb_at,cb_ni
count,824.0,824.0,824.0
mean,,113685.916945,2191.241188
std,,296348.893884,5047.784304
max,,,50778.396


## Exercise 1

1. Load the first sheet of the Excel file "wdi_reduced.xlsx" into a pandas DataFrame (see [here](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for help with `pandas.read_excel()`)
2. Run the `describe()` method for the whole DataFrame.
3. Calculate the mean of all numeric columns over the whole DataFrame.
4. Reproduce the result from `describe()` using the `agg()` method and the appropriate built-in aggregation functions (you may skip the percentiles; including those is a **BONUS!**).
5. BONUS: Produce an aggregated DataFrame that counts the values for all columns (including the non-numerical columns) and also includes the mean and median for the numeric columns.
6. BONUS: Define your own function that counts the string values starting with the (capital) letter 'E'. Apply it using the `agg()` method for all string columns.

# Groupwise operations with the `groupby()` method

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The result is a `DataFrameGroupBy` object. These behave very similarly to a DataFrame. In particular, they contain the usual aggregation methods such as `mean`, `std` as well as `describe`, `agg` and `transform`.

When calling one of the aggregation methods, the result is again a `DataFrame`!

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)



## Basic examples of `groupby`

In [11]:
# Calculating the max() per group:
df.groupby("u_year").max()

Unnamed: 0_level_0,u_company_name_id,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
u_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2005,103775,Waddell & Reed Financial Inc.,999977,USA,2006-04-30,D1668R123,1181690.68,22341.0,True
2006,103775,Waddell & Reed Financial Inc.,999977,USA,2007-04-30,D1668R123,1389942.352,22315.0,True
2007,103775,Waddell & Reed Financial Inc.,999977,USA,2008-04-30,D1668R123,1549595.885,20845.0,True
2008,103775,Waddell & Reed Financial Inc.,999977,USA,2009-04-30,D1668R123,1330066.234,21157.0,True
2009,103775,Waddell & Reed Financial Inc.,999977,USA,2010-04-30,D1668R123,1015066.766,16578.0,True
2010,103775,Waddell & Reed Financial Inc.,999977,USA,2011-04-30,D1668R123,2261780.0,14026.66,True
2011,103775,Waddell & Reed Financial Inc.,999977,USA,2012-04-30,D1668R123,2147216.0,25700.0,True
2012,103775,Waddell & Reed Financial Inc.,999977,USA,2013-04-30,D1668R123,1989856.0,28636.036,True
2013,109031,voxeljet AG,999977,USA,2014-04-30,D1668R123,1966061.0,48668.0,True
2014,109031,voxeljet AG,999977,USA,2015-04-30,D1668R123,1945539.0,13291.739,True


In [12]:
# so does mean(), for example
df.groupby("u_year").mean(numeric_only=True)

Unnamed: 0_level_0,u_company_name_id,cb_naics,cb_at,cb_ni,cb_financial_industry
u_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2005,45184.96,390886.72,92792.31748,2387.22698,0.16
2006,45184.96,390886.72,106853.25054,2749.50428,0.16
2007,45833.137255,393495.215686,113722.617196,3140.256235,0.176471
2008,45316.886792,384827.943396,107798.376208,898.588113,0.169811
2009,45316.886792,384827.943396,102147.096245,1508.802094,0.169811
2010,45511.207547,379922.396226,127645.853887,1840.888472,0.169811
2011,44000.981818,380972.872727,125152.4358,2703.733673,0.163636
2012,44072.175439,371318.421053,122610.937509,2604.871965,0.175439
2013,45192.155172,370661.965517,123250.022828,3017.393483,0.172414
2014,43336.389831,369895.067797,117476.362678,1989.857288,0.169492


In [13]:
# You can limit the computation to one specific column (note that we request a Series here!):
df.groupby("u_year")["cb_at"].mean()

u_year
2005     92792.317480
2006    106853.250540
2007    113722.617196
2008    107798.376208
2009    102147.096245
2010    127645.853887
2011    125152.435800
2012    122610.937509
2013    123250.022828
2014    117476.362678
2015    107103.081841
2016    107586.286635
2017    115585.384969
2018    107645.230423
2019    146505.115783
Name: cb_at, dtype: float64

In [14]:
# The result is a Series in that case:
type(df.groupby("u_year")["cb_at"].mean())

pandas.core.series.Series

In [15]:
# But if you requested a DataFrame (with one column), you get a DataFrame after the aggregation:
print(type(df.groupby("u_year")[["cb_at"]].mean()))
df.groupby("u_year")[["cb_at"]].mean()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,cb_at
u_year,Unnamed: 1_level_1
2005,92792.31748
2006,106853.25054
2007,113722.617196
2008,107798.376208
2009,102147.096245
2010,127645.853887
2011,125152.4358
2012,122610.937509
2013,123250.022828
2014,117476.362678


In [16]:
# naturally, you can apply aggregations to several selected columns
df.groupby("u_year")[["cb_at", "cb_ni"]].mean()

Unnamed: 0_level_0,cb_at,cb_ni
u_year,Unnamed: 1_level_1,Unnamed: 2_level_1
2005,92792.31748,2387.22698
2006,106853.25054,2749.50428
2007,113722.617196,3140.256235
2008,107798.376208,898.588113
2009,102147.096245,1508.802094
2010,127645.853887,1840.888472
2011,125152.4358,2703.733673
2012,122610.937509,2604.871965
2013,123250.022828,3017.393483
2014,117476.362678,1989.857288


In [17]:
# describe works with groupby()!
df.groupby("u_year").describe()

Unnamed: 0_level_0,u_company_name_id,u_company_name_id,u_company_name_id,u_company_name_id,u_company_name_id,u_company_name_id,u_company_name_id,u_company_name_id,cb_naics,cb_naics,...,cb_at,cb_at,cb_ni,cb_ni,cb_ni,cb_ni,cb_ni,cb_ni,cb_ni,cb_ni
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
u_year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2005,50.0,45184.96,33361.330543,3307.0,14766.75,32787.5,75883.25,103775.0,50.0,390886.72,...,88652.2835,1181690.68,50.0,2387.22698,4129.885145,-679.731,52.91325,818.6375,2703.00675,22341.0
2006,50.0,45184.96,33361.330543,3307.0,14766.75,32787.5,75883.25,103775.0,50.0,390886.72,...,95407.7115,1389942.352,50.0,2749.50428,4278.303827,-340.012,71.30275,1125.671,4094.63,22315.0
2007,51.0,45833.137255,33348.8475,3307.0,14882.5,36437.0,78098.0,103775.0,51.0,393495.215686,...,90055.7975,1549595.885,51.0,3140.256235,4850.566411,-3094.0,51.7945,717.13,5611.1425,20845.0
2008,53.0,45316.886792,32807.218554,3307.0,15114.0,32797.0,77954.0,103775.0,53.0,384827.943396,...,100198.705,1330066.234,53.0,898.588113,8232.999961,-50119.0,2.223,411.487,2436.919,21157.0
2009,53.0,45316.886792,32807.218554,3307.0,15114.0,32797.0,77954.0,103775.0,53.0,384827.943396,...,114726.22,1015066.766,53.0,1508.802094,4977.229698,-21553.0,-5.18,242.991,2505.234,16578.0
2010,53.0,45511.207547,32583.341391,3417.0,15114.0,32797.0,77954.0,103775.0,53.0,379922.396226,...,113136.802,2261780.0,53.0,1840.888472,3831.506998,-14025.0,13.929,519.594,3647.648,14026.66
2011,55.0,44000.981818,32923.086672,3307.0,14186.5,31508.0,73812.5,103775.0,55.0,380972.872727,...,99782.664,2147216.0,55.0,2703.733673,5201.771273,-5266.0,6.0685,649.711,3638.927,25700.0
2012,57.0,44072.175439,32332.02626,3307.0,14651.0,32797.0,69671.0,103775.0,57.0,371318.421053,...,84821.582,1989856.0,57.0,2604.871965,4984.122135,-6929.243,3.84,570.279,3486.794,28636.036
2013,58.0,45192.155172,33162.823181,3307.0,14766.75,34617.0,75883.25,109031.0,58.0,370661.965517,...,88371.5485,1966061.0,58.0,3017.393483,7746.191249,-12799.313,-2.31275,458.344,4327.77475,48668.0
2014,59.0,43336.389831,32350.546055,3307.0,14186.5,31508.0,68033.5,109031.0,59.0,369895.067797,...,85670.8445,1945539.0,59.0,1989.857288,2998.112052,-3823.916,-0.9785,644.35,3659.166,13291.739


## `agg()` and `groupby()`

In [18]:
# specific columns and specific aggregations:
df.groupby("u_year").agg(
    {
        "cb_at": ["mean", "std", "count"],
        "cb_ni": ["mean", "min", "max"],
    }
)

Unnamed: 0_level_0,cb_at,cb_at,cb_at,cb_ni,cb_ni,cb_ni
Unnamed: 0_level_1,mean,std,count,mean,min,max
u_year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2005,92792.31748,218413.106805,50,2387.22698,-679.731,22341.0
2006,106853.25054,257332.407008,50,2749.50428,-340.012,22315.0
2007,113722.617196,276558.935322,51,3140.256235,-3094.0,20845.0
2008,107798.376208,245992.84157,53,898.588113,-50119.0,21157.0
2009,102147.096245,211057.444052,53,1508.802094,-21553.0,16578.0
2010,127645.853887,348643.43338,53,1840.888472,-14025.0,14026.66
2011,125152.4358,329421.767613,55,2703.733673,-5266.0,25700.0
2012,122610.937509,314499.593318,57,2604.871965,-6929.243,28636.036
2013,123250.022828,314976.449801,58,3017.393483,-12799.313,48668.0
2014,117476.362678,308157.084485,59,1989.857288,-3823.916,13291.739


In [19]:
# A selection dtypes first requires that the grouping column is part of the selected dtypes!
df.select_dtypes(["int", "float"]).groupby("u_year").agg(["count", "mean"])

Unnamed: 0_level_0,u_company_name_id,u_company_name_id,cb_naics,cb_naics,cb_at,cb_at,cb_ni,cb_ni
Unnamed: 0_level_1,count,mean,count,mean,count,mean,count,mean
u_year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2005,50,45184.96,50,390886.72,50,92792.31748,50,2387.22698
2006,50,45184.96,50,390886.72,50,106853.25054,50,2749.50428
2007,51,45833.137255,51,393495.215686,51,113722.617196,51,3140.256235
2008,53,45316.886792,53,384827.943396,53,107798.376208,53,898.588113
2009,53,45316.886792,53,384827.943396,53,102147.096245,53,1508.802094
2010,53,45511.207547,53,379922.396226,53,127645.853887,53,1840.888472
2011,55,44000.981818,55,380972.872727,55,125152.4358,55,2703.733673
2012,57,44072.175439,57,371318.421053,57,122610.937509,57,2604.871965
2013,58,45192.155172,58,370661.965517,58,123250.022828,58,3017.393483
2014,59,43336.389831,59,369895.067797,59,117476.362678,59,1989.857288


In [20]:
# This would not work:
try:
    df.select_dtypes(["int", "float"]).groupby("u_iso3").agg(["count", "mean"])
except KeyError as e:
    print(
        f"An error of class {type(e).__name__} occurrs, indicating the key (column) {e} is not found"
    )

An error of class KeyError occurrs, indicating the key (column) 'u_iso3' is not found


## `groupby` and `transform()`

`transform()` calculates aggregated values but in contrast to `agg()`, it returns a DataFrame of the same length (with repeated values).

When combined with `groupby()`, `transform()` can be used for e.g. standardizing data within groups or replace missing values with the mean of the group.

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html)

In [21]:
# an aggregation instead of transform will return one row per u_year:
df.groupby("u_year")["cb_at"].median()

u_year
2005    12445.5165
2006    14256.5570
2007    15184.4700
2008    16507.9340
2009    18244.6360
2010    19825.2120
2011    24838.7970
2012    22325.9980
2013    23348.5765
2014    24959.5230
2015    15515.4580
2016    15363.3980
2017    17398.3040
2018     9129.4130
2019    14294.8000
Name: cb_at, dtype: float64

In [22]:
# transform returns one entry per row
df.groupby("u_year")["cb_at"].transform("median")

0      12445.5165
1      14256.5570
2      15184.4700
3      16507.9340
4      18244.6360
          ...    
819    17398.3040
820     9129.4130
821    15363.3980
822    17398.3040
823     9129.4130
Name: cb_at, Length: 824, dtype: float64

In [23]:
# we can then compare the row-wise total assets to the respective yearly benchmark
df["cb_at"] > df.groupby("u_year")["cb_at"].transform("median")

0       True
1       True
2       True
3       True
4       True
       ...  
819    False
820    False
821    False
822    False
823    False
Name: cb_at, Length: 824, dtype: bool

In [24]:
df["big_company"] = df["cb_at"] > df.groupby("u_year")["cb_at"].transform("median")
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,big_company
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False,True
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,34816.074,3713.506,False,True
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,37161.97,4226.559,False,True
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,40276.807,3591.888,False,True
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,43026.854,4386.107,False,True


## Exercise 2

1. Load the first sheet of the Excel file "wdi_reduced.xlsx" into a pandas DataFrame (see [here](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for help with `pandas.read_excel()`)
2. `describe()` the DataFrame, grouped by *region*.
3. Using the `agg()` method, summarize the data by *region*. We are interested in the number of values for all columns, and the mean and standard deviation of the numeric columns
4. Calculate a new column containing the ratio of the population (*SP_POP_TOTL*) compared to the maximum in the respective *year*.
5. BONUS: Using the `agg()` method, summarize the data by *region*. We are interested in the 5% quantile and the 95% quantile. **Hint**: you could define your own functions to calculate the specific quantiles.
6. BONUS: Calculate the number of NaN values in the columns *NY_GDP_MKTP_CD*, *NY_GDP_MKTP_KD_ZG*, and *SP_POP_TOTL*, grouped by *region*.

---
---
---

## Side note:Categorical data and `groupby()`

Pandas supports a special data type called `Categorical` that can be used to represent categorical data. This is useful for grouping operations, as it allows for efficient memory usage and faster computations.

It behaves a little differently from regular object types in that it has a fixed number of possible values (categories). For instance, if a category generally exists but is not present in the current DataFrame, it may still appear in the results. This is decided by the `observed` argument in the `groupby()` method.

> **Note:**
The current default behaviour of `observed == False` is deprecated and will be changed to `True` in a future version. This is why you may see warnings when working with categorical data.


### Transforming a column to `Categorical`

In [25]:
df["country_cat"] = df["u_iso3"].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   u_company_name_id      824 non-null    int64   
 1   u_year                 824 non-null    int64   
 2   u_company_name         824 non-null    object  
 3   cb_naics               824 non-null    int64   
 4   u_iso3                 824 non-null    object  
 5   u_fye                  824 non-null    object  
 6   cb_cusip               824 non-null    object  
 7   cb_at                  824 non-null    float64 
 8   cb_ni                  824 non-null    float64 
 9   cb_financial_industry  824 non-null    bool    
 10  big_company            824 non-null    bool    
 11  country_cat            824 non-null    category
dtypes: bool(2), category(1), float64(2), int64(3), object(4)
memory usage: 60.7+ KB


In [26]:
# finding out more about the categorical column
print(df["country_cat"].cat.categories)
print(df["country_cat"].cat.ordered)

Index(['DEU', 'FRA', 'GBR', 'USA'], dtype='object')
False


In [27]:
# in the background, categorical data is represented as integer codes:
pd.crosstab(df["country_cat"].cat.codes.unique(), df["country_cat"].cat.categories)

col_0,DEU,FRA,GBR,USA
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,0,0,1
1,0,0,1,0
2,1,0,0,0
3,0,1,0,0


### Demonstrating the effect of `observed=False` or `observed=True` in `groupby()`

In [28]:
# see the future warning below: observed=False as default is going to be changed to True
df.groupby("country_cat").count()

  df.groupby("country_cat").count()


Unnamed: 0_level_0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,big_company
country_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
DEU,237,237,237,237,237,237,237,237,237,237,237
FRA,230,230,230,230,230,230,230,230,230,230,230
GBR,136,136,136,136,136,136,136,136,136,136,136
USA,221,221,221,221,221,221,221,221,221,221,221


In [29]:
# what happens if a country is removed?
df[df["country_cat"] != "USA"].groupby("country_cat").count()

  df[df["country_cat"] != "USA"].groupby("country_cat").count()


Unnamed: 0_level_0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,big_company
country_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
DEU,237,237,237,237,237,237,237,237,237,237,237
FRA,230,230,230,230,230,230,230,230,230,230,230
GBR,136,136,136,136,136,136,136,136,136,136,136
USA,0,0,0,0,0,0,0,0,0,0,0


In [30]:
# The new default is going to be:
df[df["country_cat"] != "USA"].groupby("country_cat", observed=True).count()

Unnamed: 0_level_0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,big_company
country_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
DEU,237,237,237,237,237,237,237,237,237,237,237
FRA,230,230,230,230,230,230,230,230,230,230,230
GBR,136,136,136,136,136,136,136,136,136,136,136


### Acknowledge the existence of missing categories

You can make sure that pandas "knows" the missing categories (other ISO3 country codes).

In [31]:
df["country_cat"] = df["country_cat"].cat.add_categories(["CAN", "MEX", "JPN", "AUS"])
df[df["country_cat"] != "USA"].groupby("country_cat", observed=False).count()

Unnamed: 0_level_0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,big_company
country_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
DEU,237,237,237,237,237,237,237,237,237,237,237
FRA,230,230,230,230,230,230,230,230,230,230,230
GBR,136,136,136,136,136,136,136,136,136,136,136
USA,0,0,0,0,0,0,0,0,0,0,0
CAN,0,0,0,0,0,0,0,0,0,0,0
MEX,0,0,0,0,0,0,0,0,0,0,0
JPN,0,0,0,0,0,0,0,0,0,0,0
AUS,0,0,0,0,0,0,0,0,0,0,0


# Side note: The `apply()` method

The `apply()` method is very flexible and can be used with whole DataFrames and with `groupby()`. Among other things, it can be used to aggregate data as we would do with `agg()` or `transform()`. However, `apply()` is usually very slow compared to the more specialized methods available in pandas. Therefore, it is really only used as a last resort.

For a detailed discussion and further examples: There is an excelent post on stackoverflow that discusses the uses and misuses of `apply()` and the relative performance compared to (usually available) faster alternatives:
[https://stackoverflow.com/a/54432584](https://stackoverflow.com/a/54432584)

Here, only a quick demonstration of the relative speed when using `apply()` as opposed to `max()`

In [32]:
# First, the result of a call to `agg()`:
df.select_dtypes(["int", "float"]).groupby("u_year").max().head(5)

Unnamed: 0_level_0,u_company_name_id,cb_naics,cb_at,cb_ni
u_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005,103775,999977,1181690.68,22341.0
2006,103775,999977,1389942.352,22315.0
2007,103775,999977,1549595.885,20845.0
2008,103775,999977,1330066.234,21157.0
2009,103775,999977,1015066.766,16578.0


In [33]:
# Now the same with `apply()`
df.select_dtypes(["int", "float"]).groupby("u_year").apply(np.max, include_groups=False).head(5)

u_year
2005    1181690.680
2006    1389942.352
2007    1549595.885
2008    1330066.234
2009    1015066.766
dtype: float64

In [34]:
# using max() is faster than apply()!
%timeit df.select_dtypes(["int","float"]).groupby("u_year").max()
%timeit df.select_dtypes(["int","float"]).groupby("u_year").apply(np.max, include_groups=False)

569 μs ± 173 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


1.81 ms ± 337 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
