In this notebook we learn about one of the most important methods in Pandas: GroupBy. It is a very powerful method that can be used to give us insight into a variety of data. This notebook closely follows the tutorial from [RealPython](https://realpython.com/pandas-groupby/) and should be an addition to the content we discussed in the live-session.


In [4]:
import numpy as np
import pandas as pd

In [5]:
import requests

download_url = "https://raw.githubusercontent.com/rashida048/Datasets/master/titanic_data.csv"
target_csv_path = "titanic_data.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

Download ready.


In [7]:
df = pd.read_csv("titanic_data.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We see again the passengers as rows and things like age, survival and sex as features. What is the average age of male and female passengers on the ship, respectively? Entry [groupby](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html):

In [9]:
age_sex = df.groupby("Sex")["Age"].mean()
age_sex
#as_index=False more closely resembles SQL: Gives back DataFrame with RangeIndex; if true it returns Saries with MultiIndex
#if you use .count() you exclude poosible NaN values; .size() would include those

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [11]:
age_sex = df.groupby("Sex")
age_sex
print(age_sex)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2616ffcc90>


Groupby operations are often splitted into three steps:

1. **Split**: Split your data into different categories based on some criteria

2. **Apply**: Apply some aggregation operations (such as sum, mean, ...) to the different groups

3. **Combine**: Combine the different results back into the original data frame to get a new result

<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/06_groupby.svg?raw=1", width=500>
</div>

The split operation is lazily implemented in pandas; the groupby operation itself does not do anything we can directly see, but prepares a groupby object which we can further use. We can for example see the different categories by iterating over the groupby object:

In [12]:
for sex, frame in age_sex:
    print(f"First 2 entries for {sex!r}")
    print("---------------")
    print(frame.head(2), end="\n\n")

First 2 entries for 'female'
---------------
   PassengerId  Survived  Pclass  \
1            2         1       1   
2            3         1       3   

                                                Name     Sex   Age  SibSp  \
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  

First 2 entries for 'male'
---------------
   PassengerId  Survived  Pclass                      Name   Sex   Age  SibSp  \
0            1         0       3   Braund, Mr. Owen Harris  male  22.0      1   
4            5         0       3  Allen, Mr. William Henry  male  35.0      0   

   Parch     Ticket  Fare Cabin Embarked  
0      0  A/5 21171  7.25   NaN        S  
4      0     373450  8.05   NaN        S  



The groupby object is also in some way a dictionary with the group names as keys and the group labels as values (if we call the .groups method on it). So we can index it as a normal dictionary when we call its groups:

In [13]:
dir(age_sex)

['Age',
 'Cabin',
 'Embarked',
 'Fare',
 'Name',
 'Parch',
 'PassengerId',
 'Pclass',
 'Sex',
 'SibSp',
 'Survived',
 'Ticket',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__orig_bases__',
 '__parameters__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_agg_examples_doc',
 '_agg_general',
 '_agg_py_fallback',
 '_aggregate_frame',
 '_aggregate_item_by_item',
 '_aggregate_with_numba',
 '_apply_allowlist',
 '_apply_filter',
 '_apply_to_column_groupbys',
 '_bool_agg',
 '_can_use_transform_fast',
 '_choose_path',
 '_concat_objects',
 '_constructor',
 '_cumcount_array',
 '_cython_agg_ge

As you see, the groupby object has a lot of methods! In general, you can categorize them into one of the following:

1. **Aggregation/reductions methods**: Squashes a lot of data points into a single number (e.g. mean, median, sum, count)

2. **Filter methods** (including .filter()): Gives back a subset of the original DataFrame, either columns or rows based on some boolean expression

3. **Tranformation methods**: Transforms data in the DataFrame (e.g. unit conversion), but leaves the shape of the DataFrame untouched

4. **Meta methods**: Do not change the original dataframe, but give you information about the outcome of the grouping/splitting process (e.g. .groups, get.group, ...)

5. **Plotting methods**: Similar to normal Pandas plotting, but gives back multiple subplots based on the grouping.

Here a visual intituition about those categories:

<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/Groupby_methods.png?raw=1", width=800>

</div>


In [14]:
age_sex.groups["female"]

Int64Index([  1,   2,   3,   8,   9,  10,  11,  14,  15,  18,
            ...
            866, 871, 874, 875, 879, 880, 882, 885, 887, 888],
           dtype='int64', length=314)

In [15]:
age_sex.get_group("male") #same as df_us.loc[df["Sex"] == "male"]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In the apply step, we choose some specific operation (aggregation, transformation, filtration) and apply it to every data subset that we produced during splitting:

In [16]:
#get first tuple from iterator groupby object:
sex, frame = next(iter(age_sex))
sex
frame.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [20]:
#apply stage applied to one of the DataFrames containing a data subset:
frame["Age"].mean()

27.915708812260537

Combine then only combines all the results from the different subsets and combines them. Here the visual intuition from the beginning, together with the corresponding commands we used ( example in the [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html)):

<div>
<img src="https://github.com/kdidi99/Python_for_Biochemists/blob/main/images/06_groupby_select_detail.jpg?raw=1", width=900>
</div>

So far, we grouped based on a string, eg df.groupby("state"). But we could also use different arguments, for example a list of columns instead of a single one, a dict or PandasSeries or a NumPy array or a PandasIndex (or array-like objects of the last two).

In [None]:
df_air = pd.read_csv(
    "../additional_data/airqual.csv",
    parse_dates=[["Date", "Time"]],
    na_values=[-200],
    usecols=["Date", "Time", "CO(GT)", "T", "RH", "AH"]
).rename(
    columns={
        "CO(GT)": "co",
        "Date_Time": "tstamp",
        "T": "temp_c",
        "RH": "rel_hum",
        "AH": "abs_hum",
    }
).set_index("tstamp")

In [None]:
df_air.head()

Unnamed: 0_level_0,co,temp_c,rel_hum,abs_hum
tstamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-03-10 18:00:00,2.6,13.6,48.9,0.758
2004-03-10 19:00:00,2.0,13.3,47.7,0.726
2004-03-10 20:00:00,2.2,11.9,54.0,0.75
2004-03-10 21:00:00,2.2,11.0,60.0,0.787
2004-03-10 22:00:00,1.6,11.2,59.6,0.789


We could for example use the index of timestamps to get out the corresponding weekdays via the .day_name() method, creating a PandasIndex containing strings (an "array-like object" which we can use with groupby)

In [None]:
weekdays = df_air.index.day_name()
type(weekdays)
weekdays

Index(['Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',
       'Wednesday', 'Thursday', 'Thursday', 'Thursday', 'Thursday',
       ...
       'Monday', 'Monday', 'Monday', 'Monday', 'Monday', 'Monday', 'Monday',
       'Monday', 'Monday', 'Monday'],
      dtype='object', name='tstamp', length=9357)

In [None]:
df_air.groupby(weekdays)["co"].mean() 
#splitting based on artifical column created by us

tstamp
Friday       2.543
Monday       2.017
Saturday     1.861
Sunday       1.438
Thursday     2.456
Tuesday      2.382
Wednesday    2.401
Name: co, dtype: float64

Same can be done for grouping by hour of the day for each weekday, enabling a finer split:

In [None]:
hr = df_air.index.hour
type(hr) #Pandas Int64 Index object
hr
df_air.groupby([weekdays, hr])["co"].mean().rename_axis(["dow", "hr"])

dow        hr
Friday     0     1.936
           1     1.609
           2     1.172
           3     0.887
           4     0.823
                 ...  
Wednesday  19    4.147
           20    3.845
           21    2.898
           22    2.102
           23    1.938
Name: co, Length: 168, dtype: float64

In [None]:
#turn numerical into categorical variable with cut
#then use the resulting Series object as argument for groupby
bins = pd.cut(df_air["temp_c"], bins=3, labels = ("cool", "warm", "hot"))
bins
type(bins)
df_air.groupby(bins).agg("mean")
df_air.groupby(bins).agg(["mean", "median"])
df_air[["rel_hum", "abs_hum"]].groupby(bins).agg(["mean", "median"])

Unnamed: 0_level_0,rel_hum,rel_hum,abs_hum,abs_hum
Unnamed: 0_level_1,mean,median,mean,median
temp_c,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
cool,57.651,59.2,0.666,0.658
warm,49.383,49.3,1.183,1.145
hot,24.994,24.1,1.293,1.274


*Add-on*: Some advanced features of groupby using a news dataset

In [None]:
import datetime as dt
import pandas as pd

def parse_millisecond_timestamp(ts: int) -> dt.datetime:
    """Convert ms since Unix epoch to UTC datetime instance."""
    return dt.datetime.fromtimestamp(ts / 1000, tz=dt.timezone.utc)

df_news = pd.read_csv(
    "../additional_data/news.csv",
    sep="\t",
    header=None,
    index_col=0,
    names=["title", "url", "outlet", "category", "cluster", "host", "tstamp"],
    parse_dates=["tstamp"],
    date_parser=parse_millisecond_timestamp,
    dtype={
        "outlet": "category",
        "category": "category",
        "cluster": "category",
        "host": "category",
    },
)

In [None]:
df_news.iloc[0]

title       Fed official says weak data caused by weather,...
url         http://www.latimes.com/business/money/la-fi-mo...
outlet                                      Los Angeles Times
category                                                    b
cluster                         ddUyU0VZz0BRneMioxUPQVP6sIxvM
host                                          www.latimes.com
tstamp                       2014-03-10 16:52:50.698000+00:00
Name: 1, dtype: object

With the concept of lamda function, we can answer more sophisticated questions not only based on numerical data but also based on textual information. For example: Which of the newspapers talk more about democrats and which about republicans?

In [None]:
df.groupby("outlet", sort=False)["title"].apply(
    lambda select: select.str.contains("Senate").sum()).nlargest(5)

outlet
Reuters                    11
GlobalPost                  8
Law360 \(subscription\)     7
Businessweek                6
Los Angeles Times           5
Name: title, dtype: int64

In [None]:
df.groupby("outlet", sort=False)["title"].apply(
    lambda select: sele.str.contains("Wall Street").sum()).nlargest(5)

outlet
Proactive Investors USA \& Canada    69
Economic Times                       36
RTT News                             34
Business Standard                    32
Independent Online                   32
Name: title, dtype: int64

To look at what is going on, we can again iterate over the Pandas GroupBy interator and look at a single tuple corresponding to the first group:

In [None]:
title, sele = next(iter(df.groupby("outlet", sort=False)["title"]))
title
sele
type(sele) #series object, not DataFrame since we called only one column

pandas.core.series.Series

In [None]:
#construct boolean mask by invoking string methods
sele.str.contains("Senate")

1         False
486       False
1124      False
1146      False
1237      False
          ...  
421547    False
421584    False
421972    False
422226    False
422905    False
Name: title, Length: 1976, dtype: bool

In [None]:
sele.str.contains("Senate").sum() #False=0, True=1

5

**Conclusion**: .groupby() works by the split-apply-combine paradigm and can use many different input types as long as they can be read as a sequence of labels to perform the grouping/splitting on.