Skip to content

[ENH] Compatibility with pandas.Grouper() for datar.all.group_by() #215

@Longistamina

Description

@Longistamina

Feature Type

  • Adding new functionality to datar

  • Changing existing functionality in datar

  • Removing existing functionality in datar

Problem Description

Hi Mr. Pwwang, I hope everything goes well with your journey.

Recently, I have discovered another little issue regarding the group_by() function from your library. Therefore, I raised another request for your help. Hope you will consider it.

The problem is that group_by() function does not work well with pandas.Grouper() for handling more complicated grouping keys. Below, I use the air quality data for illustration. Here is the link of the dataset https://github.com/pandas-dev/pandas/blob/main/doc/data/air_quality_no2_long.csv

#----------------------------------------------#
#--------------- Data preparation -------------#
#----------------------------------------------#

import datar.all as dr
from datar import f
import pandas as pd

from pipda import register_verb
dr.filter = register_verb(func = dr.filter_)

# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")

df_aq = (
    pd.read_csv("05_Pandas_DataR_dataframe/data/air_quality_no2_long.csv")
    .rename(columns={"date.utc": "date"})
    .assign(date = lambda df: pd.to_datetime(df["date"], format="%Y-%m-%d %H:%M:%S%z"))
)

print(df_aq.head(3))
#       city  country                      date location parameter     value     unit
#   <object> <object>     <datetime64[ns, UTC]> <object>  <object> <float64> <object>
# 0    Paris       FR 2019-06-21 00:00:00+00:00  FR04014       no2      20.0    µg/m³
# 1    Paris       FR 2019-06-20 23:00:00+00:00  FR04014       no2      21.8    µg/m³
# 2    Paris       FR 2019-06-20 22:00:00+00:00  FR04014       no2      26.5    µg/m³

#-----------------------------------------------#
#-------------- Try df.groupby() ---------------#
#-----------------------------------------------#

print(
    df_aq
    .groupby(pd.Grouper(key="date", freq="5D"))
    .agg(value_mean = ("value", "mean")) # # Calculate the mean of "value" column every 5 days
    .reset_index()
)
#                        date  value_mean
#       <datetime64[ns, UTC]>   <float64>
# 0 2019-05-07 00:00:00+00:00   30.286017
# 1 2019-05-12 00:00:00+00:00   24.975304
# 2 2019-05-17 00:00:00+00:00   30.772917
# 3 2019-05-22 00:00:00+00:00   32.298340
# 4 2019-05-27 00:00:00+00:00   20.337705
# 5 2019-06-01 00:00:00+00:00   25.743933
# 6 2019-06-06 00:00:00+00:00   19.717273
# 7 2019-06-11 00:00:00+00:00   25.300855
# 8 2019-06-16 00:00:00+00:00   25.027119
# 9 2019-06-21 00:00:00+00:00   20.000000

#-----------------------------------------------#
#----------- Try with dr.group_by() ------------#
#-----------------------------------------------#

print(
    df_aq
    >> dr.group_by(pd.Grouper(key="date", freq="5D"))
    >> dr.summarize(value_mean = f.value.mean()) # Calculate the mean of "value" column every 5 days
)

#                                                  ...  value_mean
#                                             <object>   <float64>
# 0  TimeGrouper(key='date', freq=<5 * Days>, axis=...   26.261847

'''Something wrong happens'''

Feature Description

print(
    df_aq
    >> dr.group_by(pd.Grouper(key="date", freq="5D"))
    >> dr.summarize(value_mean = f.value.mean()) # Calculate the mean of "value" column every 5 days
)

#                        date  value_mean
#       <datetime64[ns, UTC]>   <float64>
# 0 2019-05-07 00:00:00+00:00   30.286017
# 1 2019-05-12 00:00:00+00:00   24.975304
# 2 2019-05-17 00:00:00+00:00   30.772917
# 3 2019-05-22 00:00:00+00:00   32.298340
# 4 2019-05-27 00:00:00+00:00   20.337705
# 5 2019-06-01 00:00:00+00:00   25.743933
# 6 2019-06-06 00:00:00+00:00   19.717273
# 7 2019-06-11 00:00:00+00:00   25.300855
# 8 2019-06-16 00:00:00+00:00   25.027119
# 9 2019-06-21 00:00:00+00:00   20.000000

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions