Skip to content

ENH: Add convenience API to summarize null counts grouped by dtype (e.g. df.dtype_nulls.summary()) #62833

@Princu1999

Description

@Princu1999

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Add a small convenience API to provide a quick, per-dtype view of missing values in a DataFrame. The utility should list columns grouped by dtype with null counts and optional null percentages, and return both a one-row-per-dtype summary and a per-dtype detail table (columns + null counts).

This is a diagnostic convenience (similar in spirit to df.info(show_counts=True) but grouped by dtype and returning programmatic output).

Feature Description

Add a DataFrame accessor that provides a compact, programmatic summary of missing values grouped by column dtype.

@pd.api.extensions.register_dataframe_accessor("dtype_nulls")
class DtypeNullsAccessor:
def init(self, df):
self._df = df

def summary(self, include_pct: bool = True, sort_desc: bool = True):
    """
    Return (summary_df, detail_dict)

    Parameters
    ----------
    include_pct : bool, default True
        Include null_pct columns (percentage of nulls relative to len(df)).
    sort_desc : bool, default True
        Sort per-dtype detail tables by null_count descending when True.

    Returns
    -------
    summary_df : pd.DataFrame
        One row per dtype with columns:
          - dtype : str (dtype string, e.g., 'float64', 'object')
          - n_columns : int
          - cols_with_nulls : int
          - total_nulls : int
          - avg_null_pct : float (if include_pct)
    detail_dict : dict[str, pd.DataFrame]
        Mapping dtype string -> DataFrame listing columns of that dtype with
        columns ['column','null_count','null_pct'?] (null_pct present if include_pct).
    """

Implementation sketch / pseudocode:

nrows = len(df)
per_col = DataFrame({
"column": df.columns,
"dtype": df.dtypes.astype(str),
"null_count": df.isna().sum().values
})
if include_pct:
per_col["null_pct"] = per_col["null_count"] / (nrows if nrows else 1) * 100

detail = { dtype: g.sort_values("null_count", ascending=not sort_desc).reset_index(drop=True)
for dtype, g in per_col.groupby("dtype") }

agg = per_col.groupby("dtype").agg(
n_columns=("column","count"),
cols_with_nulls=("null_count", lambda s: (s>0).sum()),
total_nulls=("null_count","sum")
).reset_index()

if include_pct:
agg["avg_null_pct"] = per_col.groupby("dtype")["null_pct"].mean().values

return agg, detail

Expected behaviour / examples:

df = pd.DataFrame({
"a": [1, None, 3],
"b": [None, None, 2.0],
"c": ["x","y", None],
"d": [True, False, True]
})
summary, detail = df.dtype_nulls.summary()

summary: rows for 'float64', 'object', 'bool' with counts and percentages

detail['float64'] lists columns 'b' and 'a' with null_count and null_pct

Alternative Solutions

One-liner / ad-hoc: Users can already compute this with a short snippet:

(pd.DataFrame({'dtype': df.dtypes.astype(str), 'nulls': df.isna().sum()})
.reset_index()
.groupby('dtype')[['index','nulls']])

Additional Context

Related design rationale:

This feature is a convenience diagnostic that complements df.info() and profiling packages; it returns programmatic data structures (DataFrames and dict) so downstream tooling and tests can consume results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions