Add DataFrame level metadata #5117

freeformstu · 2022-10-05T19:20:00Z

Problem description

I would like to be able to track dataframe specific metadata through processing, serialization, and deserialization.

A common use case for dataframe metadata is to store data about how the dataframe was generated or metadata about the data contained within its columns.

Below are some examples of existing libraries and formats which have dataframe level metadata. I am definitely open to putting this metadata elsewhere if there's a better place for it.

Arrow

With PyArrow, you can add metadata to the Schema with with_metadata.
https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.with_metadata

IPC

Arrow's IPC format can store File level metadata.
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

Parquet

Parquet has File and Column level metadata. Metadata per column may be useful for some use cases, but for the purposes of this issue I'd like to scope the metadata to the file level.
https://parquet.apache.org/docs/file-format/metadata/

Avro

Avro supports file level metadata.
https://avro.apache.org/docs/1.11.1/specification/_print/#object-container-files

Pandas

Pandas has an attrs attribute for the same purpose.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html

The text was updated successfully, but these errors were encountered:

kylebarron · 2022-10-07T04:09:34Z

This is something I've thought about a little for my geospatial use case, where it's helpful to be able to store extra (non-columnar) information along with the data, like its coordinate reference system information. In my case, I think I'd opt for arrow extension data type support (which allows for column-level metadata) instead of dataframe-level metadata, but I can see how that wouldn't fit every use case

ritchie46 · 2022-10-07T06:20:16Z

I think you should use a new-type pattern for this if you want the DataFrame to have some extra data.

With regard to extension types. This is something we want to support. We first need FixedSizeList and then we can work on extension types.

universalmind303 · 2022-10-07T15:27:10Z

I think it could make a lot of sense to store some metadata on the schema itself. arrow2 is already doing this. I can think of many scenarios it could be used. For example, with the new binary dtype, it would be helpful to store some metadata about the encoding of the binary data.

markdoerr · 2023-07-02T10:30:37Z

I am also looking for a persitant way to store additional metadata to polar columns, like @freeformstu suggested.
This will be very useful, if one wants to store semantic information, e.g. for machine learning purposes, to a colum. A IRI to an ontology would enable to autonomously interpret the data in a desired way. AI/ML algorithms will "understand" the meaning of a certain datacolumn, e.g. Column "time" -> "EMMO:time" (once one makes a connection to an ontology, a lot of information, like units, relations, etc. can be extracted. I hope this example makes it very clear, why metadata is need (it would be also sad to lose information coming from parquet or pyarrow to polars).

Insighttful · 2024-02-02T17:53:56Z

@freeformstu I think you could just accomplish this by subclassing the polars.DataFrame type as a MetaDataFrame and using a SimpleNamespace for the meta so you retain the dot accessor access?

import meta_dataframe as mdf

df = MetaDataFrame({"a": [1, 2, 3]})
df.meta.crs = "EPSG:4326"
df.meta.crs = "EPSG:3857"
print(df.meta.crs)

>>> EPSG:3857

By leveraging the Arrow IPC specification you can provide additional functions to write and read the DataFrame while also managing the metadata.

filepath = f"{df.meta.name}.ipc"  # use a fully qualified path if desired
df.write_ipc_with_meta(filepath)
loaded_df = mdf.read_ipc_with_meta(filepath)
print(loaded_df.meta.crs)

>>> EPSG:3857

As I answered on Stack:

Click here for MetaDataFrame module example

# meta_dataframe.py

"""Provides functionality for handling Polars DataFrames with custom metadata.

This module enables the serialization and deserialization of Polars DataFrames
along with associated metadata, utilizing the IPC format for data interchange
and `orjson` for fast JSON processing. Metadata management is facilitated through
the use of the `DfMeta` class, a flexible container for arbitrary metadata fields.
Key functions include `write_ipc_with_meta` and `read_ipc_with_meta`, which allow
for the persistence of metadata across storage cycles, enhancing data context
retention and utility in analytical workflows.

Note:
    This module was not written for efficiency or performance, but to solve the
    use case of persisting metadata with Polars DataFrames. It is not recommended for
    production use, but rather as a starting point for more robust metadata management.

Classes:
    DfMeta: A simple namespace for metadata management.
    MetaDataFrame: An extension of Polars DataFrame to include metadata.

Functions:
    write_ipc_with_meta(df, filepath, meta): Serialize DataFrame and metadata to IPC.
    read_ipc_with_meta(filepath): Deserialize DataFrame and metadata from IPC.
"""


# Standard Library
from typing import Any, Dict
from types import SimpleNamespace

# Third Party
import orjson
import polars as pl
import pyarrow as pa


class DfMeta(SimpleNamespace):
    """A simple namespace for storing MetaDataFrame metadata.

    Usage:
        meta = DfMeta(
        name="checkins",
        db_name="my_db",
        tz_name="America/New_York",
        crs="EPSG:4326"
    )
    """

    # Generate a string representation of metadata keys
    def __repr__(self) -> str:
        keys = ", ".join(self.__dict__.keys())
        return f"DfMeta({keys})"

    # Alias __str__ to __repr__ for consistent string representation
    def __str__(self) -> str:
        return self.__repr__()


class MetaDataFrame(pl.DataFrame):
    """A Polars DataFrame extended to include custom metadata.

    Attributes:
        meta (DfMeta): A simple namespace for storing metadata.

    Usage:

        # Create MetaDataFrame with metadata
        meta = DfMeta(
            name="my_df",
            db_name="my_db",
            tz_name="America/New_York",
            crs="EPSG:4326"
        )
        df = MetaDataFrame({"a": [1, 2, 3]}, meta=meta)

        # Create MetaDataFrame then add metadata
        df = MetaDataFrame({"a": [1, 2, 3]})
        df.meta.name = "my_df"
        df.meta.db_name = "my_db"
        df.meta.tz_name = "America/New_York"
        df.meta.crs = "EPSG:4326"

        # Overwrite metadata
        df.meta.crs = "EPSG:3857"

        # Write MetaDataFrame to IPC with metadata
        df.write_ipc_with_meta("my_df.ipc")

        # Read MetaDataFrame from IPC with metadata
        loaded_df = read_ipc_with_meta("my_df.ipc")

        # Access metadata
        print(loaded_df.meta.name)
        print(loaded_df.meta_as_dict())
    """

    # Initialize DataFrame with `meta` attr SimpleNamespace
    def __init__(self, data: Any = None, *args, meta: DfMeta = None, **kwargs):
        super().__init__(data, *args, **kwargs)
        self.meta = meta if meta else DfMeta()

    def meta_as_dict(self) -> dict[str, Any]:
        """Returns the metadata as a dictionary.

        Returns:
            dict[str, Any]: A dictionary representation of the metadata.
        """
        return vars(self.meta)

    def write_ipc_with_meta(self, filepath: str) -> None:
        """Serialize MetaDataFrame and metadata stored in `meta` attr to an IPC file.

        Args:
            filepath (str): The path to the IPC file.

        Returns:
            None
        """
        # Convert Polars DataFrame to Arrow Table
        arrow_table = self.to_arrow()

        # Serialize metadata to JSON
        meta: DfMeta = self.meta
        meta_dict = {k: v for k, v in meta.__dict__.items()}
        meta_json = orjson.dumps(meta_dict)

        # Embed metadata into Arrow schema
        new_schema = arrow_table.schema.with_metadata({"meta": meta_json})
        arrow_table_with_meta = arrow_table.replace_schema_metadata(new_schema.metadata)

        # Write Arrow table with metadata to IPC file
        with pa.OSFile(filepath, "wb") as sink:
            with pa.RecordBatchStreamWriter(
                sink, arrow_table_with_meta.schema
            ) as writer:
                writer.write_table(arrow_table_with_meta)


def read_ipc_with_meta(filepath: str) -> MetaDataFrame:
    """Deserialize DataFrame and metadata from an IPC file.

    Args:
        filepath (str): The path to the IPC file.

    Returns:
        MetaDataFrame: The deserialized DataFrame with metadata stored in `meta` attr.
    """
    # Read Arrow table from IPC file
    with pa.OSFile(filepath, "rb") as source:
        reader = pa.ipc.open_stream(source)
        table = reader.read_all()

    # Extract and deserialize metadata from Arrow schema
    meta_json = table.schema.metadata.get(b"meta")
    if meta_json:
        meta_dict = orjson.loads(meta_json)
        meta = DfMeta(**meta_dict)
    else:
        meta = DfMeta()

    # Convert Arrow table to Polars DataFrame and attach metadata
    df = pl.from_arrow(table)
    extended_df = MetaDataFrame(df, meta=meta)
    return extended_df

AlexanderNenninger · 2024-04-14T21:28:43Z

I recently ran into the same issue with sensor data. I'd really like to preserve units, orientations, sampling frequency etc. through processing as it helps with catching bad data.
Maybe this could be a polars extension though, or people should just roll with their own implementation on a project basis.

During my experiments, I found that adding support for custom data (could be as simple as being able to pass in an additional Dict[bytes, bytes]) in the (de-)serialization methods where it makes sense would simplify the implementation dramatically and could make it more robust.

E.g. currently there's really no good way of storing a Categorical(ordering="lexical") column in Parquet through PyArrow. Hive partitioning also has a few pitfalls w.r.t. data types.

Is it bad style to link my own repo? In any case if someone needs a temporary solution with a lot of pitfalls already ironed out: https://github.com/AlexanderNenninger/parquet_data_classes/tree/main

freeformstu added the enhancement New feature or an improvement of an existing feature label Oct 5, 2022

freeformstu mentioned this issue Oct 5, 2022

feat(rust, python): Add metadata to DataFrame and wire it into parquet #5102

Closed

This was referenced May 6, 2023

Allow to add metadata on columns #8519

Closed

Associating Metadata to a Polars Series #7900

Closed

MarcoGorelli mentioned this issue Aug 14, 2023

Add option to include source filename and filepath in dataframe #10481

Open

natir mentioned this issue Sep 28, 2023

Add info description in io.vcf.from_lazyframe natir/variantplaner#3

Open

stinodego mentioned this issue Mar 2, 2024

Add empty attr dictionary to lazy/dataframe #14815

Closed

AlexanderNenninger mentioned this issue Apr 16, 2024

Support File Metadata in Serialization Functions where Sensible #15689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFrame level metadata #5117

Add DataFrame level metadata #5117

freeformstu commented Oct 5, 2022

kylebarron commented Oct 7, 2022 •

edited

ritchie46 commented Oct 7, 2022

universalmind303 commented Oct 7, 2022

markdoerr commented Jul 2, 2023

Insighttful commented Feb 2, 2024 •

edited

AlexanderNenninger commented Apr 14, 2024 •

edited

Add DataFrame level metadata #5117

Add DataFrame level metadata #5117

Comments

freeformstu commented Oct 5, 2022

Problem description

Arrow

IPC

Parquet

Avro

Pandas

kylebarron commented Oct 7, 2022 • edited

ritchie46 commented Oct 7, 2022

universalmind303 commented Oct 7, 2022

markdoerr commented Jul 2, 2023

Insighttful commented Feb 2, 2024 • edited

AlexanderNenninger commented Apr 14, 2024 • edited

kylebarron commented Oct 7, 2022 •

edited

Insighttful commented Feb 2, 2024 •

edited

AlexanderNenninger commented Apr 14, 2024 •

edited