Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame level metadata #5117

Open
freeformstu opened this issue Oct 5, 2022 · 6 comments
Open

Add DataFrame level metadata #5117

freeformstu opened this issue Oct 5, 2022 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@freeformstu
Copy link

Problem description

I would like to be able to track dataframe specific metadata through processing, serialization, and deserialization.

A common use case for dataframe metadata is to store data about how the dataframe was generated or metadata about the data contained within its columns.

Below are some examples of existing libraries and formats which have dataframe level metadata. I am definitely open to putting this metadata elsewhere if there's a better place for it.

Arrow

With PyArrow, you can add metadata to the Schema with with_metadata.
https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.with_metadata

IPC

Arrow's IPC format can store File level metadata.
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

Parquet

Parquet has File and Column level metadata. Metadata per column may be useful for some use cases, but for the purposes of this issue I'd like to scope the metadata to the file level.
https://parquet.apache.org/docs/file-format/metadata/

Avro

Avro supports file level metadata.
https://avro.apache.org/docs/1.11.1/specification/_print/#object-container-files

Pandas

Pandas has an attrs attribute for the same purpose.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html

@freeformstu freeformstu added the enhancement New feature or an improvement of an existing feature label Oct 5, 2022
@kylebarron
Copy link

kylebarron commented Oct 7, 2022

This is something I've thought about a little for my geospatial use case, where it's helpful to be able to store extra (non-columnar) information along with the data, like its coordinate reference system information. In my case, I think I'd opt for arrow extension data type support (which allows for column-level metadata) instead of dataframe-level metadata, but I can see how that wouldn't fit every use case

@ritchie46
Copy link
Member

I think you should use a new-type pattern for this if you want the DataFrame to have some extra data.

With regard to extension types. This is something we want to support. We first need FixedSizeList and then we can work on extension types.

@universalmind303
Copy link
Collaborator

I think it could make a lot of sense to store some metadata on the schema itself. arrow2 is already doing this. I can think of many scenarios it could be used. For example, with the new binary dtype, it would be helpful to store some metadata about the encoding of the binary data.

@markdoerr
Copy link

I am also looking for a persitant way to store additional metadata to polar columns, like @freeformstu suggested.
This will be very useful, if one wants to store semantic information, e.g. for machine learning purposes, to a colum. A IRI to an ontology would enable to autonomously interpret the data in a desired way. AI/ML algorithms will "understand" the meaning of a certain datacolumn, e.g. Column "time" -> "EMMO:time" (once one makes a connection to an ontology, a lot of information, like units, relations, etc. can be extracted. I hope this example makes it very clear, why metadata is need (it would be also sad to lose information coming from parquet or pyarrow to polars).

@Insighttful
Copy link

Insighttful commented Feb 2, 2024

@freeformstu I think you could just accomplish this by subclassing the polars.DataFrame type as a MetaDataFrame and using a SimpleNamespace for the meta so you retain the dot accessor access?

import meta_dataframe as mdf

df = MetaDataFrame({"a": [1, 2, 3]})
df.meta.crs = "EPSG:4326"
df.meta.crs = "EPSG:3857"
print(df.meta.crs)

>>> EPSG:3857

By leveraging the Arrow IPC specification you can provide additional functions to write and read the DataFrame while also managing the metadata.

filepath = f"{df.meta.name}.ipc"  # use a fully qualified path if desired
df.write_ipc_with_meta(filepath)
loaded_df = mdf.read_ipc_with_meta(filepath)
print(loaded_df.meta.crs)

>>> EPSG:3857

As I answered on Stack:

Click here for MetaDataFrame module example
# meta_dataframe.py

"""Provides functionality for handling Polars DataFrames with custom metadata.

This module enables the serialization and deserialization of Polars DataFrames
along with associated metadata, utilizing the IPC format for data interchange
and `orjson` for fast JSON processing. Metadata management is facilitated through
the use of the `DfMeta` class, a flexible container for arbitrary metadata fields.
Key functions include `write_ipc_with_meta` and `read_ipc_with_meta`, which allow
for the persistence of metadata across storage cycles, enhancing data context
retention and utility in analytical workflows.

Note:
    This module was not written for efficiency or performance, but to solve the
    use case of persisting metadata with Polars DataFrames. It is not recommended for
    production use, but rather as a starting point for more robust metadata management.

Classes:
    DfMeta: A simple namespace for metadata management.
    MetaDataFrame: An extension of Polars DataFrame to include metadata.

Functions:
    write_ipc_with_meta(df, filepath, meta): Serialize DataFrame and metadata to IPC.
    read_ipc_with_meta(filepath): Deserialize DataFrame and metadata from IPC.
"""


# Standard Library
from typing import Any, Dict
from types import SimpleNamespace

# Third Party
import orjson
import polars as pl
import pyarrow as pa


class DfMeta(SimpleNamespace):
    """A simple namespace for storing MetaDataFrame metadata.

    Usage:
        meta = DfMeta(
        name="checkins",
        db_name="my_db",
        tz_name="America/New_York",
        crs="EPSG:4326"
    )
    """

    # Generate a string representation of metadata keys
    def __repr__(self) -> str:
        keys = ", ".join(self.__dict__.keys())
        return f"DfMeta({keys})"

    # Alias __str__ to __repr__ for consistent string representation
    def __str__(self) -> str:
        return self.__repr__()


class MetaDataFrame(pl.DataFrame):
    """A Polars DataFrame extended to include custom metadata.

    Attributes:
        meta (DfMeta): A simple namespace for storing metadata.

    Usage:

        # Create MetaDataFrame with metadata
        meta = DfMeta(
            name="my_df",
            db_name="my_db",
            tz_name="America/New_York",
            crs="EPSG:4326"
        )
        df = MetaDataFrame({"a": [1, 2, 3]}, meta=meta)

        # Create MetaDataFrame then add metadata
        df = MetaDataFrame({"a": [1, 2, 3]})
        df.meta.name = "my_df"
        df.meta.db_name = "my_db"
        df.meta.tz_name = "America/New_York"
        df.meta.crs = "EPSG:4326"

        # Overwrite metadata
        df.meta.crs = "EPSG:3857"

        # Write MetaDataFrame to IPC with metadata
        df.write_ipc_with_meta("my_df.ipc")

        # Read MetaDataFrame from IPC with metadata
        loaded_df = read_ipc_with_meta("my_df.ipc")

        # Access metadata
        print(loaded_df.meta.name)
        print(loaded_df.meta_as_dict())
    """

    # Initialize DataFrame with `meta` attr SimpleNamespace
    def __init__(self, data: Any = None, *args, meta: DfMeta = None, **kwargs):
        super().__init__(data, *args, **kwargs)
        self.meta = meta if meta else DfMeta()

    def meta_as_dict(self) -> dict[str, Any]:
        """Returns the metadata as a dictionary.

        Returns:
            dict[str, Any]: A dictionary representation of the metadata.
        """
        return vars(self.meta)

    def write_ipc_with_meta(self, filepath: str) -> None:
        """Serialize MetaDataFrame and metadata stored in `meta` attr to an IPC file.

        Args:
            filepath (str): The path to the IPC file.

        Returns:
            None
        """
        # Convert Polars DataFrame to Arrow Table
        arrow_table = self.to_arrow()

        # Serialize metadata to JSON
        meta: DfMeta = self.meta
        meta_dict = {k: v for k, v in meta.__dict__.items()}
        meta_json = orjson.dumps(meta_dict)

        # Embed metadata into Arrow schema
        new_schema = arrow_table.schema.with_metadata({"meta": meta_json})
        arrow_table_with_meta = arrow_table.replace_schema_metadata(new_schema.metadata)

        # Write Arrow table with metadata to IPC file
        with pa.OSFile(filepath, "wb") as sink:
            with pa.RecordBatchStreamWriter(
                sink, arrow_table_with_meta.schema
            ) as writer:
                writer.write_table(arrow_table_with_meta)


def read_ipc_with_meta(filepath: str) -> MetaDataFrame:
    """Deserialize DataFrame and metadata from an IPC file.

    Args:
        filepath (str): The path to the IPC file.

    Returns:
        MetaDataFrame: The deserialized DataFrame with metadata stored in `meta` attr.
    """
    # Read Arrow table from IPC file
    with pa.OSFile(filepath, "rb") as source:
        reader = pa.ipc.open_stream(source)
        table = reader.read_all()

    # Extract and deserialize metadata from Arrow schema
    meta_json = table.schema.metadata.get(b"meta")
    if meta_json:
        meta_dict = orjson.loads(meta_json)
        meta = DfMeta(**meta_dict)
    else:
        meta = DfMeta()

    # Convert Arrow table to Polars DataFrame and attach metadata
    df = pl.from_arrow(table)
    extended_df = MetaDataFrame(df, meta=meta)
    return extended_df

@AlexanderNenninger
Copy link

AlexanderNenninger commented Apr 14, 2024

I recently ran into the same issue with sensor data. I'd really like to preserve units, orientations, sampling frequency etc. through processing as it helps with catching bad data.
Maybe this could be a polars extension though, or people should just roll with their own implementation on a project basis.

During my experiments, I found that adding support for custom data (could be as simple as being able to pass in an additional Dict[bytes, bytes]) in the (de-)serialization methods where it makes sense would simplify the implementation dramatically and could make it more robust.

E.g. currently there's really no good way of storing a Categorical(ordering="lexical") column in Parquet through PyArrow. Hive partitioning also has a few pitfalls w.r.t. data types.

Is it bad style to link my own repo? In any case if someone needs a temporary solution with a lot of pitfalls already ironed out: https://github.com/AlexanderNenninger/parquet_data_classes/tree/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

7 participants