-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrame level metadata #5117
Comments
This is something I've thought about a little for my geospatial use case, where it's helpful to be able to store extra (non-columnar) information along with the data, like its coordinate reference system information. In my case, I think I'd opt for arrow extension data type support (which allows for column-level metadata) instead of dataframe-level metadata, but I can see how that wouldn't fit every use case |
I think you should use a With regard to extension types. This is something we want to support. We first need |
I think it could make a lot of sense to store some metadata on the schema itself. |
I am also looking for a persitant way to store additional metadata to polar columns, like @freeformstu suggested. |
@freeformstu I think you could just accomplish this by subclassing the import meta_dataframe as mdf
df = MetaDataFrame({"a": [1, 2, 3]})
df.meta.crs = "EPSG:4326"
df.meta.crs = "EPSG:3857"
print(df.meta.crs)
>>> EPSG:3857 By leveraging the Arrow IPC specification you can provide additional functions to write and read the DataFrame while also managing the metadata. filepath = f"{df.meta.name}.ipc" # use a fully qualified path if desired
df.write_ipc_with_meta(filepath)
loaded_df = mdf.read_ipc_with_meta(filepath)
print(loaded_df.meta.crs)
>>> EPSG:3857 Click here for MetaDataFrame module example# meta_dataframe.py
"""Provides functionality for handling Polars DataFrames with custom metadata.
This module enables the serialization and deserialization of Polars DataFrames
along with associated metadata, utilizing the IPC format for data interchange
and `orjson` for fast JSON processing. Metadata management is facilitated through
the use of the `DfMeta` class, a flexible container for arbitrary metadata fields.
Key functions include `write_ipc_with_meta` and `read_ipc_with_meta`, which allow
for the persistence of metadata across storage cycles, enhancing data context
retention and utility in analytical workflows.
Note:
This module was not written for efficiency or performance, but to solve the
use case of persisting metadata with Polars DataFrames. It is not recommended for
production use, but rather as a starting point for more robust metadata management.
Classes:
DfMeta: A simple namespace for metadata management.
MetaDataFrame: An extension of Polars DataFrame to include metadata.
Functions:
write_ipc_with_meta(df, filepath, meta): Serialize DataFrame and metadata to IPC.
read_ipc_with_meta(filepath): Deserialize DataFrame and metadata from IPC.
"""
# Standard Library
from typing import Any, Dict
from types import SimpleNamespace
# Third Party
import orjson
import polars as pl
import pyarrow as pa
class DfMeta(SimpleNamespace):
"""A simple namespace for storing MetaDataFrame metadata.
Usage:
meta = DfMeta(
name="checkins",
db_name="my_db",
tz_name="America/New_York",
crs="EPSG:4326"
)
"""
# Generate a string representation of metadata keys
def __repr__(self) -> str:
keys = ", ".join(self.__dict__.keys())
return f"DfMeta({keys})"
# Alias __str__ to __repr__ for consistent string representation
def __str__(self) -> str:
return self.__repr__()
class MetaDataFrame(pl.DataFrame):
"""A Polars DataFrame extended to include custom metadata.
Attributes:
meta (DfMeta): A simple namespace for storing metadata.
Usage:
# Create MetaDataFrame with metadata
meta = DfMeta(
name="my_df",
db_name="my_db",
tz_name="America/New_York",
crs="EPSG:4326"
)
df = MetaDataFrame({"a": [1, 2, 3]}, meta=meta)
# Create MetaDataFrame then add metadata
df = MetaDataFrame({"a": [1, 2, 3]})
df.meta.name = "my_df"
df.meta.db_name = "my_db"
df.meta.tz_name = "America/New_York"
df.meta.crs = "EPSG:4326"
# Overwrite metadata
df.meta.crs = "EPSG:3857"
# Write MetaDataFrame to IPC with metadata
df.write_ipc_with_meta("my_df.ipc")
# Read MetaDataFrame from IPC with metadata
loaded_df = read_ipc_with_meta("my_df.ipc")
# Access metadata
print(loaded_df.meta.name)
print(loaded_df.meta_as_dict())
"""
# Initialize DataFrame with `meta` attr SimpleNamespace
def __init__(self, data: Any = None, *args, meta: DfMeta = None, **kwargs):
super().__init__(data, *args, **kwargs)
self.meta = meta if meta else DfMeta()
def meta_as_dict(self) -> dict[str, Any]:
"""Returns the metadata as a dictionary.
Returns:
dict[str, Any]: A dictionary representation of the metadata.
"""
return vars(self.meta)
def write_ipc_with_meta(self, filepath: str) -> None:
"""Serialize MetaDataFrame and metadata stored in `meta` attr to an IPC file.
Args:
filepath (str): The path to the IPC file.
Returns:
None
"""
# Convert Polars DataFrame to Arrow Table
arrow_table = self.to_arrow()
# Serialize metadata to JSON
meta: DfMeta = self.meta
meta_dict = {k: v for k, v in meta.__dict__.items()}
meta_json = orjson.dumps(meta_dict)
# Embed metadata into Arrow schema
new_schema = arrow_table.schema.with_metadata({"meta": meta_json})
arrow_table_with_meta = arrow_table.replace_schema_metadata(new_schema.metadata)
# Write Arrow table with metadata to IPC file
with pa.OSFile(filepath, "wb") as sink:
with pa.RecordBatchStreamWriter(
sink, arrow_table_with_meta.schema
) as writer:
writer.write_table(arrow_table_with_meta)
def read_ipc_with_meta(filepath: str) -> MetaDataFrame:
"""Deserialize DataFrame and metadata from an IPC file.
Args:
filepath (str): The path to the IPC file.
Returns:
MetaDataFrame: The deserialized DataFrame with metadata stored in `meta` attr.
"""
# Read Arrow table from IPC file
with pa.OSFile(filepath, "rb") as source:
reader = pa.ipc.open_stream(source)
table = reader.read_all()
# Extract and deserialize metadata from Arrow schema
meta_json = table.schema.metadata.get(b"meta")
if meta_json:
meta_dict = orjson.loads(meta_json)
meta = DfMeta(**meta_dict)
else:
meta = DfMeta()
# Convert Arrow table to Polars DataFrame and attach metadata
df = pl.from_arrow(table)
extended_df = MetaDataFrame(df, meta=meta)
return extended_df |
I recently ran into the same issue with sensor data. I'd really like to preserve units, orientations, sampling frequency etc. through processing as it helps with catching bad data. During my experiments, I found that adding support for custom data (could be as simple as being able to pass in an additional E.g. currently there's really no good way of storing a Is it bad style to link my own repo? In any case if someone needs a temporary solution with a lot of pitfalls already ironed out: https://github.com/AlexanderNenninger/parquet_data_classes/tree/main |
Problem description
I would like to be able to track dataframe specific metadata through processing, serialization, and deserialization.
A common use case for dataframe metadata is to store data about how the dataframe was generated or metadata about the data contained within its columns.
Below are some examples of existing libraries and formats which have dataframe level metadata. I am definitely open to putting this metadata elsewhere if there's a better place for it.
Arrow
With PyArrow, you can add metadata to the Schema with
with_metadata
.https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.with_metadata
IPC
Arrow's IPC format can store File level metadata.
https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
Parquet
Parquet has File and Column level metadata. Metadata per column may be useful for some use cases, but for the purposes of this issue I'd like to scope the metadata to the file level.
https://parquet.apache.org/docs/file-format/metadata/
Avro
Avro supports file level metadata.
https://avro.apache.org/docs/1.11.1/specification/_print/#object-container-files
Pandas
Pandas has an
attrs
attribute for the same purpose.https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html
The text was updated successfully, but these errors were encountered: