Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-36775: Refactor string length checking on dataframe->arrow conversion. #747

Merged
merged 1 commit into from
Oct 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/changes/DM-36775.bugfix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix bug in pandas dataframe to arrow conversion that would crash with some pandas object datatypes.

15 changes: 7 additions & 8 deletions python/lsst/daf/butler/formatters/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,21 +357,20 @@ def pandas_to_arrow(dataframe: pd.DataFrame, default_length: int = 10) -> pa.Tab
-------
arrow_table : `pyarrow.Table`
"""
import numpy as np
import pandas as pd

arrow_table = pa.Table.from_pandas(dataframe)

# Update the metadata
md = arrow_table.schema.metadata

md[b"lsst::arrow::rowcount"] = str(arrow_table.num_rows)

if not isinstance(dataframe.columns, pd.MultiIndex):
for name in dataframe.columns:
if dataframe[name].dtype.type is np.object_:
if len(dataframe[name].values) > 0:
strlen = max(len(row) for row in dataframe[name].values)
# We loop through the arrow table columns because the datatypes have
# been checked and converted from pandas objects.
for name in arrow_table.column_names:
if not name.startswith("__"):
if arrow_table[name].type == pa.string():
if len(arrow_table[name]) > 0:
strlen = max(len(row.as_py()) for row in arrow_table[name])
else:
strlen = default_length
md[f"lsst::arrow::len::{name}".encode("UTF-8")] = str(strlen)
Expand Down