Add support for reading/writing parquet files with pyarrow by huddlej · Pull Request #252 · nextstrain/docker-base

huddlej · 2025-05-22T23:32:03Z

Description of proposed changes

We need support for reading/writing parquet files to prepare submissions to the SARS-CoV-2 variant hub [1]. The pyarrow library is one of two supported by pandas [2] along with fastparquet. The pyarrow library provides a more comprehensive set of tools for the Arrow spec [3], while fastparquet is defined to provide a minimal library for the parquet format. I've opted for the larger pyarrow library here, since it will eventually be a required dependency for pandas [4].

[1] nextstrain/forecasts-ncov#132
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet
[3] https://arrow.apache.org/docs/cpp/user_guide.html
[4] https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

Related issue(s)

nextstrain/forecasts-ncov#132

Checklist

Checks pass

huddlej · 2025-05-23T18:08:26Z

Closed in favor of #253

We need support for reading/writing parquet files to prepare submissions to the SARS-CoV-2 variant hub [1]. The pyarrow library is one of two supported by pandas [2] along with fastparquet. The pyarrow library provides a more comprehensive set of tools for the Arrow spec [3], while fastparquet is defined to provide a minimal library for the parquet format. We need to switch to the larger pyarrow library here, because it supports the parquet DATE data type that we need for our SARS-CoV-2 nowcast submissions. [1] nextstrain/forecasts-ncov#132 [2] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet [3] https://arrow.apache.org/docs/cpp/user_guide.html

victorlin

Looks like it installed successfully in the build logs:

#36 [linux/amd64 builder-target-platform 12/21] RUN pip3 install pyarrow==20.0.0
#36 0.621 Collecting pyarrow==20.0.0
#36 0.663   Downloading pyarrow-20.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (42.3 MB)
#36 0.950      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 MB 86.2 MB/s eta 0:00:00
#36 1.487 Installing collected packages: pyarrow
#36 2.581 Successfully installed pyarrow-20.0.0

#35 [linux/arm64 builder-target-platform 12/21] RUN pip3 install pyarrow==20.0.0
#35 27.49 Successfully installed pyarrow-20.0.0

victorlin · 2025-07-07T19:21:19Z

Dockerfile

 # Install openpyxl for pandas in GenoFLU
 RUN pip3 install openpyxl==3.1.0

-# Install fastparquet for pandas to support parquet files.
-RUN pip3 install fastparquet==2024.11.0
+# Install pyarrow for pandas to support parquet files.
+RUN pip3 install pyarrow==20.0.0


I was thinking that, given the way this comment is worded, it might make more sense to pip install "pandas[parquet]" which lets pandas resolve a compatible version of pyarrow. But then I realized that pandas is not directly installed in the Dockerfile – it's installed as a dependency of TreeTime and Augur.

If/when Augur declares a dependency of pandas[parquet], we can remove the separate command that installs pyarrow.

Similarly, openpyxl is included in pip install "pandas[excel]", but there's an image size argument to be made for installing openpyxl separately since pandas[excel] includes other unneeded dependencies.

huddlej mentioned this pull request May 22, 2025

Add USA-specific model nextstrain/forecasts-ncov#133

Merged

14 tasks

huddlej changed the title ~~Add support for reading/writing parquet files~~ Add support for reading/writing parquet files with pyarrow May 23, 2025

huddlej mentioned this pull request May 23, 2025

Add support for reading/writing parquet files with fastparquet #253

Merged

1 task

huddlej closed this May 23, 2025

huddlej deleted the add-pyarrow branch May 23, 2025 18:08

huddlej restored the add-pyarrow branch July 2, 2025 22:37

huddlej reopened this Jul 2, 2025

huddlej force-pushed the add-pyarrow branch from ddc0aeb to 2904f40 Compare July 2, 2025 22:45

huddlej mentioned this pull request Jul 2, 2025

Cast dates in hub submissions to parquet date type nextstrain/forecasts-ncov#139

Merged

1 task

victorlin approved these changes Jul 2, 2025

View reviewed changes

huddlej merged commit 9270fb3 into master Jul 3, 2025
61 checks passed

huddlej deleted the add-pyarrow branch July 3, 2025 00:24

victorlin reviewed Jul 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add support for reading/writing parquet files with pyarrow#252

Add support for reading/writing parquet files with pyarrow#252
huddlej merged 1 commit intomasterfrom
add-pyarrow

huddlej commented May 22, 2025 •

edited

Loading

Uh oh!

huddlej commented May 23, 2025

Uh oh!

victorlin left a comment

Uh oh!

Uh oh!

victorlin Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

huddlej commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

huddlej commented May 23, 2025

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victorlin Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huddlej commented May 22, 2025 •

edited

Loading