Add support for reading/writing parquet files with pyarrow#252
Add support for reading/writing parquet files with pyarrow#252
Conversation
|
Closed in favor of #253 |
We need support for reading/writing parquet files to prepare submissions to the SARS-CoV-2 variant hub [1]. The pyarrow library is one of two supported by pandas [2] along with fastparquet. The pyarrow library provides a more comprehensive set of tools for the Arrow spec [3], while fastparquet is defined to provide a minimal library for the parquet format. We need to switch to the larger pyarrow library here, because it supports the parquet DATE data type that we need for our SARS-CoV-2 nowcast submissions. [1] nextstrain/forecasts-ncov#132 [2] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet [3] https://arrow.apache.org/docs/cpp/user_guide.html
victorlin
left a comment
There was a problem hiding this comment.
Looks like it installed successfully in the build logs:
#36 [linux/amd64 builder-target-platform 12/21] RUN pip3 install pyarrow==20.0.0
#36 0.621 Collecting pyarrow==20.0.0
#36 0.663 Downloading pyarrow-20.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (42.3 MB)
#36 0.950 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 MB 86.2 MB/s eta 0:00:00
#36 1.487 Installing collected packages: pyarrow
#36 2.581 Successfully installed pyarrow-20.0.0
#35 [linux/arm64 builder-target-platform 12/21] RUN pip3 install pyarrow==20.0.0
#35 27.49 Successfully installed pyarrow-20.0.0
| # Install openpyxl for pandas in GenoFLU | ||
| RUN pip3 install openpyxl==3.1.0 | ||
|
|
||
| # Install fastparquet for pandas to support parquet files. | ||
| RUN pip3 install fastparquet==2024.11.0 | ||
| # Install pyarrow for pandas to support parquet files. | ||
| RUN pip3 install pyarrow==20.0.0 |
There was a problem hiding this comment.
I was thinking that, given the way this comment is worded, it might make more sense to pip install "pandas[parquet]" which lets pandas resolve a compatible version of pyarrow. But then I realized that pandas is not directly installed in the Dockerfile – it's installed as a dependency of TreeTime and Augur.
If/when Augur declares a dependency of pandas[parquet], we can remove the separate command that installs pyarrow.
Similarly, openpyxl is included in pip install "pandas[excel]", but there's an image size argument to be made for installing openpyxl separately since pandas[excel] includes other unneeded dependencies.
Description of proposed changes
We need support for reading/writing parquet files to prepare submissions to the SARS-CoV-2 variant hub [1]. The pyarrow library is one of two supported by pandas [2] along with fastparquet. The pyarrow library provides a more comprehensive set of tools for the Arrow spec [3], while fastparquet is defined to provide a minimal library for the parquet format. I've opted for the larger pyarrow library here, since it will eventually be a required dependency for pandas [4].
[1] nextstrain/forecasts-ncov#132
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet
[3] https://arrow.apache.org/docs/cpp/user_guide.html
[4] https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html
Related issue(s)
nextstrain/forecasts-ncov#132
Checklist