Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compressed csv in scan_csv #7287

Open
corneliusroemer opened this issue Mar 1, 2023 · 4 comments
Open

Support compressed csv in scan_csv #7287

corneliusroemer opened this issue Mar 1, 2023 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@corneliusroemer
Copy link

corneliusroemer commented Mar 1, 2023

Problem description

It would be nice if polars could load compressed csvs out of the box, e.g. a zstd compressed csv.

I'm not sure what the best workaround is. xopen doesn't seem to work:

import polars as pl
import xopen 

#%%
with xopen.xopen("metadata_germany.tsv.zst", "rt") as f:
    pl.scan_csv(f, has_header=True, sep="\t").head(10).collect()

raises:

TypeError: argument 'path': 'TextIOWrapper' object cannot be converted to 'PyString'

Related: #3166

@corneliusroemer corneliusroemer added the enhancement New feature or an improvement of an existing feature label Mar 1, 2023
@corneliusroemer
Copy link
Author

Now I'd also like zstd support for ndjson - would be great to be able to read from compressed files.

@natir
Copy link

natir commented May 23, 2023

A rust xopen alternative is niffler.

Sorry for self promoting, but I'm also need this feature, maybe niffler could help polars.

@corneliusroemer
Copy link
Author

corneliusroemer commented Jun 6, 2023

It could be that reading in rb mode would work, at least in the case of read_csv I got managed to get it to read from a zst compressed file with xopen, see https://stackoverflow.com/questions/76417610/how-to-read-csv-a-zstd-compressed-file-using-python-polars

The downside is that the entire uncompressed file/stream is read into memory before parsing, so this doesn't work for cases where uncompressed file is of similar size as the machine's memory

@ghuls
Copy link
Collaborator

ghuls commented Jun 16, 2023

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them:
#9283 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants