Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan csv.gz tables #17011

Open
USM-CHU-FGuyon opened this issue Jun 17, 2024 · 2 comments
Open

Scan csv.gz tables #17011

USM-CHU-FGuyon opened this issue Jun 17, 2024 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@USM-CHU-FGuyon
Copy link

USM-CHU-FGuyon commented Jun 17, 2024

Description

I'm working on larger-than-RAM csv.gz files that I want to convert to parquet.

Just like in the docs, I would love to be able to do this:

import polars as pl
lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv.gz")
lf.sink_parquet("out.parquet")

Unfortunately

ComputeError: cannot scan compressed csv; use `read_csv` for compressed data

and reading these compressed csv in memory is not feasable.

My current workaround is to to use pandas to chunk the file and pyarrow to write the chunks to a single file:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df_chunks = pd.read_csv("/path/to/my_larger_than_ram_file.csv.gz",  chunksize=1e7)

for i, df in enumerate(df_chunks):
    table = pa.Table.from_pandas(df)
    if i == 0:
        pqwriter = pq.ParquetWriter(pth_tgt, table.schema)         
    pqwriter.write_table(table)
pqwriter.close()

This is unpractical and gets clunky when there are values with a new datatype in some chunks (eg. a single float value in a columns of ints).
It would be nice to ba able to do this smoothly in polars !

Thanks you for this project,
Matthieu

@USM-CHU-FGuyon USM-CHU-FGuyon added the enhancement New feature or an improvement of an existing feature label Jun 17, 2024
@aut0clave
Copy link

Duplicate of #7287?

@USM-CHU-FGuyon
Copy link
Author

Not familiar with zstd but #7287 seems to have the same intent of scanning compressed csvs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants