Scan csv.gz tables #17011

USM-CHU-FGuyon · 2024-06-17T09:46:08Z

Description

I'm working on larger-than-RAM csv.gz files that I want to convert to parquet.

Just like in the docs, I would love to be able to do this:

import polars as pl
lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv.gz")
lf.sink_parquet("out.parquet")

Unfortunately

ComputeError: cannot scan compressed csv; use `read_csv` for compressed data

and reading these compressed csv in memory is not feasable.

My current workaround is to to use pandas to chunk the file and pyarrow to write the chunks to a single file:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df_chunks = pd.read_csv("/path/to/my_larger_than_ram_file.csv.gz",  chunksize=1e7)

for i, df in enumerate(df_chunks):
    table = pa.Table.from_pandas(df)
    if i == 0:
        pqwriter = pq.ParquetWriter(pth_tgt, table.schema)         
    pqwriter.write_table(table)
pqwriter.close()

This is unpractical and gets clunky when there are values with a new datatype in some chunks (eg. a single float value in a columns of ints).
It would be nice to ba able to do this smoothly in polars !

Thanks you for this project,
Matthieu

The text was updated successfully, but these errors were encountered:

aut0clave · 2024-06-17T12:56:58Z

Duplicate of #7287?

USM-CHU-FGuyon · 2024-06-18T05:48:29Z

Not familiar with zstd but #7287 seems to have the same intent of scanning compressed csvs

USM-CHU-FGuyon added the enhancement New feature or an improvement of an existing feature label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scan csv.gz tables #17011

Scan csv.gz tables #17011

USM-CHU-FGuyon commented Jun 17, 2024 •

edited

Loading

aut0clave commented Jun 17, 2024

USM-CHU-FGuyon commented Jun 18, 2024

Scan csv.gz tables #17011

Scan csv.gz tables #17011

Comments

USM-CHU-FGuyon commented Jun 17, 2024 • edited Loading

Description

aut0clave commented Jun 17, 2024

USM-CHU-FGuyon commented Jun 18, 2024

USM-CHU-FGuyon commented Jun 17, 2024 •

edited

Loading