You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on larger-than-RAM csv.gz files that I want to convert to parquet.
Just like in the docs, I would love to be able to do this:
import polars as pl
lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv.gz")
lf.sink_parquet("out.parquet")
Unfortunately
ComputeError: cannot scan compressed csv; use `read_csv` for compressed data
and reading these compressed csv in memory is not feasable.
My current workaround is to to use pandas to chunk the file and pyarrow to write the chunks to a single file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df_chunks = pd.read_csv("/path/to/my_larger_than_ram_file.csv.gz", chunksize=1e7)
for i, df in enumerate(df_chunks):
table = pa.Table.from_pandas(df)
if i == 0:
pqwriter = pq.ParquetWriter(pth_tgt, table.schema)
pqwriter.write_table(table)
pqwriter.close()
This is unpractical and gets clunky when there are values with a new datatype in some chunks (eg. a single float value in a columns of ints).
It would be nice to ba able to do this smoothly in polars !
Thanks you for this project,
Matthieu
The text was updated successfully, but these errors were encountered:
Description
I'm working on larger-than-RAM csv.gz files that I want to convert to parquet.
Just like in the docs, I would love to be able to do this:
Unfortunately
and reading these compressed csv in memory is not feasable.
My current workaround is to to use pandas to chunk the file and pyarrow to write the chunks to a single file:
This is unpractical and gets clunky when there are values with a new datatype in some chunks (eg. a single float value in a columns of ints).
It would be nice to ba able to do this smoothly in polars !
Thanks you for this project,
Matthieu
The text was updated successfully, but these errors were encountered: