Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify read and scan functions #13040

Open
6 tasks
stinodego opened this issue Dec 14, 2023 · 2 comments
Open
6 tasks

Unify read and scan functions #13040

stinodego opened this issue Dec 14, 2023 · 2 comments
Labels
A-io Area: reading and writing data accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@stinodego
Copy link
Member

stinodego commented Dec 14, 2023

read functions should behave exactly like scan functions followed by collect.

There may be some added or removed parameters for functionality that is (not) relevant in eager mode.

We should take a look at our existing scan functions and make sure they conform to these expectations:

  • scan_parquet
  • scan_ipc
  • scan_csv
  • scan_ndjson
  • scan_delta
  • scan_iceberg (has no read equivalent yet)
@Wainberg
Copy link
Contributor

Wainberg commented Dec 31, 2023

Here are a few open, unaccepted issues that should be addressed during the harmonization of read_csv and scan_csv:

  • scan_csv() doesn’t support compressed CSVs, read_csv() does: #7287
  • scan_csv() doesn’t support column selection, read_csv() does: #5755
  • scan_csv() doesn’t support BytesIO/StringIO, read_csv() does: #4950, #12617
  • read_csv() doesn’t support reading multiple files, scan_csv() does: #10706

@alexander-beedie alexander-beedie added the A-io Area: reading and writing data label Jan 23, 2024
@mkleinbort-ic
Copy link

👍- was about to ask why there is no pl.read_iceberg

On that note - a small documentation recommendation - I think it's time to add sub-headers to the IO functions as grouped by type:

csv: 
- [polars.read_csv](https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html)
- [polars.read_csv_batched](https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv_batched.html)
- [polars.scan_csv](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html)
- [polars.DataFrame.write_csv](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_csv.html)
- [polars.LazyFrame.sink_csv](https://docs.pola.rs/py-polars/html/reference/api/polars.LazyFrame.sink_csv.html)

ipc:
- [polars.read_ipc](https://docs.pola.rs/py-polars/html/reference/api/polars.read_ipc.html)
- [polars.read_ipc_stream](https://docs.pola.rs/py-polars/html/reference/api/polars.read_ipc_stream.html)
- [polars.scan_ipc](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_ipc.html)
- [polars.read_ipc_schema](https://docs.pola.rs/py-polars/html/reference/api/polars.read_ipc_schema.html)
- [polars.DataFrame.write_ipc](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_ipc.html)
- [polars.DataFrame.write_ipc_stream](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_ipc_stream.html)
- [polars.LazyFrame.sink_ipc](https://docs.pola.rs/py-polars/html/reference/api/polars.LazyFrame.sink_ipc.html)

parquet:
- [polars.read_parquet](https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet.html)
- [polars.scan_parquet](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_parquet.html)
- [polars.read_parquet_schema](https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet_schema.html)
- [polars.DataFrame.write_parquet](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_parquet.html)
- [polars.LazyFrame.sink_parquet](https://docs.pola.rs/py-polars/html/reference/api/polars.LazyFrame.sink_parquet.html)

database:
- [polars.read_database](https://docs.pola.rs/py-polars/html/reference/api/polars.read_database.html)
- [polars.read_database_uri](https://docs.pola.rs/py-polars/html/reference/api/polars.read_database_uri.html)
- [polars.DataFrame.write_database](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_database.html)

json:
- [polars.read_json](https://docs.pola.rs/py-polars/html/reference/api/polars.read_json.html)
- [polars.read_ndjson](https://docs.pola.rs/py-polars/html/reference/api/polars.read_ndjson.html)
- [polars.scan_ndjson](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_ndjson.html)
- [polars.DataFrame.write_json](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_json.html)
- [polars.DataFrame.write_ndjson](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_ndjson.html)
- [polars.LazyFrame.sink_ndjson](https://docs.pola.rs/py-polars/html/reference/api/polars.LazyFrame.sink_ndjson.html)

avro:
- [polars.read_avro](https://docs.pola.rs/py-polars/html/reference/api/polars.read_avro.html)
- [polars.DataFrame.write_avro](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_avro.html)

excel:
- [polars.read_excel](https://docs.pola.rs/py-polars/html/reference/api/polars.read_excel.html)
- [polars.read_ods](https://docs.pola.rs/py-polars/html/reference/api/polars.read_ods.html)
- [polars.DataFrame.write_excel](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_excel.html#)

iceberg:
- [polars.scan_iceberg](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_iceberg.html)

delta:
- [polars.scan_delta](https://docs.pola.rs/py-polars/html/reference/api/polars.scan_delta.html)
- [polars.read_delta](https://docs.pola.rs/py-polars/html/reference/api/polars.read_delta.html)
- [polars.DataFrame.write_delta](https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_delta.html)

...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Candidate
Development

No branches or pull requests

4 participants