Proposal: Re-design `columns`, `new_columns`, `schema`, `dtypes` in `read_csv` #15431

CanglongCl · 2024-04-02T01:20:13Z

There has been various problems about those parameters. I'm willing to contribute but I think there are more things to be specified before starting working. Here is the proposal.

List of issues related

Returned DataFrame's column order did not follow the columns parameter.
Interaction betweendtypes and scehma is undefined and confusing.
- read_csv: dtypes not working and very confusing #14385
- read_csv() and scan_csv() usually corrupt the schema when dtypes contains columns not in the schema #15605
Whether columns or new_columns are used in the dtypes is not specified.
Consistency around the behavior of the schema argument across the API: read_csv and scan_csv
- Consistency around the behavior of the schema argument across the API #11723
Lack of clear instruction of schema leading confusion.
- read_csv fails on schema argument when columns is also provided #14227

Current behaviour

`schema`

Both names and dtypes in schema will be used in order of original dict.
- Names from headers of csv will be ignored. (will made confusing)
- Dtypes will not be inferred.
If dtype is not supported, it raise an error instead of converting.

`dtypes` and `schema`

If schema is provided and dtypes is passed as list of dtype, dtypes will overwrite schema.
If schema is provided and dtypes is passed as map, dtypes will do nothing. (not expected)
If schema is NOT provided, schema will be inferred and dtypes will overwrite it.

`new_columns`

new_columns will replace the column names finally.
Ifnew_columns is provided, it will be used in dtypes.

Intention of the parameters

Users' intention of using these parameters should can be:

schema and dtypes:
- Specify data type for specific column.
- Ensure schema of output is expected.
columns:
- Select some columns in csv.
  - Select by name of original columns
  - Select by index of columns
- Select columns with specific order.
new_columns:
- Rename columns.
- Avoid ugly column name in other parameters.

Proposal

Changes in parameters

Rename new_columns (read_csv) to rename_columns.
- 'new' usually refer to create new columns. 'rename' is more accurate.
Deprecate with_new_columns and introduce rename_columns mentioned above in scan_csv.
Bring columns to scan_csv
- Need further discussion (see select columns in scan_csv() #5755). Since columns will interact with rename_columns and dtypes, I think we should add it.
Deprecate schema since other parameters can replace it totally.
- See how new parameters resolve the users' intentions below.
Allow using single dtype in dtypes which means it applies to all columns. (Allow read_csv to set a single dtype for all columns, or all but certain columns #13226)
Specify while using Mapping[str, PolarsDataType] in dtypes, key refers to name in renamed column (final DataFrame).
Allow using Mapping[str, str] in rename_columns.
Allow using Mapping[int, PolarsDataType]/Mapping[int, str] with index key in dtypes and rename_columns.
Allow using Callable[[str], str], Callable[[Sequence[str]], Sequence[str]] in rename_columns
- Callable[[str], str] is convinient for some tasks like adding prefix or suffix or make names lowercased. E.g. lambda x: x + '_suffix'.
- Callable[Sequence[str], Sequence[str]] is the original form of with_new_columns in current scan_csv. If user needs the index of columns, this will be better. E.g. lambda cols: [f'column_{i}' for i, _ in enumerate(cols)]
If Sequence or Mapping[int, _] is passed for dtypes or rename_columns, the index means the index of DataFrame before row index column insertion (which means if columns is provided, it will follow the order specified in columns).

So the final function signature will be like

def read_csv(
	..., 
	columns: Sequence[int] | Sequence[str] | None = None, 
	rename_columns: Sequence[str] | Callable[[str], str] | Callable[[Sequence[str]], Sequence[str]] | Mapping[str, str] | Mapping[int, str] | None = None,
	..., 
	dtypes: Mapping[str, PolarsDataType] | Sequence[PolarsDataType] | Mapping[int, PolarsDataType] | PolarsDataType | None = None
	# No `schema` now
	...
) -> pl.DataFrame: 
	...

New process pipeline

Scan period

Get csv column names.
- If has_header, csv column names will be read from first row.
- If not has_header, csv column names will be f'column_{n+1}' (original behavior).
Get the final order of columns according to columns if provided
- If columns is provided, get the csv column index of each column
- If columns is not provided, the order will maintain originally.
Current information should be like:
```
[
  {
    column_name: "aaa", 
    csv_col_idx: 1
  }, 
  {
    column_name: "bbb", 
    csv_col_idx: 0
  }, 
]
```
Rename column names according to rename_columns

Inject dtype information according to dtypes if present

Current information should be like:

[
  {
    column_name: "aaa", 
    csv_col_idx: 1, 
    dtype: String
  }, 
  {
    column_name: "bbb", 
    csv_col_idx: 0, 
    dtype: None
  }, 
]

Infer dtype where dtype is None (not provided in dtypes) from csv according to csv_col_idx

[
  {
    column_name: "aaa", 
    csv_col_idx: 1, 
    dtype: String
  }, 
  {
    column_name: "bbb", 
    csv_col_idx: 0, 
    dtype: Int32
  }, 
]

Insert row index column if needed

Now, scan period ends and we have the schema now like

[
  {
    column_name: "aaa", 
    dtype: String
  }, 
  {
    column_name: "bbb", 
    dtype: Int32
  }, 
]

Read Period

For each columns, read from csv according to csv_col_idx
If need to cast dtype, cast it after reading.

Other behaviour changes

Stabilize the output schema

In order to stabilize the output schema, raise an error when column in dtypes, rename_columns or columns is not occured. E.g.

index out of range
column name not found in current stage

How new design fit users' intention

Specify data type for specific column.
- by dtypes
Ensure schema of output is expected.
- by specify both dtypes and columns. We don't really need schema here.
Select some columns in csv.
- by columns
Select columns with specific order.
- by columns
Rename columns.
- by rename_columns
Avoid ugly column name in other parameters.
- dtypes now use new column names after processing according to rename_columns

Other issues

Those issue can be handled together.

read_csv_batched skips columns and ignores batch_size if dtypes partially provided
- read_csv_batched skips columns and ignores batch_size if dtypes partially provided #9056
pl.read_csv_batched() fails if dtypes is provided and not all columns are used.
- pl.read_csv_batched() fails if dtypes is provided and not all columns are used #9654
scan_csv does not raise when schema length does not match data.
- scan_csv does not raise when schema length does not match data #11707

Unify read_csv and scan_csv functions

Unify read and scan functions #13040

	`scan_csv`	`read_csv`	ISSUE
Compressed CSVs	❌	✅	Support compressed csv in `scan_csv` #7287
`columns` parameter	❌	✅	select columns in scan_csv() #5755
`BytesIO`/`StringIO` as input	❌	✅	Add BytesIO support to `scan_csv` #4950, Support BytesIO, StringIO etc. in scan_csv() #12617
Multiple files input	✅	❌	Allow explicitly specifying multiple files to `read_csv` #10706
Rename columns	`with_column_names`: `Callable[[list[str]], list[str]]`	`new_columns`: `Sequence[str]`

with_column_names raise error while working with row_count_name in scan_csv
- scan_csv and with_column_names bug #12557
Allow multiple positional arguments for pl.scan_csv()
- Allow multiple positional arguments for pl.scan_csv() #12622
Allow using single dtype in dtypes which means all columns should be this dtype. (
- Allow read_csv to set a single dtype for all columns, or all but certain columns #13226
pyo3_runtime.PanicException: python function failed: PyErr { type: <class 'TypeError'>, value: TypeError("'list' object is not callable"), traceback: None } #15484
Slice for the first rows is slow for CSV file with hundred columns and millions rows #11157

Related issue:

Add df.assert_schema(expected_schema)
- df.assert_schema(expected_schema) #14620

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-04-02T07:33:06Z

Just a couple comments.

I think scan_csv ought to stay trim and not have more parameters introduced to it. For example pl.scan_csv(path).select(my_columns) doesn't seem so onerous to also need pl.scan_csv(path, columns=my_columns)

I think schema shouldn't be deprecated but maybe have a check so that if schema is given then if columns or dtypes are also given then it raises an error. Maybe it already does this, I don't know offhand.

I'm skeptical of the value of scanning a BytesIO object. I mean it's in memory just read it. If it's taking up enough memory that copying it to DF form makes you go OOM then you're not going to have much memory for queries anyway, just save a tempfile and scan that. Of course, if someone wants to do it then I'm all for having more features rather than fewer but seems like really high hanging fruit.

CanglongCl · 2024-04-02T07:44:42Z

I think schema shouldn't be deprecated but maybe have a check so that if schema is given then if columns or dtypes are also given then it raises an error. Maybe it already does this, I don't know offhand.

Currently, schema is just do not infer schema and use my schema and further behavior is undefined. In new implementation, we don't need it since new design of dtypes + columns + new_columns can completely replace it without any performance loss. New design also ensure output schema is stable (see stabilize the output schema section).

CanglongCl · 2024-04-02T08:01:33Z

I think scan_csv ought to stay trim and not have more parameters introduced to it. For example pl.scan_csv(path).select(my_columns) doesn't seem so onerous to also need pl.scan_csv(path, columns=my_columns)

One advantage of adding columns is that we can select via index instead of column names. But you are right, without columns and use select we can achieve the same result. However, making read_csv and scan_csv consistent is also good for users to switch between 2 apis.

CanglongCl · 2024-04-04T06:58:36Z

@ritchie46 Hi if you are available, could you please take a look at this proposal? I'm more than happy to help contribute.

caniko · 2024-04-27T20:20:23Z

One advantage of adding columns is that we can select via index instead of column names. But you are right, without columns and use select we can achieve the same result. However, making read_csv and scan_csv consistent is also good for users to switch between 2 apis.

Selecting via index is sometimes the only way to select. Just to stay simple you are ruining the day of so many clients that may need this. Simplicity by omission is evil.

I would want to combine selecting columns by indices with schema, which is not (and should not) be possible with the existing lazy API.

stinodego added A-api Area: changes to the public API A-io-csv Area: reading/writing CSV files enhancement New feature or an improvement of an existing feature labels Apr 4, 2024

cmdlineluser mentioned this issue Apr 4, 2024

pyo3_runtime.PanicException: python function failed: PyErr { type: <class 'TypeError'>, value: TypeError("'list' object is not callable"), traceback: None } #15484

Open

CanglongCl mentioned this issue Apr 11, 2024

fix(python, rust): read_csv column order did not follow the columns parameter #15317

Closed

JulianCologne mentioned this issue Apr 12, 2024

read_csv() and scan_csv() usually corrupt the schema when dtypes contains columns not in the schema #15605

Open

2 tasks

cmdlineluser mentioned this issue Apr 23, 2024

CSV parsing: ComputeError #15854

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Re-design `columns`, `new_columns`, `schema`, `dtypes` in `read_csv` #15431

Proposal: Re-design `columns`, `new_columns`, `schema`, `dtypes` in `read_csv` #15431

CanglongCl commented Apr 2, 2024 •

edited

deanm0000 commented Apr 2, 2024

CanglongCl commented Apr 2, 2024 •

edited

CanglongCl commented Apr 2, 2024

CanglongCl commented Apr 4, 2024

caniko commented Apr 27, 2024 •

edited

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Comments

CanglongCl commented Apr 2, 2024 • edited

List of issues related

Current behaviour

schema

dtypes and schema

new_columns

Intention of the parameters

Proposal

Changes in parameters

New process pipeline

Scan period

Read Period

Other behaviour changes

Stabilize the output schema

How new design fit users' intention

Other issues

deanm0000 commented Apr 2, 2024

CanglongCl commented Apr 2, 2024 • edited

CanglongCl commented Apr 2, 2024

CanglongCl commented Apr 4, 2024

caniko commented Apr 27, 2024 • edited

Proposal: Re-design `columns`, `new_columns`, `schema`, `dtypes` in `read_csv` #15431

Proposal: Re-design `columns`, `new_columns`, `schema`, `dtypes` in `read_csv` #15431

CanglongCl commented Apr 2, 2024 •

edited

`schema`

`dtypes` and `schema`

`new_columns`

CanglongCl commented Apr 2, 2024 •

edited

caniko commented Apr 27, 2024 •

edited