pd.read_csv header inference #33188

kernc · 2020-03-31T18:12:28Z

Code Sample, a copy-pastable example if possible

>>> pd.read_csv('/tmp/data.csv.txt')  # header='infer' default
   1540  125205  PRODAJA  04-JUN-18    -3  -.625  -.726
0  1540  125205  PRODAJA  29-JUN-18 -1.00  -0.62  -0.75
1  1540  125205  PRODAJA  09-JUL-18 -3.00  -0.62  -0.77
2  1540  125205  PRODAJA  25-JUL-18 -2.00  -0.41  -0.51
3  1540  125205  PRODAJA  01-NOV-18 -3.00  -0.62  -0.74
4  1540  125205  PRODAJA  24-AUG-18 -1.00  -0.21  -0.26

Problem description

Given above-like data, a simple RangeIndex header would be preferred.

As a heuristic, floats are rarely header representatives. As are values that are equal to next row's values. When six sevenths of values are rare header representatives, those are unlikely.

One can override it all by passing header=None, but the inference does seem to be lacking.

Expected Output

>>> pd.read_csv('/tmp/data.csv')
      0       1        2          3     4     5     6
0  1540  125205  PRODAJA  04-JUN-18 -3.00 -0.62 -0.73
1  1540  125205  PRODAJA  29-JUN-18 -1.00 -0.62 -0.75
2  1540  125205  PRODAJA  09-JUL-18 -3.00 -0.62 -0.77
3  1540  125205  PRODAJA  25-JUL-18 -2.00 -0.41 -0.51
4  1540  125205  PRODAJA  01-NOV-18 -3.00 -0.62 -0.74
5  1540  125205  PRODAJA  24-AUG-18 -1.00 -0.21 -0.26

Output of `pd.show_versions()`

1.1.0.dev0+786.gec7734169

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-04-03T17:31:21Z

You're right about the float heuristic, but in my experience this is a rabbit hole that we really don't want to go down.

jbrockmendel added the IO CSV read_csv, to_csv label Apr 2, 2020

mroeschke added the Enhancement label Jul 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.read_csv header inference #33188

pd.read_csv header inference #33188

kernc commented Mar 31, 2020 •

edited

jbrockmendel commented Apr 3, 2020

pd.read_csv header inference #33188

pd.read_csv header inference #33188

Comments

kernc commented Mar 31, 2020 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jbrockmendel commented Apr 3, 2020

kernc commented Mar 31, 2020 •

edited

Output of `pd.show_versions()`