Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.read_csv header inference #33188

Open
kernc opened this issue Mar 31, 2020 · 1 comment
Open

pd.read_csv header inference #33188

kernc opened this issue Mar 31, 2020 · 1 comment
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@kernc
Copy link
Contributor

kernc commented Mar 31, 2020

Code Sample, a copy-pastable example if possible

With data.csv.txt:

>>> pd.read_csv('/tmp/data.csv.txt')  # header='infer' default
   1540  125205  PRODAJA  04-JUN-18    -3  -.625  -.726
0  1540  125205  PRODAJA  29-JUN-18 -1.00  -0.62  -0.75
1  1540  125205  PRODAJA  09-JUL-18 -3.00  -0.62  -0.77
2  1540  125205  PRODAJA  25-JUL-18 -2.00  -0.41  -0.51
3  1540  125205  PRODAJA  01-NOV-18 -3.00  -0.62  -0.74
4  1540  125205  PRODAJA  24-AUG-18 -1.00  -0.21  -0.26

Problem description

Given above-like data, a simple RangeIndex header would be preferred.

As a heuristic, floats are rarely header representatives. As are values that are equal to next row's values. When six sevenths of values are rare header representatives, those are unlikely.

One can override it all by passing header=None, but the inference does seem to be lacking.

Expected Output

>>> pd.read_csv('/tmp/data.csv')
      0       1        2          3     4     5     6
0  1540  125205  PRODAJA  04-JUN-18 -3.00 -0.62 -0.73
1  1540  125205  PRODAJA  29-JUN-18 -1.00 -0.62 -0.75
2  1540  125205  PRODAJA  09-JUL-18 -3.00 -0.62 -0.77
3  1540  125205  PRODAJA  25-JUL-18 -2.00 -0.41 -0.51
4  1540  125205  PRODAJA  01-NOV-18 -3.00 -0.62 -0.74
5  1540  125205  PRODAJA  24-AUG-18 -1.00 -0.21 -0.26

Output of pd.show_versions()

1.1.0.dev0+786.gec7734169

@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Apr 2, 2020
@jbrockmendel
Copy link
Member

You're right about the float heuristic, but in my experience this is a rabbit hole that we really don't want to go down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants