Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_fwf 'infer' where first hundred lines differ from other lines #15138

Closed
adamboche opened this issue Jan 16, 2017 · 5 comments

Comments

Projects
None yet
5 participants
@adamboche
Copy link

commented Jan 16, 2017

Code Sample, a copy-pastable example if possible

  1     1   -13.120080   0.229   0.484  -0.378  -0.872
  1     2    -1.902843  -0.090   0.256   1.791   0.967
  1     3   -22.050698  -0.176  -0.394   0.922  -0.454
  1     4   -30.349928   0.081  -0.194  -0.327  -0.981
  1     5   -22.204160  -0.168  -0.197   0.984  -0.266
  1     6   -28.001753  -0.065   0.597  -0.203  -0.802
  1     7   -17.247524   0.108   0.194   0.474   0.774
  1     8   -28.014811   0.017   0.994   0.493   0.112
  1     9   -13.325491   0.259   0.189  -1.275   0.149
  1    10   -10.063621   0.327   0.108  -1.784   0.061
...
115    18     5.697000   0.391  -0.027   0.252   1.000
115    19     8.324000  -0.283   0.132   0.227  -0.216
115    20    48.451000   0.070  -0.041   0.379  -0.082
115    21     0.146000   0.677   0.031  -0.561  -0.149
115    22     1.443000  -0.706  -0.033  -0.222   0.035
115    23     4.595000   0.654  -0.081   0.774   0.997
115    24     0.146000  -0.677   0.031   0.561  -0.149
115    25     4.595000   0.654  -0.081   0.774   0.997
115    26     6.769000  -0.363  -0.093  -0.298   0.996
115    27    24.157000  -0.280  -0.324  -0.142  -0.946

Problem description

I have a long fixed-width file (>100k lines) that whose head and tail are shown above. I want to read this file with pandas. I figure pd.read_fwf is the way to do this. The issue comes up because it reads the first hundred lines, which start with ' 1' to say "lets start reading at [2]" whereas the last hundred lines start with 115, so it skips the initial 11 and starts the line with 5, so I lose data.

A couple of approaches to solving this issue come to mind, though I'm sure there are others:

  • Don't infer until all lines are scanned
  • Take as an argument the number of lines to be scanned before concluding the format, including the option to scan all (e.g. infer_from_all)
  • Take as an argument which direction to scan -- top to bottom or bottom to top

Output of pd.show_versions()

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-107-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2017

I think it would be reasonable to pass down a new parameter, maybe infer_nrows with a default of 100 to do this. Then you can control the amount of inference you want. It could accept infer_nrows='all' if you want to infer from all the rows I suppose.

PR's would be welcome!

@keshavramaswamy

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2017

I am on it :)

@yakovkeselman

This comment has been minimized.

Copy link

commented Mar 20, 2017

Ran into the same issue. Had to sort the file to have larger numbers or top.

I'd suggest, at least for numeric columns, not to infer the left boundary for the first column but go with 0.

I'd rather see a very conservative approach where nothing gets stripped/truncated initially. Rather, we can trim strings later on, if needed.

Thanks!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2017

@yakovkeselman contributions welcome! This is pretty straightforward to actually implement.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

@rdmontgomery

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2018

I just submitted a PR to address this issue. I left out the infer_nrows='all' option because I couldn't find an easy/efficient manner of knowing the number of rows in the file. I could have read in the whole file and calculated it, but I wasn't sure how that would react to partial reads or partial buffers. Any ideas on that would be welcome. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.