Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem reading one-column files with guessed separator #13839

Closed
Mark531 opened this issue Jul 29, 2016 · 9 comments
Closed

Problem reading one-column files with guessed separator #13839

Mark531 opened this issue Jul 29, 2016 · 9 comments
Labels
IO CSV read_csv, to_csv Usage Question

Comments

@Mark531
Copy link

Mark531 commented Jul 29, 2016

Code Sample, a copy-pastable example if possible

Create a simple csv file:
----- C:\myfile.csv
name
1
2
3
4
----- C:\myfile.csv
And run:
pd.read_csv("C:\myfile.csv", encoding = "ISO-8859-1", sep = None, engine = "python")
This triggers an error:
ValueError: Expected 2 fields in line 2, saw 1
If a specific separator is entered, it works.

Expected Output

The file should be imported correctly.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fr_FR

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2016

When you don't specify a delimiter, the Python engine will try to "sniff" it out from the first line using the Sniffer class from Python's built-in csv library. The reason I think you're getting this error is because the sniffer cannot tell what delimiter you want:

>>> import csv
>>> line = 'name\n'`  # that is your first line
>>> print(csv.Sniffer().sniff(line).delimiter)
'n'

I would not consider this a bug because your data does not provide sufficient information for parsing a delimiter of any kind, nor is the problem originating in the actual pandas implementation. I think as you did, specifying the delimiter, will get around this issue.

@Mark531
Copy link
Author

Mark531 commented Aug 1, 2016

I don't agree, for me this is clearly a bug since the CSV file is perfectly formed. The option "sep=None" should work in this configuration, where no delimiter is required. Specifying a delimiter will not get around the issue since my code is generic and must be able to parse CSV files with any delimiter.

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2016

Perfectly formed is moot. Python's parser has no way of knowing that you have no delimiters in your file, and it has to go with what it is given.

In addition, as I mentioned previously, the issue lies in Python's sniffing, which is out of the control of pandas.

@TomAugspurger
Copy link
Contributor

Agreed with @gfyoung, since we're just using python's csv module for sniffing here. I don't think we should try to improve on it.

@Mark531 I would look into whether this is a bug in csv.Sniffer (i.e. whether it should be able to infer that there isn't actually a delimiter; it might just not be supported), or come up with your own sniffer based on the data you'll be seeing.

@Mark531
Copy link
Author

Mark531 commented Aug 1, 2016

No, a correctly-formed CSV file is governed by rules that are met by the file I showed. Any correct tokenizer is able to handle a single token, that's a very basic case. Now, if you rely on another library to tokenize items, I understand that you can't correct this behaviour.

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2016

@Mark531: I think you're not fully understanding our point here. Your point about correctness is moot, as you have not answered the question of how the Python parser is supposed to know that your data has no delimiters.

@Mark531
Copy link
Author

Mark531 commented Aug 1, 2016

The correctness of a CSV file is nothing moot, it's specificed here: https://tools.ietf.org/html/rfc4180. So, to sum up I try to parse a correct CSV file with a generic call to read_csv (with the separator noted as "guessed") and it fails. So, it's a bug. The option "sep=None" whould logically handle the case "no delimiter". If it does not, then it should be corrected.

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2016

It is moot because that wasn't related to what I was saying. In addition, as we have made it clear, the "bug" you are insisting upon is not with pandas.

@Mark531
Copy link
Author

Mark531 commented Aug 1, 2016

I understood, I'll forward it to Python dev then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

3 participants