-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem reading one-column files with guessed separator #13839
Comments
When you don't specify a delimiter, the Python engine will try to "sniff" it out from the first line using the >>> import csv
>>> line = 'name\n'` # that is your first line
>>> print(csv.Sniffer().sniff(line).delimiter)
'n' I would not consider this a bug because your data does not provide sufficient information for parsing a delimiter of any kind, nor is the problem originating in the actual |
I don't agree, for me this is clearly a bug since the CSV file is perfectly formed. The option "sep=None" should work in this configuration, where no delimiter is required. Specifying a delimiter will not get around the issue since my code is generic and must be able to parse CSV files with any delimiter. |
Perfectly formed is moot. Python's parser has no way of knowing that you have no delimiters in your file, and it has to go with what it is given. In addition, as I mentioned previously, the issue lies in Python's sniffing, which is out of the control of pandas. |
Agreed with @gfyoung, since we're just using python's @Mark531 I would look into whether this is a bug in |
No, a correctly-formed CSV file is governed by rules that are met by the file I showed. Any correct tokenizer is able to handle a single token, that's a very basic case. Now, if you rely on another library to tokenize items, I understand that you can't correct this behaviour. |
@Mark531: I think you're not fully understanding our point here. Your point about correctness is moot, as you have not answered the question of how the Python parser is supposed to know that your data has no delimiters. |
The correctness of a CSV file is nothing moot, it's specificed here: https://tools.ietf.org/html/rfc4180. So, to sum up I try to parse a correct CSV file with a generic call to read_csv (with the separator noted as "guessed") and it fails. So, it's a bug. The option "sep=None" whould logically handle the case "no delimiter". If it does not, then it should be corrected. |
It is moot because that wasn't related to what I was saying. In addition, as we have made it clear, the "bug" you are insisting upon is not with pandas. |
I understood, I'll forward it to Python dev then. |
Code Sample, a copy-pastable example if possible
Create a simple csv file:
----- C:\myfile.csv
name
1
2
3
4
----- C:\myfile.csv
And run:
pd.read_csv("C:\myfile.csv", encoding = "ISO-8859-1", sep = None, engine = "python")
This triggers an error:
ValueError: Expected 2 fields in line 2, saw 1
If a specific separator is entered, it works.
Expected Output
The file should be imported correctly.
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 8.1
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: fr_FR
pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
The text was updated successfully, but these errors were encountered: