BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

louisgeisler · 2023-05-02T13:18:09Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import csv
import pandas as pd

one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""

with open("dummy.csv", "w") as file:
    file.write(one_col_csv)

print(pd.read_csv("dummy.csv", sep=None))
#Bad detection of the separator, but no error

# According to the doc, csv sniffer is used in backend
# So I also tested it:
with open("dummy.csv") as fp:
    print(csv.Sniffer().sniff(fp.read(5000)).delimiter)
    # Throw an error "Could not determine delimiter" (which is better)

Issue Description

When pd.reader is used with the parameter sep=None, it failed to automaticaly detect signle column csv.

I have already read this conversation (2016): #13839
They had the same problem but come to the conclusion that the error comes csv sniffer python library.

But I think this is no longer relevant as csv.sniffer now raise an error when pandas still insist on parsing it wrong.

Expected Behavior

It should automatically detect it is a single column file, or at least raise an error.

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-41-generic
Version : #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.1
numpy : 1.23.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.1
Cython : 0.29.34
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : 2.0.9
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

DeaMariaLeon · 2023-05-03T08:06:27Z

Hi @louisgeisler, you don't see this warning?

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.

So you need to change your print line to print(pd.read_csv("dummy.csv", engine='python')).
I did it with pandas 2.0.1, an older version, and the dev version. They all work.

Thank you for opening the issue but I will close it now.

louisgeisler · 2023-05-03T13:02:49Z

Hi @DeaMariaLeon, I will make the issue clearer:

When you use this:
print(pd.read_csv("dummy.csv", sep=None, engine="python"))
You get this:

            H  AD   R
0  hzhrzhjtrj NaN NaN  
1      rzthyy NaN NaN  
2      azr231 NaN NaN  
3     zrhtzrh NaN NaN  
4     zrhtzrh NaN NaN

instead of this:

       HEADER
0  hzhrzhjtrj
1      rzthyy
2      azr231
3     zrhtzrh
4     zrhtzrh

I would appreciate you reopen the issue...

Toroi0610 · 2023-05-07T14:26:12Z

Hi, @louisgeisler,
Perhaps the csv part of the sample code you provided does not reproduce the actual behavior of pandas.read_csv. In fact, if you look at this link of python_parser.py, it seems that only the first line is read for the delimiter.

with open("dummy.csv") as fp:
    print(csv.Sniffer().sniff(fp.readline()).delimiter)
    # Throw "E". Therefore, HEADER --> H[sep]AD[sep]R

Therefore, this problem seems to be a problem with the csv library, not with pandas as in #13839.

Furthermore, I found the rules for delimiter analysis in the _guess_delimiter function of csv library. It seems that the most frequent character was selected. Therefore, "E" was output this time. The following test results in the output of "R".

print(csv.Sniffer().sniff("HREARDER").delimiter)
# output : "R"

asishm · 2023-05-07T22:59:01Z

The docs for read_csv does mention that it uses the builtin csv.Sniffer

If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer.

I would say this behaves as expected. A line stating that only the first valid line is passed on to csv.Sniffer could probably be added to the docs.

Toroi0610 · 2023-05-07T23:42:30Z

May I add it?

louisgeisler · 2023-05-07T23:48:32Z

For me the issue did come from pandas ; why are you just passing the first line of the csv to the sniffer?

From what I understand of the Sniffer example: https://docs.python.org/3/library/csv.html They passed more than the Header.

And intuitively, for me, it seems to be commom sense to detect the separator with more than just the header...

asishm · 2023-05-08T00:11:37Z

sniffing by itself is more of an art than a science; and indeed the documentation does state it's based on a rough heuristic and can generate false positives and negatives.
The documentation for it says to use "sample text" without any guidance of what constitutes a sample. The example only reads the first 1024 bytes. It's also possible the header line is longer than that. If your suggestion is to feed in more lines to the sample - then what is a sensible limit? one/two/the full file?

Ultimately if you are passing files with unknown delimiters, you should expect cases where the right delimiter is not detected. I'm not a pandas maintainer though - so feel free to discuss further!

Toroi0610 · 2023-05-08T00:14:08Z

My example conforms to lines 212-219 of python_parser.py.

I understand your question. However, the issue about how to find the separator when passing the entire csv is still dependent on the csv library.

As for why only the first line is passed, I think it is an art as @asishm commented.

louisgeisler · 2023-05-08T02:16:05Z

Oh, ok, I understand.

But in this case, why not always passed the whole csv to the sniffer?

As my example show it, if you pass the whole csv, ot will raise an error.

So I think we can conclude that passing the whole csv, or let's say a sample of at least 100 lines (you detect '/n') will really improve the reliability on read_csv when sep=None, isn't it?

Because, I think we would agree that passing only the header to be parsed by the snifer is probably the less reliable choice isn't it? 😅

Toroi0610 · 2023-05-08T15:18:05Z

Passing the entire csv to sniffer may result in an error, as in the following example

one_col_csv = """col1, col2, col3
hzh,rzhj,trj
rzth,yy
azr23,1
zr,ht,zrh
zrhtz,rh""".

It seems that csv.sniffer outputs an error if the number of delimiters per line differs. (Maybe I don't understand it well enough). And the behavior is not as expected for me.

I agree with some of the claims about reliability, but on the other hand, to do that I need a way to determine from the input csv alone that it is a single-column csv, and I haven't come up with a good way to do that.

louisgeisler · 2023-05-08T16:00:31Z

In my opinion, it's a very good thing it returns an error if the csv isn't well formated. (For example if the number of separator by lines differ) 👍

And about one column csv, I also think ut's better to get an error stating it can infer the delimiter.
In this case, pandas may keep the error, or show a warning, and use by default the comma separator making the hypothesis that it is a one line column file.

What do you think about it?

asishm-wk · 2023-05-09T02:35:22Z

Passing in the entire csv is also not performant. Also it's very easy to come up with a csv that would still cause the sniffer to give you the current results.

For example, for the below contrived example - it still outputs 'E' as the delimiter:

In [1]: s = '''HEADER
   ...: THEREWAS'''

In [2]: import csv

In [3]: csv.Sniffer().sniff(s).delimiter
Out[3]: 'E'

Hence my suggestion to either

explicitly document the fact that only the first valid line of the csv is passed to csv.Sniffer to determine the delimiter, or
deprecate and eventually remove sep=None as an option - but I don't know if pandas uses this internally anywhere else.

DeaMariaLeon · 2023-05-09T10:52:02Z

Another option would be to add a comment in the docs that the separator found when sep=None, might not be correct.

Pinning @phofl for visibility

phofl · 2023-05-09T10:55:35Z

Not a fan of the sniffing,

that said exploring passing the first header + 1 lines might be interesting. Should definitely clarify the docs

DeaMariaLeon · 2023-05-09T10:57:58Z

May I add it?

@Toroi0610 do you still want to improve the docs?

Toroi0610 · 2023-05-09T11:00:27Z

@DeaMariaLeon yes, I do.

Toroi0610 · 2023-05-09T12:55:33Z

take

louisgeisler · 2023-05-09T15:16:05Z

I think the best compromise may be something like passing the 30 first line of the csv file to the sniffer.
Something like this:

import csv
import pandas as pd


one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""


with open("dummy.csv", "w") as file:
    file.write(one_col_csv)

def read_csv(path: str, sep: str, **kwargs) -> pd.DataFrame:
    
    N_LINE_FOR_SNIFFER = 30 # We take the 3àyh first lines to sniff
    
    if sep is None:
        with open(path) as file:
            line_samples = "\n".join(file.readlines(N_LINE_FOR_SNIFFER))
            try:
                sep = csv.Sniffer().sniff(line_samples).delimiter
            except Exception as e:
                if str(e)=="Could not determine delimiter":
                    sep = "," # Presume it is a single column file, so define a random separator
                else:
                    raise Exception(e)
            finally:
                return pd.read_csv(path, sep=sep, **kwargs)
    else:
        return pd.read_csv(path, sep=sep, **kwargs)
    
#read_csv("dummy.csv", sep=None)

print(pd.read_csv("dummy.csv", sep=None))

It will hugely increase the reliability of the sniffer, while still being fast as we didn't read all the file...
What do you think about this alternative?

(I really think that sep=None is a very convenient feature that should be improve instead of deleted...)

docs: #53035 clarify the behavior of sep=None.

phofl · 2023-05-09T16:53:17Z

Reopening to keep track of the sniffer exploration

louisgeisler · 2023-05-09T18:30:45Z

I discovered this beautiful library "CleverCSV" that may solve our(my) problem: https://github.com/alan-turing-institute/CleverCSV

It may be interesting to integrate it in pandas directly or using it as backend to read csv instead of the build-in csv library...

Toroi0610 · 2023-05-13T11:07:09Z

Hi @louisgeisler, I am also beginning to feel that it is important to be able to choose to have no separator in the csv (i.e. read as a single column csv) when using read_csv. Let me comment on your code.

file.readlines(N_LINE_FOR_SNIFFER)

This part of the code works to read N_LINE_FOR_SNIFFER bytes, not to read N_LINE_FOR_SNIFFER lines of the file.
Please check https://docs.python.org/3/library/codecs.html?highlight=readlines#codecs.StreamReader.read.

How about the following, for example?

def get_first_n_lines(path, n_lines : int):
    first_n_lines = []
    with open(path, 'r') as file:
        for _ in range(n_lines):
            line = file.readline()
            if not line:
                break
            first_n_lines.append(line)
    return first_n_lines

Also, how about ',\t; :' as a random separator? This is a result of ''.join(csv.Sniffer().preferred). I think that "," alone would be more likely to be included in the csv.

louisgeisler · 2023-05-13T23:36:00Z

You're totally right about readlines 👍

However, I would disagree with you about the default separator to use, I mean CSV, litteraly mean Comma Separated Values... So in my opinion, it seems more logical to come back to a comma as default value if the sniffer failed, isn't it?

…3153) docs: pandas-dev#53035 clarify the behavior of sep=None.

louisgeisler added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2023

DeaMariaLeon added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2023

DeaMariaLeon closed this as completed May 3, 2023

DeaMariaLeon reopened this May 3, 2023

DeaMariaLeon added the Needs Discussion Requires discussion from core team before further action label May 9, 2023

github-actions bot assigned Toroi0610 May 9, 2023

Toroi0610 added a commit to Toroi0610/pandas that referenced this issue May 9, 2023

docs: pandas-dev#53035 clarify the behavior of sep=None.

fbd95cf

Toroi0610 mentioned this issue May 9, 2023

DOC: #53035 clarify the behavior of sep=None. #53153

Merged

5 tasks

mroeschke closed this as completed in #53153 May 9, 2023

mroeschke pushed a commit that referenced this issue May 9, 2023

DOC: #53035 clarify the behavior of sep=None. (#53153)

7d59f55

docs: #53035 clarify the behavior of sep=None.

phofl reopened this May 9, 2023

Rylie-W pushed a commit to Rylie-W/pandas that referenced this issue May 19, 2023

DOC: pandas-dev#53035 clarify the behavior of sep=None. (pandas-dev#5…

e5f53e6

…3153) docs: pandas-dev#53035 clarify the behavior of sep=None.

Daquisu pushed a commit to Daquisu/pandas that referenced this issue Jul 8, 2023

DOC: pandas-dev#53035 clarify the behavior of sep=None. (pandas-dev#5…

c30ec1f

…3153) docs: pandas-dev#53035 clarify the behavior of sep=None.

krajcsi mentioned this issue Dec 5, 2023

one column csv import error cogstat/cogstat#152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

louisgeisler commented May 2, 2023 •

edited

Loading

INSTALLED VERSIONS

DeaMariaLeon commented May 3, 2023

louisgeisler commented May 3, 2023 •

edited

Loading

Toroi0610 commented May 7, 2023 •

edited

Loading

asishm commented May 7, 2023

Toroi0610 commented May 7, 2023 •

edited

Loading

louisgeisler commented May 7, 2023 •

edited

Loading

asishm commented May 8, 2023 •

edited

Loading

Toroi0610 commented May 8, 2023 •

edited

Loading

louisgeisler commented May 8, 2023 •

edited

Loading

Toroi0610 commented May 8, 2023

louisgeisler commented May 8, 2023

asishm-wk commented May 9, 2023

DeaMariaLeon commented May 9, 2023

phofl commented May 9, 2023

DeaMariaLeon commented May 9, 2023

Toroi0610 commented May 9, 2023

Toroi0610 commented May 9, 2023

louisgeisler commented May 9, 2023

phofl commented May 9, 2023

louisgeisler commented May 9, 2023

Toroi0610 commented May 13, 2023 •

edited

Loading

louisgeisler commented May 13, 2023 •

edited

Loading

BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

Comments

louisgeisler commented May 2, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

DeaMariaLeon commented May 3, 2023

louisgeisler commented May 3, 2023 • edited Loading

Toroi0610 commented May 7, 2023 • edited Loading

asishm commented May 7, 2023

Toroi0610 commented May 7, 2023 • edited Loading

louisgeisler commented May 7, 2023 • edited Loading

asishm commented May 8, 2023 • edited Loading

Toroi0610 commented May 8, 2023 • edited Loading

louisgeisler commented May 8, 2023 • edited Loading

Toroi0610 commented May 8, 2023

louisgeisler commented May 8, 2023

asishm-wk commented May 9, 2023

DeaMariaLeon commented May 9, 2023

phofl commented May 9, 2023

DeaMariaLeon commented May 9, 2023

Toroi0610 commented May 9, 2023

Toroi0610 commented May 9, 2023

louisgeisler commented May 9, 2023

phofl commented May 9, 2023

louisgeisler commented May 9, 2023

Toroi0610 commented May 13, 2023 • edited Loading

louisgeisler commented May 13, 2023 • edited Loading

louisgeisler commented May 2, 2023 •

edited

Loading

louisgeisler commented May 3, 2023 •

edited

Loading

Toroi0610 commented May 7, 2023 •

edited

Loading

Toroi0610 commented May 7, 2023 •

edited

Loading

louisgeisler commented May 7, 2023 •

edited

Loading

asishm commented May 8, 2023 •

edited

Loading

Toroi0610 commented May 8, 2023 •

edited

Loading

louisgeisler commented May 8, 2023 •

edited

Loading

Toroi0610 commented May 13, 2023 •

edited

Loading

louisgeisler commented May 13, 2023 •

edited

Loading