Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

Open
2 of 3 tasks
louisgeisler opened this issue May 2, 2023 · 22 comments · Fixed by #53153
Open
2 of 3 tasks

BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035

louisgeisler opened this issue May 2, 2023 · 22 comments · Fixed by #53153
Assignees
Labels
Bug IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@louisgeisler
Copy link

louisgeisler commented May 2, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import csv
import pandas as pd

one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""

with open("dummy.csv", "w") as file:
    file.write(one_col_csv)

print(pd.read_csv("dummy.csv", sep=None))
#Bad detection of the separator, but no error

# According to the doc, csv sniffer is used in backend
# So I also tested it:
with open("dummy.csv") as fp:
    print(csv.Sniffer().sniff(fp.read(5000)).delimiter)
    # Throw an error "Could not determine delimiter" (which is better)

Issue Description

When pd.reader is used with the parameter sep=None, it failed to automaticaly detect signle column csv.

I have already read this conversation (2016): #13839
They had the same problem but come to the conclusion that the error comes csv sniffer python library.

But I think this is no longer relevant as csv.sniffer now raise an error when pandas still insist on parsing it wrong.

Expected Behavior

It should automatically detect it is a single column file, or at least raise an error.

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-41-generic
Version : #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.1
numpy : 1.23.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.1
Cython : 0.29.34
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : 2.0.9
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@louisgeisler louisgeisler added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2023
@DeaMariaLeon DeaMariaLeon added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 3, 2023
@DeaMariaLeon
Copy link
Member

Hi @louisgeisler, you don't see this warning?

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.

So you need to change your print line to print(pd.read_csv("dummy.csv", engine='python')).
I did it with pandas 2.0.1, an older version, and the dev version. They all work.

Thank you for opening the issue but I will close it now.

@louisgeisler
Copy link
Author

louisgeisler commented May 3, 2023

Hi @DeaMariaLeon, I will make the issue clearer:

When you use this:
print(pd.read_csv("dummy.csv", sep=None, engine="python"))
You get this:

            H  AD   R
0  hzhrzhjtrj NaN NaN  
1      rzthyy NaN NaN  
2      azr231 NaN NaN  
3     zrhtzrh NaN NaN  
4     zrhtzrh NaN NaN

instead of this:

       HEADER
0  hzhrzhjtrj
1      rzthyy
2      azr231
3     zrhtzrh
4     zrhtzrh

I would appreciate you reopen the issue...

@DeaMariaLeon DeaMariaLeon reopened this May 3, 2023
@Toroi0610
Copy link
Contributor

Toroi0610 commented May 7, 2023

Hi, @louisgeisler,
Perhaps the csv part of the sample code you provided does not reproduce the actual behavior of pandas.read_csv. In fact, if you look at this link of python_parser.py, it seems that only the first line is read for the delimiter.

with open("dummy.csv") as fp:
    print(csv.Sniffer().sniff(fp.readline()).delimiter)
    # Throw "E". Therefore, HEADER --> H[sep]AD[sep]R

Therefore, this problem seems to be a problem with the csv library, not with pandas as in #13839.

Furthermore, I found the rules for delimiter analysis in the _guess_delimiter function of csv library. It seems that the most frequent character was selected. Therefore, "E" was output this time. The following test results in the output of "R".

print(csv.Sniffer().sniff("HREARDER").delimiter)
# output : "R"

@asishm
Copy link
Contributor

asishm commented May 7, 2023

The docs for read_csv does mention that it uses the builtin csv.Sniffer

If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer.

I would say this behaves as expected. A line stating that only the first valid line is passed on to csv.Sniffer could probably be added to the docs.

@Toroi0610
Copy link
Contributor

Toroi0610 commented May 7, 2023

May I add it?

@louisgeisler
Copy link
Author

louisgeisler commented May 7, 2023

For me the issue did come from pandas ; why are you just passing the first line of the csv to the sniffer?

From what I understand of the Sniffer example: https://docs.python.org/3/library/csv.html They passed more than the Header.

And intuitively, for me, it seems to be commom sense to detect the separator with more than just the header...

@asishm
Copy link
Contributor

asishm commented May 8, 2023

  1. sniffing by itself is more of an art than a science; and indeed the documentation does state it's based on a rough heuristic and can generate false positives and negatives.
  2. The documentation for it says to use "sample text" without any guidance of what constitutes a sample. The example only reads the first 1024 bytes. It's also possible the header line is longer than that. If your suggestion is to feed in more lines to the sample - then what is a sensible limit? one/two/the full file?

Ultimately if you are passing files with unknown delimiters, you should expect cases where the right delimiter is not detected. I'm not a pandas maintainer though - so feel free to discuss further!

@Toroi0610
Copy link
Contributor

Toroi0610 commented May 8, 2023

My example conforms to lines 212-219 of python_parser.py.

I understand your question. However, the issue about how to find the separator when passing the entire csv is still dependent on the csv library.

As for why only the first line is passed, I think it is an art as @asishm commented.

@louisgeisler
Copy link
Author

louisgeisler commented May 8, 2023

Oh, ok, I understand.

But in this case, why not always passed the whole csv to the sniffer?

As my example show it, if you pass the whole csv, ot will raise an error.

So I think we can conclude that passing the whole csv, or let's say a sample of at least 100 lines (you detect '/n') will really improve the reliability on read_csv when sep=None, isn't it?

Because, I think we would agree that passing only the header to be parsed by the snifer is probably the less reliable choice isn't it? 😅

@Toroi0610
Copy link
Contributor

Passing the entire csv to sniffer may result in an error, as in the following example

one_col_csv = """col1, col2, col3
hzh,rzhj,trj
rzth,yy
azr23,1
zr,ht,zrh
zrhtz,rh""".

It seems that csv.sniffer outputs an error if the number of delimiters per line differs. (Maybe I don't understand it well enough). And the behavior is not as expected for me.

I agree with some of the claims about reliability, but on the other hand, to do that I need a way to determine from the input csv alone that it is a single-column csv, and I haven't come up with a good way to do that.

@louisgeisler
Copy link
Author

In my opinion, it's a very good thing it returns an error if the csv isn't well formated. (For example if the number of separator by lines differ) 👍

And about one column csv, I also think ut's better to get an error stating it can infer the delimiter.
In this case, pandas may keep the error, or show a warning, and use by default the comma separator making the hypothesis that it is a one line column file.

What do you think about it?

@asishm-wk
Copy link

Passing in the entire csv is also not performant. Also it's very easy to come up with a csv that would still cause the sniffer to give you the current results.

For example, for the below contrived example - it still outputs 'E' as the delimiter:

In [1]: s = '''HEADER
   ...: THEREWAS'''

In [2]: import csv

In [3]: csv.Sniffer().sniff(s).delimiter
Out[3]: 'E'

Hence my suggestion to either

  1. explicitly document the fact that only the first valid line of the csv is passed to csv.Sniffer to determine the delimiter, or
  2. deprecate and eventually remove sep=None as an option - but I don't know if pandas uses this internally anywhere else.

@DeaMariaLeon
Copy link
Member

Another option would be to add a comment in the docs that the separator found when sep=None, might not be correct.

Pinning @phofl for visibility

@DeaMariaLeon DeaMariaLeon added the Needs Discussion Requires discussion from core team before further action label May 9, 2023
@phofl
Copy link
Member

phofl commented May 9, 2023

Not a fan of the sniffing,

that said exploring passing the first header + 1 lines might be interesting. Should definitely clarify the docs

@DeaMariaLeon
Copy link
Member

May I add it?

@Toroi0610 do you still want to improve the docs?

@Toroi0610
Copy link
Contributor

@DeaMariaLeon yes, I do.

@Toroi0610
Copy link
Contributor

take

@louisgeisler
Copy link
Author

I think the best compromise may be something like passing the 30 first line of the csv file to the sniffer.
Something like this:

import csv
import pandas as pd


one_col_csv = """HEADER
hzhrzhjtrj
rzthyy
azr231
zrhtzrh
zrhtzrh"""


with open("dummy.csv", "w") as file:
    file.write(one_col_csv)

def read_csv(path: str, sep: str, **kwargs) -> pd.DataFrame:
    
    N_LINE_FOR_SNIFFER = 30 # We take the 3àyh first lines to sniff
    
    if sep is None:
        with open(path) as file:
            line_samples = "\n".join(file.readlines(N_LINE_FOR_SNIFFER))
            try:
                sep = csv.Sniffer().sniff(line_samples).delimiter
            except Exception as e:
                if str(e)=="Could not determine delimiter":
                    sep = "," # Presume it is a single column file, so define a random separator
                else:
                    raise Exception(e)
            finally:
                return pd.read_csv(path, sep=sep, **kwargs)
    else:
        return pd.read_csv(path, sep=sep, **kwargs)
    
#read_csv("dummy.csv", sep=None)

print(pd.read_csv("dummy.csv", sep=None))

It will hugely increase the reliability of the sniffer, while still being fast as we didn't read all the file...
What do you think about this alternative?

(I really think that sep=None is a very convenient feature that should be improve instead of deleted...)

mroeschke pushed a commit that referenced this issue May 9, 2023
docs: #53035 clarify the behavior of sep=None.
@phofl
Copy link
Member

phofl commented May 9, 2023

Reopening to keep track of the sniffer exploration

@phofl phofl reopened this May 9, 2023
@louisgeisler
Copy link
Author

I discovered this beautiful library "CleverCSV" that may solve our(my) problem: https://github.com/alan-turing-institute/CleverCSV

It may be interesting to integrate it in pandas directly or using it as backend to read csv instead of the build-in csv library...

@Toroi0610
Copy link
Contributor

Toroi0610 commented May 13, 2023

Hi @louisgeisler, I am also beginning to feel that it is important to be able to choose to have no separator in the csv (i.e. read as a single column csv) when using read_csv. Let me comment on your code.

file.readlines(N_LINE_FOR_SNIFFER)

This part of the code works to read N_LINE_FOR_SNIFFER bytes, not to read N_LINE_FOR_SNIFFER lines of the file.
Please check https://docs.python.org/3/library/codecs.html?highlight=readlines#codecs.StreamReader.read.

How about the following, for example?

def get_first_n_lines(path, n_lines : int):
    first_n_lines = []
    with open(path, 'r') as file:
        for _ in range(n_lines):
            line = file.readline()
            if not line:
                break
            first_n_lines.append(line)
    return first_n_lines

Also, how about ',\t; :' as a random separator? This is a result of ''.join(csv.Sniffer().preferred). I think that "," alone would be more likely to be included in the csv.

@louisgeisler
Copy link
Author

louisgeisler commented May 13, 2023

You're totally right about readlines 👍

However, I would disagree with you about the default separator to use, I mean CSV, litteraly mean Comma Separated Values... So in my opinion, it seems more logical to come back to a comma as default value if the sniffer failed, isn't it?

Rylie-W pushed a commit to Rylie-W/pandas that referenced this issue May 19, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this issue Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants