BUG: read_table cannot handle multi-character separators in memory_map mode #34577

emptyVoid · 2020-06-04T16:45:40Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

import pandas
dataframe = pandas.read_table('file.txt',
                              header=None,
                              sep=' - ',
                              names=['key', 'value'],
                              memory_map=True)

Data Sample

key1 - value1
key2 - value2
key3 - value3

Problem description

I'm getting an exception:

Exception has occurred: TypeError
cannot use a string pattern on a bytes-like object

from this line:

pandas/pandas/io/parsers.py

Line 2472 in 14eda58

yield pat.split(line.strip())

since f.readline() returns a byte-string.

And if I change separator to sep=b' - ', a similar exception:

Exception has occurred: TypeError
cannot use a bytes pattern on a string-like object

gets raised from the following line:

pandas/pandas/io/parsers.py

Line 2475 in 14eda58

yield pat.split(line.strip())

since for line in f: yields normal strings.

Expected Output

Exception should not be raised either with sep=' - ' or with sep=b' - '.
Not sure which one is the correct, although I see no issues with sep=' - ' when memory_map=True is not set.

Workaround

Changing lines

pandas/pandas/io/parsers.py

Lines 2474 to 2475 in 14eda58

    
           for line in f: 
        
               yield pat.split(line.strip())

to

while True:
    line = f.readline()
    if not line:
        break
    yield pat.split(line.strip())

fixes read_table for byte-string separators (e.g. sep=b' - ').

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

emptyVoid added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2020

fujiaxiang added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_table cannot handle multi-character separators in memory_map mode #34577

BUG: read_table cannot handle multi-character separators in memory_map mode #34577

emptyVoid commented Jun 4, 2020 •

edited

INSTALLED VERSIONS

BUG: read_table cannot handle multi-character separators in memory_map mode #34577

BUG: read_table cannot handle multi-character separators in memory_map mode #34577

Comments

emptyVoid commented Jun 4, 2020 • edited

Code Sample

Data Sample

Problem description

Expected Output

Workaround

Output of pd.show_versions()

INSTALLED VERSIONS

emptyVoid commented Jun 4, 2020 •

edited

Output of `pd.show_versions()`