New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: error in read_html when parsing badly-escaped HTML from an io object #17975

Closed
liamim opened this Issue Oct 25, 2017 · 10 comments

Comments

Projects
None yet
3 participants
@liamim
Contributor

liamim commented Oct 25, 2017

Code Sample, a copy-pastable example if possible

Create test.html, with the contents:

<!doctype html>
<html>
<body>
<table>
	<tr><td>poorly-escaped cell with an & oh noes</td></tr>
</table>
</body>
</html>
>>> import pandas as pd
>>> pandas.__version__
'0.20.3'
>>> f = open('./test.html')
>>> pd.read_html(f)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    pd.read_html(f)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 906, in read_html
    keep_default_na=keep_default_na)
  File "/usr/lib/python3.6/site-packages/pandas/io/html.py", line 743, in _parse
    raise_with_traceback(retained)
  File "/usr/lib/python3.6/site-packages/pandas/compat/__init__.py", line 344, in raise_with_traceback
    raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <_io.TextIOWrapper name='/home/liam/test.html' mode='r' encoding='UTF-8'>

Problem description

Pandas attempts to invoke a series of parsers on HTML documents, returning when one produces a result, and continuing to the next on error. This works fine when passing a path or entire document to read_html(), but when an IO object is passed, the subsequent parsers will be reading from a file whose read cursor is at EOF, producing an inscrutable 'no text parsed from document' error.

This can easily be fixed by rewinding the file with seek(0) before continuing to the next parser (will add PR shortly).

Expected Output

[                                       0
0  poorly-escaped cell with an & oh noes]

Output of pd.show_versions()

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: e1dabf37645f0fcabeed1d845a0ada7b32415606
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.6-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0rc1+36.ge1dabf376.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.27.2
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@liamim

This comment has been minimized.

Contributor

liamim commented Oct 25, 2017

Interestingly, when adding a test for the patch, I noticed that stuffing that test document into a StringIO seems to work? lxml still fails, but the fallback on html5lib/bs4 rewinds properly.

I'll investigate that further tomorrow.

@xflr6

This comment has been minimized.

Contributor

xflr6 commented Oct 28, 2017

Same issue: Cannot read_html() a webpage directly from an urlopen() result when lxml does not like it:

>>> from urllib.request import urlopen
>>> import pandas as pd
>>> url = 'http://en.wikipedia.org/wiki/Matplotlib'
>>> assert pd.read_html(urlopen(url), 'Advantages', 'bs4')  # works with bs4 alone
>>> pd.read_html(urlopen(url), 'Advantages')
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    pd.read_html(urlopen(url), 'Advantages')
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 915, in read_html
    keep_default_na=keep_default_na)
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 749, in _parse
    raise_with_traceback(retained)
  File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 367, in raise_with_traceback
    raise exc.with_traceback(traceback)
ValueError: No text parsed from document: <http.client.HTTPResponse object at 0x0000000005621358>

Note that one cannot do .seek(0) on the urlopen return value (so the fix needs to be more complex).

I think lxml does something slightly different with StringIOs. So here is a self-contained test case:

>>> import pandas as pd
>>> from mock import Mock
>>> def mock_urlopen(data, url='http://spam'):
        return Mock(**{'geturl.return_value': url, 'read.side_effect': [data, '', '']})

>>> good = mock_urlopen('<table><tr><td>spam<br />eggs</td></tr></table>')
>>> bad = mock_urlopen('<table><tr><td>spam<wbr />eggs</td></tr></table>')
>>> assert pd.read_html(good)
>>> assert pd.read_html(bad, flavor='bs4')
>>> bad.reset_mock()
>>> pd.read_html(bad)
Traceback (most recent call last):
...
ValueError: No text parsed from document: <Mock id='85948960'>
>>> bad.mock_calls
[call.geturl(),
 call.tell(),
 call.read(4000),
 call.decode('ascii', 'strict'),
 call.decode().decode('ascii', 'strict'),
 call.decode().decode().find(':'),
 call.read()]

The second .read()-call is the one where bs4 takes over and fails parsing the empty string.

@xflr6

This comment has been minimized.

Contributor

xflr6 commented Oct 28, 2017

Minimal amendment: reset_mock() does not rewind read.side_effect so here is the same with a fresh mock:

>>> bad = mock_urlopen('<table><tr><td>spam<wbr />eggs</td></tr></table>')
>>> pd.read_html(bad)
Traceback (most recent call last):
...
ValueError: No text parsed from document: <Mock id='50837656'>
>>> bad.mock_calls
[call.geturl(),
 call.tell(),
 call.read(4000),
 call.read(3952),
 call.decode('ascii', 'strict'),
 call.decode().decode('ascii', 'strict'),
 call.decode().decode().find(':'),
 call.read()]

Again, the last .read()-call is from bs4

@liamim

This comment has been minimized.

Contributor

liamim commented Oct 28, 2017

The only way to rewind a urlopen is re-requesting it or buffering it, unfortunately. This becomes a much more complex patch, then 😦

@jreback jreback added the IO HTML label Oct 28, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 28, 2017

So i suppose that the try next parser should raise if we only have a filehandle (and not a path). would take that as a PR.

@jreback jreback added this to the Next Major Release milestone Oct 28, 2017

@liamim

This comment has been minimized.

Contributor

liamim commented Oct 28, 2017

We can seek for some IO handles, though. I don't see any reason not to add something like

if hasattr(io, 'seek'):
    io.seek(0)

and raise a warning if

hasattr(io, 'read') and not hasattr(io, 'seek')
@xflr6

This comment has been minimized.

Contributor

xflr6 commented Oct 28, 2017

Sounds good to me. I think @jreback means that the raise (possibly after checking for seek) should occur in the branch after the first parser fails, so it makes the current behaviour more official/transparent (give a better error message). The user can then select/try a different flavor (maybe the error message can hint at that) .

@liamim

This comment has been minimized.

Contributor

liamim commented Oct 28, 2017

ah, you're talking about ditching the fallthrough to the next parser entirely?

@xflr6

This comment has been minimized.

Contributor

xflr6 commented Oct 28, 2017

I thought for io handles (possibly only non-seekable ones). Does not occur with file names, right?

@liamim

This comment has been minimized.

Contributor

liamim commented Oct 28, 2017

@liamim liamim referenced this issue Oct 28, 2017

Merged

read_html(): rewinding [wip] #18017

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Nov 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment