Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on read_html(url, flavor="bs4") if table has only one column #9178

Closed
boarpig opened this issue Dec 31, 2014 · 5 comments
Closed

Crash on read_html(url, flavor="bs4") if table has only one column #9178

boarpig opened this issue Dec 31, 2014 · 5 comments
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Milestone

Comments

@boarpig
Copy link

boarpig commented Dec 31, 2014

I was trying to read a package tracking table from finnish post office's website and I got

Traceback (most recent call last):
  File "./posti.py", line 69, in <module>
    dfs = read_html(html)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 851, in read_html
    parse_dates, tupleize_cols, thousands, attrs, encoding)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 721, in _parse
    infer_types, parse_dates, tupleize_cols, thousands))
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 609, in _data_to_frame
    _expand_elements(body)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 586, in _expand_elements
    lens = Series(lmap(len, body))
  File "/usr/lib/python3.4/site-packages/pandas/compat/__init__.py", line 87, in lmap
    return list(map(*args, **kwargs))
TypeError: len() of unsized object

I isolated the offending table into this script:

https://gist.github.com/boarpig/de4044f4188fac700c68

The problem seems to be related to parse_raw_thead function

def _parse_raw_thead(self, table):
    thead = self._parse_thead(table)
    res = []
    if thead:
        res = lmap(self._text_getter, self._parse_th(thead[0]))
    return np.array(res).squeeze() if res and len(res) == 1 else res

Where res contains ['Tapahtumat'] which comes out of numpy array creation as

array('Tapahtumat', dtype='<U10')

which then produces previously mentioned error because you cannot take a len from that.

@boarpig
Copy link
Author

boarpig commented Jan 2, 2015

I've been investingating this problem and I narrowed the problem down. Basically if you use BeautifulSoup4 as the backend and you have a table with table header with only one column, the _parse_raw_thead will cause the aforementioned error. In my case it was happening because there was multiple of same ids in the table body which caused lxml to error and switch to bs4.

I wonder why the whole np.array(res) even exists. Below is the simplest valid table that will cause the aforementioned error when doing pandas.read_html(html, flavor=bs4)

 <table>
    <thead>
        <tr>
            <th>Header</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>first</td>
        </tr>
    </tbody>
</table>

@boarpig boarpig changed the title "len() of unsized object" while using read_html(url) Crash on read_html(url, flavor="bs4") if table has only one column Jan 2, 2015
@jreback jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Jan 2, 2015
@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

looks like a bug. care to do a pull-request?

@jreback jreback added the Bug label Jan 2, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 2, 2015
@boarpig
Copy link
Author

boarpig commented Jan 3, 2015

I would do but I'm not familiar enough with the code to fix this. The author obviously had a reason to do

return np.array(res).squeeze() if res and len(res) == 1 else res

instead of simply returning return res. Perhaps the assumption was that if len(res) is 1 it must be nested list like [['first', 'second']]and you want ['first', 'second']. Perhaps someone with more insingh can help with this. Using np.array().squeeze() seems like an odd way to flatten a list.

@cpcloud
Copy link
Member

cpcloud commented Jan 3, 2015

this could be fixed by adding np.atleast_1d after the squeeze call

@boarpig
Copy link
Author

boarpig commented Jan 3, 2015

Here's a pull request to fix this
#9194

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants