Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_html does not correctly parse table cells with commas #5029

Closed
cancan101 opened this issue Sep 29, 2013 · 15 comments · Fixed by #4770

Comments

@cancan101
Copy link
Contributor

commented Sep 29, 2013

read_html, find the correct table, parses the structure of the table (inclusing row and header labels), but does not parse the data:

tables = pd.read_html("http://www.camacau.com/changeLang?lang=en_US&url=/statistic_list")

In [119]: tables[7]
Out[119]: 
                     0     1     2     3     4     5     6
0                  NaT  2013  2012  2011  2010  2009  2008
1  2013-01-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
2  2013-02-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
3  2013-03-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
4  2013-04-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
5  2013-05-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
6  2013-06-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
7  2013-07-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
8  2013-08-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
9  2013-09-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
10 2013-10-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
11 2013-11-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
12 2013-12-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
13                 NaT   NaN   NaN   NaN   NaN   NaN   NaN
@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 29, 2013

for now pass infer_types=False and manually parse the results

seems to be an issue with comma parsing

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Sep 29, 2013

@cpcloud It also looks like the infer types is doing something weird to the row headings.

Not only do the values look better with infer_types=False, but so do the row headings:

            0       1       2       3       4       5       6
0                2013    2012    2011    2010    2009    2008
1     January   3,925   3,463   3,289   3,184   3,488   4,568
2    February   3,632   2,983   2,902   3,053   3,347   4,527
3       March   3,909   3,166   3,217   3,175   3,636   4,594
4       April   3,903   3,258   3,146   3,023   3,709   4,574
5         May   4,075   3,234   3,266   3,033   3,603   4,511
6        June   4,038   3,272   3,316   2,909   3,057   4,081
7        July           3,661   3,359   3,062   3,354   4,215
8      August           3,942   3,417   3,077   3,395   4,139
9   September           3,703   3,169   3,095   3,100   3,752
10    October           3,727   3,469   3,179   3,375   3,874
11   November           3,722   3,145   3,159   3,213   3,567
12   December           3,866   3,251   3,199   3,324   3,362
13      Total  23,482  41,997  38,946  37,148  40,601  49,764
@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Sep 29, 2013

Also the ordering of the tables seems somewhat arbitrary. Using the page above as an example, the html for tables[17] comes before tables[16].

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2013

@cpcloud any idea about this table ordering issue?

@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 1, 2013

Not sure what the issue is. What type of ordering are you expecting? I can't really think of a way to generally say "this table should come before this other one" other than the obvious "this one comes before this other one in the parse tree"

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2013

Okay. Is this consistent then with the parse thee used ?
On Oct 1, 2013 1:58 AM, "Phillip Cloud" notifications@github.com wrote:

Not sure what the issue is. What type of ordering are you expecting? I
can't really think of a way to generally say "this table should come before
this other one" other than the obvious "this one comes before this other
one in the parse tree"


Reply to this email directly or view it on GitHubhttps://github.com//issues/5029#issuecomment-25427114
.

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2013

I took at look at what lxml parses and the ordering still seems wrong:

tree_tr = tree.findall(".//tr")

# Cell from table[16]
In [153]: [i for i,y in enumerate([x.text_content() for x  in tree_tr]) if "536" in (y)]
Out[153]: [155]

# Cell from table[17]
In [151]: [i for i,y in enumerate([x.text_content() for x  in tree_tr]) if "37,148" in (y)]
Out[151]: [114]
@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 1, 2013

@cancan101

You can't depend on the ordering, mostly because of invalid HTML (there might be other reasons that I can't think of right now). I'm not exactly sure how the parsers I use here "fix" invalid markup.

I don't follow how your example demonstrates that there's an issue with the order of the tables in the page. Can you be more explicit about what the expected input/output is?

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2013

In this case I use lxml so I would imagine the page is relatively valid html.

The example shows the index of the table containing the cell I am searching for.
Pandas returns those two tables as numbers 16 and 17.

I should have searched for "table" rather than "tr". When looking at all tr in the document, the tr in table 17 comes before the tr in table 16.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 1, 2013

@cancan101 Few things:

  1. Never assume your HTML is valid. In fact, it would be reasonable assume that it's invalid. google.com has invalid markup. Here is an interesting read on the validity of web pages. Only 4.13% of pages validated passed the W3C's validator.
  2. lxml doesn't behave in a sane way, for all cases, when it comes across invalid markup. For example, it will sometimes remove a node instead of trying to keep it. That is, IMHO, a bad solution. html5lib on the other hand, tries very hard to keep everything.
  3. There isn't a meaningful way to assign an order to arbitrary HTML tables. You should define one if that's what you're interested in doing, but read_html's result should essentially be treated as a set. There is an ordering but it depends on the order in which the underlying parser returns tables. That may or may not be consistent across parsers.

So, I don't see the ordering as a problem, but I'd be happy to document this.

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2013

Okay. at the very least then the fact that ordering is unreliable should be documented. Perhaps the return should even be changed to a set?

This issue makes #4469 more interesting.

In the case that I do not specify a parser, is it possible to see what parser was actually used?

@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 1, 2013

In the case that I do not specify a parser, is it possible to see what parser was actually used?

No. By default it tries to use lxml, but makes lxml use strict validation. If that raises an exception, bs4 + html5lib is tried.

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2013

Okay. It appears that lxml (with recover=False) is unable to parse that page, so I guess it falls back to an alternative.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Oct 2, 2013

Ah. I've figured it out! I convert to a set when parsing bs4 tables ... thus the different ordering. Thanks @cancan101 for pointing this out, that's actually a buglet that i'll fix

@cancan101

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2013

@cpcloud That is good to hear. It makes extracting a fixed table from a given page much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.