ENH: read-html fixes #3616

Merged
merged 1 commit into from May 20, 2013

Conversation

Projects
None yet
2 participants
Member

cpcloud commented May 15, 2013

Some updates and bug fixes. See release notes for more details.

  • vbench stuff sort of pointless right now since we don't really have control over the speed of the parsing library
  • Figure out why lxml chooses to ignore things reported a bug w/ example to lxml people
  • Figure out why bs4's thead.find_all(['th', 'td']) parses differently than lxml's thead.xpath('.//thead//th|.//thead//td') even when lxml is the bs4 backend. same as above
Contributor

jreback commented May 16, 2013

let me know when you need merging on any PR's

Contributor

jreback commented May 19, 2013

this closes #3606, right?

Member

cpcloud commented May 19, 2013

yep, that is fixed already. i might be able to get to the rest of this today, i know the 0.11.1 rls is due today...the annoyances of the parsing may have to wait tho or i might just open up the flavor argument to allow one of 'lxml', 'bs4', 'html5lib' since html5lib is a bit more WYSIWIG than the others even if it's drastically slower.

Contributor

jreback commented May 19, 2013

the main issue is the import errors.....

Member

cpcloud commented May 19, 2013

that is also fixed in this.

Member

cpcloud commented May 19, 2013

u r talking about #3605/#3607 right?

Contributor

jreback commented May 19, 2013

yep...

Contributor

jreback commented May 19, 2013

take your time...btw

Member

cpcloud commented May 19, 2013

ok thanks. i'm working on a cmdline interface to store neurophys data and it's due tmrw so pandas may have to wait...

Member

cpcloud commented May 20, 2013

@jreback @y-p what do u think about removing the pure lxml option (then options would then be lxml and html5lib which correspond to the backend of bs4 to use)? i think i was a little gung-ho about lxml being fast, but in reality the best is bs4 + html5lib. Even though it's very slow, it gives correct output where lxml does not.

Contributor

jreback commented May 20, 2013

@cpcloud I would rather see correct and slow then wrong but fast!

let's see premature optimization is evil

Can always add it back in 0.12 (or after) if you discover how to fix it. And you have the flavor option, so sort of 'easy' to add it. (course have to edit stuff to take it out...docs,install docs,docstrings...)

of course if there are cases where lxml can do better (and is correct), but bombs on other cases, then you could always raise on those (but that may be more trouble than its worth)

Member

cpcloud commented May 20, 2013

i think the xpath implementation of lxml might be broken... :(

@jreback can i leave the code for lxml/bs4 + lxml in html.py? i would just make only html5lib available until i either figure out the issue or decide that it's not worth it...

Contributor

jreback commented May 20, 2013

ok

Member

cpcloud commented May 20, 2013

@jreback this ready 2 go as soon as travis passes.

Member

cpcloud commented May 20, 2013

that is odd. travis is not running arg

Member

cpcloud commented May 20, 2013

ah there we go

@jreback jreback merged commit a8723a4 into pandas-dev:master May 20, 2013

Contributor

jreback commented May 20, 2013

@cpcloud thanks...this is great...

I edited the v0.11.1 a bit (as this is new, just announcing it). I think an example is warranted. Maybe take a df, df.to_html, them read it in ? just to give an example of how to do it (i do this with read_csv a little lower in the file)

do a separate PR

Contributor

jreback commented May 20, 2013

see this:

https://www.travis-ci.org/pydata/pandas/jobs/7320947

I don't think travis was actually testing html5lib stuff....(I just added it in)
its taken out right now

add in ci/install.sh (right after bs4)....and test

Contributor

jreback commented May 20, 2013

going to put in a separate issue

Member

cpcloud commented May 20, 2013

ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment