Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLTableSet #60

Merged
merged 11 commits into from Jun 13, 2013
Merged

HTMLTableSet #60

merged 11 commits into from Jun 13, 2013

Conversation

scraperdragon
Copy link

Hi, here's a HTML Table Set importer for messytables.

It's not fantastic yet; but it's a pretty good start

  • Supports rowspan/colspan - currently by inserting blank cells.
  • Supports multiple TABLE elements - but may have unexpected behaviour where there are nested tables.
  • Doesn't attempt to handle tables that aren't using TABLE, TR, TD, TH.
  • Not enormously well tested, but seems to work on the tables I've fed it so far.
  • Requires lxml.

It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.

@@ -43,7 +43,9 @@ def any_tableset(fileobj, mimetype=None, extension=None):
if mimetype in ('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',) \
or (extension and extension.lower() in ('xlsx',)):
return XLSXTableSet(fileobj)

if mimetype in ('text/html',) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but could you put the expression into () instead of using \?

@domoritz
Copy link
Contributor

domoritz commented Jun 4, 2013

@scraperdragon I'm really keen on getting this into messytables because it would be really useful, I guess. However, I really want it to be stable and test a few real world examples before merging this.

@scraperdragon
Copy link
Author

Updated the two concrete requests.

Regarding wanting it stable, etc: That's totally understandable; will try
to get a few more examples! I'm also likely to change the behaviour of
nested tables, attempting to ignore the nested table.

Dave.

On Tue, Jun 4, 2013 at 10:30 AM, Dominik Moritz notifications@github.comwrote:

@scraperdragon https://github.com/scraperdragon I'm really keen on
getting this into messytables because it would be really useful, I guess.
However, I really want it to be stable and test a few real world examples
before merging this.


Reply to this email directly or view it on GitHubhttps://github.com//pull/60#issuecomment-18898290
.

@domoritz
Copy link
Contributor

domoritz commented Jun 4, 2013

@scraperdragon Thanks. I'll also do some testing myself. I'll probably merge this before it's 100% stable so that more people can work on this.

@frabcus
Copy link
Contributor

frabcus commented Jun 11, 2013

Anything you need for merging this?

@domoritz
Copy link
Contributor

I had a think about this and think we should use beautifulsoup instead of lxml. For this kind of parsing, it is much easier to use and we could potentially avoid the insert_blank_cells.

@frabcus or @scraperdragon If could have a go with beautifulsoup, that would be awesome.

@scraperdragon This pr needs merging master in because it is currently incompatible. I'm sorry for taking so long with the review. I tried a few files and the worked quite well but I would like to see whether beautifulsoup makes the code easier and more readable.

@rossjones
Copy link
Contributor

The none-lxml parsers for beautifulsoup are either very slow or not very forgiving, you should stick with lxml.

@domoritz
Copy link
Contributor

@rossjones Thanks for the input. I have done some parsing before with beautifulsoup and was really happy but my examples were small. In this case we should stick with lxml.

@scraperdragon Can you merge master into your branch so that we can merge it? You will have to move the tests because the one test file became multiple files in the meantime.

@frabcus I don't have much time at the moment. If you could have a look at the pr and review it, that would be great. At the moment, I don't feel comfortable with merging this because I'm not confident that I have understood the code well enough.

@pudo
Copy link
Contributor

pudo commented Jun 12, 2013

Strong +1 on LXML over native Soup.

@domoritz is there anything that speaks against giving these gentlemen push?

@domoritz
Copy link
Contributor

@pudo No. These gentlemen are very welcome.

@frabcus @scraperdragon You guys have now push access to this repo.

scraperdragon pushed a commit that referenced this pull request Jun 13, 2013
@scraperdragon scraperdragon merged commit 0a94ceb into okfn:master Jun 13, 2013
@scraperdragon scraperdragon deleted the htmltableset branch June 13, 2013 08:31
@domoritz
Copy link
Contributor

Hey @scraperdragon you missed the lxml requirement in setup.py ;-)

@scraperdragon
Copy link
Author

Thank you!

Dave.

On Thu, Jun 13, 2013 at 10:38 AM, Dominik Moritz
notifications@github.comwrote:

Hey @scraperdragon https://github.com/scraperdragon you missed the lxml
requirement in setup.py ;-)


Reply to this email directly or view it on GitHubhttps://github.com//pull/60#issuecomment-19381646
.

@frabcus
Copy link
Contributor

frabcus commented Jun 14, 2013

Dragon - in case it is useful, I've merged your pull request with
master in scraperwiki/messytables (including moving the tests which
was the conflict).

You can probably pull that into your pull request branch and use it.

Dominik - OK, will look!

On Wed, Jun 12, 2013 at 07:27:20AM -0700, Dominik Moritz wrote:

@rossjones Thanks for the input. I have done some parsing before with beautifulsoup and was really happy but my examples were small. In this case we should stick with lxml.

@scraperdragon Can you merge master into your branch so that we can merge it? You will have to move the tests because the one test file became multiple files in the meantime.

@frabcus I don't have much time at the moment. If you could have a look at the pr and review it, that would be great. At the moment, I don't feel comfortable with merging this because I'm not confident that I have understood the code well enough.


Reply to this email directly or view it on GitHub:
#60 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants