HTMLTableSet #60

scraperdragon · 2013-06-03T16:02:35Z

Hi, here's a HTML Table Set importer for messytables.

It's not fantastic yet; but it's a pretty good start

Supports rowspan/colspan - currently by inserting blank cells.
Supports multiple TABLE elements - but may have unexpected behaviour where there are nested tables.
Doesn't attempt to handle tables that aren't using TABLE, TR, TD, TH.
Not enormously well tested, but seems to work on the tables I've fed it so far.
Requires lxml.

It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.

domoritz · 2013-06-04T09:21:27Z

messytables/any.py

@@ -43,7 +43,9 @@ def any_tableset(fileobj, mimetype=None, extension=None):
    if mimetype in ('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',) \
            or (extension and extension.lower() in ('xlsx',)):
        return XLSXTableSet(fileobj)
-
+    if mimetype in ('text/html',) \


Minor, but could you put the expression into () instead of using \?

domoritz · 2013-06-04T09:29:57Z

@scraperdragon I'm really keen on getting this into messytables because it would be really useful, I guess. However, I really want it to be stable and test a few real world examples before merging this.

scraperdragon · 2013-06-04T10:00:03Z

Updated the two concrete requests.

Regarding wanting it stable, etc: That's totally understandable; will try
to get a few more examples! I'm also likely to change the behaviour of
nested tables, attempting to ignore the nested table.

Dave.

On Tue, Jun 4, 2013 at 10:30 AM, Dominik Moritz notifications@github.comwrote:

@scraperdragon https://github.com/scraperdragon I'm really keen on
getting this into messytables because it would be really useful, I guess.
However, I really want it to be stable and test a few real world examples
before merging this.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/60#issuecomment-18898290
.

domoritz · 2013-06-04T10:04:40Z

@scraperdragon Thanks. I'll also do some testing myself. I'll probably merge this before it's 100% stable so that more people can work on this.

frabcus · 2013-06-11T23:26:56Z

Anything you need for merging this?

domoritz · 2013-06-12T14:15:20Z

I had a think about this and think we should use beautifulsoup instead of lxml. For this kind of parsing, it is much easier to use and we could potentially avoid the insert_blank_cells.

@frabcus or @scraperdragon If could have a go with beautifulsoup, that would be awesome.

@scraperdragon This pr needs merging master in because it is currently incompatible. I'm sorry for taking so long with the review. I tried a few files and the worked quite well but I would like to see whether beautifulsoup makes the code easier and more readable.

rossjones · 2013-06-12T14:17:36Z

The none-lxml parsers for beautifulsoup are either very slow or not very forgiving, you should stick with lxml.

domoritz · 2013-06-12T14:27:18Z

@rossjones Thanks for the input. I have done some parsing before with beautifulsoup and was really happy but my examples were small. In this case we should stick with lxml.

@scraperdragon Can you merge master into your branch so that we can merge it? You will have to move the tests because the one test file became multiple files in the meantime.

@frabcus I don't have much time at the moment. If you could have a look at the pr and review it, that would be great. At the moment, I don't feel comfortable with merging this because I'm not confident that I have understood the code well enough.

pudo · 2013-06-12T16:31:06Z

Strong +1 on LXML over native Soup.

@domoritz is there anything that speaks against giving these gentlemen push?

domoritz · 2013-06-12T16:35:52Z

@pudo No. These gentlemen are very welcome.

@frabcus @scraperdragon You guys have now push access to this repo.

HTMLTableSet

domoritz · 2013-06-13T09:38:54Z

Hey @scraperdragon you missed the lxml requirement in setup.py ;-)

scraperdragon · 2013-06-13T12:57:08Z

Thank you!

Dave.

On Thu, Jun 13, 2013 at 10:38 AM, Dominik Moritz
notifications@github.comwrote:

Hey @scraperdragon https://github.com/scraperdragon you missed the lxml
requirement in setup.py ;-)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/60#issuecomment-19381646
.

frabcus · 2013-06-14T09:31:29Z

Dragon - in case it is useful, I've merged your pull request with
master in scraperwiki/messytables (including moving the tests which
was the conflict).

You can probably pull that into your pull request branch and use it.

Dominik - OK, will look!

On Wed, Jun 12, 2013 at 07:27:20AM -0700, Dominik Moritz wrote:

@rossjones Thanks for the input. I have done some parsing before with beautifulsoup and was really happy but my examples were small. In this case we should stick with lxml.

@scraperdragon Can you merge master into your branch so that we can merge it? You will have to move the tests because the one test file became multiple files in the meantime.

@frabcus I don't have much time at the moment. If you could have a look at the pr and review it, that would be great. At the moment, I don't feel comfortable with merging this because I'm not confident that I have understood the code well enough.

Reply to this email directly or view it on GitHub:
#60 (comment)

Dragon Dave added 8 commits May 29, 2013 16:36

First draft of HTML parser

578602c

tidy and give better name

ecb2ad5

BROKEN: tried to implement colspan/rowspan

2e95e45

Seems to be working

b23dd78

row/colspan working

119a8c1

post code review

f2ebe07

moved files; added to __init__

61580c3

Add HTMLTableSet to Any

9a0db2a

domoritz reviewed Jun 4, 2013
View reviewed changes

Changes requested for pull request okfn#60

99fdb22

Dragon Dave added 2 commits June 13, 2013 09:25

upstream merge

f649077

no duplicate horror

f662726

scraperdragon pushed a commit that referenced this pull request Jun 13, 2013

Merge pull request #60 from scraperdragon/htmltableset

0a94ceb

HTMLTableSet

scraperdragon merged commit 0a94ceb into okfn:master Jun 13, 2013

scraperdragon deleted the htmltableset branch June 13, 2013 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLTableSet #60

HTMLTableSet #60

scraperdragon commented Jun 3, 2013

domoritz Jun 4, 2013

domoritz commented Jun 4, 2013

scraperdragon commented Jun 4, 2013

domoritz commented Jun 4, 2013

frabcus commented Jun 11, 2013

domoritz commented Jun 12, 2013

rossjones commented Jun 12, 2013

domoritz commented Jun 12, 2013

pudo commented Jun 12, 2013

domoritz commented Jun 12, 2013

domoritz commented Jun 13, 2013

scraperdragon commented Jun 13, 2013

frabcus commented Jun 14, 2013

HTMLTableSet #60

HTMLTableSet #60

Conversation

scraperdragon commented Jun 3, 2013

domoritz Jun 4, 2013

Choose a reason for hiding this comment

domoritz commented Jun 4, 2013

scraperdragon commented Jun 4, 2013

domoritz commented Jun 4, 2013

frabcus commented Jun 11, 2013

domoritz commented Jun 12, 2013

rossjones commented Jun 12, 2013

domoritz commented Jun 12, 2013

pudo commented Jun 12, 2013

domoritz commented Jun 12, 2013

domoritz commented Jun 13, 2013

scraperdragon commented Jun 13, 2013

frabcus commented Jun 14, 2013