New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLTableSet #60
HTMLTableSet #60
Conversation
@@ -43,7 +43,9 @@ def any_tableset(fileobj, mimetype=None, extension=None): | |||
if mimetype in ('application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',) \ | |||
or (extension and extension.lower() in ('xlsx',)): | |||
return XLSXTableSet(fileobj) | |||
|
|||
if mimetype in ('text/html',) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but could you put the expression into ()
instead of using \
?
@scraperdragon I'm really keen on getting this into messytables because it would be really useful, I guess. However, I really want it to be stable and test a few real world examples before merging this. |
Updated the two concrete requests. Regarding wanting it stable, etc: That's totally understandable; will try Dave. On Tue, Jun 4, 2013 at 10:30 AM, Dominik Moritz notifications@github.comwrote:
|
@scraperdragon Thanks. I'll also do some testing myself. I'll probably merge this before it's 100% stable so that more people can work on this. |
Anything you need for merging this? |
I had a think about this and think we should use beautifulsoup instead of lxml. For this kind of parsing, it is much easier to use and we could potentially avoid the @frabcus or @scraperdragon If could have a go with beautifulsoup, that would be awesome. @scraperdragon This pr needs merging master in because it is currently incompatible. I'm sorry for taking so long with the review. I tried a few files and the worked quite well but I would like to see whether beautifulsoup makes the code easier and more readable. |
The none-lxml parsers for beautifulsoup are either very slow or not very forgiving, you should stick with lxml. |
@rossjones Thanks for the input. I have done some parsing before with beautifulsoup and was really happy but my examples were small. In this case we should stick with lxml. @scraperdragon Can you merge master into your branch so that we can merge it? You will have to move the tests because the one test file became multiple files in the meantime. @frabcus I don't have much time at the moment. If you could have a look at the pr and review it, that would be great. At the moment, I don't feel comfortable with merging this because I'm not confident that I have understood the code well enough. |
Strong +1 on LXML over native Soup. @domoritz is there anything that speaks against giving these gentlemen push? |
@pudo No. These gentlemen are very welcome. @frabcus @scraperdragon You guys have now push access to this repo. |
Hey @scraperdragon you missed the lxml requirement in |
Thank you! Dave. On Thu, Jun 13, 2013 at 10:38 AM, Dominik Moritz
|
Dragon - in case it is useful, I've merged your pull request with You can probably pull that into your pull request branch and use it. Dominik - OK, will look! On Wed, Jun 12, 2013 at 07:27:20AM -0700, Dominik Moritz wrote:
|
Hi, here's a HTML Table Set importer for messytables.
It's not fantastic yet; but it's a pretty good start
It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.