Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UL, OL, PRE, other non-P elements #42

Open
migurski opened this issue Sep 2, 2013 · 3 comments
Open

Add support for UL, OL, PRE, other non-P elements #42

migurski opened this issue Sep 2, 2013 · 3 comments

Comments

@migurski
Copy link

migurski commented Sep 2, 2013

I’m testing Goose and finding that elements other than paragraphs are unavailable in cleaned_text or top_node.

What I did:

url = 'https://github.com/grangier/python-goose'
g = Goose()
a = g.extract(url=url)

I expected to find the single-item list with “Xavier Grangier” and all the code samples in the output, but they were not there. I would be interested to see an additional property in the output, something like source_node that made the non-cleaned element tree of the original content available.

@migurski
Copy link
Author

migurski commented Sep 4, 2013

“Q” is another tag that gets dropped in the HTML cleaning process, which should be kept.

@psilva261
Copy link

@migurski Right, all tags get dropped. So the only formatters that stay are '\n'.

Another option would be some attribute cleaned_html that only contains basic html.

@migurski
Copy link
Author

migurski commented Sep 4, 2013

It’s not just the tags, it’s also the content. In the example above, the content of the lists is not included in the cleaned text or in the top_node. It’s still in the raw_doc but not unambiguously findable based on xpath or whatever.

@grangier grangier closed this as completed May 8, 2014
@grangier grangier reopened this May 10, 2014
grangier pushed a commit that referenced this issue May 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants