Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a built in HTML parser #37

Open
philss opened this issue Oct 29, 2015 · 16 comments

Comments

Projects
None yet
8 participants
@philss
Copy link
Owner

commented Oct 29, 2015

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

  • support HTML5;
  • support HTML snippets;
  • be able to parse large files, like 15MB;
  • easy to traverse;
  • be a bit tolerant with errors, like missing closing tags.

@philss philss added the Feature label Oct 29, 2015

@philss philss added this to the 1.0 milestone Oct 29, 2015

@philss

This comment has been minimized.

Copy link
Owner Author

commented Dec 9, 2015

Here is a test case with an example of error that Floki does not support today: henrik/sipper@49a4c09

Thanks @henrik for the example!

@gmile

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2016

@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?

@philss

This comment has been minimized.

Copy link
Owner Author

commented Jun 9, 2016

@gmile yeah, I thought about that, but what I want is to not depend on an external dependency.
This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.

But, this is not discarded. I also think Servo's HTML is a good option.

@gmile

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2016

But, this is not discarded

@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.

@philss

This comment has been minimized.

Copy link
Owner Author

commented Jun 9, 2016

@gmile I'm not looking into this right now. So, please go for it. 👍

@baron

This comment has been minimized.

Copy link

commented Jul 12, 2016

I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.

iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}

Would a replacement function recreate this behavior for backwards compatibility or break the api?

BTW, thanks for the awesome library!

@Eiji7

This comment has been minimized.

Copy link

commented Dec 20, 2016

It would be awesome to have something like this:

%Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"}

instead of:

{"p", [], []}
"content"
{comment: "content"}

I was think also about:

Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs

Features:

  • support all CSS3 (CSS4?) selectors
  • support XPath
  • log warnings when parsing + add option to raise on warning
  • add option to strip blank text node (default false)
  • add option to strip comment content (default true)
  • use Stream when possible
  • tag names and attribute names are always lower case like: "my-custom-tag" and "my-custom-data"
  • support detect encoding
  • allow validate only
  • support fetching parent(s) and sibling(s) from leaf struct ...
  • debug logs - for example: "missing title", "missing favicon" ...

Optional features:

  • method to collect styles for node (with priority, source file, line ...)
  • method to collect events for node
  • extra JQuery selectors, see docs
  • CSS validator with warnings/errors
<div style='fontt-color: white;'></div>
@mhsjlw

This comment has been minimized.

Copy link

commented Jan 14, 2017

Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...

As far as html5ever, check out https://github.com/hansihe/Rustler

@philss

This comment has been minimized.

Copy link
Owner Author

commented Mar 14, 2017

@mhsjlw I agree. Please follow this issue for more details: #94 (sorry for the delay 😅 ).

@philss

This comment has been minimized.

Copy link
Owner Author

commented Mar 14, 2017

@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!

Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser

@gmile

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2017

@philss wow, that's awesome! Thanks!

@liveresume

This comment has been minimized.

Copy link

commented Mar 21, 2017

Rust NIFs anyone?

https://github.com/servo/html5ever

;)

@mhsjlw

This comment has been minimized.

Copy link

commented Mar 22, 2017

@liveresume this was mentioned, twice, see #37 (comment) and #37 (comment)

@f34nk

This comment has been minimized.

Copy link

commented Feb 21, 2018

Please have a look at:
https://github.com/Overbryd/myhtmlex

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml

@Overbryd gave a talk about it in Berlin
I would love to see this coming together!

@Overbryd

This comment has been minimized.

Copy link

commented Feb 21, 2018

@f34nk Happy to help on this one.

I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.

I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.

@philss

This comment has been minimized.

Copy link
Owner Author

commented Feb 22, 2018

I didn't know we had bindings for myhtml. That's great! Thank you for the work on that, @Overbryd!

We could for sure write an adapter like we did for html5ever parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.

Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.