Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a built in HTML parser #37

Open
5 tasks
philss opened this issue Oct 29, 2015 · 16 comments
Open
5 tasks

Create a built in HTML parser #37

philss opened this issue Oct 29, 2015 · 16 comments

Comments

@philss
Copy link
Owner

@philss philss commented Oct 29, 2015

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

  • support HTML5;
  • support HTML snippets;
  • be able to parse large files, like 15MB;
  • easy to traverse;
  • be a bit tolerant with errors, like missing closing tags.
@philss philss added the Feature label Oct 29, 2015
@philss philss added this to the 1.0 milestone Oct 29, 2015
@philss
Copy link
Owner Author

@philss philss commented Dec 9, 2015

Here is a test case with an example of error that Floki does not support today: henrik/sipper@49a4c09

Thanks @henrik for the example!

Loading

@gmile
Copy link
Contributor

@gmile gmile commented Jun 7, 2016

@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?

Loading

@philss
Copy link
Owner Author

@philss philss commented Jun 9, 2016

@gmile yeah, I thought about that, but what I want is to not depend on an external dependency.
This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.

But, this is not discarded. I also think Servo's HTML is a good option.

Loading

@gmile
Copy link
Contributor

@gmile gmile commented Jun 9, 2016

But, this is not discarded

@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.

Loading

@philss
Copy link
Owner Author

@philss philss commented Jun 9, 2016

@gmile I'm not looking into this right now. So, please go for it. 👍

Loading

@baron
Copy link

@baron baron commented Jul 12, 2016

I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.

iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}

Would a replacement function recreate this behavior for backwards compatibility or break the api?

BTW, thanks for the awesome library!

Loading

@Eiji7
Copy link

@Eiji7 Eiji7 commented Dec 20, 2016

It would be awesome to have something like this:

%Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"}

instead of:

{"p", [], []}
"content"
{comment: "content"}

I was think also about:

Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs

Features:

  • support all CSS3 (CSS4?) selectors
  • support XPath
  • log warnings when parsing + add option to raise on warning
  • add option to strip blank text node (default false)
  • add option to strip comment content (default true)
  • use Stream when possible
  • tag names and attribute names are always lower case like: "my-custom-tag" and "my-custom-data"
  • support detect encoding
  • allow validate only
  • support fetching parent(s) and sibling(s) from leaf struct ...
  • debug logs - for example: "missing title", "missing favicon" ...

Optional features:

  • method to collect styles for node (with priority, source file, line ...)
  • method to collect events for node
  • extra JQuery selectors, see docs
  • CSS validator with warnings/errors
<div style='fontt-color: white;'></div>

Loading

@mhsjlw
Copy link

@mhsjlw mhsjlw commented Jan 14, 2017

Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...

As far as html5ever, check out https://github.com/hansihe/Rustler

Loading

@philss
Copy link
Owner Author

@philss philss commented Mar 14, 2017

@mhsjlw I agree. Please follow this issue for more details: #94 (sorry for the delay 😅 ).

Loading

@philss
Copy link
Owner Author

@philss philss commented Mar 14, 2017

@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!

Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser

Loading

@gmile
Copy link
Contributor

@gmile gmile commented Mar 14, 2017

@philss wow, that's awesome! Thanks!

Loading

@liveresume
Copy link

@liveresume liveresume commented Mar 21, 2017

Rust NIFs anyone?

https://github.com/servo/html5ever

;)

Loading

@mhsjlw
Copy link

@mhsjlw mhsjlw commented Mar 22, 2017

@liveresume this was mentioned, twice, see #37 (comment) and #37 (comment)

Loading

@f34nk
Copy link

@f34nk f34nk commented Feb 21, 2018

Please have a look at:
https://github.com/Overbryd/myhtmlex

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml

@Overbryd gave a talk about it in Berlin
I would love to see this coming together!

Loading

@Overbryd
Copy link

@Overbryd Overbryd commented Feb 21, 2018

@f34nk Happy to help on this one.

I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.

I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.

Loading

@philss
Copy link
Owner Author

@philss philss commented Feb 22, 2018

I didn't know we had bindings for myhtml. That's great! Thank you for the work on that, @Overbryd!

We could for sure write an adapter like we did for html5ever parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.

Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?

Loading

philss added a commit that referenced this issue Jun 12, 2021
This is part of a bigger effort to write a compliant HTML parser in
Elixir.

The implementation follows WHATWG specification which is the living
standard of HTML, but parts of the tokenizer are still missing like the
handling of parse errors and some states. Those missing parts are not
essential for most of the documents.

You can see details about the HTML specification here:
https://html.spec.whatwg.org/multipage/

This commit contains a lot of files. The most important one is the
`lib/floki/html/tokenizer.ex`. We added a lot of test files that were
generated according to html5lib-tests - a project that aims to provide
test cases based on WHATWG specs.
See: https://github.com/html5lib/html5lib-tests

This tokenizer was written based on the specs as seen around September
2019. Most of the parser development progress is being tracked at
https://github.com/philss/floki/projects/2

For now it will remain "private" and no other module is using it.

This is related to #37 :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants