Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
initial import
- Loading branch information
0 parents
commit 5cd0217
Showing
110 changed files
with
30,004 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
/.bundle/ | ||
/.yardoc | ||
/_yardoc/ | ||
/coverage/ | ||
/doc/ | ||
/pkg/ | ||
/spec/reports/ | ||
/tmp/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
--- | ||
language: ruby | ||
cache: bundler | ||
rvm: | ||
- 2.7.0 | ||
before_install: gem install bundler -v 2.1.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
source 'https://rubygems.org' | ||
|
||
# Specify your gem's dependencies in gammo.gemspec | ||
gemspec | ||
|
||
gem 'yard' | ||
gem 'rake', '~> 12.0' | ||
gem 'test-unit', '~> 3.3.5' | ||
gem 'erubi' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
PATH | ||
remote: . | ||
specs: | ||
gammo (0.1.0) | ||
|
||
GEM | ||
remote: https://rubygems.org/ | ||
specs: | ||
erubi (1.9.0) | ||
power_assert (1.1.5) | ||
rake (12.3.3) | ||
test-unit (3.3.5) | ||
power_assert | ||
yard (0.9.20) | ||
|
||
PLATFORMS | ||
ruby | ||
|
||
DEPENDENCIES | ||
erubi | ||
gammo! | ||
rake (~> 12.0) | ||
test-unit (~> 3.3.5) | ||
yard | ||
|
||
BUNDLED WITH | ||
2.0.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2020 namusyaka | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in | ||
all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
# Gammo - A pure-Ruby HTML5 parser | ||
|
||
Gammo is an implementation of the HTML5 parsing algorithm which conforms [the WHATWG specification](https://html.spec.whatwg.org/multipage/parsing.html), without any dependencies. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm. | ||
|
||
Gammo, its naming is inspired by [Gumbo](https://github.com/google/gumbo-parser). But Gammo is a fried tofu fritter made with vegetables. | ||
|
||
```ruby | ||
require 'gammo' | ||
require 'open-uri' | ||
|
||
parser = Gammo.new(open('https://google.com')) | ||
parser.parse #=> #<Gammo::Node::Document> | ||
``` | ||
|
||
## Overview | ||
|
||
### Features | ||
|
||
- [Tokenization](#tokenization): Gammo has a tokenizer for implementing [the tokenization algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tokenization). | ||
- [Parsing](#parsing): Gammo provides a parser which implements the parsing algorithm by the above tokenization and [the tree-construction algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction). | ||
- [Node](#node): Gammo provides the nodes which implement [WHATWG DOM specification](https://dom.spec.whatwg.org/) partially. | ||
- [Performance](#performance): Gammo does not prioritize performance, and there are a few potential performance notes. | ||
|
||
## Tokenizaton | ||
|
||
`Gammo::Tokenizer` implements the tokenization algorithm in WHATWG. You can get tokens in order by calling `Gammo::Tokenizer#next_token`. | ||
|
||
Here is a simple example for performing only the tokenizer. | ||
|
||
```ruby | ||
def dump_for(token) | ||
puts "data: #{token.data}, class: #{token.class}" | ||
end | ||
|
||
tokenizer = Gammo::Tokenizer.new('<!doctype html><input type="button"><frameset>') | ||
dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken | ||
dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken | ||
dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken | ||
dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken | ||
``` | ||
|
||
The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order. | ||
|
||
### Token types | ||
|
||
The tokens generated by the tokenizer will be categorized into one of the following types: | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>Token type</th> | ||
<th>Description</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::ErrorToken</code></td> | ||
<td>Represents an error token, it usually means end-of-string.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::TextToken</code></td> | ||
<td>Represents a text token like "foo" which is inner text of elements.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::StartTagToken</code></td> | ||
<td>Represents a start tag token like <code><a></code>.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::EndTagToken</code></td> | ||
<td>Represents an end tag token like <code></a></code>.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::SelfClosingTagToken</code></td> | ||
<td>Represents a self closing tag token like <code><img /></code></td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::CommentToken</code></td> | ||
<td>Represents a comment token like <code><!-- comment --></code>.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Tokenizer::DoctypeToken</code></td> | ||
<td>Represents a doctype token like <code><!doctype html></code>.</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
## Parsing | ||
|
||
`Gammo::Parser` implements processing in [the tree-construction stage](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction) based on the tokenization described above. | ||
|
||
A successfully parsed parser has the `document` accessor as the root document (this is the same as the return value of the `Gammo::Parser#parse`). From the `document` accessor, you can traverse the DOM tree constructed by the parser. | ||
|
||
```ruby | ||
require 'gammo' | ||
require 'pp' | ||
|
||
document = Gammo.new('<!doctype html><input type="button">').parse | ||
|
||
def dump_for(node, strm) | ||
strm << node.to_h | ||
return unless node && (child = node.first_child) | ||
while child | ||
dump_for(child, (strm.last[:children] ||= [])) | ||
child = child.next_sibling | ||
end | ||
strm | ||
end | ||
|
||
pp dump_for(document, []) | ||
``` | ||
|
||
### Notes | ||
|
||
Currently, it's not possible to traverse the DOM tree with css selector or xpath like [Nokogiri](https://nokogiri.org/). | ||
However, Gammo plans to implement these features in the future. | ||
|
||
## Node | ||
|
||
The nodes generated by the parser will be categorized into one of the following types: | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>Node type</th> | ||
<th>Description</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td><code>Gammo::Node::Error</code></td> | ||
<td>Represents error node, it usually means end-of-string.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Node::Text</code></td> | ||
<td>Represents the text node like "foo" which is inner text of elements.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Node::Document</code></td> | ||
<td>Represents the root document type. It's always returned by <code>Gammo::Parser#document</code>.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Node::Element</code></td> | ||
<td>Represents any elements of HTML like <code><p></code>.</td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Node::Comment</code></td> | ||
<td>Represents comments like <code><!-- foo --></code></td> | ||
</tr> | ||
<tr> | ||
<td><code>Gammo::Node::Doctype</code></td> | ||
<td>Represents doctype like <code><!doctype html></code></td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
For some nodes such as `Gammo::Node::Element` and `Gammo::Node::Document`, they contains pointers to nodes that can be referenced by itself, such as `Gammo::Node#next_sibling` or `Gammo::Node#first_child`. In addition, APIs such as `Gammo::Node#append_child` and `Gammo::Node#remove_child` that perform operations defined in DOM living standard are also provided. | ||
|
||
## Performance | ||
|
||
As mentioned in the features at the beginning, Gammo doesn't prioritize its performance. | ||
Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user). | ||
Instead, the goal is to work well with batch processing such as crawlers. | ||
Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems. | ||
|
||
## References | ||
|
||
This was developed with reference to the following softwares. | ||
|
||
- [x/net/html](https://godoc.org/golang.org/x/net/html): I've been working on this package, it gave me strong reason to make this happen. | ||
- [Blink](https://www.chromium.org/blink): Blink gave me great impression about tree construction. | ||
- [html5lib-tests](https://github.com/html5lib/html5lib-tests): Gammo relies on this test. | ||
|
||
## License | ||
|
||
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
require "bundler/gem_tasks" | ||
require "rake/testtask" | ||
require 'yaml' | ||
require 'erubi' | ||
|
||
Rake::TestTask.new(:test) do |t| | ||
t.libs << "test" | ||
t.libs << "lib" | ||
t.test_files = FileList["test/**/*_test.rb"] | ||
end | ||
|
||
task default: :test | ||
|
||
def camelize(str) | ||
str.sub(/^[a-z\d]*/) { $&.capitalize }.sub(/\-[a-z]*/) { $&.slice(1..-1).capitalize } | ||
end | ||
|
||
task default: :test | ||
|
||
task :generate do | ||
data = YAML.load(File.read('misc/html.yaml'), symbolize_names: true) | ||
@tags = data.each_value.inject(:+).uniq | ||
table = eval(Erubi::Engine.new(File.read('misc/table.erubi')).src, binding) | ||
File.write('lib/gammo/tags/table.rb', table) | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
require_relative 'lib/gammo/version' | ||
|
||
Gem::Specification.new do |spec| | ||
spec.name = "gammo" | ||
spec.version = Gammo::VERSION | ||
spec.authors = ["namusyaka"] | ||
spec.email = ["namusyaka@gmail.com"] | ||
|
||
spec.summary = %q{An HTML parser which implements WHATWG parsing algorithm.} | ||
spec.description = %q{Gammo is an implementation of the HTML5 parsing algorithm which conforms the WHATWG specification with pure Ruby.} | ||
spec.homepage = "https://github.com/namusyaka/gammo" | ||
spec.license = "MIT" | ||
spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0") | ||
|
||
spec.metadata["homepage_uri"] = spec.homepage | ||
spec.metadata["source_code_uri"] = "https://github.com/namusyaka/gammo" | ||
spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do | ||
`git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) } | ||
end | ||
spec.bindir = "exe" | ||
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) } | ||
spec.require_paths = ["lib"] | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
require "gammo/version" | ||
require "gammo/parser" | ||
require "gammo/fragment_parser" | ||
|
||
module Gammo | ||
# Constructs a parser based on the input. | ||
# | ||
# @param [String] input | ||
# @param [TrueClass, FalseClass] fragment | ||
# @param [Hash] options | ||
# @return [Gammo::Parser] | ||
def self.new(input, fragment: false, **options) | ||
(fragment ? FragmentParser : Parser).new(input, **options) | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
module Gammo | ||
# Class for representing an attribute. | ||
class Attribute | ||
attr_accessor :key, :value, :namespace | ||
|
||
# Constructs an attribute with the key-value pair. | ||
# @param [String] key | ||
# @param [String] value | ||
# @param [String] namespace | ||
# @return [Attribute] | ||
def initialize(key:, value:, namespace: nil) | ||
@key = key | ||
@value = value | ||
@namespace = namespace | ||
end | ||
end | ||
end |
Oops, something went wrong.