Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
initial import
  • Loading branch information
namusyaka committed Feb 11, 2020
0 parents commit 5cd0217
Show file tree
Hide file tree
Showing 110 changed files with 30,004 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .gitignore
@@ -0,0 +1,8 @@
/.bundle/
/.yardoc
/_yardoc/
/coverage/
/doc/
/pkg/
/spec/reports/
/tmp/
6 changes: 6 additions & 0 deletions .travis.yml
@@ -0,0 +1,6 @@
---
language: ruby
cache: bundler
rvm:
- 2.7.0
before_install: gem install bundler -v 2.1.2
9 changes: 9 additions & 0 deletions Gemfile
@@ -0,0 +1,9 @@
source 'https://rubygems.org'

# Specify your gem's dependencies in gammo.gemspec
gemspec

gem 'yard'
gem 'rake', '~> 12.0'
gem 'test-unit', '~> 3.3.5'
gem 'erubi'
27 changes: 27 additions & 0 deletions Gemfile.lock
@@ -0,0 +1,27 @@
PATH
remote: .
specs:
gammo (0.1.0)

GEM
remote: https://rubygems.org/
specs:
erubi (1.9.0)
power_assert (1.1.5)
rake (12.3.3)
test-unit (3.3.5)
power_assert
yard (0.9.20)

PLATFORMS
ruby

DEPENDENCIES
erubi
gammo!
rake (~> 12.0)
test-unit (~> 3.3.5)
yard

BUNDLED WITH
2.0.2
21 changes: 21 additions & 0 deletions LICENSE.txt
@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) 2020 namusyaka

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
175 changes: 175 additions & 0 deletions README.md
@@ -0,0 +1,175 @@
# Gammo - A pure-Ruby HTML5 parser

Gammo is an implementation of the HTML5 parsing algorithm which conforms [the WHATWG specification](https://html.spec.whatwg.org/multipage/parsing.html), without any dependencies. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm.

Gammo, its naming is inspired by [Gumbo](https://github.com/google/gumbo-parser). But Gammo is a fried tofu fritter made with vegetables.

```ruby
require 'gammo'
require 'open-uri'

parser = Gammo.new(open('https://google.com'))
parser.parse #=> #<Gammo::Node::Document>
```

## Overview

### Features

- [Tokenization](#tokenization): Gammo has a tokenizer for implementing [the tokenization algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tokenization).
- [Parsing](#parsing): Gammo provides a parser which implements the parsing algorithm by the above tokenization and [the tree-construction algorithm](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction).
- [Node](#node): Gammo provides the nodes which implement [WHATWG DOM specification](https://dom.spec.whatwg.org/) partially.
- [Performance](#performance): Gammo does not prioritize performance, and there are a few potential performance notes.

## Tokenizaton

`Gammo::Tokenizer` implements the tokenization algorithm in WHATWG. You can get tokens in order by calling `Gammo::Tokenizer#next_token`.

Here is a simple example for performing only the tokenizer.

```ruby
def dump_for(token)
puts "data: #{token.data}, class: #{token.class}"
end

tokenizer = Gammo::Tokenizer.new('<!doctype html><input type="button"><frameset>')
dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken
dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken
```

The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.

### Token types

The tokens generated by the tokenizer will be categorized into one of the following types:

<table>
<thead>
<tr>
<th>Token type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Gammo::Tokenizer::ErrorToken</code></td>
<td>Represents an error token, it usually means end-of-string.</td>
</tr>
<tr>
<td><code>Gammo::Tokenizer::TextToken</code></td>
<td>Represents a text token like "foo" which is inner text of elements.</td>
</tr>
<tr>
<td><code>Gammo::Tokenizer::StartTagToken</code></td>
<td>Represents a start tag token like <code>&lt;a&gt;</code>.</td>
</tr>
<tr>
<td><code>Gammo::Tokenizer::EndTagToken</code></td>
<td>Represents an end tag token like <code>&lt;/a&gt;</code>.</td>
</tr>
<tr>
<td><code>Gammo::Tokenizer::SelfClosingTagToken</code></td>
<td>Represents a self closing tag token like <code>&lt;img /&gt;</code></td>
</tr>
<tr>
<td><code>Gammo::Tokenizer::CommentToken</code></td>
<td>Represents a comment token like <code>&lt;!-- comment --&gt;</code>.</td>
</tr>
<tr>
<td><code>Gammo::Tokenizer::DoctypeToken</code></td>
<td>Represents a doctype token like <code>&lt;!doctype html&gt;</code>.</td>
</tr>
</tbody>
</table>

## Parsing

`Gammo::Parser` implements processing in [the tree-construction stage](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction) based on the tokenization described above.

A successfully parsed parser has the `document` accessor as the root document (this is the same as the return value of the `Gammo::Parser#parse`). From the `document` accessor, you can traverse the DOM tree constructed by the parser.

```ruby
require 'gammo'
require 'pp'

document = Gammo.new('<!doctype html><input type="button">').parse

def dump_for(node, strm)
strm << node.to_h
return unless node && (child = node.first_child)
while child
dump_for(child, (strm.last[:children] ||= []))
child = child.next_sibling
end
strm
end

pp dump_for(document, [])
```

### Notes

Currently, it's not possible to traverse the DOM tree with css selector or xpath like [Nokogiri](https://nokogiri.org/).
However, Gammo plans to implement these features in the future.

## Node

The nodes generated by the parser will be categorized into one of the following types:

<table>
<thead>
<tr>
<th>Node type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Gammo::Node::Error</code></td>
<td>Represents error node, it usually means end-of-string.</td>
</tr>
<tr>
<td><code>Gammo::Node::Text</code></td>
<td>Represents the text node like "foo" which is inner text of elements.</td>
</tr>
<tr>
<td><code>Gammo::Node::Document</code></td>
<td>Represents the root document type. It's always returned by <code>Gammo::Parser#document</code>.</td>
</tr>
<tr>
<td><code>Gammo::Node::Element</code></td>
<td>Represents any elements of HTML like <code>&lt;p&gt;</code>.</td>
</tr>
<tr>
<td><code>Gammo::Node::Comment</code></td>
<td>Represents comments like <code>&lt;!-- foo --&gt;</code></td>
</tr>
<tr>
<td><code>Gammo::Node::Doctype</code></td>
<td>Represents doctype like <code>&lt;!doctype html&gt;</code></td>
</tr>
</tbody>
</table>

For some nodes such as `Gammo::Node::Element` and `Gammo::Node::Document`, they contains pointers to nodes that can be referenced by itself, such as `Gammo::Node#next_sibling` or `Gammo::Node#first_child`. In addition, APIs such as `Gammo::Node#append_child` and `Gammo::Node#remove_child` that perform operations defined in DOM living standard are also provided.

## Performance

As mentioned in the features at the beginning, Gammo doesn't prioritize its performance.
Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user).
Instead, the goal is to work well with batch processing such as crawlers.
Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems.

## References

This was developed with reference to the following softwares.

- [x/net/html](https://godoc.org/golang.org/x/net/html): I've been working on this package, it gave me strong reason to make this happen.
- [Blink](https://www.chromium.org/blink): Blink gave me great impression about tree construction.
- [html5lib-tests](https://github.com/html5lib/html5lib-tests): Gammo relies on this test.

## License

The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
25 changes: 25 additions & 0 deletions Rakefile
@@ -0,0 +1,25 @@
require "bundler/gem_tasks"
require "rake/testtask"
require 'yaml'
require 'erubi'

Rake::TestTask.new(:test) do |t|
t.libs << "test"
t.libs << "lib"
t.test_files = FileList["test/**/*_test.rb"]
end

task default: :test

def camelize(str)
str.sub(/^[a-z\d]*/) { $&.capitalize }.sub(/\-[a-z]*/) { $&.slice(1..-1).capitalize }
end

task default: :test

task :generate do
data = YAML.load(File.read('misc/html.yaml'), symbolize_names: true)
@tags = data.each_value.inject(:+).uniq
table = eval(Erubi::Engine.new(File.read('misc/table.erubi')).src, binding)
File.write('lib/gammo/tags/table.rb', table)
end
23 changes: 23 additions & 0 deletions gammo.gemspec
@@ -0,0 +1,23 @@
require_relative 'lib/gammo/version'

Gem::Specification.new do |spec|
spec.name = "gammo"
spec.version = Gammo::VERSION
spec.authors = ["namusyaka"]
spec.email = ["namusyaka@gmail.com"]

spec.summary = %q{An HTML parser which implements WHATWG parsing algorithm.}
spec.description = %q{Gammo is an implementation of the HTML5 parsing algorithm which conforms the WHATWG specification with pure Ruby.}
spec.homepage = "https://github.com/namusyaka/gammo"
spec.license = "MIT"
spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")

spec.metadata["homepage_uri"] = spec.homepage
spec.metadata["source_code_uri"] = "https://github.com/namusyaka/gammo"
spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
`git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
end
spec.bindir = "exe"
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
spec.require_paths = ["lib"]
end
15 changes: 15 additions & 0 deletions lib/gammo.rb
@@ -0,0 +1,15 @@
require "gammo/version"
require "gammo/parser"
require "gammo/fragment_parser"

module Gammo
# Constructs a parser based on the input.
#
# @param [String] input
# @param [TrueClass, FalseClass] fragment
# @param [Hash] options
# @return [Gammo::Parser]
def self.new(input, fragment: false, **options)
(fragment ? FragmentParser : Parser).new(input, **options)
end
end
17 changes: 17 additions & 0 deletions lib/gammo/attribute.rb
@@ -0,0 +1,17 @@
module Gammo
# Class for representing an attribute.
class Attribute
attr_accessor :key, :value, :namespace

# Constructs an attribute with the key-value pair.
# @param [String] key
# @param [String] value
# @param [String] namespace
# @return [Attribute]
def initialize(key:, value:, namespace: nil)
@key = key
@value = value
@namespace = namespace
end
end
end

0 comments on commit 5cd0217

Please sign in to comment.