BoilerText

BoilerText is a Go implementation of the algorithm to remove boilerplate text from HTML files as described by http://www.l3s.de/~kohlschuetter/boilerplate. The paper is found here (PDF). The intent of BoilerText output is for full-text search indexing.

The reference implementation is found in https://github.com/PageDash/boilerpipe (forked from https://github.com/kohlschutter/boilerpipe). This implementation does its best to mimick the algorithm described in the paper, but isn't 100% the same as the boilerpipe implementation.

By no means idiomatic Go. We'll get there. PRs welcome to clean up stuff or to add new algorithms.

How to use

See example usage in https://github.com/PageDash/boilertext/blob/master/main.go

Language Support (Split Strategy)

There are two possible split strategies that you will want to consider. For English and English-like languages (which consists of words formed by a sequence of characters), the bufio.ScanWords SplitFunc is appropriate. For languages such as Chinese and Japanese (which consists of rune characters), use the bufio.ScanRunes SplitFunc to obtain the desired result. Obviously this is a simplistic view, but we gotta start somewhere.

Note that the research algorithm was based on the English language. YMMV for other languages. We found that replacing word split with rune split for runic languages performed decently.

See https://github.com/abadojack/whatlanggo for language detection feature support.

Performance

I did a benchmark, and it actually shows that naive string concatenation is faster than bytes.Buffer. And since most HTML is sort of lightweight with text block count in the order of hundreds, string concatenation will be just fine. My results corroborate with https://github.com/hermanschaaf/go-string-concat-benchmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
pkg		pkg
sample		sample
vendor		vendor
Gopkg.lock		Gopkg.lock
Gopkg.toml		Gopkg.toml
LICENSE		LICENSE
README.md		README.md
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg

pkg

sample

sample

vendor

vendor

Gopkg.lock

Gopkg.lock

Gopkg.toml

Gopkg.toml

LICENSE

LICENSE

README.md

README.md

main.go

main.go

Repository files navigation

BoilerText

How to use

Language Support (Split Strategy)

Performance

About

Releases 3

Packages

Languages

License

beliantech/boilertext

Folders and files

Latest commit

History

Repository files navigation

BoilerText

How to use

Language Support (Split Strategy)

Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Languages