Skip to content

beliantech/boilertext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoilerText

BoilerText is a Go implementation of the algorithm to remove boilerplate text from HTML files as described by http://www.l3s.de/~kohlschuetter/boilerplate. The paper is found here (PDF). The intent of BoilerText output is for full-text search indexing.

The reference implementation is found in https://github.com/PageDash/boilerpipe (forked from https://github.com/kohlschutter/boilerpipe). This implementation does its best to mimick the algorithm described in the paper, but isn't 100% the same as the boilerpipe implementation.

By no means idiomatic Go. We'll get there. PRs welcome to clean up stuff or to add new algorithms.

How to use

See example usage in https://github.com/PageDash/boilertext/blob/master/main.go

Language Support (Split Strategy)

There are two possible split strategies that you will want to consider. For English and English-like languages (which consists of words formed by a sequence of characters), the bufio.ScanWords SplitFunc is appropriate. For languages such as Chinese and Japanese (which consists of rune characters), use the bufio.ScanRunes SplitFunc to obtain the desired result. Obviously this is a simplistic view, but we gotta start somewhere.

Note that the research algorithm was based on the English language. YMMV for other languages. We found that replacing word split with rune split for runic languages performed decently.

See https://github.com/abadojack/whatlanggo for language detection feature support.

Performance

I did a benchmark, and it actually shows that naive string concatenation is faster than bytes.Buffer. And since most HTML is sort of lightweight with text block count in the order of hundreds, string concatenation will be just fine. My results corroborate with https://github.com/hermanschaaf/go-string-concat-benchmarks.