This is a library for unsupervised data extraction from HTML pages
It has the following name:
- Unsupervised
- HTML
- Extraction
- Based
- 0n
- Mining Data Records
In short, Uhebom.
It consists of two parts:
- MDR algorithm for extracting data regions from a HTML web page
- Needleman–Wunsch algorithm for alignment of data records
The MDR algorithm based on Mining Data Records paper. The implementation is heavily inspired by this library.
The alignment part uses this Needleman–Wunsch implementation.
Is to provide a fast and portable way to extract repeating data in tabular form from HTML pages. This implementation also aims to work in JS environment.
go get -u github.com/MichaelLeachim/uhebom
import (
extractor "github.com/MichaelLeachim/uhebom"
log
)
func main(){
datum_extracted := extractor.Extract([]byte("<html><div>Hello world</div></html>"))
log.Println(datum_extracted)
}
You should check out the result of the system
TODO: implement the HTML example of this library usage.