Skip to content
/ uhebom Public

This is a library for unsupervised data extraction from HTML pages

License

Notifications You must be signed in to change notification settings

pmapcat/uhebom

Repository files navigation

This is a library for unsupervised data extraction from HTML pages

It has the following name:

  • Unsupervised
  • HTML
  • Extraction
  • Based
  • 0n
  • Mining Data Records

In short, Uhebom.

It consists of two parts:

  • MDR algorithm for extracting data regions from a HTML web page
  • Needleman–Wunsch algorithm for alignment of data records

The MDR algorithm based on Mining Data Records paper. The implementation is heavily inspired by this library.

The alignment part uses this Needleman–Wunsch implementation.

The purpose of this work

Is to provide a fast and portable way to extract repeating data in tabular form from HTML pages. This implementation also aims to work in JS environment.

Installation

go get -u github.com/MichaelLeachim/uhebom

Usage

import (
  extractor "github.com/MichaelLeachim/uhebom"
  log
)

func main(){
  datum_extracted := extractor.Extract([]byte("<html><div>Hello world</div></html>"))
  log.Println(datum_extracted)
}

Demo

You should check out the result of the system

TODO: implement the HTML example of this library usage.

About

This is a library for unsupervised data extraction from HTML pages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published