GitHub - pmapcat/uhebom: This is a library for unsupervised data extraction from HTML pages

This is a library for unsupervised data extraction from HTML pages

It has the following name:

Unsupervised
HTML
Extraction
Based
0n
Mining Data Records

In short, Uhebom.

It consists of two parts:

MDR algorithm for extracting data regions from a HTML web page
Needleman–Wunsch algorithm for alignment of data records

The MDR algorithm based on Mining Data Records paper. The implementation is heavily inspired by this library.

The alignment part uses this Needleman–Wunsch implementation.

The purpose of this work

Is to provide a fast and portable way to extract repeating data in tabular form from HTML pages. This implementation also aims to work in JS environment.

Installation

go get -u github.com/MichaelLeachim/uhebom

Usage

import (
  extractor "github.com/MichaelLeachim/uhebom"
  log
)

func main(){
  datum_extracted := extractor.Extract([]byte("<html><div>Hello world</div></html>"))
  log.Println(datum_extracted)
}

Demo

You should check out the result of the system

TODO: implement the HTML example of this library usage.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
data_record.go		data_record.go
data_region.go		data_region.go
generalized_node.go		generalized_node.go
html_tools.go		html_tools.go
mining_data_record.go		mining_data_record.go
mining_data_region.go		mining_data_region.go
simple_tree_match.go		simple_tree_match.go
simplified_api.go		simplified_api.go
test_general_workage.go		test_general_workage.go
trees.go		trees.go
trees_utils.go		trees_utils.go
uhebom.test		uhebom.test
utils.go		utils.go
wunsch_processing.go		wunsch_processing.go
wunsch_processing_test.go		wunsch_processing_test.go

License

pmapcat/uhebom

Folders and files

Latest commit

History

Repository files navigation

The purpose of this work

Installation

Usage

Demo

About

Resources

License

Stars

Watchers

Forks

Languages