Skip to content

Unsupervised Content Extractor from Web Pages via Laplacian Graph-Node Weighting

Notifications You must be signed in to change notification settings

nkt1546789/weightress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Quick start

git clone --depth 1 git@github.com:nkt1546789/weightress.git
python weightress.py "url"

Usage

Weightress assign weights to DOM elements such as texts and images. Each weight represents the importance of the corresponding DOM element. Since all of the DOM elements are weighted by weightress, you can extract any elements with weights. We shall show some examples as follows.

Texts

You can obtain weighted text-node list like this:

import weightress
ce = weightress.ContentExtractor().fit(html)
print "weighted texts (top 10):"
h = ce.get_weighted_texts()
for text, weight in sorted(h, key=lambda x:x[1], reverse=True)[:10]:
    print text, weight

Images

You can obtain weighted image src list like this:

import weightress
ce = weightress.ContentExtractor().fit(html)
print "content images (in top 3 elements)"
for src, weight in ce.extract_images(topn=3):
	print src, weight

bs4 Elements

You can obtain weighted DOM elements (bs4.Elemens) list this:

import weightress
ce = weightress.ContentExtractor().fit(html)
print "bs4 elements (top 5):"
for elem, weight in ce.extract_elements(topn=5):
	print elem.name, elem.attrs, weight
print

Top-1 text

import weightress
ce = weightress.ContentExtractor().fit(html)
print "top-1 text:"
print ce.extract_text(deliminator=u"\n")

About

Unsupervised Content Extractor from Web Pages via Laplacian Graph-Node Weighting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages