Quick start

git clone --depth 1 git@github.com:nkt1546789/weightress.git
python weightress.py "url"

Usage

Weightress assign weights to DOM elements such as texts and images. Each weight represents the importance of the corresponding DOM element. Since all of the DOM elements are weighted by weightress, you can extract any elements with weights. We shall show some examples as follows.

Texts

You can obtain weighted text-node list like this:

import weightress
ce = weightress.ContentExtractor().fit(html)
print "weighted texts (top 10):"
h = ce.get_weighted_texts()
for text, weight in sorted(h, key=lambda x:x[1], reverse=True)[:10]:
    print text, weight

Images

You can obtain weighted image src list like this:

import weightress
ce = weightress.ContentExtractor().fit(html)
print "content images (in top 3 elements)"
for src, weight in ce.extract_images(topn=3):
	print src, weight

bs4 Elements

You can obtain weighted DOM elements (bs4.Elemens) list this:

import weightress
ce = weightress.ContentExtractor().fit(html)
print "bs4 elements (top 5):"
for elem, weight in ce.extract_elements(topn=5):
	print elem.name, elem.attrs, weight
print

Top-1 text

import weightress
ce = weightress.ContentExtractor().fit(html)
print "top-1 text:"
print ce.extract_text(deliminator=u"\n")

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
readme.md		readme.md
weightress.py		weightress.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start

Usage

Texts

Images

bs4 Elements

Top-1 text

About

Releases

Packages

Languages

nkt1546789/weightress

Folders and files

Latest commit

History

Repository files navigation

Quick start

Usage

Texts

Images

bs4 Elements

Top-1 text

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages