First, you need to import the :pyExtractor
class :
from chopper import Extractor
Then you can create an :pyExtractor
instance by explicitly instantiating one or by directly using :pyExtractor.keep
and :pyExtractor.discard
class methods :
from chopper import Extractor
# Instantiate style
extractor = Extractor().keep('//div').discard('//a')
# Class method style
extractor = Extractor.keep('//div').discard('//a')
The :pyExtractor
instance allows you to chain multiple :pyExtractor.keep
and :pyExtractor.discard
from chopper import Extractor
e = Extractor.keep('//div[p]').discard('//span').discard('//a').keep('strong')
Once your :pyExtractor
instance is created you can call the :pyExtractor.extract
method on it. The :pyExtractor.extract
method takes at least one argument that is the HTML to parse.
If you want to also parse CSS, pass it as the second argument.
from chopper import Extractor
HTML = """
<html>
<head>
<title>Hello world !</title>
</head>
<body>
<header>This is the header</header>
<div>
<p><span>Main </span>content</p>
<a href="/">See more</a>
</div>
<footer>This is the footer</footer>
</body>
</html>
"""
CSS = """
a { color: blue; }
p { color: red; }
span { border: 1px solid red; }
body { background-color: green; }
"""
# Create the Extractor
e = Extractor.keep('//div[p]').discard('//span').discard('//a')
# Parse HTML only
html = e.extract(HTML)
>>> html
"""
<html>
<body>
<div>
<p>content</p>
</div>
</body>
</html>
"""
# Parse HTML & CSS
html, css = e.extract(HTML, CSS)
>>> html
"""
<html>
<body>
<div>
<p>content</p>
</div>
</body>
</html>
"""
>>> css
"""
p{color:red;}
body{background-color:green;}
"""
Chopper can also convert relative links to absolute ones. To do so, simply use the base_url keyword arguments on the :pyExtractor.extract
method.
from chopper import Extractor
HTML = """
<html>
<head>
<title>Hello world !</title>
</head>
<body>
<div>
<p>content</p>
<a href="page.html">See more</a>
</div>
</body>
</html>
"""
html = Extractor.keep('//a').extract(HTML, base_url='http://test.com/path/index.html')
>>> html
"""
<html>
<body>
<div>
<a href="http://test.com/path/page.html">See more</a>
</div>
</body>
</html>
"""