Skip to content

Latest commit

 

History

History
151 lines (111 loc) · 2.9 KB

usage.rst

File metadata and controls

151 lines (111 loc) · 2.9 KB

Usage

Create the :pyExtractor instance

First, you need to import the :pyExtractor class :

from chopper import Extractor

Then you can create an :pyExtractor instance by explicitly instantiating one or by directly using :pyExtractor.keep and :pyExtractor.discard class methods :

from chopper import Extractor

# Instantiate style
extractor = Extractor().keep('//div').discard('//a')

# Class method style
extractor = Extractor.keep('//div').discard('//a')

Add Xpath expressions

The :pyExtractor instance allows you to chain multiple :pyExtractor.keep and :pyExtractor.discard

from chopper import Extractor

e = Extractor.keep('//div[p]').discard('//span').discard('//a').keep('strong')

Extract contents

Once your :pyExtractor instance is created you can call the :pyExtractor.extract method on it. The :pyExtractor.extract method takes at least one argument that is the HTML to parse.

If you want to also parse CSS, pass it as the second argument.

from chopper import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <header>This is the header</header>
    <div>
      <p><span>Main </span>content</p>
      <a href="/">See more</a>
    </div>
    <footer>This is the footer</footer>
  </body>
</html>
"""

CSS = """
a { color: blue; }
p { color: red; }
span { border: 1px solid red; }
body { background-color: green; }
"""

# Create the Extractor
e = Extractor.keep('//div[p]').discard('//span').discard('//a')

# Parse HTML only
html = e.extract(HTML)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

# Parse HTML & CSS
html, css = e.extract(HTML, CSS)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

>>> css
"""
p{color:red;}
body{background-color:green;}
"""

Chopper can also convert relative links to absolute ones. To do so, simply use the base_url keyword arguments on the :pyExtractor.extract method.

from chopper import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <div>
      <p>content</p>
      <a href="page.html">See more</a>
    </div>
  </body>
</html>
"""

html = Extractor.keep('//a').extract(HTML, base_url='http://test.com/path/index.html')

>>> html
"""
<html>
  <body>
    <div>
      <a href="http://test.com/path/page.html">See more</a>
    </div>
  </body>
</html>
"""