Usage

Create the :py`Extractor` instance

First, you need to import the :pyExtractor class :

from chopper import Extractor

Then you can create an :pyExtractor instance by explicitly instantiating one or by directly using :pyExtractor.keep and :pyExtractor.discard class methods :

from chopper import Extractor

# Instantiate style
extractor = Extractor().keep('//div').discard('//a')

# Class method style
extractor = Extractor.keep('//div').discard('//a')

Add Xpath expressions

The :pyExtractor instance allows you to chain multiple :pyExtractor.keep and :pyExtractor.discard

from chopper import Extractor

e = Extractor.keep('//div[p]').discard('//span').discard('//a').keep('strong')

Extract contents

Once your :pyExtractor instance is created you can call the :pyExtractor.extract method on it. The :pyExtractor.extract method takes at least one argument that is the HTML to parse.

If you want to also parse CSS, pass it as the second argument.

from chopper import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <header>This is the header</header>
    <div>
      <p><span>Main </span>content</p>
      <a href="/">See more</a>
    </div>
    <footer>This is the footer</footer>
  </body>
</html>
"""

CSS = """
a { color: blue; }
p { color: red; }
span { border: 1px solid red; }
body { background-color: green; }
"""

# Create the Extractor
e = Extractor.keep('//div[p]').discard('//span').discard('//a')

# Parse HTML only
html = e.extract(HTML)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

# Parse HTML & CSS
html, css = e.extract(HTML, CSS)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

>>> css
"""
p{color:red;}
body{background-color:green;}
"""

Convert relative links to absolute ones

Chopper can also convert relative links to absolute ones. To do so, simply use the base_url keyword arguments on the :pyExtractor.extract method.

from chopper import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <div>
      <p>content</p>
      <a href="page.html">See more</a>
    </div>
  </body>
</html>
"""

html = Extractor.keep('//a').extract(HTML, base_url='http://test.com/path/index.html')

>>> html
"""
<html>
  <body>
    <div>
      <a href="http://test.com/path/page.html">See more</a>
    </div>
  </body>
</html>
"""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.rst

usage.rst

Usage

Create the :py`Extractor` instance

Add Xpath expressions

Extract contents

Convert relative links to absolute ones

Files

usage.rst

Latest commit

History

usage.rst

File metadata and controls

Usage

Create the :pyExtractor instance

Add Xpath expressions

Extract contents

Convert relative links to absolute ones

Create the :py`Extractor` instance