Usage

Create the |extractor| instance

First, you need to import the |extractor| class :

from chopper.extractor import Extractor

from chopper.extractor import Extractor

# Instantiate style
extractor = Extractor().keep('//div').discard('//a')

# Class method style
extractor = Extractor.keep('//div').discard('//a')

Add Xpath expressions

from chopper.extractor import Extractor

e = Extractor.keep('//div[p]').discard('//span').discard('//a').keep('strong')

Extract contents

If you want to also parse CSS, pass it as the second argument.

Warning

Depending on the CSS content size, CSS parsing and cleaning can be really slow compared to HTML parsing and cleaning.

from chopper.extractor import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <header>This is the header</header>
    <div>
      <p><span>Main </span>content</p>
      <a href="/">See more</a>
    </div>
    <footer>This is the footer</footer>
  </body>
</html>
"""

CSS = """
a { color: blue; }
p { color: red; }
span { border: 1px solid red; }
body { background-color: green; }
"""

# Create the Extractor
e = Extractor.keep('//div[p]').discard('//span').discard('//a')

# Parse HTML only
html = e.extract(HTML)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

# Parse HTML & CSS
html, css = e.extract(HTML, CSS)

>>> html
"""
<html>
  <body>
    <div>
      <p>content</p>
    </div>
  </body>
</html>
"""

>>> css
"""
p{color:red;}
body{background-color:green;}
"""

Convert relative links to absolute ones

Chopper can also convert relative links to absolute ones. To do so, simply use the base_url keyword arguments on the |extract| method.

from chopper.extractor import Extractor

HTML = """
<html>
  <head>
    <title>Hello world !</title>
  </head>
  <body>
    <div>
      <p>content</p>
      <a href="page.html">See more</a>
    </div>
  </body>
</html>
"""

html = Extractor.keep('//a').extract(HTML, base_url='http://test.com/path/index.html')

>>> html
"""
<html>
  <body>
    <div>
      <a href="http://test.com/path/page.html">See more</a>
    </div>
  </body>
</html>
"""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage.rst

usage.rst

Usage

Create the |extractor| instance

Add Xpath expressions

Extract contents

Convert relative links to absolute ones

Files

usage.rst

Latest commit

History

usage.rst

File metadata and controls

Usage

Create the |extractor| instance

Add Xpath expressions

Extract contents

Convert relative links to absolute ones