python-htmlparser

Python 3.x HTMLParser extension with ElementTree support.

Why another Python HTML Parser?

There is no HTML Parser in the Python Standard Library. Actually, there is the html.parser.HTMLParser that simply traverses the DOM tree and allows us to be notified as each tag is being parsed.

Usually, when we parse HTML we want to query its elements and extract data from it. The most simple way to do this is to use XPath expressions. Python do support a simple (read limited) XPath engine into its ElementTree, but there is no way to parse an HTML document into XHTML and then use this library to query it.

This HTML Parser extends html.parser.HTMLParser returning an xml.etree.Element instance (the root element) which natively supports the ElementTree API.

You may use this code however you like. You may even copy-paste it into your project in order to keep the result clean and simple (a comment to this source is welcome!).

But... wait!

As the filename implies, this is a very naive approach to this problem. If you really need (or may use) a fully-fledged parsing library, lxml and BeautifulSoup are arguably the most used.

Example

html = """
<html>
  <head>
    <title>GitHub</title>
  </head>
  <body>
    <a href="https://github.com/marmelo">GitHub</a>
    <a href="https://github.com/marmelo/python-htmlparser">GitHub Project</a>
  </body>
</html>
"""

parser = NaiveHTMLParser()
root = parser.feed(html)
parser.close()

# root is an xml.etree.Element and supports the ElementTree API
# (e.g. you may use its limited support for XPath expressions)

# get title
print(root.find('head/title').text)

# get all anchors
for a in root.findall('.//a'):
    print(a.get('href'))

# for more information, see:
# http://docs.python.org/2/library/xml.etree.elementtree.html
# http://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support

Output:

GitHub
https://github.com/marmelo
https://github.com/marmelo/python-htmlparser

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
naivehtmlparser.py		naivehtmlparser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python-htmlparser

Why another Python HTML Parser?

But... wait!

Example

About

Uh oh!

Releases

Packages

Languages

License

marmelo/python-htmlparser

Folders and files

Latest commit

History

Repository files navigation

python-htmlparser

Why another Python HTML Parser?

But... wait!

Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages