Skip to content

Commit cbe9633

Browse files
author
Rafael Marmelo
committed
Added naive html parser implementation.
1 parent 4aa8099 commit cbe9633

File tree

2 files changed

+151
-1
lines changed

2 files changed

+151
-1
lines changed

README.md

Lines changed: 67 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,70 @@
11
python-htmlparser
22
=================
33

4-
Python 3.x html.parser.HTMLParser extension with ElementTree support.
4+
Python 3.x HTMLParser extension with ElementTree support.
5+
6+
7+
Why another Python HTML Parser?
8+
-----
9+
10+
There is no HTML Parser in the Python Standard Library.
11+
Actually, there is the html.parser.HTMLParser that simply traverses the DOM tree and allows us to be notified as each tag is being parsed.
12+
13+
Usually, when we parse HTML we want to query its elements and extract data from it.
14+
The most simple way to do this is to use XPath expressions.
15+
Python do support a simple (read limited) XPath engine into its ElementTree, but there is no way to parse an HTML document into XHTML and then use this library to query it.
16+
17+
This HTML Parser extends html.parser.HTMLParser returning an xml.etree.Element instance (the root element) which natively supports the ElementTree API.
18+
19+
You may use this code however you like.
20+
You may even copy-paste it into your project in order to keep the result clean and simple (a comment to this source is welcome!).
21+
22+
23+
But... wait!
24+
-----
25+
26+
As the filename implies, this is a very naive approach to this problem.
27+
If you really need (or may use) a fully-fledged parsing library, [lxml](http://lxml.de/) and [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) are arguably the most used.
28+
29+
30+
Example
31+
-----
32+
33+
```python
34+
html = """
35+
<html>
36+
<head>
37+
<title>GitHub</title>
38+
</head>
39+
<body>
40+
<a href="https://github.com/marmelo">GitHub</a>
41+
<a href="https://github.com/marmelo/python-htmlparser">GitHub Project</a>
42+
</body>
43+
</html>
44+
"""
45+
46+
parser = NaiveHTMLParser()
47+
root = parser.feed(html)
48+
parser.close()
49+
50+
# root is an xml.etree.Element and supports the ElementTree API
51+
# (e.g. you may use its limited support for XPath expressions)
52+
53+
# get title
54+
print(root.find('head/title').text)
55+
56+
# get all anchors
57+
for a in root.findall('.//a'):
58+
print(a.get('href'))
59+
60+
# for more information, see:
61+
# http://docs.python.org/2/library/xml.etree.elementtree.html
62+
# http://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support
63+
```
64+
65+
Output:
66+
```
67+
GitHub
68+
https://github.com/marmelo
69+
https://github.com/marmelo/python-htmlparser
70+
```

naivehtmlparser.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#!/usr/bin/env python
2+
"""
3+
Python 3.x HTMLParser extension with ElementTree support.
4+
"""
5+
6+
from html.parser import HTMLParser
7+
from xml.etree import ElementTree
8+
9+
10+
class NaiveHTMLParser(HTMLParser):
11+
"""
12+
Python 3.x HTMLParser extension with ElementTree support.
13+
@see https://github.com/marmelo/python-htmlparser
14+
"""
15+
16+
def __init__(self):
17+
self.root = None
18+
self.tree = []
19+
HTMLParser.__init__(self)
20+
21+
def feed(self, data):
22+
HTMLParser.feed(self, data)
23+
return self.root
24+
25+
def handle_starttag(self, tag, attrs):
26+
if len(self.tree) == 0:
27+
element = ElementTree.Element(tag, dict(self.__filter_attrs(attrs)))
28+
self.tree.append(element)
29+
self.root = element
30+
else:
31+
element = ElementTree.SubElement(self.tree[-1], tag, dict(self.__filter_attrs(attrs)))
32+
self.tree.append(element)
33+
34+
def handle_endtag(self, tag):
35+
self.tree.pop()
36+
37+
def handle_startendtag(self, tag, attrs):
38+
self.handle_starttag(tag, attrs)
39+
self.handle_endtag(tag)
40+
pass
41+
42+
def handle_data(self, data):
43+
if self.tree:
44+
self.tree[-1].text = data
45+
46+
def get_root_element(self):
47+
return self.root
48+
49+
def __filter_attrs(self, attrs):
50+
return filter(lambda x: x[0] and x[1], attrs) if attrs else []
51+
52+
53+
# example usage
54+
if __name__ == "__main__":
55+
56+
html = """
57+
<html>
58+
<head>
59+
<title>GitHub</title>
60+
</head>
61+
<body>
62+
<a href="https://github.com/marmelo">GitHub</a>
63+
<a href="https://github.com/marmelo/python-htmlparser">GitHub Project</a>
64+
</body>
65+
</html>
66+
"""
67+
68+
parser = NaiveHTMLParser()
69+
root = parser.feed(html)
70+
parser.close()
71+
72+
# root is an xml.etree.Element and supports the ElementTree API
73+
# (e.g. you may use its limited support for XPath expressions)
74+
75+
# get title
76+
print(root.find('head/title').text)
77+
78+
# get all anchors
79+
for a in root.findall('.//a'):
80+
print(a.get('href'))
81+
82+
# for more information, see:
83+
# http://docs.python.org/2/library/xml.etree.elementtree.html
84+
# http://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support

0 commit comments

Comments
 (0)