Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Simple HTML Parser Written in Python

branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Simple HTML parser:

Usage: ./parse <file>.html

-Whitespace in tags is dropped, but whitespace in text is preserved.
-A trailing Text tag with a single newline will always be present in the result.
-Line numbers for match errors are not yet printed.

Original problem spec:
Write a program to take as input a file, and determine whether it is a properly
formatted subset of html.  Here are the rules for what should be accepted:

1) A standalone tag is of the form <foo/>, where there is a '/' immediately
before the closing '>'.  It can appear anywhere that text would appear.

2) A tag is of the form <foo> and must be closed with </foo>.  Text and other
tags can appear in between.  Hitting '</bar>' in the document without having an
active '<bar>' tag is illegal.  Hitting the end of the document without finding
'</foo>' while processing a '<foo>' tag is illegal.  Tags of either form must be
valid C identifiers, with hyphens allowed as well.

3) Text can appear anywhere that a tag would start.  When parsing text, the
character '&' is an escape character.  To put a literal '&' the text should
contain '&amp;', and to put a literal '<' the text should contain '&lt;'.  All
other escape characters are invalid.

If the document is not formatted in a valid way, you should print the line
number where there is a problem, and what the problem is.

If the document is formatted in a valid way, you should return a data structure
that represents the document, including all tags and text (with all values
Something went wrong with that request. Please try again.