Simple HTML Parser Written in Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Simple HTML parser:

Usage: ./parse <file>.html

-Whitespace in tags is dropped, but whitespace in text is preserved.
-A trailing Text tag with a single newline will always be present in the result.
-Line numbers for match errors are not yet printed.

Original problem spec:
Write a program to take as input a file, and determine whether it is a properly
formatted subset of html.  Here are the rules for what should be accepted:

1) A standalone tag is of the form <foo/>, where there is a '/' immediately
before the closing '>'.  It can appear anywhere that text would appear.

2) A tag is of the form <foo> and must be closed with </foo>.  Text and other
tags can appear in between.  Hitting '</bar>' in the document without having an
active '<bar>' tag is illegal.  Hitting the end of the document without finding
'</foo>' while processing a '<foo>' tag is illegal.  Tags of either form must be
valid C identifiers, with hyphens allowed as well.

3) Text can appear anywhere that a tag would start.  When parsing text, the
character '&' is an escape character.  To put a literal '&' the text should
contain '&amp;', and to put a literal '<' the text should contain '&lt;'.  All
other escape characters are invalid.

If the document is not formatted in a valid way, you should print the line
number where there is a problem, and what the problem is.

If the document is formatted in a valid way, you should return a data structure
that represents the document, including all tags and text (with all values