GitHub - lamielle/simphtml: Simple HTML Parser Written in Python

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
simphtml		simphtml
README		README
parse		parse
parse.py		parse.py
test		test

Repository files navigation

Simple HTML parser:

Usage: ./parse <file>.html

Notes:
-Whitespace in tags is dropped, but whitespace in text is preserved.
-A trailing Text tag with a single newline will always be present in the result.
-Line numbers for match errors are not yet printed.

Original problem spec:
Write a program to take as input a file, and determine whether it is a properly
formatted subset of html.  Here are the rules for what should be accepted:

1) A standalone tag is of the form <foo/>, where there is a '/' immediately
before the closing '>'.  It can appear anywhere that text would appear.

2) A tag is of the form <foo> and must be closed with </foo>.  Text and other
tags can appear in between.  Hitting '</bar>' in the document without having an
active '<bar>' tag is illegal.  Hitting the end of the document without finding
'</foo>' while processing a '<foo>' tag is illegal.  Tags of either form must be
valid C identifiers, with hyphens allowed as well.

3) Text can appear anywhere that a tag would start.  When parsing text, the
character '&' is an escape character.  To put a literal '&' the text should
contain '&amp;', and to put a literal '<' the text should contain '&lt;'.  All
other escape characters are invalid.

If the document is not formatted in a valid way, you should print the line
number where there is a problem, and what the problem is.

If the document is formatted in a valid way, you should return a data structure
that represents the document, including all tags and text (with all values
unescaped).