Simple HTML Parser Written in Python
Fetching latest commit…
Cannot retrieve the latest commit at this time
Simple HTML parser: Usage: ./parse <file>.html Notes: -Whitespace in tags is dropped, but whitespace in text is preserved. -A trailing Text tag with a single newline will always be present in the result. -Line numbers for match errors are not yet printed. Original problem spec: Write a program to take as input a file, and determine whether it is a properly formatted subset of html. Here are the rules for what should be accepted: 1) A standalone tag is of the form <foo/>, where there is a '/' immediately before the closing '>'. It can appear anywhere that text would appear. 2) A tag is of the form <foo> and must be closed with </foo>. Text and other tags can appear in between. Hitting '</bar>' in the document without having an active '<bar>' tag is illegal. Hitting the end of the document without finding '</foo>' while processing a '<foo>' tag is illegal. Tags of either form must be valid C identifiers, with hyphens allowed as well. 3) Text can appear anywhere that a tag would start. When parsing text, the character '&' is an escape character. To put a literal '&' the text should contain '&', and to put a literal '<' the text should contain '<'. All other escape characters are invalid. If the document is not formatted in a valid way, you should print the line number where there is a problem, and what the problem is. If the document is formatted in a valid way, you should return a data structure that represents the document, including all tags and text (with all values unescaped).