GitHub - kgaillot/rum: Demonstration project for parser concept

kgaillot / rum Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Demonstration project for parser concept

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
samples		samples
.gitignore		.gitignore
Makefile		Makefile
README		README
README.spec		README.spec
rum.c		rum.c
rum_buffer.c		rum_buffer.c
rum_buffer.h		rum_buffer.h
rum_document.c		rum_document.c
rum_document.h		rum_document.h
rum_language.c		rum_language.c
rum_language.h		rum_language.h
rum_parser.c		rum_parser.c
rum_parser.h		rum_parser.h
rum_private.h		rum_private.h
rum_types.h		rum_types.h
rump.c		rump.c
rump.h		rump.h

Repository files navigation

Project specification
---------------------
See README.spec.


Rudimentary Markup
------------------
Rudimentary Markup (RuM) is a subset of XML:
* Processing instructions (<?...?>) and comments (<!--...-->) are accepted but ignored;
* No <! ... > elements are supported other than comments (<![CDATA[...]]>, <!ENTITY...>, etc.)
* Non-ASCII characters are not supported;
* Numeric character entities (&#DECIMAL; or &#xHEX;) are not supported;
* Line endings are not normalized (i.e. carriage returns are not removed or mapped to newline).
* White space in attribute values is not normalized (i.e. newlines etc. replaced with spaces).
* Empty elements may only be represented by an empty tag (<.../>), not a contentless tag (<...></...>).
* Only a tag's content before any nested tags will be returned to the application, for example
  in the following cases, the application will always receive "XXX" as the content of T1:
     <T1>XXX</T1>
     <T1>XXX <T2>YYY</T2> </T1>    <!-- YYY is content of T2 -->
     <T1>XXX <T2>YYY</T2> ZZZ</T1> <!-- ZZZ is ignored -->
* Probably other minor limitations not worth detailing.


The library
-----------
The RuM parser library, librump.a, consists of:

* rum_types.h: A separate header for typedef's allows different structures
to reference each other.

* rum_buffer.c and rum_buffer.h: This is a support object for the use of
the parser but is exposed to users (all one of them) in case they want
it. It defines a character buffer object that can be dynamically
resized as characters are added to it. The buffer can track a substring
of itself based on start and end positions, allowing the calling code
to "bookmark" a section of the buffer.

* rum_parser.c and rum_parser.h: This is another support object, and defines
a stack of parser states for the parser state engine. One parser state
corresponds to one document element. As nested tags are encountered,
new states are popped onto the stack, and they are finished parsing,
they are popped off. The parser object is exposed so that users of the library
could write custom parse routines for input other than files if desired.

* rum_language.c and rum_language.h: This portion of the library allows
calling code to define a RuM-conformant language. The caller can specify
each tag, including what attributes the tag takes and which other tags
may contain it. This is basically a low-rent replacement for DTDs.

The main language object is the tag specification. Tags are stored
in a tree structure, so a language is simply a pointer to the root tag.

Tags have a function-pointer-based display method, allowing
the calling code to specify how each tag type should be displayed.

* rum_document.c and rum_document.h: This portion of the library
defines the document object (a particular textual instance of
the defined language).

The main structure is the element, which corresponds to a particular
occurrence of a tag with its attribute values and content. Elements
also are stored in a tree structure, and a document is simply
a pointer to the root element.

The document object handles replacement of predefined entities
(&amp; etc.).

The document object has a display function that iterates through the
element tree, calling the appropriate tag display method for each.

* rum_private.h: This contains declarations for unexposed
support functions (currently just one to set the library's global
error message).

* rump.h: This is the overall include file for library, and includes
the other public includes.

* rump.c: This contains high-level functions, most importantly
the file parser.

Though not a full validating parser, it does do some language validation:
- It knows the root tag, and requires it as the outermost tag.
- It knows and enforces which tags may contain which other tags.
- It knows and enforces which attributes are valid for which tags.


The application
---------------
* rum.c is the application. It defines a sample RuM-conformant language
and displays files written in that language. It has the usual Unix utility
behavior of accepting a filename or stdin, so it can be used like:

	rum samples/illegal_char.rum
	rum < samples/illegal_char.rum
	cat samples/illegal_char.rum | rum

* samples/: This directory contains sample RuM files, well-formed and not.


Limitations
-----------
Since this is a demonstration project, some corners were cut:

* Memory management is minimal. There are lots of small mallocs and frees,
and some duplication and waste that could be eliminated. There are some
memory leaks in error conditions; since the application exits on error,
it's not a big deal for this project, but a production library would
need more clean-up.

* Error handling and reporting is basic. Messages could be more detailed
and user-friendly.

* The tree structure for language specification is inefficient when a tag may
be included by more than one other tag, but it is sufficient for this project.

* A "real" project would have to pay more attention to security concerns,
especially crafted input. A simple memory limit would take care of most of it,
since RuM doesn't support <!ENTITY> expansion.

* There are no command-line options. For a "real" project, I'd at least
add standard --debug/--version/--help options, and for any expansions
such as the memory limit or a configurable chunk size for the buffer.

* The sample files are the only formal tests. A proper unit test would
need to be written in C to dynamically generate the very large number of
distinct ways a document can be malformed.

* No bells and whistles such as GNU autotools and gettext.