soh-cah-toa edited this page Sep 24, 2011 · 1 revision

Compiler developers implementing HLL's using the Parrot Compiler Toolkit need a way for unit testing their code. One source of difficulty lies in ensuring that their parser emits correct PAST trees and that the trees are consistent.

TreeUnit is a tool for making assertions about the trees.

Necessary Features

Following is a list of all the features needed in a full implementation.


In a C program, you can't just compile a = 1; and see what comes back. Instead, you need to compile something like:

void main(void) {
   int a;

   a = 1;

So the resulting PAST tree won't be completely independent. It will be nested inside the "scaffolding" that had to be implemented to drive the parser to a state where the expression could be parsed.

TreeUnit will provide some kind of search or match feature so the scaffolding can be ignored.

Background Noise

A PAST node can have a whole collection of attributes set, many of which are completely insignificant.

[10] => PMC 'PAST;Block'  {
    <blocktype> => "declaration"
    <name> => "_runner"
    <lstype> => "function body"
    <hll> => \past
    <namespace> => ResizablePMCArray (size:1) [
    <source> => \past
    <pos> => 1309
    <is_function> => 1
    <params> => "uplifted"
    <pirflags> => " :init :load"
    <rtype> => undef
    <scope> => \past
    <is_rooted> => 0
    <adverbs> => Hash {
        "init" => 1,
        "load" => 1
    [0] => PMC 'PAST;Stmts'  {
        <name> => "local variables"
    [1] => PMC 'PAST;Stmts'  {
        <source> => \past
        <pos> => 1336
        <name> => "compound_stmt"
        [0] => PMC 'PAST;Op'  {
            <pasttype> => "call"
            <source> => \past
            <pos> => 1344
            [0] => PMC 'PAST;Var'  {
                <source> => \past
                <pos> => 1338
                <scope> => "package"
                <name> => "test"
                <is_rooted> => 0
                <hll> => \past
                <namespace> => ResizablePMCArray (size:0) [
                <decl> => \past

TreeUnit will permit selective validation: only those nodes and attributes which are significant will be validated.

Arbitrary descendants

Given that a parser may insert arbitrary containers (PAST::Block or PAST::Stmts) between a parent and child node, TreeUnit syntax has to provide for some kind of ... mechanism. (In Apache Ant file patterns, /**/ is used for this.)

Successor matching

Sequences of statements will produces sequences of nodes, so TreeUnit must be able to recognize A; B; C as well as (A (B (C))) relationships.


In addition to strict sequences ('A', followed by 'B'), it may be necessary to recognize general orderings: 'A', then eventually 'B'.

Lexical Nesting

HLL lexical blocks may map to an entirely separate subroutine that is "lexically contained" in an outer subroutine. TreeUnit should recognize the subroutines "inner" and "outer" linkages as indicating containment.


Here are some tools that deal with trees, or tree-like structures. Maybe they've got ideas to steal?

  • The UNIX find utility and Perl's File::Find
  • Apache Ant's FileList and FileSet elements
  • CSS Descriptors
  • The HTML DOM
  • XML's XQuery and XPath specifications
  • Lisp
  • ANTLR has a tree serialization format
  • LDAP
  • TreeDL
  • YAML (arbitrary structures, user-specified node types)



An important part of TreeUnit is a facile syntax. It's possible to do tree matching in PIR, but who would want to write it?

Instead, consider something like YAML (note that --- and ... are part of YAML):

name      : Test "a = 1;"
error-in  : Simple assignment of int
source    : | 
  void main(void) {
    int a;

    a = 1;
  type       : PAST::Block
  name       : main
  error-in   : PAST::Block for "main".
    type       : PAST::Op
    pirop      : assign
    children   :
      - type     : PAST::Var
        name     : a
        scope    : lexical
      - type     : PAST::Val
        returns  : Integer
        value    : 1

This is verbose, but ultimately readable, and a parser for it exists in Perl 5 already. In this case, we've chosen to express the "eventual descendant" predicate using "descendants" versus "children", and sequences using YAML sequences. There's nothing here directly supporting "in order but not consequent", nor "in any order". Implicitly, any attribute listed must match, and any attribute not listed is ignored.


A Parrot TreeUnit could obviously interoperate with the actual code (once it got working, which is another question entirely).

A Perl 5 TreeUnit would need an intermediate form. One obvious choice is PIR: produce a PIR program that did what was needed. Another, related choice is PAST: produce a PAST representation of a program. Another choice is to translate "easy to write" into "easy to parse" and produce some output format that is trivial for something like a Perl 6 script to parse (since there is no YAML library for Perl 6 yet).