M1 refers to a C-like language targeting the M0 "virtual machine". That is, M0 can be considered as a virtual machine which defines an instruction set and a computation model. Details of M0 can be found in [1].
The M1 compiler is reentrant, which means it is thread-safe. This allows to invoke another instance of the M1 compiler if it's already running.
See the TODO file for the current status.
A few basic design principles (so-called key drivers) drove the design of M1:
- 1 Maintainability
- 2 High-quality output (i.e., M0 instructions)
- 3 Extensibility (e.g., easy to add another phase to the compilation process)
- 4 Ease of implementation
As it turns out, though many of the well-known constructs such as control statements (if, while, etc.) are fairly easy to implement, a few things are quite complex, and have made M1's implementation non-trivial. The most tricky parts are:
assignments
arrays (multi-dimensional)
structs (including nested structs)
function calls and parameters
So far, goal 1 (maintainability) is becoming a problem, as the AST data structure has become quite complex. Goal 2 (high quality output) is not achieved, as the code generator is fairly naive and simple. M1's compiler is fairly easy to extend, by adding another phase to the compiler that traverses the AST. Goal 4, ease of implementation: it is not easy to implement M1.
Compared to C, M1 has:
a built-in
string
type;a built-in
bool
type;a C++-class-style construction for defining PMCs, including inheritance;
a simpler (and more restrictive) grammar;
no
goto
statement (to prevent spaghetti code)namespaces {{XXX do we want that?}}
a try-catch statement {{XXX if we get that working}}
nested multi-line comments (/* comment 1 /* nested comment */ comment 1 part 2 */ )
no pointers; instead, structs and PMCs are treated as references (similar to Java).
variables with the same name in a nested block (hiding the one in the outer blok) are not allowed.
Completed milestones: (for varying interpretations of "completed")
assignments
iterating statements (while, do-while, for)
conditional statements (if/else, switch)
arrays; read and write
structs; read and write
function calls and returns (but still buggy)
register allocator
semantic checker
Enumerations
PMC definitions (though not running)
Incomplete milestones:
bug-free function calls
methods (as part of a PMC)
vtable methods (virtual methods)
PMC inheritance
namespaces
module system (preprocessor?)
optimization phase
register spiller
The M1 compiler follows the so-called batch-sequential
architecture> [2], which is a variant of the better known pipes-and-filters
architectural pattern. The key difference is that in a pipes-filter architecture, the next filter can already start processing before the filter before it has finished. In the M1 compiler, all text needs to be parsed before semantic analysis and other phases can start.
Each phase of the compiler is one "filter". The M1 compiler has the following phases:
- 1. Lexical and grammatical analysis (combined, as is usual for compilers).
- 2. Semantic analysis/type checking.
- 3. Instruction generation. (currently emitted to stdout).
- 4. Optimization and register optimization. (not implemented)
- 5. File generation (writing instructions to an M0 file). (not implemented)
Lexical analysis refers to the tokenization of the input file. Grammatical analysis refers to parsing the input stream of tokens. The compiler invocation starts with the grammar, which requests tokens from the lexer as needed until end of file is reached. During the parse, an Abstract Syntax Tree (AST) is constructed for later phases.
After the parse, the AST is passed on to the semantic analysis phase. In this phase, the compiler performs a number of type checks. Also, the semantic analysis phase visits all nodes and finds the right type information for variables. In this process, pointers are set to type definitions. In the parsing phase, a variable may be declared of a type that has not been parsed itself yet, therefore, the semantic check phase is essential.
After the type checking phase, the code generation phase traverses the AST, and generates appropriate instructions for each node.
M1 is implemented in C. The lexer is implemented using the Flex lexer generator. The parser is implemented using the Bison parser generator. Flex and Bison are more modern implementations of the well-known Lex and Yacc tools, respectively. The choice for these tools is grounded in the fact that the precise definition of M1 was expected to be in flux in this early stage, and the need for maintainability.
The current implementation consists currently of about 9,300 lines of code, using SLOCcount [3], though it is expected this will go up to 10,000 when the implementation is finished.
The implementation consists of the following files:
m1.l - lexer specification.
m1.y - parser specification.
ast.{c,h} - abstract syntax tree nodes definitions and constructors.
eval.{c,h} - code to generate M1 code from the AST. This is to ensure that the compiler can generate the same code as it has parsed.
semcheck.{c,h} - implementation of the semantic check phase.
instr.{c,h} - structures and code to represent M0 instructions.
decl.{c,h} - structures and code to represent type declarations {{XXX needs better name}}
gencode.{c,h} - code generation.
stack.{c,h} - implementation of a stack
symtab.{c,h} - symbol table implementation.
ann.h - code annotation #defines for splint. {{XXX needs better name}}
M1 has a number of built-in types. These are:
int
num
string
char
bool
Besides these built-in types, M1 supports the notion of aggregate types; you can define a struct
, which is essentially the same as a C struct, in which all attributes (member fields) are publicly accessible. An M1 struct is therefore just a means to bundle a number of related variables.
Besides structs, M1 allows you to write PMCs. A PMC is a Parrot Magic Cookie, or Poly-Morphic Container. The PMC is the magic means through which Parrot supports language-dependent behavior. A PMC can be compared to a C++ class definition.
{{XXX need to see whether inheritance and vtables can be implemented.}}
Assignments are as simple as:
lvalue '=' rvalue
Assignments can be chained; that is, you can write:
int a, b, c;
a = b = c = 42; // initializes a, b, and c to 42.
Arrays, when declared, are auto-vivivied. This means that code is generated to allocate memory for them. For instance, the following declaration:
int x[10];
will allocate 10 * sizeof(int) bytes on the heap.
Structs and PMCs are very similar. Structs are similar to C's structures. PMCs are like structs, but may have methods. In that sense, they are somewhat similar to C++ classes. Within PMC methods, the self
handle refers to the current object. self
is a parameter that is automatically available within PMC methods.
Structs and PMCs can be allocated with the new
keyword; they are allocated on the heap. Unlike arrays, they are not auto-vivivied.
This section describes a number of subsystems and how they work.
Variable declarations are stored in the AST by m1_var
nodes. For each variable that is declared, a new m1_symbol
is also created and inserted into the currently active symbol table. The AST node (m1_var
) has a pointer to the m1_symbol
structure.
Types are represented by a m1_type
struct instance. For each built-in type, the compiler inserts a new m1_type
into the type declaration table. For each user-defined type (structs, unions, PMCs and enums), a new m1_type
is created.
The typechecker (in semcheck.c) works mostly with these m1_type
nodes. Since there is exactly one m1_decl
node for each type, the typechecker compares types by doing pointer-comparisons. That is:
m1_type *ltype;
m1_type *rtype;
... // assign ltype and rtype
if (ltype != rtype) { // error! types do not match.
...
For each variable declaration, a m1_symbol
struct instance is created, which is stored in the currently active symbol table. Which table this is depends on the situation. If a struct
, union
or PMC
is parsed, this will be the symbol table for that struct
(or union
or PMC
). Whenever a variable declaration is parsed in a function body (including parameters), they will be entered into the symbol table that is active for that block
.
Each nested scope (block) has its own symbol table. The following code illustrates what this implies:
int main()
{ // scope for "main"
int i;
{ // new scope
int i; // will result in an error, as it is hiding the variable i above
int j;
}
{
int j; // ok, as this is a different scope/block from above.
}
}
M1 has a nifty register allocation system. Three functions are the core of this system:
void alloc_reg(M1_compiler *c, m1_valuetype type);
void free_reg(M1_compiler *c, m1_reg r);
void freeze_reg(M1_compiler *c, m1_reg r);
Whenever a new register is needed, for instance, to store a result of a calculation, you can allocate one with alloc_reg
. This function takes as a second parameter the type that the register should be (int, num, string or pmc).
Once you're done with a register, and you know that you won't need it any longer, then you can give it back to the register manager. This will make the register available; next time you invoke alloc_reg
, you might end up with the register you just freed.
In cases where you want to allocate a register to a symbol (a declared variable), you can freeze the register by calling freeze_reg
. It takes as an argument the register you want to freeze. This means that the register will be reserved for the symbol that you assign the register number to (a m1_symbol
structure stores the register). The freeze_reg
function ensures that, even if you call free_reg
on a register, it won't be released.
In general, free_reg
can be called on register that are just popped off the regstack
. If you allocate register that you push onto the regstack
, then obviously you should not free it, as you're making it available to other code generation functions that will use it.
The M1 compiler uses a number of stacks: two for break
and continue
statements, and one for holding registers in the code generator.
continue
statements are only allowed within for
, while
, and do-while
statements. break
statements are allowed wherever continue
statements are allowed, plus within switch
statements. The parser allows to write them anywhere; the check whether the break
(or continue
) statement is allowed is done in the semantic checker. The compiler pushes the value "1" on the break-stack whenever it encounteres a for
, while
, do-while
or switch
statement. Whenever a break statement is encountered, the breakstack is topped, to see whether there is in fact a "1" on top of the stack. In any other contex, there is a "0" on the stack.
A stack is also used in the code generator. The AST is traversed to generate instructions for each node. A single statement may be represented by more than one AST node. The regstack allows them to communicate. For instance, consider the statement:
x = 42;
First, the AST node representing the rvalue (42) is visited; a new register is allocated, and an instruction is generated to load the constant into that register:
set_imm I0, 0, 42 #load constant 42 in register I0.
The register I0 is pushed onto the regstack. Then, the assignment
node is visited, which will allocate a new register for x
if it hasn't been assigned one before. The assignment node can pop off the register that was assigned to keep the value 42.
A stack is used rather than the code generator functions' return values, since some nodes may be represented by two nodes, in particular for generating instructions for lvalues (which can become quite complex, such as x[3].y[4].z).
The following words are reserved and cannot be used as identifier:
bool
break
case
char
const
continue
default
do
else
false
for
if
int
null
num
pmc
return
string
struct
switch
true
while
The grammar of M1 is expressed using EBNF notation below. Note that this EBNF notation differs from the Bison specification, as Bison expects a certain formatting of the grammar rules.
M1-program: chunk+
chunk: function-def
| pmc-def
| struct-def
| enum-def
| namespace
pmc-def: pmc NAME extends-clause { struct-member* pmc-method* }
pmc-method: vtable? method type NAME block
struct-def: struct NAME { struct-member+ }
struct-member: type NAME ;
function-def: type NAME ( parameters? ) block
parameters: parameter [, parameter]*
parameter: type NAME
enum-def: enum NAME { enum-constant [, enum-constant]* }
enum-constant: NAME [= INTEGER]?
block: { statement* }
statement: assignment
| if-stat
| while-stat
| do-while-stat
| break-stat
| continue-stat
| for-stat
| switch-stat
| return-stat
| funcall-stat
| block
| var-decl
| const-decl
assignment: lvalue '=' expression ;
| lvalue '=' assignment ;
| assignop-expr
assignop-expr: lvalue assignop rvalue ;
assignop: += | -= | *= | /= | %= | >>= | <<= | &= | |=
if-stat: if ( expression ) statement [else statement]?
while-stat: while ( expression ) statement
do-while-stat: do { statement* } while ( expression ) ;
for-statement: for ( for-init? ; for-cond? ; for-step? ) statement
break-stat: break
continue-stat: continue
switch-stat: switch ( expression ) { case* default-case? }
case: case INTEGER : statement*
default-case: default : statement *
return-stat: return expression? ;
funcall-stat: funcall ';'
funcall: lvalue ( arguments? )
arguments: argument [, argument]*
argument: expression
expression: INTEGER
| NUMBER
| STRING
| CHAR
| true
| false
| null
| new NAME ( arguments? )
| expression ? expression : expression
| expression binop expression
| unop expression
| funcall
| target
| lvalue ++
| lvalue --
| ++ lvalue
| -- lvalue
| ( expression )
binop: + | - | * | % | / | >> | << | > | < | >= | <= | == | != | = | & | "|" | && | "||"
unop: !
| -
| ~
target: object field-access*
object: NAME
| self
| super
field-access: "." NAME
| "[" expression "]"
type: native-type
| userdefined-type
native-type: int
| num
| char
| string
| bool
[1] PDD 32: M0 Design Spec
[2] Paris Avgeriou and Uwe Zdun, Architectural Patterns Revisited - A Pattern Language http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.7444&rep=rep1&type=pdf