This project is a complete compiler for a custom procedural language called "M1". It takes M1 source code files (.m1)
and compiles them into assembly language for the Motorola 68000 (M68K) processor.
The compiler is written in Java and built using Maven. It features a full front-end (lexer, parser, semantic analysis), an intermediate representation (Three-Address Code with optimizations), and a back-end for code generation.
M1 is a strongly-typed, procedural language designed for simplicity and clarity. It features familiar constructs from other languages like Ada and Pascal.
- Static typing
- Primitive types (
int,bool) - Multi-dimensional arrays
- Constants and variables
- Procedures and functions
- Standard arithmetic and logical operators
- Structured control flow (
if,while,loop) - Basic I/O (
print,read)
int: A signed integer. The size is checked by the compiler to fit within the M68K's word size.bool: A boolean value, represented astrueorfalse.
All constants, variables, and arrays must be declared in a special is ... begin block that precedes the main logic of
a scope (global, procedure, or function).
- Constants:
const int MY_CONST = 100; - Variables:
int i;orint j = 5; - Arrays:
int[10] a;orbool[2][5] grid = [[true, false, ...], ...];
- If/Else:
if (x > 10) begin print(x); else print(0); end; - While Loop:
while (i < 10) begin i = i + 1; end; - Loop (For-style):
loop i = 1 to 10 begin print(i); end;
M1 supports two types of subprograms: procedure (which does not return a value) and function (which does). The entry
point of every program is procedure main.
- Procedure:
procedure print_sum(int a, int b) is begin print(a + b); end print_sum; - Function:
function factorial(int n) return int is int result = 1; begin loop i = 2 to n begin result = result * i; end; return result; end factorial;
A simple program that sorts an array using bubble sort.
// programs/correct/bubble_sort.m1
procedure main is
int[10] arr = [5, 1, 4, 2, 8, 9, 0, 3, 7, 6];
int n = 10;
int i = 0;
int j = 0;
int temp;
begin
while (i < n) begin
j = 0;
while (j < n - i - 1) begin
if (arr[j] > arr[j+1]) begin
temp = arr[j];
arr[j] = arr[j+1];
arr[j+1] = temp;
end;
j = j + 1;
end;
i = i + 1;
end;
// Print the sorted array
i = 0;
while (i < n) begin
print(arr[i]);
i = i + 1;
end;
end main
The compilation process is a pipeline of sequential phases. A key design decision in this compiler is that parsing, semantic validation, and TAC generation are performed in a single pass.
- Technology: JFlex (
lexer.flex) - Responsibility: Reads the raw M1 source code and converts it into a stream of tokens (e.g., keywords, identifiers, literals, operators).
- Details: The lexer is responsible for recognizing keywords, identifiers, and literals. It also performs an early, efficient integer overflow check against the target machine's word size.
- Technology: CUP (
grammar.cup) - Responsibility:
- Syntactic: Takes the stream of tokens and verifies that they form a valid program according to the M1 language grammar. It builds an Abstract Syntax Tree (AST) representing the code's structure.
- Semantic: As the AST is being built, the parser triggers validation logic on each node. This phase uses a
SymbolTableto check for type errors, undeclared variables, incorrect function arguments, etc.
- AST: The AST nodes are defined in the
src/main/java/eu/uib/nodespackage.
- Responsibility: As AST nodes are successfully validated, their
codify()method is called. This method traverses the AST and generates a linearized, machine-independent intermediate representation called Three-Address Code (TAC). - Structure: Each TAC instruction (
Instruction.java) has at most three operands: two sources and one destination. TheThreeAddressCode.javaclass manages the list of instructions and a symbol table for the temporary variables and labels it generates. - Example TAC Instruction:
PLUS, var1, var2, var3(var3 = var1 + var2)
- Responsibility: Before final code generation, the TAC instruction list is passed through a series of optimization passes.
- Implementation: The
ThreeAddressCode.optimize()method iterates over available optimizations. The project includes an implementation forDeferredAssignments. This modular design allows for new optimization passes to be added easily.
- Responsibility: This is the final phase, where the optimized TAC is translated into M68K assembly language (
M68KGenerator.java). - Strategy:
- Instruction Dispatch: A
HashMapis used as a dispatch table to map each TAC opcode to a dedicated assembly generation method. - Stack Management: Before generation, a pass calculates the stack offsets for all local variables and
parameters. The generator produces
LINKandUNLKinstructions to manage the stack frame and correctly computes offsets from the frame pointer (A6) to access variables. - Register Usage:
D0is used as the primary scratch register for operations. The generator also tracks register usage to automatically save and restore them (MOVEM.L) across function calls. - Macros: I/O and other complex operations are handled by including pre-written M68K macro files (
.X68) located insrc/main/resources/macros.
- Instruction Dispatch: A
This section dives deeper into the technical design of the compiler's main components.
The project's classes are segmented into packages, each with a distinct responsibility:
eu.uib.error: Contains the error handling infrastructure, includingErrorHandler,CompilingException, andErrorTypeenum, which defines all possible compilation errors.eu.uib.nodes: Defines the Abstract Syntax Tree. It's further divided intodeclarations,expressions, andstatements. Each class in this package represents a construct in the M1 language.eu.uib.type: Contains the implementation of the M1 type system and theSymbolTable. It's responsible for representing types and managing scopes.eu.uib.tac: Implements the Three-Address Code.Instructiondefines a single operation, whileThreeAddressCodemanages the instruction stream and related metadata. Theoptimitzationssubpackage holds optimization passes that operate on the TAC.eu.uib.m68k: The compiler back-end.M68KGeneratortranslates the TAC to assembly, andRegisterhelps manage M68K register allocation and usage.
A core design pattern in the compiler is the use of two interfaces, Validable and Codable, implemented by the AST
nodes. This pattern enables the single-pass analysis and generation process.
-
Validable: This interface defines the contract for semantic validation.public interface Validable { void validate(SymbolTable symbolTable); }
Every AST node that needs to be checked for semantic correctness (e.g., type compatibility in an assignment, variable existence) implements this. The parser calls
validate()on each node as it's created, passing the currentSymbolTable. -
Codable: This interface defines the contract for generating Three-Address Code.public interface Codable { String codify(ThreeAddressCode threeAddressCode, SymbolTable symbolTable); }
After a node is validated, the parser calls its
codify()method. This method is responsible for translating the AST node into one or more TAC instructions and adding them to theThreeAddressCodeobject. It returns the name of the temporary TAC variable that holds the result of the operation, if any.
The type system is designed hierarchically to represent and validate the language's types.
-
Type.java: This is an abstract base class for all types. A key concept is theSubjacentBasicTypeenum, which prevents invalid operations between fundamentally different types (e.g., a boolean and a number). -
Primitive.java: Represents theintandbooltypes. It provides static singleton instances (Primitive.INTEGER,Primitive.BOOLEAN) and defines their size on the target machine (e.g.,WORDfor M68K). -
Array.java: Represents an array type. It stores thePrimitivetype of its elements and aList<Integer>for the sizes of its dimensions. It includes critical logic for:- Checking for out-of-bounds access during TAC generation.
- Calculating the type resulting from an indexed access. For example, accessing one dimension of a
int[10][5]array yields a type ofint[5].
-
SymbolTable.java: The symbol table is implemented with two maps: one for the global scope and one for the current local scope.- When the parser encounters a subprogram signature, the symbol table automatically transitions to a new, empty
LOCALscope. - When
scopeOut()is called at the end of a subprogram, the local scope is cleared, and the table reverts to theGLOBALscope. - It stores instances of
Description, which wrap aTypeobject and hold additional metadata, such as the TAC variable name and whether the symbol is a subprogram argument.
- When the parser encounters a subprogram signature, the symbol table automatically transitions to a new, empty
Abstract classes and interfaces are represented with dashed borders.
The SubprogramDeclaration is the root of the AST for a given function or procedure, containing lists of other nodes
like Declaration and Statement. The Call node is unique in that it can be both a Statement (if its result is
ignored) and an Expression (if its result is used).
This section covers implementation details that are key to understanding the compiler's behavior, from memory management to testing.
The compiler implements a standard stack-based memory model for subprogram calls. Each time a function or procedure is
called, a new "stack frame" is pushed onto the stack. The A6 register is always used as the Frame Pointer (FP),
pointing to a fixed location within the current frame.
Here is the layout of a single stack frame:
+-------------------------+
SP -> | | <-- Stack grows downwards
+-------------------------+
| Local Variable N |
| ... |
| Local Variable 1 |
+-------------------------+
A6 -> | Saved Previous A6 Value | (LINK A6, #-locals_size)
+-------------------------+
| Return Address | (Placed by JSR instruction)
+-------------------------+
| Parameter 1 |
| ... |
| Parameter N |
+-------------------------+
| Previous Stack Frame |
| ... |
- Parameters: Pushed onto the stack by the caller before the
JSRinstruction. They are accessed at a positive offset from the Frame Pointer ((8,A6),(10,A6), etc.). - Return Address: Pushed automatically by the
JSRinstruction. - Frame Pointer: The
LINKinstruction at the start of the subprogram saves the caller'sA6value and sets the newA6to the current stack pointer. - Local Variables: Space is made on the stack for local variables by subtracting from the Stack Pointer (
SP). They are accessed at a negative offset from the Frame Pointer ((-2,A6),(-4,A6), etc.).
When the subprogram returns, UNLK A6 restores the previous frame pointer and RTS pops the return address, unwinding
the stack.
The generated M68K code follows a strict register convention:
A7(SP): The Stack Pointer, used implicitly by stack operations.A6(FP): The Frame Pointer, as described above. It remains constant throughout the subprogram.D0: Used as the primary scratch register for all arithmetic and logical calculations. It is also used to hold the return value of a function call.D1-D7/A0-A5: If other registers are needed, they are saved on the stack at the beginning of a subprogram using theMOVEM.L(Move Multiple Registers) instruction and restored before returning. This is handled automatically by the code generator.
To illustrate the full process, consider this simple M1 statement: x = a + b;
-
Parsing & Semantic Analysis: The parser recognizes an assignment statement where the right-hand side is an addition
Operation. -
TAC Generation: The
codify()methods of the AST nodes would produce the following intermediate code (variable names are illustrative):; TAC for x = a + b; PLUS, var_a, var_b, temp_1 ; temp_1 = a + b COPY, temp_1, var_x ; x = temp_1 -
Optimization: The
DeferredAssignmentspass recognizes thattemp_1is a temporary copy and eliminates it.; Optimized TAC PLUS, var_a, var_b, var_x ; x = a + b -
M68K Code Generation: The
M68KGeneratortranslates the optimized TAC into assembly. Assuminga,b, andxare local variables at offsets -2, -4, and -6 from the frame pointer (A6):; M68K for x = a + b MOVE.W (-2,A6), D0 ; D0 = a ADD.W (-4,A6), D0 ; D0 = D0 + b MOVE.W D0, (-6,A6) ; x = D0
The compiler performs a simple but effective optimization pass at the TAC level called DeferredAssignments. This is a
form of copy propagation.
Its goal is to remove temporary variables that only serve to copy a value from one place to another.
-
How it works: The pass identifies
COPY, source, temp_varinstructions wheretemp_varis used only once immediately after. It removes theCOPYinstruction and replaces the subsequent use oftemp_vardirectly withsource. -
Example:
- Before Optimization:
COPY, var_a, temp_1 COPY, temp_1, var_x - After Optimization:
COPY, var_a, var_x
- Before Optimization:
This reduces the number of redundant MOVE instructions in the final assembly code.
The project uses JUnit 5 for testing and employs a robust integration testing strategy.
- Parameterized Tests: The core of the testing is the
ProgramsArgumentsProvider.class. This class automatically discovers all test files in thesrc/test/resources/programs/directory. - Test File Structure: Each test file is a complete M1 program with a special "front matter" section that declares
the expected outcome:
- A list of comma-separated error codes (from the
ErrorTypeenum) that the compiler should produce. - If no errors are expected, this list is left blank.
- A
---separator divides the front matter from the M1 code.
- A list of comma-separated error codes (from the
- Test Logic: The
M1Testclass uses this provider to run the compiler against each file. It then asserts that the set of errors produced by the compiler exactly matches the set of errors declared in the test file's front matter. This makes it very easy to add new tests for both correct programs and specific compiler error conditions.
C:.
├───pom.xml # Maven build configuration
├───programs # Example .m1 programs
│ ├───correct
│ └───incorrect
└───src
├───main
│ ├───java/eu/uib
│ │ ├───M1.java # Main compiler driver
│ │ ├───nodes/ # AST Node classes
│ │ ├───type/ # Type system and Symbol Table
│ │ ├───tac/ # Three-Address Code implementation
│ │ │ └───optimitzations/ # TAC optimization passes
│ │ └───m68k/ # M68K code generator
│ └───resources
│ ├───grammar.cup # CUP grammar for the M1 language
│ ├───lexer.flex # JFlex definition for the lexer
│ └───macros/ # M68K utility macros
└───test # JUnit tests
The project is managed by Apache Maven.
- Java 23 or higher
- Apache Maven
To build the project and generate the executable JAR, run the following command from the project root directory:
mvn clean packageThis command will:
- Run JFlex and CUP to generate the lexer and parser classes.
- Compile all Java source files.
- Run the test suite.
- Package the compiler into an executable JAR file in the
target/directory.
The .X68 files generated by the compiler contain assembly code for the Motorola 68000 processor. To run these
programs, you need an emulator.
It is recommended to use the Easy68K editor and simulator. You can open the .X68 file in the editor (EDIT68k.exe),
assemble it (F9 or via the menu), and then run the resulting object code in the simulator (SIM68k.exe).
To compile an M1 source file, use the generated JAR:
java -jar target/compiler-1.0.0.jar <path/to/your/program.m1>This will produce an M68K assembly file named program.X68 in the same directory.
The Java classes for the lexer (Lexer.java) and the parser (Parser.java, Token.java) are not written manually.
Instead, they are automatically generated from definition files:
src/main/resources/lexer.flex(for JFlex)src/main/resources/grammar.cup(for CUP)
This generation is handled by Maven during the build process, as configured in pom.xml. Two specific plugins are
responsible:
jflex-maven-plugin: This plugin hooks into thegenerate-sourcesphase of the Maven lifecycle. It finds the.flexfile and generates theLexer.javasource file insrc/main/java/eu/uib/.cup-maven-plugin: This plugin also runs during thegenerate-sourcesphase and uses the.cupfile to generateParser.javaand theToken.javasymbol class.
Because this is integrated into the standard Maven build lifecycle, you do not need to do anything special. Simply running the main build command will handle everything:
# This command automatically generates the sources, then compiles them.
mvn clean packageIf you only want to generate the Lexer and Parser source files without compiling the entire project, you can run the
generate-sources goal directly:
# Only runs the code generation step
mvn generate-sourcesThis final section covers other interesting implementation details.
The parser implements a "panic-mode" error recovery strategy to avoid terminating on the first syntax error. This allows
it to report multiple syntax issues in a single compilation run. The recovery is defined in grammar.cup using the
error token.
The strategy is as follows:
- If the parser encounters an invalid token within a list of statements, it enters recovery mode. It will discard
incoming tokens until it finds a statement-terminating semicolon (
;). At this point, it assumes the broken statement has ended and attempts to parse the next one. - If the parser gets lost while parsing a subprogram, it will discard tokens until it finds the
endkeyword, which it uses as a synchronization point to recover and continue parsing subsequent subprograms.
All array data is allocated on a simple, custom-built heap rather than on the stack. The heap manager is implemented
entirely in M68K assembly in the src/main/resources/macros/HEAP.X68 file.
- Structure: The heap is a 1MB block of memory, managed as a linear list of blocks. Each block has a 6-byte header containing its size and status (busy or free).
HEAP_INIT: This routine is called once at the beginning of themainprogram to initialize the entire heap as a single large, free block.MALLOC: When an array is declared, the compiler generates aJSRcall to theMALLOCroutine. This routine uses a first-fit algorithm to find the first available block that is large enough. If a block is larger than required, it is split into a busy block and a new free block. The routine returns a pointer to the allocated memory in theD0register.- Memory Deallocation: It is important to note that this heap implementation is very simple. Although a
FREEroutine exists in the macro file, the compiler does not use it. Therefore, memory allocated for arrays is never freed during the program's lifetime.
This compiler provides a solid foundation, but there are many opportunities for future enhancements:
- Improve Memory Management: Implement memory deallocation by using the
FREEroutine. A more sophisticated heap manager could also be developed to, for example, merge adjacent free blocks. - Additional Data Types: The language could be extended with new types like
floator a built-instringtype with corresponding library functions. - More Advanced Optimizations: The TAC optimization phase could be expanded with more passes, such as:
- Constant Folding: Pre-calculate expressions with constant values at compile time.
- Dead Code Elimination: Remove variables and code that have no effect on the program's output.
- Improved Error Reporting: Error messages could be made more helpful by providing suggestions or more context about the error.
- Additional Backend Targets: The modular nature of the compiler (with its TAC intermediate representation) makes it possible to add new backends to generate code for other architectures, like ARM or x86.
- A Standard Library: A pre-compiled library of standard M1 functions could be provided for common tasks like string manipulation or more complex math operations.
