Skip to content

jvanyom/m1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

M1 Compiler

This project is a complete compiler for a custom procedural language called "M1". It takes M1 source code files (.m1) and compiles them into assembly language for the Motorola 68000 (M68K) processor.

The compiler is written in Java and built using Maven. It features a full front-end (lexer, parser, semantic analysis), an intermediate representation (Three-Address Code with optimizations), and a back-end for code generation.

Table of Contents

  1. The M1 Language
  2. Compiler Architecture
  3. Project Structure
  4. How to Build and Run

The M1 Language

M1 is a strongly-typed, procedural language designed for simplicity and clarity. It features familiar constructs from other languages like Ada and Pascal.

Features

  • Static typing
  • Primitive types (int, bool)
  • Multi-dimensional arrays
  • Constants and variables
  • Procedures and functions
  • Standard arithmetic and logical operators
  • Structured control flow (if, while, loop)
  • Basic I/O (print, read)

Data Types

  • int: A signed integer. The size is checked by the compiler to fit within the M68K's word size.
  • bool: A boolean value, represented as true or false.

Declarations

All constants, variables, and arrays must be declared in a special is ... begin block that precedes the main logic of a scope (global, procedure, or function).

  • Constants: const int MY_CONST = 100;
  • Variables: int i; or int j = 5;
  • Arrays: int[10] a; or bool[2][5] grid = [[true, false, ...], ...];

Control Flow

  • If/Else:
    if (x > 10) begin
        print(x);
    else
        print(0);
    end;
    
  • While Loop:
    while (i < 10) begin
        i = i + 1;
    end;
    
  • Loop (For-style):
    loop i = 1 to 10 begin
        print(i);
    end;
    

Subprograms

M1 supports two types of subprograms: procedure (which does not return a value) and function (which does). The entry point of every program is procedure main.

  • Procedure:
    procedure print_sum(int a, int b) is
    begin
        print(a + b);
    end print_sum;
    
  • Function:
    function factorial(int n) return int is
        int result = 1;
    begin
        loop i = 2 to n begin
            result = result * i;
        end;
        return result;
    end factorial;
    

Example Program

A simple program that sorts an array using bubble sort.

// programs/correct/bubble_sort.m1

procedure main is
    int[10] arr = [5, 1, 4, 2, 8, 9, 0, 3, 7, 6];
    int n = 10;
    int i = 0;
    int j = 0;
    int temp;
begin
    while (i < n) begin
        j = 0;
        while (j < n - i - 1) begin
            if (arr[j] > arr[j+1]) begin
                temp = arr[j];
                arr[j] = arr[j+1];
                arr[j+1] = temp;
            end;
            j = j + 1;
        end;
        i = i + 1;
    end;

    // Print the sorted array
    i = 0;
    while (i < n) begin
        print(arr[i]);
        i = i + 1;
    end;
end main

Compiler Architecture

The compilation process is a pipeline of sequential phases. A key design decision in this compiler is that parsing, semantic validation, and TAC generation are performed in a single pass.

1. Lexical Analysis

  • Technology: JFlex (lexer.flex)
  • Responsibility: Reads the raw M1 source code and converts it into a stream of tokens (e.g., keywords, identifiers, literals, operators).
  • Details: The lexer is responsible for recognizing keywords, identifiers, and literals. It also performs an early, efficient integer overflow check against the target machine's word size.

2. Syntactic & Semantic Analysis

  • Technology: CUP (grammar.cup)
  • Responsibility:
    • Syntactic: Takes the stream of tokens and verifies that they form a valid program according to the M1 language grammar. It builds an Abstract Syntax Tree (AST) representing the code's structure.
    • Semantic: As the AST is being built, the parser triggers validation logic on each node. This phase uses a SymbolTable to check for type errors, undeclared variables, incorrect function arguments, etc.
  • AST: The AST nodes are defined in the src/main/java/eu/uib/nodes package.

3. Three-Address Code (TAC) Generation

  • Responsibility: As AST nodes are successfully validated, their codify() method is called. This method traverses the AST and generates a linearized, machine-independent intermediate representation called Three-Address Code (TAC).
  • Structure: Each TAC instruction (Instruction.java) has at most three operands: two sources and one destination. The ThreeAddressCode.java class manages the list of instructions and a symbol table for the temporary variables and labels it generates.
  • Example TAC Instruction: PLUS, var1, var2, var3 (var3 = var1 + var2)

4. TAC Optimization

  • Responsibility: Before final code generation, the TAC instruction list is passed through a series of optimization passes.
  • Implementation: The ThreeAddressCode.optimize() method iterates over available optimizations. The project includes an implementation for DeferredAssignments. This modular design allows for new optimization passes to be added easily.

5. M68K Code Generation

  • Responsibility: This is the final phase, where the optimized TAC is translated into M68K assembly language ( M68KGenerator.java).
  • Strategy:
    • Instruction Dispatch: A HashMap is used as a dispatch table to map each TAC opcode to a dedicated assembly generation method.
    • Stack Management: Before generation, a pass calculates the stack offsets for all local variables and parameters. The generator produces LINK and UNLK instructions to manage the stack frame and correctly computes offsets from the frame pointer (A6) to access variables.
    • Register Usage: D0 is used as the primary scratch register for operations. The generator also tracks register usage to automatically save and restore them (MOVEM.L) across function calls.
    • Macros: I/O and other complex operations are handled by including pre-written M68K macro files (.X68) located in src/main/resources/macros.

Detailed Design and Class Structure

This section dives deeper into the technical design of the compiler's main components.

Class Organization by Package

The project's classes are segmented into packages, each with a distinct responsibility:

  • eu.uib.error: Contains the error handling infrastructure, including ErrorHandler, CompilingException, and ErrorType enum, which defines all possible compilation errors.
  • eu.uib.nodes: Defines the Abstract Syntax Tree. It's further divided into declarations, expressions, and statements. Each class in this package represents a construct in the M1 language.
  • eu.uib.type: Contains the implementation of the M1 type system and the SymbolTable. It's responsible for representing types and managing scopes.
  • eu.uib.tac: Implements the Three-Address Code. Instruction defines a single operation, while ThreeAddressCode manages the instruction stream and related metadata. The optimitzations subpackage holds optimization passes that operate on the TAC.
  • eu.uib.m68k: The compiler back-end. M68KGenerator translates the TAC to assembly, and Register helps manage M68K register allocation and usage.

Core Interfaces: Validable and Codable

A core design pattern in the compiler is the use of two interfaces, Validable and Codable, implemented by the AST nodes. This pattern enables the single-pass analysis and generation process.

  • Validable: This interface defines the contract for semantic validation.

    public interface Validable {
        void validate(SymbolTable symbolTable);
    }

    Every AST node that needs to be checked for semantic correctness (e.g., type compatibility in an assignment, variable existence) implements this. The parser calls validate() on each node as it's created, passing the current SymbolTable.

  • Codable: This interface defines the contract for generating Three-Address Code.

    public interface Codable {
        String codify(ThreeAddressCode threeAddressCode, SymbolTable symbolTable);
    }

    After a node is validated, the parser calls its codify() method. This method is responsible for translating the AST node into one or more TAC instructions and adding them to the ThreeAddressCode object. It returns the name of the temporary TAC variable that holds the result of the operation, if any.

Type System Design (eu.uib.type)

The type system is designed hierarchically to represent and validate the language's types.

  • Type.java: This is an abstract base class for all types. A key concept is the SubjacentBasicType enum, which prevents invalid operations between fundamentally different types (e.g., a boolean and a number).

  • Primitive.java: Represents the int and bool types. It provides static singleton instances ( Primitive.INTEGER, Primitive.BOOLEAN) and defines their size on the target machine (e.g., WORD for M68K).

  • Array.java: Represents an array type. It stores the Primitive type of its elements and a List<Integer> for the sizes of its dimensions. It includes critical logic for:

    • Checking for out-of-bounds access during TAC generation.
    • Calculating the type resulting from an indexed access. For example, accessing one dimension of a int[10][5] array yields a type of int[5].
  • SymbolTable.java: The symbol table is implemented with two maps: one for the global scope and one for the current local scope.

    • When the parser encounters a subprogram signature, the symbol table automatically transitions to a new, empty LOCAL scope.
    • When scopeOut() is called at the end of a subprogram, the local scope is cleared, and the table reverts to the GLOBAL scope.
    • It stores instances of Description, which wrap a Type object and hold additional metadata, such as the TAC variable name and whether the symbol is a subprogram argument.

AST Node Hierarchy

Abstract classes and interfaces are represented with dashed borders.

Global Architecture

The SubprogramDeclaration is the root of the AST for a given function or procedure, containing lists of other nodes like Declaration and Statement. The Call node is unique in that it can be both a Statement (if its result is ignored) and an Expression (if its result is used).


Advanced Compiler Design

This section covers implementation details that are key to understanding the compiler's behavior, from memory management to testing.

1. Memory Model and Stack Layout

The compiler implements a standard stack-based memory model for subprogram calls. Each time a function or procedure is called, a new "stack frame" is pushed onto the stack. The A6 register is always used as the Frame Pointer (FP), pointing to a fixed location within the current frame.

Here is the layout of a single stack frame:

      +-------------------------+
SP -> |                         |  <-- Stack grows downwards
      +-------------------------+
      |   Local Variable N      |
      |   ...                   |
      |   Local Variable 1      |
      +-------------------------+
A6 -> | Saved Previous A6 Value |  (LINK A6, #-locals_size)
      +-------------------------+
      |   Return Address        |  (Placed by JSR instruction)
      +-------------------------+
      |   Parameter 1           |
      |   ...                   |
      |   Parameter N           |
      +-------------------------+
      | Previous Stack Frame    |
      | ...                     |
  • Parameters: Pushed onto the stack by the caller before the JSR instruction. They are accessed at a positive offset from the Frame Pointer ((8,A6), (10,A6), etc.).
  • Return Address: Pushed automatically by the JSR instruction.
  • Frame Pointer: The LINK instruction at the start of the subprogram saves the caller's A6 value and sets the new A6 to the current stack pointer.
  • Local Variables: Space is made on the stack for local variables by subtracting from the Stack Pointer (SP). They are accessed at a negative offset from the Frame Pointer ((-2,A6), (-4,A6), etc.).

When the subprogram returns, UNLK A6 restores the previous frame pointer and RTS pops the return address, unwinding the stack.

2. Registering Convention

The generated M68K code follows a strict register convention:

  • A7 (SP): The Stack Pointer, used implicitly by stack operations.
  • A6 (FP): The Frame Pointer, as described above. It remains constant throughout the subprogram.
  • D0: Used as the primary scratch register for all arithmetic and logical calculations. It is also used to hold the return value of a function call.
  • D1-D7 / A0-A5: If other registers are needed, they are saved on the stack at the beginning of a subprogram using the MOVEM.L (Move Multiple Registers) instruction and restored before returning. This is handled automatically by the code generator.

3. TAC to M68K Translation Example

To illustrate the full process, consider this simple M1 statement: x = a + b;

  1. Parsing & Semantic Analysis: The parser recognizes an assignment statement where the right-hand side is an addition Operation.

  2. TAC Generation: The codify() methods of the AST nodes would produce the following intermediate code (variable names are illustrative):

    ; TAC for x = a + b;
    PLUS, var_a, var_b, temp_1   ; temp_1 = a + b
    COPY, temp_1, var_x          ; x = temp_1
    
  3. Optimization: The DeferredAssignments pass recognizes that temp_1 is a temporary copy and eliminates it.

    ; Optimized TAC
    PLUS, var_a, var_b, var_x    ; x = a + b
    
  4. M68K Code Generation: The M68KGenerator translates the optimized TAC into assembly. Assuming a, b, and x are local variables at offsets -2, -4, and -6 from the frame pointer (A6):

    ; M68K for x = a + b
            MOVE.W  (-2,A6), D0     ; D0 = a
            ADD.W   (-4,A6), D0     ; D0 = D0 + b
            MOVE.W  D0, (-6,A6)     ; x = D0

4. Optimization: Deferred Assignments (Copy Propagation)

The compiler performs a simple but effective optimization pass at the TAC level called DeferredAssignments. This is a form of copy propagation.

Its goal is to remove temporary variables that only serve to copy a value from one place to another.

  • How it works: The pass identifies COPY, source, temp_var instructions where temp_var is used only once immediately after. It removes the COPY instruction and replaces the subsequent use of temp_var directly with source.

  • Example:

    • Before Optimization:
      COPY, var_a, temp_1
      COPY, temp_1, var_x
      
    • After Optimization:
      COPY, var_a, var_x
      

This reduces the number of redundant MOVE instructions in the final assembly code.

5. Testing Strategy

The project uses JUnit 5 for testing and employs a robust integration testing strategy.

  • Parameterized Tests: The core of the testing is the ProgramsArgumentsProvider.class. This class automatically discovers all test files in the src/test/resources/programs/ directory.
  • Test File Structure: Each test file is a complete M1 program with a special "front matter" section that declares the expected outcome:
    • A list of comma-separated error codes (from the ErrorType enum) that the compiler should produce.
    • If no errors are expected, this list is left blank.
    • A --- separator divides the front matter from the M1 code.
  • Test Logic: The M1Test class uses this provider to run the compiler against each file. It then asserts that the set of errors produced by the compiler exactly matches the set of errors declared in the test file's front matter. This makes it very easy to add new tests for both correct programs and specific compiler error conditions.

Project Structure

C:.
├───pom.xml              # Maven build configuration
├───programs             # Example .m1 programs
│   ├───correct
│   └───incorrect
└───src
    ├───main
    │   ├───java/eu/uib
    │   │   ├───M1.java               # Main compiler driver
    │   │   ├───nodes/              # AST Node classes
    │   │   ├───type/               # Type system and Symbol Table
    │   │   ├───tac/                # Three-Address Code implementation
    │   │   │   └───optimitzations/ # TAC optimization passes
    │   │   └───m68k/               # M68K code generator
    │   └───resources
    │       ├───grammar.cup         # CUP grammar for the M1 language
    │       ├───lexer.flex          # JFlex definition for the lexer
    │       └───macros/             # M68K utility macros
    └───test                        # JUnit tests

How to Build and Run

The project is managed by Apache Maven.

Prerequisites

  • Java 23 or higher
  • Apache Maven

Building the Compiler

To build the project and generate the executable JAR, run the following command from the project root directory:

mvn clean package

This command will:

  1. Run JFlex and CUP to generate the lexer and parser classes.
  2. Compile all Java source files.
  3. Run the test suite.
  4. Package the compiler into an executable JAR file in the target/ directory.

Executing the Compiled Code

The .X68 files generated by the compiler contain assembly code for the Motorola 68000 processor. To run these programs, you need an emulator.

It is recommended to use the Easy68K editor and simulator. You can open the .X68 file in the editor (EDIT68k.exe), assemble it (F9 or via the menu), and then run the resulting object code in the simulator (SIM68k.exe).

Running the Compiler

To compile an M1 source file, use the generated JAR:

java -jar target/compiler-1.0.0.jar <path/to/your/program.m1>

This will produce an M68K assembly file named program.X68 in the same directory.

JFlex and CUP Build Process

The Java classes for the lexer (Lexer.java) and the parser (Parser.java, Token.java) are not written manually. Instead, they are automatically generated from definition files:

  • src/main/resources/lexer.flex (for JFlex)
  • src/main/resources/grammar.cup (for CUP)

This generation is handled by Maven during the build process, as configured in pom.xml. Two specific plugins are responsible:

  1. jflex-maven-plugin: This plugin hooks into the generate-sources phase of the Maven lifecycle. It finds the .flex file and generates the Lexer.java source file in src/main/java/eu/uib/.
  2. cup-maven-plugin: This plugin also runs during the generate-sources phase and uses the .cup file to generate Parser.java and the Token.java symbol class.

Because this is integrated into the standard Maven build lifecycle, you do not need to do anything special. Simply running the main build command will handle everything:

# This command automatically generates the sources, then compiles them.
mvn clean package

If you only want to generate the Lexer and Parser source files without compiling the entire project, you can run the generate-sources goal directly:

# Only runs the code generation step
mvn generate-sources

Further Implementation Details

This final section covers other interesting implementation details.

Parser Error Recovery

The parser implements a "panic-mode" error recovery strategy to avoid terminating on the first syntax error. This allows it to report multiple syntax issues in a single compilation run. The recovery is defined in grammar.cup using the error token.

The strategy is as follows:

  • If the parser encounters an invalid token within a list of statements, it enters recovery mode. It will discard incoming tokens until it finds a statement-terminating semicolon (;). At this point, it assumes the broken statement has ended and attempts to parse the next one.
  • If the parser gets lost while parsing a subprogram, it will discard tokens until it finds the end keyword, which it uses as a synchronization point to recover and continue parsing subsequent subprograms.

Heap Management for Arrays

All array data is allocated on a simple, custom-built heap rather than on the stack. The heap manager is implemented entirely in M68K assembly in the src/main/resources/macros/HEAP.X68 file.

  • Structure: The heap is a 1MB block of memory, managed as a linear list of blocks. Each block has a 6-byte header containing its size and status (busy or free).
  • HEAP_INIT: This routine is called once at the beginning of the main program to initialize the entire heap as a single large, free block.
  • MALLOC: When an array is declared, the compiler generates a JSR call to the MALLOC routine. This routine uses a first-fit algorithm to find the first available block that is large enough. If a block is larger than required, it is split into a busy block and a new free block. The routine returns a pointer to the allocated memory in the D0 register.
  • Memory Deallocation: It is important to note that this heap implementation is very simple. Although a FREE routine exists in the macro file, the compiler does not use it. Therefore, memory allocated for arrays is never freed during the program's lifetime.

Future Work and Potential Improvements

This compiler provides a solid foundation, but there are many opportunities for future enhancements:

  • Improve Memory Management: Implement memory deallocation by using the FREE routine. A more sophisticated heap manager could also be developed to, for example, merge adjacent free blocks.
  • Additional Data Types: The language could be extended with new types like float or a built-in string type with corresponding library functions.
  • More Advanced Optimizations: The TAC optimization phase could be expanded with more passes, such as:
    • Constant Folding: Pre-calculate expressions with constant values at compile time.
    • Dead Code Elimination: Remove variables and code that have no effect on the program's output.
  • Improved Error Reporting: Error messages could be made more helpful by providing suggestions or more context about the error.
  • Additional Backend Targets: The modular nature of the compiler (with its TAC intermediate representation) makes it possible to add new backends to generate code for other architectures, like ARM or x86.
  • A Standard Library: A pre-compiled library of standard M1 functions could be provided for common tasks like string manipulation or more complex math operations.

About

M1 programming language (Ada/Pascal-like)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors