---

# Notes on Compiler Construction
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, January 2019**

---

> _This series of notebooks rely on following Jupyter extensions:_
> - [`exercise`](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/exercise/readme.html) with [`rubberband`](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/rubberband/readme.html): for releaving solution hints with the ⊞ symbol
> - [`jupyter-emil-extension`](https://gitlab.cas.mcmaster.ca/parksj6/jupyter-se3bb4-extension): for formatting of algorithms and layout of slides
> - [`cite2c`](https://github.com/takluyver/cite2c): for dispaying the references properly
> - [`RISE`](https://github.com/damianavila/RISE): optinally, for presenting the notebooks as slides;  
    the resolution is set to 1080p may need to be adjusted under "Edit Notebook Metadata"

Compilers convert computer programs in "high level" languages to executable code. These lecture notes introduce the concepts of compilation and illustrates those by a compiler for a small imperative programming language. However, these notes and the accompanying exercises are not just about writing compilers! 

- You will learn to use parsing, analysis, and generation techniques for compilation and other syntax-directed processing (e.g. for query languages, mark-up languages, protocols).
- You will see other topics in computing connected, in particular computer architecture, programming languages, formal languages, and operating systems.
- You will appreciate efficiency issues in programming languages through the understanding of memory layout of data types (e.g. arrays, objects) and compilation of control structures (e.g. short circuit evaluation, recursion), which leads to a more resource-conscientious programming style.
- You will understand why historically programming languages are defined in the ways they are, and be better prepared for new languages.
- You will learn how compilation techniques (e.g. byte code, just-in-time, garbage collection) affect the suitability of programming languages for specific applications.
- You will learn about optimization techniques and the impact of processor architectures, which helps in using compilers more effectively.
- You will be made aware of compiler construction tools (e.g. scanner and parser generators) and will be able to judge if and when to use them.

For achieving above, besides studying these notes, it is imperative to engage in smaller and larger exercices:

- Some questions are embedded in these notes. You are urged to attempt them on your own while reading a chapter. You may reveal a hint by clicking on the ⊞ symbol to the left of the question.
- Further exercises are provided at the end of each chapter. These are best attempted after reading a chapter and before proceeding with the next one.

The lecture notes are organized as one notebook for each chapter. Each notebook is meant to be covered in approximately one week with three classes per week:

1. Language and Syntax
2. Regular Languages
3. Analysis of Context-free Languages
4. Syntax-Directed Translation
5. The Construction of a Parser
6. A Stack Architecture as Target
7. A RISC Architecture as Target
8. Further Data Types
9. Separate Compilation and Executable Files
10. Concurrency and Parallelism
11. Garbage Collection
12. Generalized Parsing

A course following this layout was delivered (based on a series of earlier offerings) in the Winter Term of 2019 at McMaster University as 4th year course that is required for computer science students and optional for software engineering students. The accompanying exercises were split into graded labs of two hours per week and take-home assignments. Students also completed a group project; these projects are not the subject of these notes, but suffice to say that the topics reflected the broad applicability of compilation techniques.

----

In general a compiler processes a structured source text and generates (simpler structured) target code or error messages.

<img style="width:24em" src="attachment:00Task.svg"></img>

<br>
<div style="column-count:2">
    <div style="display:inline-block">

**Sources**<br>
_programming languages:_ C, Java, Go, ...<br>
_virtual machine languages:_ LLVM, JVM, CIL, WebAssembly, ...<br>
_scripting languages:_ bash, Python, JavaScript, ...<br>
_text formatting languages:_ TeX, html, markdown, ...<br>
_interchange languages:_ JSON, XML, ...
    </div>
    <div style="display:inline-block">
**Targets**<br>
_virtual machine languages:_ LLVM, JVM, CIL, WebAssembly, ...<br>
_executables code:_ RISC-V, ARMv8, x86-64, ...<br>
_assembly language:_ [as](https://en.wikipedia.org/wiki/GNU_Assembler), [nasm](https://www.nasm.us/), ...<br>
_coprocessor code:_ FPU, GPU, FPGA, ...<br>
_layout languages:_ PDF, html, RTF, ...<br>
    </div>
</div>

Sources can also be _preference files_, _configuration files_, _database queries_, or _hardware description languages_, even if for those no executable code is generated. The same languages, e.g. a virtual machine language, can be both target and source of different compilation processes.

Starting with Algol 60 <cite data-cite="1997494/9KJ5CWCJ"></cite>, the first programming language with a formally defined _grammar_, the translation process of a compiler is guided by the _syntactical structure_ of the source text. The method of _syntax-directed compilation_ leads to the following decomposition:

- _Analysis:_ recognising the structure of the source text according to its grammar.
- _Synthesis:_ generating target code from the recognized structure.

Analysis and synthesis are each split into a number of consecutive _phases_ with different _intermediate representations_.

<img style="width:22em;float: right;border-left:10px solid white" src="attachment:00Phases.svg"></img>

_Symbols_ (or _tokens_) are sequences of characters like a number (a sequences of digits), an identifier (a sequence of letters and digits), a keyword (**if**, **while**), a separator (comma, semicolon).

The _lexical analysis_ recognizes symbols; it is carried out by a _scanner_, hence also called _scanning_. The _syntactic analysis_ recognizes the syntatic structure, a _syntax tree_; syntactic analysis is carried out by a parser, hence also called _parsing_. 

The _contextual analysis_ (or _type-checking_) augments the syntax tree with type information. The redundancy provided by type information allows common programming errors to be detected early and quickly; type information is also used by the compiler to memory management. Only syntactically correct and type-correct programs have a meaning and are subject to code generation.

In the first phase of code generation, _intermediate code_ is generated, a representation that is simpler than the target code, but close enough that it can be straighforwardly translated to the target code and that the gain of _code optimizations_ can be judged. 


<img style="width:32em;float:right;border-left:10px solid white" src="attachment:00PhasesExample.svg"></img>
For example, suppose the context includes that `dist` is an integer variable and that `rot` is an integer parameter of the enclosed procedure `update`:

`var dist: integer`  
`procedure update(rot: integer)`

The figure to the right illustrates the compilation of the assignment `dist := dist + rot × 24` in the body of `update`. The contextual analysis augments with syntax tree with type information; here all expressions are of `integer` type. The intermediate code is _register-based_: all arithmetic operations can only involve registers, here `R1`, `R2`, etc. The code optimization reduces the number of registers and the number of instructions. The generated code is specific as how to access variables; here, `rot` is SP-relative (stack pointer) with offset `8₁₆` (base `16`) and `dist` is global at location `4000₁₆`.

Phases conceptually decompose of the task of a compiler. In practice, several phases are merged into _passes_ such that no intermediate data structure is necessary between the phases of a pass.
  
  
<img style="width:58em" src="attachment:00Passes.svg"></img>
  
  
Historically, files were used for passing the data between the passes. Modern compilers use main memory.

<img style="width:18em;float:right;border-left:10px solid white" src="attachment:00FrontEndBackEnd.svg"></img>
A common and advantageous separation of the task is by dividing the compiler into a _front end_ and a _back end_.

This division helps reducing the efforts for writing compilers for different targets for the same source language by sharing the front end, or for different source languages for the same target by sharing the back end.

In principle, given `m` source languages and `n` targets, this reduces the effort of writing `m × n` compilers to `m` front ends `+` `n` back ends.

In practice, this only works if the languages respectively the targets are sufficiently similar. It is nevertheless a good structuring principle for flexibility.

<img style="width:20em;float: right;border-left:10px solid white" src="attachment:00ByteCode.svg"></img>
A variation of the front end / back end decomposition is when a virtual machine code rather than a syntax tree is used. In essence, the front end and back end become compilers of their own. 

Some virtual machine codes represent each instruction by a single byte, hence are called _byte codes_. The compactness of byte code makes the memory footprint and the download times appealingly small.

Byte codes can either be _interpreted_ without further compilation, i.e. executed instruction by instruction, or further compiled to excutable machine code _just-in-time_ when loaded to main memory or while being executed. An advantage of just-int-time compilation compared to _ahead-of-time_ compilation is that all characteristics of the processor are known when compiling and that the code execution itself can influence the compilation.

### Historic Notes and Further Reading

A chapter in Wirth's book _Algorithms + Data Structures = Programs_ <cite data-cite="1997494/Z2YXI6H6"></cite> illustrates with a compiler for a subset of Pascal the principles that became the foundation for a whole line of work, including [Turbo Pascal](https://en.wikipedia.org/wiki/Turbo_Pascal), an early interactive programming environment (which later became Delphi) and [UCSD Pascal](https://en.wikipedia.org/wiki/UCSD_Pascal). Later versions of Wirth's compiler generate RISC code <cite data-cite="1997494/NYTCJWS7"></cite>.  

UCSD Pascal includes an editor and a whole operating system. The compiler generates [p-code](https://en.wikipedia.org/wiki/P-code_machine), a Pascal-influenced byte code, which is then interpreted. On the Apple II computer, UCSD Pascal, which itself is p-code, occupies 16KB of main memory, leaving the rest for programs and data. P-code influenced the design of JVM, the Java Virtual Machine, which itself influenced the design of CIL, the Common Intermediate Language (see [Exercises](#Exercises) below). 

Some programming languages, notably C <cite data-cite="1997494/3R5VIULM"></cite> and Pascal <cite data-cite="1997494/Z2YLVHR5"></cite>, were originally defined with _single pass compilation_ in mind, due to the restricted size of main memory at that time: a single pass compiler generates target code while the source file is read. To allow static type checking and code generation with mutally recursive procedures, these languages have _forward declarations_ (called _function prototypes_ in C):
```Pascal
procedure p(x: integer); forward; (* forward declaration of p, without body *)
procedure q();                    (* q with body *)
    begin ... p(3) ... end;       (* call to p is allowed *)
procedure p(x: integer);          (* parameters must be the same as in forward declaration *)
    begin ... q() ... end;        (* call to q is allowed *)
```
Besides requiring forward declarations, single pass compilation also limits code optimizations. For modern compilers, main memory restrictions are not a concern: limits of human comprehension has kept the size of individual source files unchanged over decades, but hardware technology has provided compilers with an abundance of main memory.

### Exercises

1. The Common Interface Language (CIL) is a virtual machine that was originally developed by Microsoft and then standardized by [ECMA](http://www.ecma-international.org/publications/standards/Ecma-335.htm). Front ends exist for C# (an object-oriented language), F# (a functional language), and "Managed C++", but not for full C++. What is the difference between Managed C++ and C++? Explain which properties of CIL prohibit full C++ to be supported!


2. The Java Virtual Machine, JVM,  is a byte code that was originally developed for Java but has been used as the target for other languages. Provide pointers to at least five other languages that target JVM! Is JVM code interpreted, just-in-time compiled, or ahead-of-time compiled?


3. The functionality of processors can be augmented by _coprocessors_ or _units_ on a different chip or on the same chip as the processor. By consulting a textbook on computer architecture, discuss the functionality of SIMD units, FPUs, GPUs, FPGAs, how they are programmed, and what role compilers play, if any!

### Bibliography

<div class="cite2c-biblio"></div>