---

# 1. Language and Syntax
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, January 2019**

---

Every language is based on a _vocabulary_. Its elements are called _words_ or _symbols_ whose structure is of no further interest. The _syntax_ determines which sequences of words, called _sentences_, belong to the language.

| language                | symbols                                          |
|:------------------------|:-------------------------------------------------|
| English                 | `eats`, `Kevin`, `a`, `banana`, ...              |
| Roman numerals          | `I`, `V`, `X`, `L`, `C`, `D`, `M`                |
| identifiers in programs | `A`, `B`, ..., `a`, `b`, ..., `0`, `1`, ..., `_` |
| arithmetic expressions  | `dist`, `rot`, `24`, `+`, `-`, `×`, `/`, ...     |

**Question:** What are other non-spoken languages?

_Answer:_
- Chemical formulae, e.g <code>H<sub>3</sub>O<sup>+</sup></code> for hydronium.
- Musical scores, with vocabulary 🎼, ♭, ♮, ♯, ♩, ♪, ♫, ♬, etc.
- Morse code, with vocabulary "●" (short), "━━━" (long), " " (pause).

<div style="float:right;background-color:lightgrey;border-left:20px solid white">

**Example:** if `V = {a, b}`, then<br>
<code>V<sup>+</sup> = {a, b, aa, ab, ba, bb, aaa, … }</code><br>
<code>V<sup>\*</sup> = {ε, a, b, aa, ab, ba, bb, aaa, … }</code><br>
The sentences of the language<br>
<code>L = {σaσ | σ ∈ V<sup>\*</sup>}</code><br>
are those sequences that contain at least one `a`.</div>
Formally, a vocabulary `V` is a finite, nonempty set of (atomic) symbols. The set <code>V<sup>\*</sup></code> of all _finite sequences_ or _strings_ over `V` consists of

- the empty string `ε`,
- any symbol `x ∈ V`,
- the _concatenation_ `στ` of strings <code>σ, τ ∈ V<sup>\*</sup></code>.

The empty sequence is both the left and right identity of concatenation and concatenation is associative, meaning that parenthesis can be left out. Formally for any <code>σ, τ, ω ∈ V<sup>\*</sup></code>:

	σε = σ = εσ
    (στ)ω = σ(τω)

The set of all _non-empty strings_ over `V` is denoted by <code>V<sup>+</sup></code>, formally <code>V<sup>+</sup> = V<sup>\*</sup> – {ε}</code>. The _length_ of string `σ` is written as `|σ|`:

- `|ε| = 0`,
- `|x| = 1` for any `x ∈ V`,
- `|στ| = |σ| + |τ|` for any <code>σ, τ ∈ V<sup>\*</sup></code>.

<img style="width:16em;height:auto;float:right;border-left:10px solid white" src="attachment:NLexample.svg"></img>
A _grammar_ not only determines unambiguously which sequences of words are sentences and which not but also provides sentences with a _structure_. The structure is instrumental in recognizing the _semantic_ of a sentence, which is our ultimate goal.

The theory of formal languages originates in linguistics. A basic rule of English is that sentences (`S`) consists of a noun phrase (`NP`) followed by verb phrase (`VP`). A noun phrase is either a proper name (`PN`) or a determiner (`D`) followed by a noun (`N`). A verb phrase is either a verb (`V`) or a verb phrase followed by a noun phrase. Determiners are `a` and `the`. The hierarchical composition of an English sentence by a _parse tree_ is given to the right; below are the corresponding rules. Grammars of this form are called _generative_ and the rules are called _productions_, as they determine how all sentences of a language are generated.

<div style="float:right;background-color:lightgrey;margin-left:18pt">

`S → NP VP`  
`NP → PN`  
`NP → D N`  
`VP → V`  
`VP → V NP`  
`NP → Kevin`  
`NP → Dave`  
`D → a`  
`D → the`  
`N → banana`  
`N → apple`  
`V → eats`  
`V → runs`</div>
Formally, grammar `G = (T, N, P, S)` is specified by

- a finite set `T` of _terminal symbols_,
- a finite set `N` of _non-terminal symbols_,
- a finite set `P` of _productions_,
- a symbol `S ∈ N`, the _start symbol_

where `N ∩ T = {}` and `V = T ∪ N` is its vocabulary. Productions are pairs of strings <code>σ ∈ V<sup>+</sup></code>, <code>τ ∈ V<sup>\*</sup></code>, written `σ → τ`.

**Example.** `G₀ = (T, N, P, S)` with `T = {Kevin, Dave, a, the, banana, apple, eats, runs}`, `N = {S, NP, VP, NP, D, N, V}`, and the productions to the right is a grammar.

<div style="float:right;background-color:lightgrey;margin-left:18pt">

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`S`  
`⇒ NP VP`  
`⇒ PN VP`  
`⇒ Kevin VP`  
`⇒ Kevin V NP`  
`⇒ Kevin eats NP`  
`⇒ Kevin eats D N`  
`⇒ Kevin eats a N`  
`⇒ Kevin eats a banana`</div>
Given grammar `G = (T, N, P, S)`, sequence <code>χ ∈ V<sup>\*</sup></code>is _directly derivable_ from <code>π ∈ V<sup>+</sup></code>, written `π ⇒ χ`, if there exist sequences `σ`, `τ`, `μ`, `ν` such that, `π = μσν`, `χ = μτν`, and `σ → τ ∈ P`.

We write

- <code>π ⇒<sup>\*</sup> χ</code> if `χ` is _derivable in zero or more steps_ from `π`,
- <code>π ⇒<sup>+</sup> χ</code> if `χ` is _derivable in one or more steps_ from `π`.

Formally, <code>π ⇒<sup>\*</sup> χ</code> is the transitive and reflexive closure of relation `⇒` and <code>π ⇒<sup>+</sup> χ</code> is the transitive closure of `⇒`.

The derivation to the right allows to conclude that <code>S ⇒<sup>\*</sup> Kevin eats a banana</code> with grammar `G₀`.

The _language_ `L(G)` generated by grammar `G = (T, N, P, S)` is the set of all sequences of terminal symbols which can be derived from the start symbol:

<code style="margin-left:18pt">L(G) = {χ ∈ T<sup>\*</sup> | S ⇒<sup>+</sup> χ}</code>

Two grammars `G`, `G'` are _equivalent_ if they generate the same language, `L(G) = L(G')`.

**Example.** Given `G₁ = (T, N, P, S)`, where `T = {a, b, c, d}`, `N = {S, X}`, `P = {S → aX, S → bX, X → c, X → d}`, the sequence `ac` is derivable from `S`, formally <code>S ⇒<sup>\*</sup> ac</code>,

    S ⇒ aX ⇒ ac

as are `ad`, `bc`, `bd`. The language generated by `G₁` is:

	L(G) = {ac, ad, bc, bd}

**Question.** What are other equivalent grammars? 

_Answer._
- `G₁̍ = (T, N', P', S)`, where `N = {S}`, `P = {S → XY, X → a, X → b, Y → c, Y → d}`, is equivalent to `G₁`.
- Renaming the non-terminals also gives an equivalent grammar. In that sense, non-terminals "carry no meaning".

Languages generated by a grammar can be _finite_ or _infinite_. Infinite languages are expressed through recursion with a finite set of productions. For example, let `G₂ = (T, N, P, S)`, where `T = {a}`, `N = {S}` and let the productions `P` be:

    S → ε
    S → aS

The language generated is that of sequences over `a` of arbitrary length,

<code>   L(G₂) = {ε, a, aa, aaa, aaaa, …} = {a<sup>n</sup> | n ≥ 0}</code>

where <code>a<sup>0</sup> = ε</code> and <code>a<sup>n+1</sup> = aa<sup>n</sup></code>. We prove this formally by inclusion in both directions. By definition of `L(G₂)`,

<code style="margin-left:18pt">{χ ∈ T<sup>\*</sup> | S ⇒<sup>+</sup> χ} ⊆ {a<sup>n</sup> | n ≥ 0}</code>

means that for every <code>χ ∈ T<sup>\*</sup></code> that is derivable from `S` there exists an `n ≥ 0` such that <code>χ = a<sup>n</sup></code>. We show this by induction over the length of derivations.

- _Base._ Suppose `χ` is derived directly from `S` by `S ⇒ χ`, which leaves only `χ = ε` according to the first production. Then <code>χ = a<sup>0</sup></code>.
- _Step._ Suppose `χ` is derived from `S` in multiple steps, which implies <code>S ⇒ aS ⇒<sup>+</sup> χ</code> according to the second production, and that for any shorter derivations <code>S ⇒<sup>+</sup> ω</code> there exists an `n` such that <code>ω = a<sup>n</sup></code>. Since by induction assumption <code>S ⇒<sup>+</sup> a<sup>n</sup></code> holds for <code>aS ⇒<sup>+</sup> χ</code>, we can conclude <code>χ = aa<sup>n</sup> = a<sup>n+1</sup></code>.

The inclusion in the other direction means that every <code>a<sup>n</sup></code> for `n ≥ 0` can be derived from `S`:

<code style="margin-left:18pt">{a<sup>n</sup> | n ≥ 0} ⊆ {χ ∈ T<sup>\*</sup> | S ⇒<sup>+</sup> χ}</code>

We show this by induction over `n`. Obviously <code>a<sup>0</sup> = ε</code> can be generated by the first production, <code>S ⇒<sup>+</sup> ε</code>. Suppose <code>a<sup>n</sup></code> can be generated, <code>S ⇒<sup>+</sup> a<sup>n</sup></code>. We need to show that <code>a<sup>n+1</sup></code> can be generated as well. This follows from <code>S ⇒ aSc ⇒<sup>+</sup> aa<sup>n</sup> = a<sup>n+1</sup></code>.

Thus we can conclude <code>L(G₂) = {a<sup>n</sup> | n ≥ 0}</code>.

Recursion also allows to express arbitrarily deep _nested structures_. For example let `G₃ = (T, N, P, S)`, where `T = {a, b, c}`, `N = {S}`, and let the productions `P` be:

    S → b
    S → aSc

The sequence `aabcc` is derivable from `S`:

    S ⇒ aSc ⇒ aaScc ⇒ aabcc
    
The generated language is:

<code>    L(G₃) = {b, abc, aabcc, aaabccc , …} = {a<sup>n</sup>bc<sup>n</sup> | n ≥ 0}</code>

Languages can be classified according to restrictions on their grammar. The following classification is known as the _Chomsky Hierarchy_ <cite data-cite="1997494/AMDT6J5A"></cite>. For a grammar `G = (T, N, P, S)`, let `V = T ∪ N` be its vocabulary, and assume `a ∈ T`, `A, B ∈ N`, `μ, ν, τ ∈ V*`, <code>σ ∈ V<sup>+</sup></code>:

- A grammar is _context-sensitive_ if productions are of the form

    `μAν → μσν`  
    `S → ε` ` ` ` ` and `S` does not occur on the right hand side of another production


- A grammar is _context-free_ if productions are of the form

    `A → τ`


- A grammar is _regular_ if productions are of the form

    `A → ε`  
    `A → a`  
    `A → aB`

**Question.** Consider the earlier grammars `G₀`, `G₁`, `G₂`, `G₃`. Which grammars are regular, context-free, context-sensitive, or none of those, i.e. unrestricted?

_Answer._
- `G₀` is not regular, but is context-free (and context-sensitive)
- `G₁` is regular (and context-free, context-sensitive)
- `G₂` is regular (and context-free, context-sensitive)
- `G₃` is not regular, but is context-free (and context-sensitive)

We give some fundamental results from formal language theory. Regular grammars can express repetition, but not nesting:

**Theorem.** For context-free grammar `G₃`, no equivalent regular grammar exists.

Let `G₄ = (T, N, P, S)`, where `T = {a, b, c}`, `N = {S, B}`, and let the productions `P` be:

    S → abc
    S → aBSc
    Ba → aB
    Bb → bb

The language generated is:

<code>   L(G₄) = {abc, aabbcc, aaabbbccc, …} = {a<sup>n</sup>b<sup>n</sup>c<sup>n</sup> | n ≥ 1}</code>

**Theorem.** For context-sensitive grammar `G₄`, no equivalent context-free grammar exists.

**Question.** What is a derivation of `aaabbbccc` in `G₄`? Explain how the grammar works!

_Answer._

      S
    ⇒ aBSc
    ⇒ aBaBScc
    ⇒ aBaBabccc
    ⇒ aBaaBbccc
    ⇒ aaBaBbccc
    ⇒ aaaBBbccc
    ⇒ aaaBbbccc
    ⇒ aaabbbccc

The grammar works by first producing the same number of `a`, `B`, `c`, with all `c` in correct position at the end but `a` and `B` alternating. The the production `Ba → aB` moves all `a` to the left and all `B` to the middle. Once a `B` is in its correct position, it is converted to a `b`.

Let `G₅ = (T, N, P, S)`, where `T = {a, b}`, `N = {A, B, S, X, $}`, and productions `P` are:  
<code>
   S → X$
   X → ε
   X → aXA
   X → bXB
   Aa → aA
   Ab → bA
   Ba → aB
   Bb → bB
   A$ → a$
   B$ → b$
   $ → ε
</code>

The language generated is the _copy language_:

<code>   L(G₅) = {ww | w ∈ T<sup>*</sup>}</code>

**Theorem.** For context-sensitive grammar `G₅`, no equivalent context-free grammar exists.

**Question.** What is a derivation of `abab`?

_Answer._

```
  S
⇒ X$
⇒ aXA$
⇒ abXBA$
⇒ abBA$
⇒ abBa$
⇒ abaB$
⇒ abab$
⇒ abab
```

Languages generated by context-sensitive, context-free, and regular grammars are called _context-sensitive_, _context-free_, and _regular languages_, respectively.

**Theorem.** Every regular language is also context-free. Every context-free language is also context-sensitive.

Note that the inclusion does quite not hold for grammars, as `A → ε` is allowed in regular and context-free languages, but not in context-sensitive languages.

For brevity, we write

	σ → τ₀ | τ₁ | …

for the set of productions

	σ → τ₀
    σ → τ₁
    …

We continue with context-free languages. For those, the _parse tree_ or _concrete syntax tree_ is a visual representation of a derivation which abstracts from the order of independent applications of productions. In the example, `E` and `id` stand for expressions and identifiers of programs.

<img style="width:6em;float:right;border-left:10px solid white" src="attachment:idplusid.svg"></img>
**Example.** Let `G₆ = (T, N, P, E)` where `T = {id, +}`, `N = {E}`, and the productions `P` are:

<code>   E → id | E + E</code>

There are two derivations of `a + a`:

    E ⇒ E + E ⇒ id + E ⇒ id + id
    E ⇒ E + E ⇒ E + id ⇒ id + id

<img style="width:16em;float:right;border-left:10px solid white" src="attachment:idplusidplusid.svg"></img>
Continuing with `G₆`, there are two parse trees for `id + id + id`. A sentence with more than one parse trees is an _ambiguous sentence_ and a grammar allowing that is an _ambiguous grammar_. Syntactically ambiguous sentences may have an ambiguous meaning. In natural languages this may be resolved through the context; in programming languages, syntactic ambiguity is avoided.

<img style="width:8em;float:right;border-left:10px solid white" src="attachment:idplusleft.svg"></img>
Changing the productions to a _left-recursive_ form eliminates ambiguity and makes `+` associate to the left.

<code>   E → id | E + id</code>

<img style="width:8em;float:right;border-left:10px solid white" src="attachment:idplusright.svg"></img>
Changing the productions to a _right-recursive_ form eliminates ambiguity and makes `+` associate to the right.

<code>   E → id | id + E</code>

**Question.** For which operators in programming languages does associativity matter and for which not?

_Answer._
- For integer division associativity matters.
- For integer addition associativity matters in bounded arithmetic (overflow is error) and saturating arithmetic (overflow results in maximal number).
- For integer addition associativity does not matter in modulo arithmetic, e.g. with word size.

The next example illustrates operator _precedence_.

<img style="width:16em;float:right;border-left:10px solid white" src="attachment:idplusidtimesid.svg"></img>
**Example.** Let `G₇ = (T, N, P, E)` where `T = {id, +, ×}`, `N = {E}`, and the productions `P` are:

<code>   E → id | E + id | E × id</code>

In `id + id × id`, operator `+` binds tighter; in `id × id + id`, operator `×` binds tighter: `+` and `×` bind equally tight and associate to the left.

<img style="width:16em;float:right;border-left:10px solid white" src="attachment:idplustimestimesplus.svg"></img>
To have proper operator precedence, nonterminal `T` for terms is introduced and the productions are changed to:  
<code>
   E → T | E + T
   T → id | T × id
</code>

To allow `+` to bind tighter than `×`, parenthesis are needed. For this, nonterminal `F` for factor is introduced.

**Example.** Let `G₈ = (T, N, P, E)` where `T = {id, +, ×, (, )}`, `N = {E, T, F}`, and the productions `P` are:  
<code>
   E → T | E + T
   T → F | T × F
   F → id | ( E )
</code>

**Question.** What are the parse trees for `id + id × id`, for `id × id + id`, and for `(id + id) × id`?

### Historic Notes and Further Reading

The original motivation for the classification of grammars came from the study of natural languages. Following examples illustrate the potential use of regular, context-free, and context-sensitive languages (credit for examples: [C. Chesi, Univ. of Siena](http://www.ciscl.unisi.it/master/chesi/lingcomp-2017_18-03_04-formal_grammar.pdf))

- _Right recursion_ (_tail recursion_) of the form <code>ab<sup>n</sup></code>:


    [the dog bit [the cat [that chased [the mouse [that ran]]]]]


- _Center embedding_ (_true recursion_) of the form <code>a<sup>n</sup>b<sup>n</sup></code>:


    [the mouse [(that) the cat [(that) the dog bit] chased] ran]


- _Cross‐serial dependencies_ (_identity recursion_) of the form `ww`:


    John, Mary, and David, are a widower, a widow, and a widower, respectively

There is an ongoing discussion on using regular, context-free, and context-sensitive languages for natural languages. The male-female correspondence of the last example can also be seen as a semantic issue rather than a syntactic issue. If one takes human comprehension into account the full generality of context-sensitive, context-free, and even regular languages is not needed. As a consequence, a further classes of grammars have emerged, e.g. <cite data-cite="1997494/JRSJH4RN"></cite>. Considering the translation of natural languages, neural networks can perform better than grammar-based translation <cite data-cite="1997494/TP97XAIW"></cite>, <cite data-cite="1997494/TN2A3DC2"></cite>.

On the other hand, Chomsky's Hierarchy had a profound impact on computing: for each class of languages equivalent _recognizers_ for languages are known. Calling languages of unrestricted grammars _recursively enumerable_, we have:

|language              |recognizer              |
|:---------------------|:-----------------------|
|recursively enumerable|Turing machine          |
|context-sensitive     |linear bounded automaton|
|context-free          |pushdown automaton      |
|regular               |finite state machine    |

Regular and context-free languages are ubiquitous as recognizers for those can be constructed efficiently and are themselves in some sense efficient. The next chapters in these notes discuss their use for scanning and parsing.

Even the earlier examples show the difficulty of writing context-sensitive grammars. After Algol 60 introduced the use of context-free grammars for its syntax, with Algol 68 an attempt was made to go beyond context-free grammars by using a dedicated "two-level grammar" <cite data-cite="1997494/AE3UY6NV"></cite>; that kind of grammar was not used for another language. Around the same time, Knuth proposed _attribute grammars_ as a way of associating computation (which can be type-checking and translation) to recognition of a context-free language <cite data-cite="1997494/K7M86FYQ"></cite>. Since then it has become common to define a programming language with regular and context-free grammars and to use attribute grammars for compilation. Type systems, which can be thought of as context-sensitive grammars, are also used in the definition of some languages <cite data-cite="1997494/GZX9GVPK"></cite>.

### Exercises

1. Considering `G₅`, give a derivation of `abbabb`! Explain how the grammar works!

2. A well-known syntactically ambiguous English sentence is `Time flies like an arrow`. Explain the ambiguity!

### Bibliography

<div class="cite2c-biblio"></div>