---

# 2. Regular Languages
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, January 2019**

---

A _recognizer_ for a language is a program that takes as input a string and _accepts_ it if the string is a sentence of the language or otherwise _rejects_ it.

As recognizers we use _state machines_, which consist of a set of _states_ and set of labeled _transitions_ between states. An input sequence is recognized by starting in the _initial state_ and from each state following the transition which is labeled with the next input symbol. The input is accepted if the machine ends in the allowed set of _final states_.

Regular languages are of particular interest because they can be recognized by finite state machines and can be described by regular expression.

The _regular expressions_ over symbols `T` consist of

- any symbol `a ∈ T ∪ {ε}`,
- `E₁ | E₂`, where `E₁`, `E₂` are regular expressions,
- `E₁ E₂`, where `E₁`, `E₂` are regular expressions,
- `( E )`, where `E` is a regular expression,
- `[ E ]`, where `E` is a regular expression,
- `E*`, where `E` is a regular expression.

The notation `E*` is the same as `{ E }`. That is, regular expressions are like the right-hand side of productions in EBNF grammars if restricted to terminals.

The language `L(E)` _described_ by regular expression `E` over set `T` of symbols can be is defined recursively over the structure of `E`. Assume `a ∈ T ∪ {ε}` and `E`, `E₁`, `E₂` are regular expressions

| regular expression | language          |
|:-------------------|:------------------|
| `a`                | `{a}`             |
| `E₁ \| E₂`         | `L(E₁) ∪ L(E₂)`   |
| `E₁ E₂`            | `L(E₁) L(E₂)`      |
| `(E)`              | `L(E)`            |
| `[E]`              | `{ε} ∪ L(E)`      |
| `E*`               | `⋃ n ≥ 0 • Lⁿ(E)` |

<div style="float:right;background-color:lightgrey;border-left:20px solid white">

**Example.**

 `L([a | b])`  
`=  {ε} ∪ L(a | b)`  
`=  {ε} ∪ L(a) ∪ L(b)`  
`=  {ε} ∪ {a} ∪ {b}`  
`=  {ε, a, b}`
</div>

where for sets `A`, `B` of sequences over `T`:

  `A B = {a b | a ∈ A ∧ b ∈ B}`  
  `A⁰ = {ε}`  
  `Aⁿ = A Aⁿ⁻¹, n > 0`

As a consequence `A¹ = A A⁰ = A {ε} = A` and therefore:

    L(E*) = {ε} ∪ L(E) ∪ L²(E) ∪ …






**Question.** What language does `[a a*]` describe? How can the expression be simplified? Give a formal proof!

_Answer._

     L([a a*])
    =  {ε} ∪ L(a a*)
    =  {ε} ∪ (L(a) L(a*)
    =  {ε} ∪ ({a} (⋃ n ≥ 0 • Lⁿ(a)))
    =  {ε} ∪ ({a} (⋃ n ≥ 0 • aⁿ))
    =  {ε} ∪ (⋃ n ≥ 0 • aaⁿ)
    =  {a⁰} ∪ (⋃ n > 0 • aⁿ)
    =  ⋃ n ≥ 0 • aⁿ
    =  L(a*)

That is, `[a a*]` describes the same language as `a*`.

The similarity in the notation of regular expression and productions in EBNF is justified by following theorem, given without proof:

For regular expression `E` over `T` and grammar `G` with single production `S → E`, the language described by `E` and the language generated by `G` are the same, `L(E) = L(G)`.

For context-free grammars, we won't consider equivalent grammars, unless necessitated by the recognizer. For regular expressions, equivalence is of fundamental importance. Regular expressions `E`, `E'` are _equal_, `E = E'`, if `L(E) = L(E')`; that is, a regular expression is identified with the set of sentences it describes. For example, for regular expressions `E`, `F`, `G`:

- `E | F = F | E`
- `(E | F) | G = E | (F | G)`
- `E (F | G) = E F | E G`
- `(E | F) G = E G | F G`
- `E* = [E E*]`
- `E E* = E* E`
- `E** = E*`

For every regular grammar `G` we can construct a regular expression `E` such that `L(G) = L(E)` by transforming the productions of the grammar. The assumption is that the grammar is in BNF, with one production for everyone nonterminal, including one for the start symbol `S`. An EBNF production of the form

    A → E A | F

where `A` does not occur in `E`, `F`, is equivalent to (_Arden's Rule_)

    A → E* F

which can be used to replace `A` by `E* F` in all other productions. This is repeated until a single production for `S` is left, whose right-hand side is the equivalent regular expression.

**Example.** Given regular grammar with productions

    S → a | b X    X → b X | c Y    Y → c

first the last production can be eliminated by replacing `Y` with `c` in all other productions:

    S → a | b X    X → b X | c c

An equivalent production for `X` is `X → b* c c` which allows `X` to be replaced by `b* c c`:

    S → a | b b* c c

Thus an equivalent regular expression is `a | bᐩ c c`.

**Question.** What is an equivalent regular expression for the grammar with productions:

    S → a S | b X a
    X → a X | b Y | a
    Y → a Y | a

_Answer._

    S → a S | b X a    X → a X | b Y | a    Y → a Y | a

An equivalent production for `Y` is `Y → a* a` which allows `Y` to be replaced by `aᐩ`:

    S → a S | b X a    X → a X | b aᐩ | a

The production for `X` can be written as `X → a X | (b aᐩ | a)`, which matches the form that is needed for it be rewritten as `X → a* (b aᐩ | a)`, which in turn allows `X` to be eliminated from the other productions:

    S → a S | b (a* (b aᐩ | a)) a

Now `S` can be equivalently defined by `S → a* b (a* (b aᐩ | a)) a`. The regular expression equivalent to the grammar is therefore:

    a* b (a* (b aᐩ | a)) a

A finite state machine `A = (T, Q, R, q₀, F)` is specified by
- a finite set `T` of _symbols_,
- a finite set `Q` of _states_ ,
- a finite set `R` of _transitions_,
- an _initial state_ `q₀ ∈ Q`,
- a set of _final states_ `F ⊆ Q`,

where `T` is the _vocabulary_ and each transition is a triple with a state `q ∈ Q`, a symbol `t ∈ T`, and a state `q' ∈ Q`, written:

	 q t → q'

A finite state machine is given a sequence `τ ∈ T` as input and starts in its initial state. A transition `q t → q'` allows it to move from `q` to `q'` while consuming `t` from the input.

Note that sometimes ε-transitions are allowed, i.e. transitions of the form `q → q'`, which can be taken without consuming a symbol from the input.

Finite state machines can be graphically represented by _finite state diagrams_:
- States enclosed in a circle.
- Transitions are arrows between states labeled with a symbol.
- One transition points to the initial state.
- Each final state is enclosed in a double circle.

For example, for `A₀ = (T, Q, R, q₀, F)` where `T = {a, b, c}`, `Q = {q₀, q₁, f}`, `R = {q₀ a → q₁, q₁ b → q₁, q₁ c → f, q₀ c → f}`, and `F = {f}`:

<img style="width:20em" src="attachment:A0.svg"></img>


Sequence `τ ∈ T*` is _accepted_ in state `q` leading to state `q'`, written `q τ ⇒ᐩ q'`, if `q'` can be reached in a number of steps, each accepting the next element from the input,

- `q t ⇒ᐩ q'` if `q t → q'`
- `q tσ ⇒ᐩ q'` if	`q t → r` and `r σ ⇒ᐩ q'` for some state `r`.

The language `L(A)` accepted by finite state machine `A` is the set of all sequences of symbols which can be recognized when starting from the initial state:

	L(A) = {τ ∈ T* | q₀ τ ⇒ᐩ q, q ∈ F}

**Question.** What is the language accepted by `A₀`? What is an equivalent regular expression? What is an equivalent regular grammar?

_Answer._
- Language: all strings ending with `c` and possibly preceded by an `a` and zero or more `b`'s.
- Regular expression: `[ab*]`
- Productions of grammar: `A → a B | c` and `B → b B | c`
