In [1]:
from IPython.display import HTML
HTML(open('../style.css').read())

In [2]:
%load_ext nb_mypy

Version 1.0.6


In [3]:
from typing import TypeVar

# From Regular Expressions to <span style="font-variant:small-caps;">Nfa</span>s

This notebook shows how a given regular expression $r$ can be transformed into an equivalent finite state machine. 
It implements the theory that is outlined in section 4.4. of the 
lecture notes.

In [4]:
from typing import Literal

In [5]:
type BinaryOp = Literal['⋅', '+']
type UnaryOp  = Literal['*']

The type `RegExp` describes the *parse tree* of a regular expression.  This will be the *input* of the program we develop in this notebook.  Note that `RegExp` is a *recursive* type.  

The expression `tuple[RegExp, ...]` denotes a tuple of objects of type `RegExp` that has an arbitrary length.

In [6]:
Char   = str
RegExp = TypeVar('RegExp')
RegExp = int | Char | tuple[RegExp, UnaryOp] | tuple[RegExp, BinaryOp, RegExp]

We will represent the states of the `NFA` as integers.

In [7]:
State = int

The type `Delta` denotes the transition relation of a non-deterministic finite automaton.
The type `NFA` denotes a non-deterministic finite automaton.  Its elements are 5-tuples of the form
$$ \langle Q, \Sigma, \delta, q_0, A \rangle $$
where
1. $Q$ is the set of states,
2. $\Sigma$ is the alphabet,
3. $\delta: Q \times (\Sigma \cup \{ \varepsilon \}) \rightarrow 2^Q$ is the transition relation,
4. $q_0 \in Q$ is the start state, and
5. $A \subseteq Q$ is the set of accepting states

In [8]:
Delta = dict[tuple[State, Char], set[State]]
NFA   = tuple[set[State], set[Char], Delta, State, set[State]]

The class `RegExp2NFA` administers two member variables:
- `Sigma` is the <em style="color:blue">alphabet</em>, i.e. the set of characters used.
- `StateCount` is a counter that is needed to create <em style="color:blue">unique</em> state names.

The methods given here are just stubs that are needed by the type checker.  The implementation of these stubs is given later.

In [9]:
class RegExp2NFA:
    def __init__(self, Sigma: set[Char]):
        self.Sigma     : set[Char] = Sigma
        self.StateCount: int = 0
            
    def toNFA(self, r: RegExp) -> NFA:
        return None # type: ignore
    def genEmptyNFA(self) -> NFA:
        return None # type: ignore
    def genEpsilonNFA(self) -> NFA:
        return None # type: ignore
    def genCharNFA(self, c) -> NFA:
        return None # type: ignore
    def catenate(self, f1, f2) -> NFA:   
        return None # type: ignore
    def disjunction(self, f1, f2) -> NFA:
        return None # type: ignore
    def kleene(self, f) -> NFA:
        return None # type: ignore
    def getNewState(self) -> State:
        return None # type: ignore

The member function `toNFA` takes an object `self` of class `RegExp2NFA` and a regular expression `r` and returns a finite state machine 
that accepts the same language as described by `r`.  The regular expression is represented in `Python` as follows:
- The regular expression $\emptyset$ is represented as the number `0`.
- The regular expression $\varepsilon$ is represented as the string `'𝜀'`.
- The regular expression $c$ that matches the character $c$ is represented by the character $c$.
- The regular expression $r_1 \cdot r_2$  is represented by the triple $\bigl(\texttt{repr}(r_1), \texttt{'⋅'}, \texttt{repr}(r_2)\bigr)$ 
  where $\texttt{repr}(r_1)$ and $\texttt{repr}(r_2)$ are the representations of the regular expressions $r_1$ and $r_2$.
  
  Here, and in the following, for a given regular expression $r$ the expression $\texttt{repr}(r)$ denotes the `Python` representation of the regular 
  expressions  $r$.
- The regular expression $r_1 + r_2$  is represented by the triple $\bigl(\texttt{repr}(r_1), \texttt{'+'}, \texttt{repr}(r_2)\bigr)$.
- The regular expression $r^*$  is represented by the pair $\bigl(\texttt{repr}(r), \texttt{'*'}\bigr)$.

The annotation `# type: ignore`is needed to silence the type checker.

In [10]:
def toNFA(self: RegExp2NFA, r: RegExp) -> NFA:
    match r:
        case 0: 
            return self.genEmptyNFA()
        case '𝜀': 
            return self.genEpsilonNFA()
        case r if isinstance(r, str) and len(r) == 1: 
            return self.genCharNFA(r)
        case (r1, '⋅', r2):
            return self.catenate(self.toNFA(r1), self.toNFA(r2))
        case (r1, '+', r2):
            return self.disjunction(self.toNFA(r1), self.toNFA(r2))
        case (r1, '*'):
            return self.kleene(self.toNFA(r1))
        case _:
            raise ValueError(f'{r} is not a proper regular expression.') 

RegExp2NFA.toNFA = toNFA # type: ignore
del toNFA

The <span style="font-variant:small-caps;">Nfa</span> `genEmptyNFA()` is defined as
$$\bigl\langle \{ q_0, q_1 \}, \Sigma, \{\}, q_0, \{ q_1 \} \bigr\rangle. $$
Note that this <span style="font-variant:small-caps;">Nfa</span> has no transitions at all.
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:

![Nfa recognizing the empty set](./aLeer.jpg)

In [11]:
def genEmptyNFA(self: RegExp2NFA) -> NFA:
    q0 = self.getNewState()
    q1 = self.getNewState()
    return {q0, q1}, self.Sigma, {}, q0, { q1 }

RegExp2NFA.genEmptyNFA = genEmptyNFA # type: ignore
del genEmptyNFA

The <span style="font-variant:small-caps;">Nfa</span> `genEpsilonNFA` is defined as
$$  \bigl\langle \{ q_0, q_1 \}, \Sigma, 
                          \bigl\{ \langle q_0, \varepsilon\rangle \mapsto \{q_1\} \bigr\}, q_0, \{ q_1 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:

![Nfa recognizing the empty string](./aEpsilon.jpg)

In [12]:
def genEpsilonNFA(self: RegExp2NFA) -> NFA:
    q0 = self.getNewState()
    q1 = self.getNewState()
    delta = { (q0, '𝜀'): {q1} }
    return {q0, q1}, self.Sigma, delta, q0, { q1 }

RegExp2NFA.genEpsilonNFA = genEpsilonNFA # type: ignore
del genEpsilonNFA

For a letter $c \in \Sigma$ the <span style="font-variant:small-caps;">NFA</span> `genCharNFA`$(c)$ is defined as 
$$ A(c) = 
   \bigl\langle \{ q_0, q_1 \}, \Sigma, 
   \bigl\{ \langle q_0, c \rangle \mapsto \{q_1\}\bigr\}, q_0, \{ q_1 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">NFA</span> looks as follows:

![NFA recognizing the character c](./aChar.jpg)

In [13]:
def genCharNFA(self: RegExp2NFA, c: str) -> NFA:
    q0 = self.getNewState()
    q1 = self.getNewState()
    delta = { (q0, c): {q1} } 
    return {q0, q1}, self.Sigma, delta, q0, { q1 }

RegExp2NFA.genCharNFA = genCharNFA # type: ignore
del genCharNFA

Given two <span style="font-variant:small-caps;">Nfa</span>s `f1` and `f2`, the function `catenate(f1, f2)` 
creates an <span style="font-variant:small-caps;">Nfa</span> that recognizes a string $s$ if it can be written 
in the form
$$ s = s_1s_2 $$
and $s_1$ is recognized by `f1` and $s_2$ is recognized by `f2`. 

Assume that $f_1$ and $f_2$ have the following form:
- $f_1 = \langle Q_1, \Sigma, \delta_1, q_1, \{ q_2 \}\rangle$,
- $f_2 = \langle Q_2, \Sigma, \delta_2, q_3, \{ q_4 \}\rangle$,
- $Q_1 \cap Q_2 = \{\}$.
 
Then $\texttt{catenate}(f_1, f_2)$ is defined as:
$$  \bigl\langle Q_1 \cup Q_2, \Sigma, 
   \bigl\{ \langle q_2,\varepsilon\rangle  \mapsto \{q_3\} \bigr\} 
         \cup \delta_1 \cup \delta_2, q_1, \{ q_4 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:

![Nfa recognizing the concatenation of two languages](./aConcat.jpg)

In [14]:
def catenate(self: RegExp2NFA, f1: NFA, f2: NFA) -> NFA:
    M1, Sigma, delta1, q1, A1 = f1
    M2, Sigma, delta2, q3, A2 = f2
    q2, = A1 # extract the element from the set A1
    delta = delta1 | delta2
    delta[q2, '𝜀'] = {q3}
    return M1 | M2, Sigma, delta, q1, A2

RegExp2NFA.catenate = catenate # type: ignore
del catenate

Given two <span style="font-variant:small-caps;">Nfa</span>s `f1` and `f2`, the function `disjunction(f1, f2)` 
creates an <span style="font-variant:small-caps;">Nfa</span> that recognizes a string $s$ if it is either 
is recognized by `f1` or by `f2`. 

Assume again that the states of 
$f_1$ and $f_2$ are different and that $f_1$ and $f_2$ have the following form:
- $f_1 = \langle Q_1, \Sigma, \delta_1, q_1, \{ q_3 \}\rangle$,
- $f_2 = \langle Q_2, \Sigma, \delta_2, q_2, \{ q_4 \}\rangle$,
- $Q_1 \cap Q_2 = \{\}$.

Then $\texttt{disjunction}(f_1, f_2)$ is defined as follows:
$$ \bigl\langle \{ q_0, q_5 \} \cup Q_1 \cup Q_2, \Sigma, 
                \bigl\{ \langle q_0,\varepsilon\rangle \mapsto \{q_1, q_2\},
                   \langle q_3,\varepsilon\rangle \mapsto \{q_5\}, 
                   \langle q_4,\varepsilon\rangle \mapsto \{q_5\} \bigr\} 
                   \cup \delta_1 \cup \delta_2, q_0, \{ q_5 \} \bigr\rangle
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:
![Nfa recognizing the disjunction](./aPlus.jpg)

In [None]:
def disjunction(self: RegExp2NFA, f1: NFA, f2: NFA) -> NFA:
        Q1, Sigma, delta1, q1, A1 = f1
        Q2, Sigma, delta2, q2, A2 = f2
        q3, = A1
        q4, = A2
        q0 = self.getNewState()
        q5 = self.getNewState() 
        delta = delta1 | delta2
        delta[q0, '𝜀'] = { q1, q2 }
        delta[q3, '𝜀'] = { q5 }
        delta[q4, '𝜀'] = { q5 }
        return { q0, q5 } | Q1 | Q2, Sigma, delta, q0, { q5 }
    
RegExp2NFA.disjunction = disjunction # type: ignore
del disjunction

Given an <span style="font-variant:small-caps;">Nfa</span> `f`, the function `kleene(f)` 
creates an <span style="font-variant:small-caps;">Nfa</span> that recognizes a string $s$ if it can be written as
$$ s = s_1 s_2 \cdots s_n $$
and all $s_i$ are recognized by `f`.  Note that $n$ might be $0$. 

If `f` is defined as
$$ f = \langle Q, \Sigma, \delta, q_1, \{ q_2 \} \rangle,
$$
then  `kleene(f)` is defined as follows:
$$ \bigl\langle \{ q_0, q_3 \} \cup Q, \Sigma, 
                \bigl\{ \langle q_0,\varepsilon\rangle \mapsto \{q_1, q_3\},  
                        \langle q_2,\varepsilon\rangle \mapsto \{q_1, q_3\} \bigr\} 
                \cup \delta, q_0, \{ q_3 \} \bigr\rangle.
$$
Graphically, this <span style="font-variant:small-caps;">Nfa</span> looks as follows:
![Nfa recognizing the Kleene star](./aStar.jpg)

In [None]:
def kleene(self: RegExp2NFA, f: NFA) -> NFA:
    Q, Sigma, delta0, q1, A = f
    q2, = A
    q0 = self.getNewState()
    q3 = self.getNewState()
    delta = delta0
    delta[q0, '𝜀'] = { q1, q3 }
    delta[q2, '𝜀'] = { q1, q3 }
    return { q0, q3 } | Q, Sigma, delta, q0, { q3 }

RegExp2NFA.kleene = kleene # type: ignore
del kleene

The auxiliary function `getNewState` returns a new number that has not yet been used as a state.

In [None]:
def getNewState(self: RegExp2NFA) -> State:
    self.StateCount += 1
    return self.StateCount

RegExp2NFA.getNewState = getNewState # type: ignore
del getNewState

The notebook `04-Test-Regexp-2-NFA`can be used to test the functions implemented in this notebook.