In [None]:
from IPython.core.display import HTML
with open('../style.css', 'r') as file:
    css = file.read()
HTML(css)

The following cells loads the `mypy` extension for notebooks.  This enables us to check the type annotation of cells.

In [None]:
%load_ext nb_mypy

# A Parser for Regular Expression

This notebook implements a parser for regular expressions. The parser that is implemented in the function `parseExpr` parses a regular expression 
according to the following <em style="color:blue">EBNF grammar</em>.
```
   regExp  -> product ('+' product)*
   product -> factor factor*
   factor  -> atom '*'?
   atom    -> '(' expr ')' | CHAR | '""' | '0'
```
The parse tree is represented as a nested tuple.
- characters are represented by themselves,
- `'0'` is interpreted as $\emptyset$ and is represented as `0`,
- `""` is interpreted as the regular expression $\varepsilon$ and represented as `''`,
- $r_1 \cdot r_2$ is represented as `('cat', 'a', 'b')`, 
- $r_1 + r_2$ is represented as `('or', `$r_1, r_2$ `)`,
- $r^*$ is represented as `(star, r)` .

The parser is implemented as a recursive *top-down* parser.

In [None]:
from typing import List, Union, Tuple, TypeVar, Match

We start with a definition of the type of the parse trees that are generated.  A parse tree is either
* an integer,
* a string,
* a tuple of parse trees.

In [None]:
ParseTree = TypeVar('ParseTree')

# The type definition for the recursive tuple
ParseTree = Union[int, str, Tuple[ParseTree, ...]]

In order to tokenize strings, we need regular expressions from the module `re`.

In [None]:
import re

The function $\texttt{isWhiteSpace}(s)$ checks whether the string $s$ contains only blanks and tabulators.
If this is the case, it returns a `Match` object.  Otherwise, `None` is returned.

In [None]:
def isWhiteSpace(s: str) -> Match[str] | None:
    whitespace = re.compile(r'[ \t]+')
    return whitespace.fullmatch(s)

The function `tokenize(s)` partitions the string `s` into a list of tokens.
It recognizes 
- the operator symbols `+` and `*`, 
- the parentheses `(`, `)`, 
- single upper or lower case letters, 
- `0`, 
- the empty string `""`.

All whitespace characters are discarded.

In [None]:
def tokenize(s: str) -> List[str]:
    regExp = r'''
              [+*()]   |  # operators and parentheses
              [ \t\n]  |  # white space
              [a-zA-Z] |  # single characters from the alphabet
              0        |  # empty regular expression
              ""          # epsilon
              '''
    return [t for t in re.findall(regExp, s, flags=re.VERBOSE) if not isWhiteSpace(t)]

Below we have defined forward declarations of some functions that are mutually recursive. 

In [None]:
def parseRegExp(TokenList: List[str]) -> Tuple[ParseTree, List[str]]: 
    return None # type: ignore

def parseProduct(TokenList: List[str]) -> Tuple[ParseTree, List[str]]: 
    return None # type: ignore

def parseFactor(TokenList: List[str]) -> Tuple[ParseTree, List[str]]: 
    return None # type: ignore

def parseAtom(TokenList: List[str]) -> Tuple[Union[str, int, ParseTree], List[str]]:
    return None # type: ignore

The function `parse` takes a string `s` and tries to parse it as a regular expression.  
The parse tree is returned as a nested tuple.

In [None]:
def parse(s: str) -> ParseTree:
     TokenList = tokenize(s)
     regExp, Rest = parseRegExp(TokenList)
     assert Rest == [], f'Parse Error: could not parse {TokenList}'
     return regExp

The function `parseRegExp` takes a token list `TokenList` and tries to interpret this list
as a regular expression.  It returns the regular expression in the form of a nested tuple and
a list of those tokens that could not be parsed.  It is implemented as a <em style="color:blue">top-down-parser.</em>
The function `parseRegExp` implements the following grammar rule:
```
regExp  -> product ('+' product)*
```

In [None]:
def parseRegExp(TokenList: List[str]) -> Tuple[ParseTree, List[str]]:
    result, Rest = parseProduct(TokenList)
    while len(Rest) > 1 and Rest[0] == '+':
        arg, Rest = parseProduct(Rest[1:])
        result = ('or', result, arg)
    return result, Rest

The function `parseProduct` implements the following grammar rule:
```
product -> factor factor*
```

In [None]:
def parseProduct(TokenList: List[str]) -> Tuple[ParseTree, List[str]]:
    result, Rest = parseFactor(TokenList)
    while len(Rest) > 0 and not (Rest[0] in ["+", "*", ")"]):
        arg, Rest = parseFactor(Rest)
        result = ('cat', result, arg)
    return result, Rest

The function `parseFactor` implements the following grammar rule:
```
factor  -> atom '*'?
```

In [None]:
def parseFactor(TokenList: List[str]) -> Tuple[ParseTree, List[str]]:
    atom, Rest = parseAtom(TokenList)
    if len(Rest) > 0 and Rest[0] == "*":
        return ('star', atom), Rest[1:]
    return atom, Rest

The function `parseAtom` implements the following grammar rule:
```
atom    -> '0'
         | '(' expr ')' 
         | '""' 
         | CHAR 
```

In [None]:
def parseAtom(TokenList: List[str]) -> Tuple[Union[str, int, ParseTree], List[str]]:
    if TokenList[0] == '0':
        return 0, TokenList[1:]
    if TokenList[0] == '(':
        regExp, Rest = parseRegExp(TokenList[1:])
        assert Rest[0] == ")", "Parse Error"
        return regExp, Rest[1:]
    if TokenList[0] == '""':
        return '', TokenList[1:]
    s = TokenList[0]
    assert len(s) <= 1, f'parse error: {TokenList}'
    return s, TokenList[1:]

In [None]:
parse('a*b + ba* + 0')