In [None]:
from IPython.core.display import HTML
with open ("../../style.css", "r") as file:
    css = file.read()
HTML(css)

# <b>Exercise</b>: Extending a Shift-Reduce Parser

In this exercise your task is to extend the *shift-reduce parser*
that has been discussed in the lecture so that it returns an abstract syntax tree.  You should test it with the program `sum-for.sl` that is given the directory `Examples`.

In [None]:
cat Examples/sum-for.sl

The grammar that should be used to parse this program is given in the file
`Examples/simple.g`.  It is very similar to the grammar that we have developed previously for our *interpreter*.  I have simplified this grammar at various places to make it more suitable
for the current task.

In [None]:
cat Examples/simple.g

**Exercise 1**:  Generate both the *action-table* and the *goto table* for this grammar using the notebook `SLR-Table-Generator.ipynb`.  

## Implementing a Scanner

In [None]:
import re

**Exercise 2:** The function `tokenize(s)` transforms the string `s` into a list of tokens. 
Given the program `sum-for.sl` it should produce the list of tokens shown further below.  Note that a *number* `n` is stored as a pairs of the form 
```
('NUMBER', n)
```
and an *identifier* `v` is stored as the pair
```
('ID', v).
```
You have to take care of *keywords* like `for` or `while`: Syntactically, they are equal to identifiers, but the scanner should <u>not</u> turn them into pairs but rather return them as strings so that the parser does not mistake them for *identifiers*. 

Below is the token list that should be produced from scanning the file `sum-for.sl`:
```
['function',
 ('ID', 'sum'),
 '(',
 ('ID', 'n'),
 ')',
 '{',
 ('ID', 's'),
 ':=',
 ('NUMBER', 0),
 ';',
 'for',
 '(',
 ('ID', 'i'),
 ':=',
 ('NUMBER', 1),
 ';',
 ('ID', 'i'),
 '≤',
 ('ID', 'n'),
 '*',
 ('ID', 'n'),
 ';',
 ('ID', 'i'),
 ':=',
 ('ID', 'i'),
 '+',
 ('NUMBER', 1),
 ')',
 '{',
 ('ID', 's'),
 ':=',
 ('ID', 's'),
 '+',
 ('ID', 'i'),
 ';',
 '}',
 'return',
 ('ID', 's'),
 ';',
 '}',
 ('ID', 'print'),
 '(',
 ('ID', 'sum'),
 '(',
 ('NUMBER', 6),
 ')',
 ')',
 ';']
 ```
For reference, I have given the old implementation of the function `tokenize` that has been used in the notebook `Shift-Reduce-Parser-Pure.ipynb`.  You have to edit this function so that it works with the grammar `simple.g`.

In [None]:
def tokenize(s):
    '''Transform the string s into a list of tokens.  The string s
       is supposed to represent an arithmetic expression.
    '''
    "Edit the code below!"
    lexSpec = r'''([ \t\n]+)      |  # blanks and tabs
                  ([1-9][0-9]*|0) |  # number
                  ([()])          |  # parentheses 
                  ([-+*/])        |  # arithmetical operators
                  (.)                # unrecognized character
               '''
    tokenList = re.findall(lexSpec, s, re.VERBOSE)
    result    = []
    for ws, number, parenthesis, operator, error in tokenList:
        if ws:        # skip blanks and tabs
            continue
        elif number:
            result += [ 'NUMBER' ]
        elif parenthesis:
            result += [ parenthesis ]
        elif operator:
            result += [ operator ]
        elif error:
            result += [ f'ERROR({error})']
    return result

The cell below tests your tokenizer.  Your task is to compare the output with the output shown above.

In [None]:
with open('Examples/sum-for.sl', 'r', encoding='utf-8') as file:
    program = file.read()
tokenize(program)  

In [None]:
class ShiftReduceParser():
    def __init__(self, actionTable, gotoTable):
        self.mActionTable = actionTable
        self.mGotoTable   = gotoTable

The function `parse(self, TL)` is called with two arguments:
- `self` ia an object of class `ShiftReduceParser` that maintain both an *action table* 
   and a *goto table*.
- `TL` is a list of tokens.  Tokens are either 
   - *literals*, i.e. strings enclosed in single quote characters, 
   - pairs of the form `('NUMBER', n)` where `n` is a natural number, or 
   - the symbol `$` denoting the *end of input*.

Below, it is assumed that `parse-table.py` is the file that you have created in 
**Exercise 1**.  

In [None]:
%run parse-table.py

**Exercise 3:**
The function `parse` given below is the from the notebook `Shift-Reduce-Parser.ipynb`. Adapt this function so that it does not just return `True`or `False`
but rather returns a *parse tree* as a nested list.  The key idea is that the list `Symbols`
should now be a list of *parse trees* and *tokens* instead of just *syntactical variables* and *tokens*, i.e. the syntactical variables should be replaced by their parse trees.  

It might be useful to implement an auxilliary function `combine_trees` that takes a 
list of parse trees and combines the into a new parse tree.  

In [None]:
def parse(self, TL):
    """
    Edit this code so that it returns a parse tree.
    Make use of the auxiliary function combine_trees that you have to
    implement in Exercise 4.
    """
    index   = 0      # points to next token
    Symbols = []     # stack of symbols
    States  = ['s0'] # stack of states, s0 is start state
    TL     += ['$']
    while True:
        q = States[-1]
        t = TL[index]
        print('Symbols:', ' '.join(Symbols + ['|'] + TL[index:]).strip())
        p = self.mActionTable.get((q, t), 'error')
        if p == 'error': 
            return False
        elif p == 'accept':
            return True
        elif p[0] == 'shift':
            s = p[1]
            Symbols += [t]
            States  += [s]
            index   += 1
        elif p[0] == 'reduce':
            head, body = p[1]
            n       = len(body)
            if n > 0:
                Symbols = Symbols[:-n]
                States  = States [:-n]
            Symbols = Symbols + [head]
            state   = States[-1]
            States += [ self.mGotoTable[state, head] ]

ShiftReduceParser.parse = parse
del parse

**Exercise 4:** 
Given a list of *tokens* and *parse trees* `TL` the function `combine_trees` combines these trees into a new *parse tree*.  The parse trees are represented as *nested tuples*.  The data type of a *nested tuple* is defined recursively:
- A nested tuple is a tuple of the form `(Head,) + Body` where
  * `Head` is a string and
  * `Body` is a tuple of strings, integers, and *nested tuples*. 

When the *nested tuple* `(Head,) + Body` is displayed as a tree, `Head` is used as the label at the root of the tree.  If `len(Body) = n`, then the root has `n` children.  These `n` children are obtained by displaying `Body[0]`, $\cdots$, `Body[n-1]` as trees.

In order to convert the list of tokens and parse trees into a nested tuple we need a string that can serve as the `Head` of the parse tree.  The easiest way to to this is to take the first element of `TL` that is a string because the strings in `TL` are keywords like `for` or `while` or they are operator symbols.  The remaining strings after the first in `TL` can be discarded.
If there is no string in `TL`, you can define `Head` as the empty string. 

I suggest a *recursive* implementation of this function.
The file `sum-st.pdf` shows the parse tree of the program that is stored in the file `sum-for.sl`.

In [None]:
def combine_trees(TL):
    if len(TL) == 0:
        return ()
    if isinstance(TL, str):
        return (str(TL),)
    Literals = [t for t in TL if     isinstance(t, str)]
    Trees    = [t for t in TL if not isinstance(t, str)]
    if len(Literals) > 0:
        label = Literals[0]
    else:
        label = ''
    result = (label,) + tuple(Trees)
    return result

In [None]:
VoidKeys = { '', '(', ';', 'NUMBER', 'ID' }

**Exercise 5:** 
The function `simplfy_tree(tree)` transforms the *parse tree* `tree` into an *abstract syntax tree*.  The parse tree `tree` is represented as a nested tuple of the form
``` 
tree = (head,) + body
```
The function should simplify the `tree` as follows:
- If `head == ''` and `body` is a tuple of length 2 that starts with an empty string,
  then this tree should be simplified to `body[1]`.
- If `head` does not contain useful information, for example if `head` is the empty string
  or an opening parenthesis and, furthermore,  `body` is a tuple of length 1,
  then this tree should be simplified to `body[0]`.
- By convention, remaining empty `Head` labels should be replaced by the label `'.'`
  as this label is traditionally used to construct lists.
  
I suggest a *recursive* implementation of this function.
The file `sum-ast.pdf` shows the abstract syntax tree of the program that is stored in the file `sum-for.sl`.

In [None]:
def simplify_tree(tree):
    if isinstance(tree, int) or isinstance(tree, str):
        return tree
    head, *body = tree
    if body == []:
        return tree
    if head == '' and len(body) == 2 and body[0] == ('',):
        return simplify_tree(body[1])
    if head in VoidKeys and len(body) == 1:
        return simplify_tree(body[0])
    body_simplified = simplify_tree_list(body)
    if head == '(' and len(body) == 2:
        return (body_simplified[0],) + body_simplified[1:]
    if head == '':
        head = '.'
    return (head,) + body_simplified

In [None]:
def simplify_tree_list(TL):
    if TL == []:
        return ()
    tree, *Rest = TL
    return (simplify_tree(tree),) + simplify_tree_list(Rest)

## Testing

The notebook `../AST-2-Dot.ipynb` implements the function `tuple2dot(nt)` that displays the nested tuple `nt` as a tree via `graphvis`.

In [None]:
%run ../AST-2-Dot.ipynb

In [None]:
cat -n Examples/sum-for.sl

In [None]:
def test(file):
    with open(file, 'r', encoding='utf-8') as file:
        program = file.read()    
    parser = ShiftReduceParser(actionTable, gotoTable)
    TL     = tokenize(program)
    st     = parser.parse(TL)
    ast    = simplify_tree(st)
    return st, ast

Calling the function `test` below should produce the following nested tuple as *parse tree*:
```
('', ('', ('', ('function', ('ID', 'sum'), ('', ('ID', 'n')), ('', ('', ('', ('',), (';', (':=', ('ID', 's'), ('', ('', ('', ('NUMBER', 0))))))), ('for', (':=', ('ID', 'i'), ('', ('', ('', ('NUMBER', 1))))), ('', ('', ('', ('≤', ('', ('', ('', ('ID', 'i')))), ('', ('*', ('', ('', ('ID', 'n'))), ('', ('ID', 'n')))))))), (':=', ('ID', 'i'), ('+', ('', ('', ('', ('ID', 'i')))), ('', ('', ('NUMBER', 1))))), ('', ('',), (';', (':=', ('ID', 's'), ('+', ('', ('', ('', ('ID', 's')))), ('', ('', ('ID', 'i'))))))))), ('return', ('', ('', ('', ('ID', 's')))))))), (';', ('', ('', ('(', ('ID', 'print'), ('', ('', ('', ('(', ('ID', 'sum'), ('', ('', ('', ('', ('NUMBER', 6)))))))))))))))
```
The file `sum-st.pdf` shows this nested tuple as a tree.

Transforming the parse tree into an *abstract syntax tree* should yield the following nested tuple:
```
('.', ('function', 'sum', 'n', ('.', ('.', (':=', 's', 0), ('for', (':=', 'i', 1), ('≤', 'i', ('*', 'n', 'n')), (':=', 'i', ('+', 'i', 1)), (':=', 's', ('+', 's', 'i')))), ('return', 's'))), ('print', ('sum', 6)))
```
The file `sum-ast.pdf` shows this nested tuple as a tree.

In [None]:
st, ast = test('Examples/sum-for.sl')
print(st)
print(ast)
display(tuple2dot(st))
display(tuple2dot(ast))