# LR parser
This notebook contains both theory and implementation of LR(0) parser according to the
[Dragon Book](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools).

LR parser is a bottom-up parser that can parse context-free languages in linear time.
It reads input tokens, concatenates them into AST nodes in hope that
at the end the whole input will collapse into one big AST node, which will be the AST itself.
If you read this notebook in hopes of understanding LR parser,
please make sure you have already understood what is
[CFG](https://en.wikipedia.org/wiki/Context-free_grammar) and
[AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree),
since I wan't cover them here.


### LR(0) parser
LR(0) parser is the simplest one. It's also sometimes called SLR.
* S - Simple. But I can't call it simple: it's much more complex than PEG or LL.
* L - Left-to-right: the parser reads an input from left to right without peeking at the end of the input.
* R - Rightmost derivation in reverse: the parser builds tree by operating on the right end of list of nodes.

I don't yet understand what zero in parenthes means, I will find it out when I will be writing LR(1) parser.

This notebook contains the theory and the code of LR(0) parser
splitted into bunch of sections. Every section consists of **header**, description and `the code itself`.
But before that I will tell you the core idea of LR parser:
> To build LR parser one should take finite automaton of LL parser with conflicts and
resolve them by transformating this this nondeterministic finite automaton into deterministic one.
Obtained finite automaton is the LR parser.

I don't expect anyone to understand what I just wrote,
but for me that description of the parser have divided my life into before and after:
before I understood the thing and after. So I had to include it in this notebook.

### Example context-free grammar
Before doing any kind of experiments with grammars, we need a lab rat.
For that purpose I have copy-pasted rules of a CFG grammar from [wikipedia](https://en.wikipedia.org/wiki/Context-free_grammar#Well-formed_parentheses).
But I don't force you to use it:
you can change the variable `grammar_source` to your own lab rat and see if the experiment give the same outcome.

In [None]:
grammar_source = """
    S → S S
    S → ( S )
    S → ( )
"""

These rules correspond to context-free grammar,
that desribes context-free language that contains these sentences:
    
    (), (()), ()(), (()()), ((())()), (()(()(()))), ()()()()()(), (())((()))(((())))

In other words, this is the grammar of well-formed parentheses.

### Representing grammar rules
Plain text rules are cool, but we need to represent them with some kind of data structure.
I will NamedTuple for that purpose.

In [None]:
from collections import namedtuple
Rule = namedtuple("Rule", ["head", "body"])
Rule.__str__ = lambda rule: rule[0] + " → " + " ".join(rule[1])
print(Rule("S", ("S", "S")))

S → S S


### Parser of rules
Everyone knows that to write a parser you have to write a parser.
So here is the code of a parser of grammar rules.
We need this parser to translate our lab rat into list of rules.

In [None]:
def parse_rules(source):
    for rule in source.strip().split("\n"):
        head, body = rule.strip().split(" → ")
        yield Rule(head, tuple(body.split(" ")))
list(parse_rules(grammar_source))

[Rule(head='S', body=('S', 'S')),
 Rule(head='S', body=('(', 'S', ')')),
 Rule(head='S', body=('(', ')'))]

### Derive variables, terminals, start symbol from rules
Mathematically speaking we [should](https://en.wikipedia.org/wiki/Context-free_grammar#Formal_definitions) specify variables, terminals, rules and start symbol in order to call it a grammar,
but we(programmer) are too lazy for this.
Why write all this when you can simply extract all the information you need solely from the rules of grammar.
So instead of specifying all these things I wrote a function `derive_symbols()`
to derive terminals and variables from grammar rules.
The idea of derivation is based on the fact, that a variable can be rule head,
while a terminal may occure only in the body.

In [None]:
################################################################################
def derive_symbols(rules):
    variables = {variable for variable, body in rules}
    terminals = {t for _, body in rules for t in body if t not in variables}
    return variables, terminals
variables, terminals = derive_symbols(rules)
print(f"{variables=}\n{terminals=}")

variables={'S'}
terminals={')', '('}


I will assume the head of the first rule to be the start symbol.
this assumption is based solely on the fact that it is true for **my** lab rat

In [None]:
import dataclasses
@dataclasses.dataclass(frozen=True)
class Grammar:
    variables: set[str]
    terminals: set[str]
    rules: list[(str, tuple[str])]
    start: str
        
    def __str__(self):
        s = "start symbol: " + self.start + "\n"
        s += "variables: " + ", ".join(sorted(map(str, self.variables))) + "\n"
        s += "terminals: " + ", ".join(sorted(map(repr, self.terminals))) + "\n"
        rules = [f"{var} → {' '.join(body)}" for var, body in self.rules]
        return s + "rules:\t" + "\n\t".join(sorted(rules)) + "\n"
    
    def __hash__(self):
        return id(self)
grammar = Grammar(variables, terminals, rules, start)
print(grammar)

start symbol: E
variables: B, E
terminals: '*', '+', '0', '1'
rules:	B → 0
	B → 1
	E → B
	E → E * B
	E → E + B



### LR(0) Items
LR(0) items are just rules with dot in body, e.g. `E → E •+ B`, `E → •B`, `B → 0•`.
Items indicate that the parser has recognized a string correspondig to the part of rule before the dot,
e.g. `E → E * •B` means that the parser has recognize `E` and `*` on the input and now expects to read `B`.

I decided to make a class `Item`. It's absolutely not necessary and
it's only purpose is to nicely print the item.

In [None]:
import dataclasses
@dataclasses.dataclass(frozen=True, order=True)
class Item:
    variable: str
    body: tuple[str]
    dot_position: int
    
    def __str__(self):
        body = list(self.body)
        if self.dot_position == len(body):
            return f"{self.variable} → {' '.join(self.body)}" + "•"
        body[self.dot_position] = "•" + body[self.dot_position]
        return f"{self.variable} → {' '.join(body)}"         
    
    @staticmethod
    def from_str(s):
        variable, body = s.strip().split(" → ")
        body = body.split(" ")
        for dot_position, symbol in enumerate(body + ["•"]):
            if symbol.startswith("•"):
                break
        body = tuple(symbol.strip("•") for symbol in body)
        return Item(variable, body, dot_position)
    
    @property
    def next_symbol(self):
        if self.dot_position == len(self.body):
            return None
        return self.body[self.dot_position]
    

Item.from_str("V → A B •C D")

Item(variable='V', body=('A', 'B', 'C', 'D'), dot_position=2)

### Closure of items
Closure of a set of items is the set combined with items that can be obtained
by pushing dot from variable into the body of a rule with that variable in its head,
e.g. `closure {E → •B} = {E → •B, B → •0, B → •1}`

In [None]:
def closure(grammar, items):
    rules = grammar.rules
    new_items = set()
    for item in items:
        variable = item.next_symbol
        if variable not in grammar.variables:
            continue
        for head, body in filter(lambda rule: rule[0] == variable, rules):
            new_item = Item(head, body, 0)
            if new_item not in items:
                new_items.add(new_item)
    return closure(grammar, items | new_items) if new_items else items
                    
for item in sorted(map(str, closure(grammar, {Item.from_str("E → E * •B")}))):
    print(item)

B → •0
B → •1
E → E * •B


### States (sets of items)
The core idea of LR parser is that its states are just sets of possible items.
When parser have already read something from input, it doesn't "know" yet what
rule he is going to apply and what AST node he is going to build,
but he does know what items correspond to already read symbols.
Actually all possible items corresponding to some state fully specify this state.
And the set of all possible sets of items is finite.
Thus number of states is finite.
And we are going to precompute all the states!

With purpose of saving memory a set of items can be represented by items that
can't be computed as closure of other items in this set.
For example set {`E → E * •B`, `B → •1`, `B → •0`} can be represented by item
`E → E * •B` alone.
Here is class `ItemSet` that implements such representation:

In [None]:
@dataclasses.dataclass(frozen=True)
class ItemSet:
    kernel_items: frozenset[Item]
    domain: Grammar
    
    @staticmethod
    def from_items(grammar, items):
        # I assume items can be generated by closure() if they have dot at the
        # begining of the body.
        # It is not true in general case, but sufficient for LR parser states
        kernel_items = filter(lambda item: item.dot_position > 0, items)
        kernel_items = frozenset(kernel_items)
        return ItemSet(kernel_items, grammar)
    
    def __iter__(self):
        yield from sorted(closure(self.domain, self.kernel_items))
        
    def __str__(self):
        return "{" + ", ".join(sorted(map(str, self))) + "}"

    def __bool__(self):
        return bool(self.kernel_items)

items = "E → E * •B, B → •1, B → •0".split(", ")
s = ItemSet.from_items(grammar, map(Item.from_str, items))
print("Item set", s, "has kernel items", ", ".join(map(str, s.kernel_items)))

Item set {B → •0, B → •1, E → E * •B} has kernel items E → E * •B


### GOTO

    GOTO(current_parser_state, next_symbol) -> next_parser_state
 
The GOTO function computes next parser state(item set)
based on its current state(item set).
Since a state is just a set of items, the function is pretty straightforward:
assuming `next_symbol=Y` for every item `W → X •Y Z` from current set of items
add `W → X Y• Z` into the next state.

In [None]:
def goto(grammar, items, next_symbol):
    next_items = set()
    for item in items:
        if item.next_symbol == next_symbol:
            next_item = Item(item.variable, item.body, item.dot_position + 1)
            next_items.add(next_item)
    return ItemSet.from_items(grammar, next_items)
print(f'goto({s}, "1")  = ', goto(grammar, s, "1"))

goto({B → •0, B → •1, E → E * •B}, "1")  =  {B → 1•}


### The states precomputed
We can use the functoin `goto()` to precompute all reachable states of parser.
With this purpose in mind we will need a starting state, a starting item.
We need a rule that will contain a starting symbol in its body.
So we augment our grammar with such a rule:
we add new start symbol `START` and new rule `START → $ OLD_START $`,
where `$` denotes start or end of input.

In [None]:
def augment_grammar(grammar):
    variables, rules, start = grammar.variables, grammar.rules, grammar.start
    new_start = "START"
    # if variable START is already in the grammar, we use START' or START'' ...
    while new_start in variables:
        new_start += "'"
    new_variables = variables | {new_start}
    new_terminals = grammar.terminals | {"$"}
    new_rules = [(new_start, ("$", start, "$"))] + rules
    return Grammar(new_variables, new_terminals, new_rules, new_start)
augmented_grammar = augment_grammar(grammar)
print(augmented_grammar)

start symbol: START
variables: B, E, START
terminals: '$', '*', '+', '0', '1'
rules:	B → 0
	B → 1
	E → B
	E → E * B
	E → E + B
	START → $ E $



Now we can precompute all the states that are reachable from item `START → $ •OLD_START $`.

In [None]:
def grammar_states(grammar):
    assert grammar.rules[0][0] == grammar.start
    variables, rules, start = grammar.variables, grammar.rules, grammar.start
    symbols = sorted(grammar.variables | grammar.terminals)
    start_state = ItemSet.from_items(grammar, {Item(*rules[0], 1)})
    states = [start_state]
    states_lookup = {start_state}
    processed_states = 0
    while processed_states < len(states):
        state = states[processed_states]
        processed_states += 1
        for symbol in symbols:
            new_state = goto(grammar, state, symbol)
            if new_state and new_state not in states_lookup:
                states_lookup.add(new_state)
                states.append(new_state)
    return states

states = grammar_states(augmented_grammar)
for i, state in enumerate(states):
    print(f"{i}: {state}")

0: {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
1: {B → 0•}
2: {B → 1•}
3: {E → B•}
4: {E → E •* B, E → E •+ B, START → $ E •$}
5: {START → $ E $•}
6: {B → •0, B → •1, E → E * •B}
7: {B → •0, B → •1, E → E + •B}
8: {E → E * B•}
9: {E → E + B•}


### Actions precomputed
We have states precomputed. Cool! Now let's precompute actions that should be executed.
For each possible state and each posible terminal on input we will compute desired action.

LR parser supports these types of actions:
1. SHIFT: push the terminal from input into the stack.
2. REDUCE: pack a few symbols from stack into an AST node.
3. ACCEPT: accept current stack as succesfully built AST tree. 
4. DIE: raise an exception if there is no reasanable action

In [None]:
def precompute_actions(grammar):
    states, terminals = grammar_states(grammar), sorted(grammar.terminals)
    actions = {}
    for i, state in enumerate(states):
        kernel_item = next(iter(state.kernel_items))
        variable, body = kernel_item.variable, kernel_item.body
        for next_terminal in sorted(grammar.terminals):
            actions[i, next_terminal] = ("die", variable)
        for next_terminal in sorted(grammar.terminals - {'$'}):
            next_state = goto(grammar, state, next_terminal)
            if next_state:
                actions[i, next_terminal] = ("shift", states.index(next_state))
        if any(i.variable == grammar.start and i.dot_position == 2 for i in state.kernel_items):
            actions[i, "$"] = ("accept", variable)
        if kernel_item.dot_position == len(kernel_item.body):
            for next_terminal in sorted(grammar.terminals):
                actions[i, next_terminal] = ("reduce", variable, len(body))
    return actions

actions = precompute_actions(augmented_grammar)
for situation, action in list(actions.items())[:22]:
    state_index, terminal = situation
    print(state_index, repr(terminal), "->", *action, "\t", states[state_index])
if len(actions) > 22: print("...")

0 '$' -> die START 	 {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
0 '*' -> die START 	 {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
0 '+' -> die START 	 {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
0 '0' -> shift 1 	 {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
0 '1' -> shift 2 	 {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
1 '$' -> reduce B 1 	 {B → 0•}
1 '*' -> reduce B 1 	 {B → 0•}
1 '+' -> reduce B 1 	 {B → 0•}
1 '0' -> reduce B 1 	 {B → 0•}
1 '1' -> reduce B 1 	 {B → 0•}
2 '$' -> reduce B 1 	 {B → 1•}
2 '*' -> reduce B 1 	 {B → 1•}
2 '+' -> reduce B 1 	 {B → 1•}
2 '0' -> reduce B 1 	 {B → 1•}
2 '1' -> reduce B 1 	 {B → 1•}
3 '$' -> reduce E 1 	 {E → B•}
3 '*' -> reduce E 1 	 {E → B•}
3 '+' -> reduce E 1 	 {E → B•}
3 '0' -> reduce E 1 	 {E → B•}
3 '1' -> reduce E 1 	 {E → B•}
4 '$' -> accept E 	 {E → E •* B, E → E •+ B, START → $ E •$}
4 '*' -> shift 6 	 {E → E •* B, E → E •+ B, STAR

In [None]:
table = [["   "] * (len(augmented_grammar.terminals)) for i in range(len(states))]
terminals = sorted(augmented_grammar.terminals)
for situation, action in list(actions.items()):
    state_index, terminal = situation
    table[state_index][terminals.index(terminal)] = f"{action[0][0]}{str(action[1])[:2]:2}"
print("--", " ".join(map(repr, terminals)))
for i, row in enumerate(table):
    print(f"{i:2}", " ".join(row), "\t", states[i])

-- '$' '*' '+' '0' '1'
 0 dST dST dST s1  s2  	 {B → •0, B → •1, E → •B, E → •E * B, E → •E + B, START → $ •E $}
 1 rB  rB  rB  rB  rB  	 {B → 0•}
 2 rB  rB  rB  rB  rB  	 {B → 1•}
 3 rE  rE  rE  rE  rE  	 {E → B•}
 4 aE  s6  s7  dE  dE  	 {E → E •* B, E → E •+ B, START → $ E •$}
 5 rST rST rST rST rST 	 {START → $ E $•}
 6 dE  dE  dE  s1  s2  	 {B → •0, B → •1, E → E * •B}
 7 dE  dE  dE  s1  s2  	 {B → •0, B → •1, E → E + •B}
 8 rE  rE  rE  rE  rE  	 {E → E * B•}
 9 rE  rE  rE  rE  rE  	 {E → E + B•}


### Gotos precomputed
We precomputed the actions, why not precompute goto(...) results? We need them for nonterminals to know
what state to go when reducing something.

In [None]:
def precompute_gotos(grammar):
    states = grammar_states(grammar)
    gotos = {}
    for i, state in enumerate(states):
        for variable in sorted(grammar.variables):
            next_state = goto(grammar, state, variable)
            if next_state:
                gotos[i, variable] = states.index(next_state)
    return gotos
gotos = precompute_gotos(augmented_grammar)
print(gotos)

{(0, 'B'): 3, (0, 'E'): 4, (6, 'B'): 8, (7, 'B'): 9}


### Runtime of the parser
Using all the precomputed information we can now write the parser, that uses only numbers/indexes of states, not the states itself.

In [None]:
import itertools

def parser(actions, gotos, DEBUG=False):
    def parse(source):
        stack = [("$", 0)]
        i, tokens = 0, list(source) + ["$"]
        while True:
            token = tokens[i]
            last_token, state = stack[-1]
            action = actions[state, token] 
            if DEBUG: print(stack, "<<<", repr(token), ":", action)
            if action[0] == "shift":
                next_state, i = action[1], i + 1
                stack.append(([token], next_state))
            elif action[0] == "reduce":
                variable, size = action[1:]
                stack, node = stack[:-size], stack[size:]
                next_state = gotos[stack[-1][1], variable]
                stack.append(([variable] + [n for n, _ in node], next_state))
                #stack.append((variable, next_state))
            elif action[0] == "accept":
                return stack[1][0]
            else:
                raise Exception(f"{repr(stack)} <<< {token}")
    return parse

def print_ast(ast, offset = 0):
    head, children = ast[0], ast[1:]
    print(" │" * (offset - 1) + " ├" * bool(offset) + repr(head))
    for child in children:
        print_ast(child, offset + 1)

parse = parser(actions, gotos)
print_ast(parse("1+1"))

'E'
 ├'B'
 │ ├'E'
 │ │ ├'B'
 │ │ │ ├'1'
 │ ├'+'
 │ ├'1'


# THE END

In [None]:
print_ast(parse("1*0+1"))

'E'
 ├'B'
 │ ├'E'
 │ │ ├'B'
 │ │ │ ├'E'
 │ │ │ │ ├'B'
 │ │ │ │ │ ├'1'
 │ │ │ ├'*'
 │ │ │ ├'0'
 │ ├'+'
 │ ├'1'


In [None]:
print_ast(parse("1"))

'E'
 ├'B'
 │ ├'1'
