<br><br><br><br><br>

# Parsing and interpreting

<br><br><br><br><br>

**Parsing is the conversion of source code text into a tree representing relationships among tokens (words & symbols).**

<img src="https://web.archive.org/web/20180815032316im_/http://www.german-latin-english.com/diagram2.gif" width="50%">

Reports about medicines in newspapers and on television commonly contain little or no information about drugs' risks and cost, and often cite medical "experts" without disclosing their financial ties to the pharmaceutical industry, according to a new study.

   - Susan Okie, The Washington Post (published on June 1, 2000, in Louisville, KY, in The Courier-Journal, page A3)

<img src="https://web.archive.org/web/20181030174508im_/https://www.nltk.org/book/tree_images/ch08-tree-1.png" width="45%"><img src="https://web.archive.org/web/20181030174508im_/https://www.nltk.org/book/tree_images/ch08-tree-2.png" width="45%">

<br>
"How he got into my pajamas, I'll never know." — Groucho Marx

**Grammar:** a list of rules to convert tokens into trees and trees into bigger trees.

```
sentence (S):              noun_phrase verb_phrase
prepositional_phrase (PP): preposition noun_phrase
verb_phrase (VP):          verb noun_phrase | verb noun_phrase prepositional_phrase
noun_phrase (NP):          "John" | "Mary" | "Bob"
                           | determiner noun | determiner noun prepositional_phrase
preposition (P):           "in" | "on" | "by" | "with"
verb (V):                  "saw" | "ate" | "walked"
determiner (Det):          "a" | "an" | "the" | "my"
noun (N):                  "man" | "dog" | "cat" | "telescope" | "park"
```

<center><img src="https://web.archive.org/web/20181030174508im_/https://www.nltk.org/book/tree_images/ch08-tree-4.png" width="20%"><img src="https://web.archive.org/web/20181030174508im_/https://www.nltk.org/book/tree_images/ch08-tree-5.png" width="20%"></center>

<img src="https://web.archive.org/web/20181030174508im_/https://www.nltk.org/images/rdparser1-6.png" width="90%">

<table><tr><td width="20%">
<img src="https://web.archive.org/web/20190219060132im_/https://ruslanspivak.com/lsbasi-part7/lsbasi_part7_genastdot_01.png" width="100%">
</td><td width="80%">
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">Mathematical expressions and computer programs can be parsed the same way.</p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px"><tt>7 + 3 * (10 / (12 / (3 + 1) - 1))</tt></p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">We definitely don't need to write the parsing algorithm—decades of computer science research has already gone into that.</p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">I used to have a favorite (PLY), but while preparing this demo, I found a better one (Lark).</p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px"><b>Let's get started!</b></p>
</td></tr></table>

## Lark - a modern parsing library for Python

Parse any context-free grammar, FAST and EASY!

**Beginners**: Lark is not just another parser. It can parse any grammar you throw at it, no matter how complicated or ambiguous, and do so efficiently. It also constructs a parse-tree for you, without additional code on your part.

**Experts**: Lark implements both Earley(SPPF) and LALR(1), and several different lexers, so you can trade-off power and speed, according to your requirements. It also provides a variety of sophisticated features and utilities.

Lark can:

 - Parse all context-free grammars, and handle any ambiguity
 - Build a parse-tree automagically, no construction code required
 - Outperform all other Python libraries when using LALR(1) (Yes, including PLY)
 - Run on every Python interpreter (it's pure-python)
 - Generate a stand-alone parser (for LALR(1) grammars)

In [1]:
import lark

expression_grammar = """
expression: term   | term "+" term     -> add | term "-" term     -> sub
term:       factor | factor "*" factor -> mul | factor "/" factor -> div
factor:     power  | "+" factor        -> pos | "-" factor        -> neg
power:      call ["**" factor]
call:       atom   | call trailer
atom:       "(" expression ")" | CNAME -> symbol | NUMBER -> literal
trailer:    "(" arglist ")"
arglist:    expression ("," expression)*

%import common.CNAME
%import common.NUMBER
%import common.WS

%ignore WS
"""
grammar = "start: expression\n" + expression_grammar

parser = lark.Lark(grammar)

In [2]:
print(parser.parse("2 + 2").pretty())

start
  add
    term
      factor
        power
          call
            literal	2
    term
      factor
        power
          call
            literal	2



<table><tr><td width="35%">
<img src="https://web.archive.org/web/20190219060132im_/https://ruslanspivak.com/lsbasi-part7/lsbasi_part7_ast_02.png" width="100%">
</td><td width="65%">
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">The parsing tree has too much detail because it includes nodes for rules even if they were just used to set up operator precedence.</p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">Let's reduce it to a tree that contains only what is necessary to understand the meaning of the program.</p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">Such a tree is called an <b>Abstract Syntax Tree</b> (AST).</p>
<p style="font-size: 14px; margin-left: 10px; margin-bottom: 30px">This is easy enough (and particular enough to our specific needs) that we should write it ourselves.</p>
</td></tr></table>

In [3]:
class AST:                                       # only three types (and a superclass to set them up)
    _fields = ()
    def __init__(self, *args, line=None):
        self.line = line
        for n, x in zip(self._fields, args):
            setattr(self, n, x)
            if self.line is None: self.line = getattr(x, "line", None)

class Literal(AST):                              # Literal: value that appears in the program text
    _fields = ("value",)
    def __str__(self): return str(self.value)

class Symbol(AST):                               # Symbol: value referenced by name
    _fields = ("symbol",)
    def __str__(self): return self.symbol

class Call(AST):                                 # Call: evaluate a function on arguments
    _fields = ("function", "arguments")
    def __str__(self):
        return "{0}({1})".format(str(self.function), ", ".join(str(x) for x in self.arguments))

In [4]:
def toast(ptnode):  # Recursively convert parsing tree (PT) into abstract syntax tree (AST).
    if ptnode.data in ("add", "sub", "mul", "div", "pos", "neg"):
        arguments = [toast(x) for x in ptnode.children]
        return Call(Symbol(ptnode.data, line=arguments[0].line), arguments)
    elif ptnode.data == "power" and len(ptnode.children) == 2:
        arguments = [toast(ptnode.children[0]), toast(ptnode.children[1])]
        return Call(Symbol("power", line=arguments[0].line), arguments)
    elif ptnode.data == "call" and len(ptnode.children) == 2:
        return Call(toast(ptnode.children[0]), toast(ptnode.children[1]))
    elif ptnode.data == "symbol":
        return Symbol(ptnode.children[0], line=ptnode.children[0].line)
    elif ptnode.data == "literal":
        return Literal(float(ptnode.children[0]), line=ptnode.children[0].line)
    elif ptnode.data == "arglist":
        return [toast(x) for x in ptnode.children]
    else:
        return toast(ptnode.children[0])    # many other cases, all of them simple pass-throughs

print(toast(parser.parse("2 + 2")))

add(2.0, 2.0)


## Execution

The simplest way to run a program is to repeatedly walk over the AST, evaluating each step. This is an **interpreter**.

_Historical interlude:_

   * The first high-level programming language, [Short Code](https://www.computer.org/csdl/magazine/an/1988/01/man1988010007/13rRUxCitB8) ("Short Order Code"), was an interpreter.
   * Created by a physicist, John Mauchly, in 1949 for UNIVAC I.
   * It ran 50× slower than the corresponding machine instructions.
   * His company hired Grace Hopper, who improved the situation by inventing compilers (in particular, COBOL in 1959).

A **compiler** scans the AST to generate a sequence of machine instructions, natively recognized and executed by the computer.

<img src="https://web.archive.org/web/20190322182736im_/https://ruslanspivak.com/lsbasi-part1/lsbasi_part1_compiler_interpreter.png" width="80%">

In [5]:
def interpreter(astnode, **symbols):
    if isinstance(astnode, Literal):
        return astnode.value

    elif isinstance(astnode, Symbol):
        return symbols[astnode.symbol]

    elif isinstance(astnode, Call):
        function = interpreter(astnode.function, **symbols)
        arguments = [interpreter(x, **symbols) for x in astnode.arguments]
        return function(*arguments)

import math, operator
interpreter(toast(parser.parse("2 + 2")),
            add = operator.add, sub = operator.sub, mul = operator.mul, div = operator.truediv,
            pos = operator.pos, neg = operator.neg, power = math.pow, sqrt = math.sqrt, abs = abs,
            x = 5)

4.0

## What about errors?

If a bad condition is encountered, like `sqrt(-5)`, the interpreter stops because the underlying Python execution engine raises an exception.

When writing a language, we must distinguish between our own internal errors and the users' logic mistakes. In the latter case, we have to let them know that they can fix it and provide a hint about where to start.

Line numbers are the most useful hint—but only when they're lines in the user's code, not the execution engine itself. The parser knows about line numbers—we must propagate that information into the AST (for an interpreter) and the final executable (for a compiler with debugging symbols included).

In [6]:
# We've already propagated line numbers from parsing tree tokens to all AST nodes.
def showline(ast):
    if isinstance(ast, list):
        for x in ast:
            showline(x)
    if isinstance(ast, AST):
        print("{0:5s} {1:10s} {2}".format(str(ast.line), type(ast).__name__, ast))
        for n in ast._fields:
            showline(getattr(ast, n))

print("{0:5s} {1:10s} {2}".format("line", "AST type", "expression"))
print("--------------------------------------------------------")
showline(toast(parser.parse("""sqrt(-5)""")))

line  AST type   expression
--------------------------------------------------------
1     Call       sqrt(neg(5.0))
1     Symbol     sqrt
1     Call       neg(5.0)
1     Symbol     neg
1     Literal    5.0


In [16]:
# Short exercise: change the line below to report UserErrors with line numbers from the source code.

class UserError(Exception): pass

def interpreter(astnode, **symbols):
    if isinstance(astnode, Literal):
        return astnode.value
    elif isinstance(astnode, Symbol):
        return symbols[astnode.symbol]
    elif isinstance(astnode, Call):
        function = interpreter(astnode.function, **symbols)
        arguments = [interpreter(x, **symbols) for x in astnode.arguments]
        try:
            return function(*arguments)
        except Exception as err:
            raise err   # CHANGE THIS LINE

interpreter(toast(parser.parse("""sqrt(-5)""")), **{**operator.__dict__, **math.__dict__}, x = 5)

ValueError: math domain error

## Assignments

So far, all we've implemented is a calculator. As a next step, let's extend the language to include assignments.

For clarity, we'll use `:=` as an assignment operator.

Blocks will be sets of 

In [None]:
class SymbolTable:
    def __init__(self, parent=None, **symbols):
        self.parent = parent
        self.symbols = symbols

    def __getitem__(self, symbol):
        if symbol in self.symbols:
            return self.symbols[symbol]
        elif self.parent is not None:
            return self.parent[symbol]
        else:
            raise KeyError(symbol)

    def __setitem__(self, symbol, value):
        self.symbols[symbol] = value

builtins = SymbolTable()
builtins["add"] = lambda x, y: x + y
builtins["sub"] = lambda x, y: x - y
builtins["mul"] = lambda x, y: x * y
builtins["div"] = lambda x, y: x / y
builtins["pos"] = lambda x: x
builtins["neg"] = lambda x: -x
builtins["power"] = lambda x, y: x**y

In [None]:
def interpreter(astnode, symboltable):
    if isinstance(astnode, Literal):
        return astnode.value
    elif isinstance(astnode, Symbol):
        return symboltable[astnode.symbol]
    elif isinstance(astnode, Call):
        function = interpreter(astnode.function, symboltable)
        arguments = [interpreter(x, symboltable) for x in astnode.arguments]
        return function(*arguments)

interpreter(toast(parser.parse("2 + 2")), SymbolTable(builtins))