# Languages, Parsing, and Abstract Syntax Trees

These notes cover several background concepts, definitions, and constructs that will allow us to describe and reason about programming languages.

## Dependencies

In [73]:
# Presentation dependencies.
%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mp
import matplotlib.pyplot as plt
from importlib import reload
from IPython.display import Image
from IPython.display import display_html
from IPython.display import display
from IPython.display import Math
from IPython.display import Latex
from IPython.display import HTML

# Content dependencies (also reproduced inline).
from random import randint
from itertools import permutations
from functools import reduce
from tqdm import tqdm

## Mathematical Background and Prerequisites

**Basic logic.** It is expected that the reader is familiar with the basic concepts of mathematical logic, including formulas, predicates, relational operators, and terms.

**Basic set theory.** It is expected that the reader is familiar with the notion of a set (both finite and infinite), set size, and set comprehension notation (e.g., $\{1,2,3,4\}$ is a set and $\{x \ | \ x \in \mathbb{Z}, x > 0 \} = \{1, 2, 3, ...\}$). Is it also expected that the reader is familiar with the membership and subset relations between elements and sets, (e.g., $1 \in \{1,2,3\}$ and $\{2,3\} \subset \{1,2,3\}$). 


## Defining Programming Languages

In order to define and reason about programming languages, and in order to write automated tools and algorithms that can operate on programs written using programming languagues, we must be able to define formally (i.e., mathematically) what is a programming language and what is a program.

### Sets of Character Strings

In computer science, one common way to mathematically model a formal language is to introduce a finite set of symbols (an *alphabet*). A *language* is then any subset of the set of all strings consisting of symbols from that alphabet.

### Sets of Token Sequences

Unlike human languagues, programming languages usually have a relatively small collection of symbol strings (e.g., commands or instructions) that are used to construct programs. Thus, we can adjust the definition of what constitutes a language to account for this.

If a language is just a subset of the set of all possible token sequences, how do we mathematically specify interesting subsets?

### Formal Tools for Defining Languages

A variety of formal tools and techniques exist within computer science and mathematics to formally define, identify, or delineate interesting subsets of the set of all strings. These range in their expressive power from [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (or, equivalently, [finite state automata](https://en.wikipedia.org/wiki/Finite-state_machine)) to [Turing machines](https://en.wikipedia.org/wiki/Turing_machine). A full review of these techniques is beyond the scope of this course.

In these notes, we will focus on a common formal technique for defining languages that has been adopted by the creators of programming languages: [context-free grammars](https://en.wikipedia.org/wiki/Context-free_grammar) (CFGs). While CFGs are not the most powerful formal tool for defining languages, they are sufficient to define the kinds of recursive language patterns found in programming languages that are used in practice. The most common representation for such grammars is [Backus-Naur Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form), abbreviated henceforward as BNF.

The production rules in a grammar's BNF representation can be viewed both as a way to construct an element (i.e., a token sequence that is a program) in the language, or as a way to break down (i.e., *parse*) a token sequence piece by piece until nothing is left.

## Parsing, Abstract Syntax, and Abstract Syntax Trees

Given a programming language definition, we want to have the ability to operate on programs written in that language using an automated process. To do so, we need to convert the character string representations of programs in that programming language (i.e., the concrete syntax representation) into instances of a data structure; each data structure instance would then be a representation of a program in that language as data on which we can operate.

### Abstract Syntax

### Parsing Python using Python

The [`ast`](https://docs.python.org/3/library/ast.html) module in Python provides an abstract syntax data structure, as well as associated functionalities (such as parsing a concrete syntax instance into an AST and performing transformations on an AST).

In [74]:
import ast

# Programmatically build an AST for the
# Python concrete syntax expression `-5`.
node =\
  ast.UnaryOp(
    ast.USub(),
    ast.Num(5, lineno=0, col_offset=0),
    lineno=0,
    col_offset=0
  )

ast.dump(node)

'UnaryOp(op=USub(), operand=Num(n=5))'

We can use the [`inspect`](https://docs.python.org/3/library/inspect.html) module to retrieve the concrete syntax of the body of a Python function definition.

In [75]:
def f(x):
    return x + 2

import inspect
inspect.getsource(f)

'def f(x):\n    return x + 2\n'

This is the end of the document.