# describing code

languages with recursive structure

programming languages often have recursive structure (even if they do not support recursion)

sometimes it's hard to use regex to identify the pattern of some languages

e.g. the parenthese language: `<<<><<>><<<>>><><>>>`

the regex `[<>]+` is too expressive, it matches `><` or `<>>`, using regex can only match finite depth of such languages

# context-free grammar

a language has
- syntax
- semantics

*grammar* is a compact description of the syntax of a language

*regular language* is a language whose syntax can be described by a regex

*context-free language* has syntax that can be described by a **context-free grammar**

## bakus-naur form

a particular syntax for describing context-free grammars

```BNF
?start: expr
expr: OPEN CLOSE | OPEN exprs CLOSE
exprs: expr | expr exprs
OPEN: "<"
CLOSE: ">"
```

`lark` python module on [code.cs61a.org](code.cs61a.org) has its own flavor of BNF

## detail

special symbol `?start` corrsponds to a complete expression

symbols in all caps are terminals
- only contain /regular expressions/, "text" and other TERMINALS
- no recursion is allowed within terminals

unnamed literals within non-terminals do not show up in the parse tree

```BNF
?start: numbers
numbers: INTEGER | numbers "," INTEGER
INTEGER: "0" | /-?[1-9]\d*/

%ignore /\s+/
```

## extended BNF

- `(item item ..)`
- `[item item ..]`
- `item?`
- `item*`
- `item+`
- `item ~ n` exactly n
- `item ~ n..m` between n and m

- `%import`
- `?` omit the element only appears once

```BNF
?start: expr
?expr: NUMBER | call
call: "(" OPERATOR expr* ")"
OPERATOR: "+" | "-" | "*" | "/"
%ignore /\s+/
%import common.NUMBER
```

## ambiguity

```BNF
?start: expr
?expr: mul_expr | expr PLUS mul_expr
?mul_expr: NUMBER | mul_expr TIMES NUMBER
PLUS: "+"
TIMES: "*"
%import common.NUMBER
```

above can parse time first then plus second in `1 + 2 * 3`