(via Coursera)
## Student
- Interpreters are online, just run program directly.
- Compilers are offline, preprocess program into executable.
- FORTRAN I first compile, 1954 - 1957. 1958 had 50% adoption.
- Lexical analysis. (syntactic)
- Parsing. (syntactic)
- Semantic analysis. (types, scopes)
- Optimization.
- Code generation.
- First step: recognize words.
- Lexical analysis. Divide program into tokens.
- Next step: understand the sentence.
- Parsing. Group into higher-level constructs.
- Then understand the meaning. Very hard!
- Compilers do limited semantic analysis, too hard otherwise.
- Variable bindings wrt scope.
- Type mismatching.
- Compilers do limited semantic analysis, too hard otherwise.
- Then editing, i.e. optimization. Restrict usage of some resource(s).
- Run faster.
- Use less memory.
- Power.
- Network messages.
- Database accesses.
- Finally codegen, code generation. Assembly.
- Compared to FORTRAN, modern compilers:
- Have very small lexing and parsing stages.
- Larger semantic analysis stage.
- Much larger optimization stage, dominant stage.
- Much smaller codegen stage.
- Many programming languages, why?
- Distinct application domains.
- Scientific computing, e.g. FORTRAN.
- Good FP.
- Good arrays.
- Parallelism.
- Business applications, e.g. SQL.
- Persistance.
- Report generat ion.
- Data analysis.
- Systems programming, e.g. C/C++.
- Control of resources.
- Real time constraints.
- Scientific computing, e.g. FORTRAN.
- Distinct application domains.
- Why are there new programming languages?
- Programmer training is the dominant cost of a language.
- Widely used languages are slow to change - many users!
- Easy to start a new language - no users!
- Productivity > training cost => switch.
- So languages adopted to fill a void, get work done.
- New languages tend to resemble old ones.
- What is a good programming language?
- No universally accepted metric.
- Classroom Object Oriented Language.
- Designed to be implementable in a short time.
- COOL -> MIPS assembly language.
- main class Main, which has procedure main.
- compile using coolc. (
coolc <filename.cl>
) - Run .s file using spim. (
spim <filename.s>
)
First:
class Main {
i : IO <- new IO;
main():Int { { i.out_string("Hello world!\n"); 1; };
};
Later, no need for 1 and dont care about return type:
class Main {
i : IO <- new IO;
main():Object { { i.out_string("Hello world!\n"); };
};
No need to allocate:
class Main inherits IO {
main():Object { { self.out_string("Hello world!\n"); };
};
But no need to explicitly name self:
class Main inherits IO {
main():Object { { out_string("Hello world!\n"); };
};
Converting string to integer from stdin:
class Main inherits A2I {
main() : Object {
(new IO).out_string(i2a(a2i(new IO)+1).in_string())).concat("\n"));
};
};
let
defines a local variable.<-
for assignment.while (condition) loop { .. } pool;
=
is comparison.
Complex!
- Divide strings into tokens.
- In English e.g. noun, verb
- In programming language e.g. identifier, keywords, integer, whitespace.
- Lexical analysis classify substrings according to their role. Communicate tokens to the parser.
- Output of LA (a token) is <Class, String>.
- Some strings are the only ones in their own class, e.g. '(', ')', ';'.
- LA always requires lookahead, but we want to minimize and bound it.
-
Lexical structure = set of token classes.
-
What set of strings in a token class?
- Use regular languages.
- Usually use regular expressions.
-
Regular expressions
- "c" = {"c"}. One string is a one string language.
-
$$\epsilon$$ = {""}. Empty string. Not an empty set. - Union.
$$A + B = {a | a \in A} U {b | b \in B}$$ .- Is commutative.
- Concatenation (cross product).
$$AB = {ab | a \in A ^ b \in B}$$ . - Iteration, Kleen closure.
$$A^* = \bigcup_{i \geq 0} A^i$$ . A^i = A … A (i times), A^0 = epsilon.
-
Regular expressions over Sigma are the smallest set of expressions.
- R = epsilon | 'c' (each c in alphabet (
$$\Sigma$$ )) | R + R | RR | R*. - This is a grammar.
- R = epsilon | 'c' (each c in alphabet (
-
Examples of RLs.
all strings of 0's and 1's. In fact this is the alphabet, so all the strings of the alphabet as many times as you like.
- RLs are not equations. They express languages, i.e sets of strings.
- Hence:
(0+1)* . 1 . (0+1)* = (0+1)* . (10+11+1) . (0+1)*
- LHS => make sure there is a 1 in the middle, preceded or followed by any any combination of alphabet.
- RHS guarantees the same!
- Regular expressions are an example of a formal language.
- FL has an alphabet
$\Sigma$ , a set of characters. A language over$\Sigma$ is a set of strings of characters drawn from$$\Sigma$. - e.g. alphabet = ASCII characters, language = C programs.
-
Meaning function L maps syntax to semantics.
-
$L(e) = M$ . e is language, e.g. RE. M is set of strings. - L maps from expressions (e.g. epsilon) to sets (e.g. {""}).
- L: Exp -> Sets Strings
- Recursive. L(A + B) = L(A) U L(B), etc.
-
- Why use a meaning function?
- Clearly delineate syntax from semantics.
- Allows us to consider that notation is a separate issue.
- Consider arabic numbers vs. roman numerals.
- Meaning is same, but arithmetic in roman numerals is painful.
- Consider arabic numbers vs. roman numerals.
- Expression -> meaning is not 1:1.
- Syntax -> semantics is many:1.
- Never 1:many! Means ambigious output.
- Consider below, all are the same:
- Conditional
- 'i' . 'f' = "if"
- "if" + "then" + "else"
- Integer, nonempty.
- digit = "0" + … + "9"
- digit+ = digit digit^*
- Identifier
- letter = [a-zA-Z]
- letter (letter + digit)*
- Whitespace
- ' ' + '\n' + '\t'
- Email address
- Username, domains.
- letter+ '@' letter+ '.' letter+
- (surely not valid?)
- Optional => (RE)?, is + epsilon.
- Conclusion, rule, justification.
-
Summary of RE notation
- At least one: A^+ = A . A*
- Union: A | B = A + B
- Option: A? = A + \epsilon
- Range: 'a' + 'b' + … = 'z' = [a-z]
- Excluded range: Complement of [a-z] = [^a-z]
-
Hypothetical specification for a predicate.
- s in set L( R ).
- i.e. string s in language( R ). Latter is set of strings.
- Not enough! We need to partition input into words.
- First, write regexp for lexemes or each token class.
- Integer = digit+
- Whitespace - ' ' + '\n' + '\t'
- etc.
- Construct R, matching all lexemes for all tokens
- R = Integer + Whitespace + …
- R = R_1 + R_2 + …
- For every prefix in input check if it is in L( R ).
- If it is then we know the prefix is one of the constituent regexps, i.e. token classes.
- Remove that prefix and continue checking.
- Maximal munch. Longest prefix match.
- What if more than one token matches? e.g. an Identifier called if.
e.g.
- Priority ordering. Choose the one listed first. Keywords before Identifiers.
- What is no rule matches?
- Error token class. regexp for all strings not in the lexical specification. Put it last.
- Implementation of regular expressions.
- FA consists of:
- Input alphabat
$\Sigma$ . - A finite set of states$S$ . - A start state$n$ . - A set of accepting states$F \subseteq S$ . - A set of transitions$state \overrightarrow{input} state$ . - Transition (read)
-
$s_1 \overrightarrow{a} s_2$ . - If end of input and in accepting state then accept.
- Terminates in any state other than accepting state, reject.
- Gets stuck, no transition, reject.
- Configuration of FA: state it is in and input it is processing (input pointer).
- Language of a FA: set of accepted strings.
-
$\epsilon$ -move. Transition without consuming input (without moving input pointer). -
Deterministic Finite Automata (DFA)
- One transition per input per state.
- No
$\epsilon$ -moves.
-
Non-deterministic Finite Automata (NFA)
- Multiplate transitions for one input in given state.
- Can have
$\epsilon$ -moves. - As we can transform multiple states into epsilon moves it's epsilon moves that tell us if an FA is NFA or not. - A DFA takes only one path through state graph, NFA can choose.
- An NFA accepts if some choices lead to accepting state.
- Configuration of a NFA: the set of states it could be in and the input it's processing.
- NFAs and DFAs recognise same set of languages, regular languages.
- DFAs faster (no choices)
- NFAs are generally smaller.
- Lexical specification -> Regular expressions.
- Regular expressions -> NFA.
- NFA -> DFA.
- DFA -> Table-driven implementation of DFA.
- For epsilon or input A, simple. - s_1: start state. - input: epsilon or a, goes to s_2, accepting state.
- Concatenation, AB. - Final state of A has an epsilon transition to start state of B.
- Union, A + B - Make a new start state with two epsilon transitions. One to A, one to B. - Make a new accepting state with two epsilon transitions, from accepting states of A and B.
- Kleene closure, A* - Complex. - New start state. Epsilon edge to A. - Epsilon edges from accept of A to start and new accept state. - Epsilon edge from new start state to new accept state.
- epsilon-closure of a state: set of states that can be reached on recursive epsilon transitions and the state itself. - Keep following epsilons.
- NFA may have many states at any time.
- How many different states?
- N states in total.
- |S| <= N.
-
$2^N - 1$ non-empty subsets of N states. - Finite state of possible configurations. So in principle DFA can simulate NFA. - NFA characterisation:
states:
(i.e. for a set of states X and input a what are all states that can be reached).
- DFA characterisation:
states: subsets of S (except empty set)
start state: epsilon-clos of start state of NFA.
final states:
transition from X to Y given a if:
-
So for given input you can reach the epsilon closure of the ending state.
-
!!AI Pretty tricky! Do on paper, watch end of lecture for example.
- DFA. Table of input vs. state. Value is the next state. - Downside is that many duplicate rows.
- Instead one-dimensional column, pointer to possible moves as a row. Can share moves amongst states. - Could be 2^N - 1 states in DFA for an NFA with N states. - But pointer dereferencing takes time.
- Or could implement NFA in table. - States are rows, columns are alphabet (e.g. 0, 1, and epsilon). - But values are sets of states, i.e. the configuration is multiple states. Much slower.
- Regular languages are weakest formal language.
- Consider the language of all balanced parentheses:
- Applies also to nested if/then.
- Regular expressions cannot handle all balanced parentheses or nested if/then.
- Regular languages can only count "mod k", where k is finite number of states. Not arbitrary.
- Lexer: string of characters -> string of tokens
- Parser: string of tokens -> parse tree.
- Parse tree may be implicit.
- We need:
- Language for describing valid strings of tokens.
- Method for distinguishing valid from invalid strings of tokens.
- CFGs are natural notation for describing recursive structures.
- A CFG consists of:
- A set of terminals, T
- A set of non-terminals, N
- A start symbol, S (
$S \in N$ ). - A set of productions, where production is
$X \to Y_1 ... Y_N, X \in N, Y_i \in N \bigcup T \bigcup \epsilon$ .
e.g. two productions
This implies:
- Productions can be read as rules.
- Begin with a string with only the start symbol S.
- Replace any non-terminal X by the right-hand side of some production
$X \to Y_1 ... Y_n$ . - Repeat until no non-terminals left.
- One step of a context-free derivation
One step of a context-free derivation reads as:
- Language L(G) of a context-free grammar G:
- i.e. set of all strings that I can drive just from the start symbol.
- Terminal symbols do not have rules for replacing them, permanent. e.g. tokens in language.
- Derivation: a sequence of productions.
- Can draw a derivation as a tree. Add the result as children.
- Obviously root is start symbol, interior nodes are non-terminals, leaves are terminals.
- In-order traversal of the leaves is the original input.
- Left-most derivation: at each step replace left-most non-terminal.
- Right-most derivation: at each step replace right-most non-terminal.
- Left-most and right-most derivation have the same parse tree.
- Many kinds of derivations.
- We don't just care if string s in L(G). We need a parse tree for s!
- Ambiguous grammar: grammar has more than one parse tree for some string.
- Equivalently, more than one left-most or right-most derivation for some string.
- Dealing with ambiguity.
- Most direct is to re-write grammar.
E -> E' + E | E' E' -> id * E' | id | (E) * E' | (E)
-
Notice that E controls plus, E' controls multiplication.
-
Notice higher precedence towards bottom of parse tree.
-
E is a Kleene closure over plus, zero or more. Final E becomes E'.
-
E' is a Kleene closure over multiplication, zero or more. Final E' becomes id.
-
Notice we use (E), not (E'). We can put pluses in brackets.
-
How to match if/then/else
E -> MIF | UIF // matched if or unmatched if MIF -> if E then MIF else MIF | OTHER UIF -> if E then E | if E then MIF else UIF
- Instead of rewriting grammer, tools can use precedence and associativity to do it for you.
- bison does this, e.g.
E -> E + E | int
becomes
%left + E -> E + E | int
and
E -> E + E | E * E | int
becomes:
%left + %left * E -> E + E | E * E | int
- Error types:
- Lexical, e.g. a $ b. Detected by lexer.
- Syntax, e.g. a *% b. (Lexically correct). Detected by parser
- Semantic, e.g. int x; y = x(3). (Syntax correct). Detected by type checker.
- Correctness. Detected by user.
- Panic mode
- Simplest, most popular.
- On error discard tokens until one with a clear role is found. Then continue.
- Synchronizing tokens: statement or expression terminations.
- e.g.: (1 + + 2) + 3
- On second plus discard input until it finds the next integer, then continue.
- In bison use a terminal "error" to descibe how much input to skip.
E -> int | E + E | (E) | error int | ( error )
-
Error productions
- Specify known common mistakes in the grammar.
- e.g. You write 5x instead of 5 * x
- Add production E -> ... | E E
- Disadvantage
- Complicates the grammar.
-
Error correction (not common any more)
- Find a correct "nearby" program.
- Try token insertions and deletions. Minimize edit distance.
- Exhaustive search.
- Disadvantages:
- Hard.
- Slows down parsing.
- Nearby doesn't necessarily imply "intended".
- e.g. PL/C. Always parses to running program.
- But this was developed in the 70's
- Slow recompilation cycle, even one day.
- Want to find as many errors in one cycle as possible.
- Review:
- Parser traces derivation of a sequence of tokens.
- But later parts of compiler needs structural representation.
- Abstract syntax trees (AST). Like a parse tree but ignores some details.
- e.g.: E -> int | ( E ) | E + E
- 5 + (2 + 3).
- Parse tree has too much informaion: parentheses, single-successor nodes.
- Top-down, left to right.
- Terminals seen in order of appearance in token stream.
- Try production rules in order, e.g.
E -> T | T + E T -> int | int * T | ( E )
( 5 )
- For start symbol E we try T then T + E. On failure may have to do back-tracking.
- Notice while generating that when we generate non-terminal don't know if we're on the right track.
- Once you generate a terminal you can check the input at the input pointer to see if it's right. If not, try another production.
- Don't be too smart. Recursive descent parser tries all productions in order, so you should too.
- Let TOKEN be type of a TOKEN, i.e. INT, OPEN, CLOSE, PLUS, or TIMES.
- Let global next be pointer to next input token.
- Define boolean fuctions.
bool term(TOKEN tok) { return *next++ == tok; }
- Is the token currently pointed to? Also increment next.
bool S_n() { ... }
- nth production of S. Does particular production match input?
bool S() { ... }
- Try all productions of S. Do any of all productions match input?
- For production E -> T
bool E_1() { return T(); }
- For production E -> T + E
bool E_2() { return T() && term(PLUS) && E(); }
- Notice short-circuited AND.
- Notice each bool function increments input pointer.
- Notice it's called E_2, second production of E.
- For all productions of E with backtracking
bool E() {
TOKEN *save = next;
return (next = save, E_1())
|| (next = save, E_2());
}
-
Notice short-circuited OR. If first condition true don't evaluate second.
-
Notice we restore next after failing an OR condition.
-
Notice on failing all productions of E we don't restore anything, because we're passing the error back up to a higher production.
-
Functions for non-terminal T
- T -> int
bool T_1() { return term(INT); }
- T -> int * T
bool T_2() { return term(INT) && term(TIMES) && T(); }
- T -> ( E )
bool T_3() { return term(OPEN) && E() && term(CLOSE); }
- T -> int
bool T() {
TOKEN *save = next;
return (next = save, T_1())
|| (next = save, T_2())
|| (next = save, T_3());
}
- To start parser.
- Initialize next to first token
- Invoke start symbol, E().
- From before try to parse int * int.
- We parse as E -> E_1 -> T -> int, but input not exhausted, so reject input string.
- What happened?! String is definitely in language of grammar.
- Problem: there is no backtracking once we have found a matching production; ideally we'd try all productions, and then maximal munch.
- If a production for non-terminal X succeeds cannot backtrack to try a different production for X.
- General recursive-descent algorithms do however support such "full" backtracking.
- Presented RA algorithm is not general but easy to implement by hand.
- Sufficient for grammars where for any non-terminal at most one production can succeed.
- Consider S -> S a.
bool S_1() { return S() && term(a); } bool S() { return S_1(); }
- S() goes into an infinite loop.
- Left-recursive grammar: has some non-terminal S such that S ->+ S alpha for some alpha.
- One or more productions that refer to same non-terminal.
- There is always a non-terminal in the derivation, can't finish.
- Recursive descent does not work with left-recursive grammars.
- Consider:
S -> S alpha | beta
- S generates all strings starting with one beta followed by any number of alphas.
- But it does so right to left.
- Can rewrite using right-recursion:
S -> beta S' S' -> alpha S' | epsilon
e.g.
S -> B S' -> B A S' -> B A A S' -> ... -> B A ... A S' -> B A ... A
- So in general take one left-recursive production S and rewrite into right-recursive productions S and S'.
- But above is not the most general form of left-recursion, e.g.
S -> A alpha | beta A -> S beta
- Above is also left-recursive.
- !!AI don't get it, can't do the quiz question. Look at EC later.
- EC p100 is pretty good.
- Quiz includes question about eliminating indirect left-recursion, so midterm will have it too.
- Summary of recursive descent (general form not presented here)
- simple and general parsing strategy.
- left-recursion must first be eliminated, but can be automatically eliminated.
- Used in gcc.
- Like recursive descent but parser can predict what production to use.
- By looking at next few tokens, no backtracking, always right.
- Accept LL(k) grammars.
- First L: left-to-right scan.
- Second L: left-most derivation.
- k tokens of lookahead.
- In practice k always 1.
- Review, in recursive descent:
- At each step many choices of production.
- Backtrack to undo bad choices.
- In LL(1)
- Only one choice of production.
- Recall our favourite grammar:
E -> T + E | T T -> int | int * T | ( E )
- Hard to predict:
- For T two predictions start with int.
- For E not clear how to predict, both start with T.
- Need to left-factor the grammar.
- Eliminate the common prefixes of multiple productions for one non-terminal.
E -> T X X -> + E | epsilon
T -> int Y | ( E ) Y -> * T | epsilon
- Use left-factored grammar to generate an LL(1) parsing table.
- Rows: leftmost non-terminal, i.e. current non-terminal.
- Columns: next input token.
- Value: RHS of production to use.
- How to use the table:
- For the leftmost non-terminal S.
- Look at next input token a.
- Choose production shown at [S, a].
- A stack records frontier of parse tree.
- Non-terminals that have yet to be expanded.
- Terminals that have yet to be matched against input.
- Top of stack = leftmost pending terminal or non-terminal.
- Reject on error state.
- Accept on end of input and empty stack.
- Dollar sign ($) is special end-of-input symbol, put it on the end of the input and in the stack.
initialize stack = <S $> and next repeat case stack of <X, rest>: if T[X, *next] = Y_1 ... Y_n then stack <- <Y_1 ... Y_n rest>; else error(); <t, rest>: if t == *next++ then stack <- <rest>; else error(); until stack == < >
- Note: after processing symbol X or t we pop it off the stack. In case of X after popping we push on its children such that first (leftmost) child is top of stack.
- When doing this take your LL(1) parse table and draw a new table with columns "stack", "input", "action". Stack starts as "S
$", input is "input $ ". - 14:23: worked example.
- Quiz at the end is useful, but would like to see parse tree too. Worked example at 14:23 has this.
- Consider leftmost non-terminal A, production
$A \to $ \alpha$, next input token t. - T[A, t] =
$\alpha$ in two cases. - If
$\alpha \to* t \beta$ .- alpha can derive t in the first position in zero or more moves.
- We say that that
$t \in First(\alpha)$ .- t is one of the terminals that
$\alpha$ can produce in the first position.
- t is one of the terminals that
- If
$A \to \alpha$ and$\alpha \to * \epsilon$ and$S \to * \Beta A t \delta$ .- Useful if stack has A, input t, and A cannot derive t.
-
$\alpha$ cannot derive t. t is not in$First(\alpha)$ . - Only option is to get rid of A by deriving
$\epsilon$ . - We say that
$t \in Follow(A)$ . - Must be a derivation where t comes immediately after A.
- Definition
- All terminals t that can be derived by alpha in the first position.
- Also need to keep track of X deriving
$\epsilon$ . - Trying to figure out all the terminals in the First set.
-
$First(t) = { t }$ . -
$\epsilon \in First(X)$ if-
$X \to \epsilon$ , or -
$X \to A_1 ... A_n$ and$\epsilon \in First(A_i)$ for$1 \leq i \leq n$ .- The entire RHS must derive to
$\epsilon$ , all the A_i. All A_i must be non-terminal.
- The entire RHS must derive to
-
-
$First(\alpha) \subseteq First(X)$ if$X \to A_1 ... A_n \alpha$ .- and
$\epsilon \in First(A_i)$ for$1 \leq i \leq n$ . - All A_i goes to epsilon, i.e. X can go to alpha.
- and
- Recall our favourite grammar, left-factored:
E -> T X X -> + E | epsilon
T -> ( E ) | int Y Y -> * T | epsilon
First( + ) = { + } First( * ) = { * } First( ( ) = { ( } First( ) ) = { ) } First(int) = { int }
- Can T go to epsilon? If it can then First(X) also in First(E).
- But no! T can't go to epsilon. So First(X) can never contribute to First(E).
Hence:
First(X) = { +, epsilon } First(Y) = { *, epsilon }
- First sets for what is produced, Follow sets for where the symbol is used.
Follow(X) = { t | S ->* B X t d }
-
All t that immediately follow X in derivations.
-
Intuition: - If X -> A B then $First(B) \subseteq Follow(A)$.
- Suppose that X -> A B ->* A t beta, implies that anything in the First of B is in the Follow of A. - Also, $Follow(X) \subseteq Follow(B)$.
- Suppose on S -> X t. -> A B t.
- We say that the Follow of X is in the Follow of the last symbol, here B.
- If B ->* epsilon,
$Follow(X) \subseteq Follow(A)$ . - If S is the start symbol then $ \in Follow(S). (end of input marker).
-
Algorithm 1. $ \in Follow(S). 2. First(beta) - { epsilon } \subseteq Follow(X)
- For each production A -> alpha X beta.
- epsilon never appears in Follow sets. Follow sets are always sets of terminals. 3. Follow(A) \subseteq Follow(X).
- For each production A -> alpha X beta where epsilon in First(beta).
-
Recall:
E -> T X X -> + E | epsilon
T -> ( E ) | int Y Y -> * T | epsilon
Follow(E)
- = { $, … } (get this for free, always true)
- Close paren must be in Follow(E) because it's a terminal following E in first production of T.
- E is last symbol in X's first production, so Follow(X) in Follow(E).
Follow(X)
- Last symbol in production of E, so Follow(E) in Follow(X).
- Hence Follow(X) = Follow(E).
Now nothing else to learn about X or E, so Follow(E) is concluded, and hence Follow(X) is concluded.
Follow(E) = Follow(X) = { $, ) }
Follow(T):
- Recall that First(X) = { +, epsilon } (from previous lecture).
- First(X) \in Follow(T).
- Follow(T) = { + , …} (don't put epsilon there because epsilon never in Follow sets).
- As X can go to epsilon, it can disappear, so Follow(E) in Follow(T).
- Hence Follow(T) = { +, $, ), …}
- T is in rightmost spot of Y, so Follow(Y) in Follow(T).
Follow(Y)
- Y appears in rightmost spot of T, so Follow(T) in Follow(Y).
- So Follow(T) = Follow(Y).
And now we're done!
Follow(T) = Follow(Y) = { +, $, ) }
Now consider terminals:
Follow('(')
- First(E) \in Follow( '(' ).
- From first lecture First(E) \in First(T), and T never goes to epsilon, hence First(E) = First(T).
- First(E) = First(T) = { (, int }
- Hence Follow('(') = { (, int } - Makes sense because only things that follow open paren is open paren and int.
Follow(')')
- Right end of T, so Follow(T) in Follow(')').
- Follow(T) = { +, $, ) }.
- Hence Follow(')') = { +, $, ) }.
Follow('+')
- First(E) in Follow('+').
- First(E) = First(T) = { '(', int }
- Hence Follow('+') = { '(', int }
- If T went to epsilon we'd need to consider other rules, but it doesn't.
Follow('*')
- First(T) in Follow('*')
- First(T) = { '(', int }
- Hence Follow('*') = { '(', int }
- If T went to epsilon we'd need to consider other rules, but it doesn't.
Follow(int)
- First(Y) in Follow(int).
- First(Y) = { '*', epsilon }
- Hence Follow(int) = { '*', … }
- But Y can go to epsilon!
- Hence int could be right end of T's second production.
- Hence Follow(T) in Follow(int)
- Hence Follow(int) = { '*', '+', $, ')' }
-
Construct a parsing table T for CFG G.
-
For each production A -> \alpha in G do: - For each terminal t \in First(\alpha) do:
- T[A, t] = \alpha. - If \epsilon \in First(\alpha), for each t \in Follow(A) do:
- T[A, t] = \alpha.
- If we need to get rid of A. - If \epsilon in First(\alpha) and $ \in Follow(A) do:
- T[A, $] = \alpha.
-
02:38: worked example, see paper. If it doesn't make sense watch again and work through.
-
Blank entry in parsing table is parsing error.
-
Consider:
S -> Sa | b
-
This is not an LL(1) grammar, it is left recursive.
-
First(S) = {b}.
-
Follow(s) = { $, a }
-
Small LL(1) table (draw it out).
-
Given non-terminal S and input b we have multiple moves: b and Sa.
-
If you build an LL(1) parsing table and some entry has more than one move in it then the CFG G is not LL(1).
-
Quick checks if grammar is not LL(1): - We know that any grammar that is not left-factored will not be LL(1). - Also not LL(1) is left recursive. - Also if ambiguous.
-
However, other grammars are not LL(1) too.
-
Most programming language CFGs are not LL(1).
- Bottom-up parsing is more general than top-down parsing. - Just and efficient. - Builds on top-down parsing. - Preferred method.
- Don't need left-factored grammars.
- Revert to "natural" grammars.
E -> T + E | T T -> int * T | int | ( E )
- Reduces a string to the start symbol by inverting productions.
- Consider:
int * int + int
-
Productions, read backwards, trace a rightmost derivation.
-
We parse downwards in reductions.
-
But read upwards it's a right-most derivation.
-
Important fact #1: A bottom-up parser traces a rightmost derivation in reverse.
-
Let abw be a step of a bottom-up parse.
-
Assume the next reduction is by X -> b (replace b by X, remember we run productions backwards).
-
Then w is a string of terminals.
-
Why? - aXw -> abw is a step in a right-most derivation. X must be the right-most non-terminal as we're operating on it. Hence w is a string of terminals.
-
Idea: Split string into substrings. - if aXw, w is unexamined input. - Left substring has terminals and non-terminals. - Pipe separates the two. - aX | w.
-
Two kinds of moves, shift and reduce.
-
Shift: - Move pipe one place to the right. - ABC | xyz => ABCx | yz.
-
Reduce: - Apply an inverse production at the right end of the left string. - If A -> xy is a production then: - Cbxy | ijk => CbA | ijk
05:00: Example of shift-reduce on int * int + int.
-
Left-string can be implemented by a stack. - Top of stack is the pipe. - Shift pushes a terminal on the sack.
-
Reduce: - pops symbols off of the stack (production RHS) - pushes a non-terminal on the stack (production LHS).
-
Shift-reduce conflict: if it is legal to shift or reduce. (Almost expected, use precedence declarations to remove).
-
Reduce-reduce conflict: if it is legal to reduce by two different productions. (Bad!)
-
Review:
- Shift: move pipe operator by one.
- Reduce: apply inverse production to the right end of the left string.
- Left string can be implemented by stack.
- Top of the stack is the pipe.
- Shift pushes terminal onto the stack.
- Reduce:
- pops >= 0 symbols off stack (production RHS)
- pushes a non-terminal onto the stack (production LHS)
-
How to decide when to shift or reduce?
-
Example grammar:
E -> T + E | T T -> int * T | int | ( E )
-
Consider step:
int | * int + int
-
Could try reduce T -> int.
- Fatal mistake! Can't then reduce to start symbol E. No production that says "T *".
-
Intution: only reduce if result can still be reduced to start symbol.
-
Assume rightmost derivation:
S ->* aXw -> abw
-
Remember we're going in reverse, reductions.
-
ab is a handle of abw.
-
A handle is a reduction that also allows further reductions back to the start symbol.
-
Only reduce at handles.
-
Important fact #2: In shift-reduce parsing, handles appear only at the top of the stack, never inside.
- Informal proof by induction on number of reduce moves.
- True initially, stack empty.
- Immediately after reducing a handle:
- We reduce to a non-terminal, handle on top of the stack, hence right-most non-terminal on top of the stack.
- Next handle must be to right of right-most non-terminal, because this is a right-most derivation.
- Sequence of shift moves reaches next handle.
-
Handles never to the left of the rightmost non-terminal.
- Shift-reduce moves sufficient; pipe never need move left.
-
Bottom-up parsing algorithms based on recognizing handles.
-
Bad news:
- No known efficient algorithms.
-
Good news:
- Some good heuristics.
- On some CFGs the heuristics always guess correctly.
-
All CFGs
$\subset$ Unambiguous CFGs$\subset$ LR(k) CFGs$\subset$ LALR(k) CFGs %\subset$ SLR(k) CFGs.- LR(k) means left-to-right, rightmost derivation, k lookahead.
- SLR(k): simple left-to-right rightmost.
-
$\alpha$ is a viable prefix if there is an$\omega$ such that$\alpha | \omega$ is a state of a shift-reduce parser.- alpha is the stack.
- omega is the rest of the input.
- Parser know about alpha.
- Parser knows about a little of omega due to lookahead but certainly not all of it.
-
What does this mean?
- Viable prefix does not extend past right end of the handle.
- It is a viable prefix because it is a prefix of the handle.
- As long as the parser has viable prefixes on the stack no parsing error has been detected.
-
Important fact #3 about bottom-up parsing: For any grammar, the set of viable prefixes is a regular language.
-
Going to show how to compute automata to accept viable prefixes.
-
An item is a production with a "." somewhere on the RHS.
-
The items for T -> ( E ) are:
T -> .(E) T -> (.E) T -> (E.) T -> (E).
-
For X -> epsilon only X -> .
-
Items are often called LR(0) items.
-
Stack has only bits and pieces of the RHS of productions.
- Bits and pieces always prefixes of RHS of productions.
-
Consider:
E -> T + E | T T -> int * T | int | ( E )
-
and input:
( int )
-
Then
( E | )
is a state of a shift-reduce parse. -
(E
is a prefix of the RHS ofT -> (E)
.- Will be reduced after next shift.
-
Item
T -> (E.)
says that so far we have seen(E
of this production and hope to see)
. -
Stack is actually composed of prefixes of RHS's.
Prefix_1 Prefix_2 ... Prefix_n
-
Let Prefix_i be prefix of RHS of X_i -> alpha_i.
- Prefix_i eventually reduces to X_i.
- Missing part of alpha_{i-1} starts with X_i.
- i.e. there is a X_{i-1} -> Prefix_{i-1} X_i Beta for some Beta.
-
Recursively, Prefix_{k+1} ... Prefix_n eventually reduces to the missing part of alpha_k.
-
Favourite grammar with
(int * int)
. -
(int * | int)
is a state of the shift-reduce parse. -
(
is prefix of the RHS ofT -> (E)
. -
epsilon
is prefix of RHS ofE -> T
. -
int *
is prefix of the RHS ofT -> int * T
. -
The "stack of items" (bottom is the top of the stack):
T -> (.E) E -> .T T -> int * . T
-
Says:
- We've seen
(
ofT -> (E)
. - We've seen epsilon of
E -> T
. - We've seen
int *
ofT -> int * T
. - Notice each item's LHS becomes part of the RHS of the item below it on the stack.
- We've seen
-
Idea, to recognize viable prefixes:
- Recognize a sequence of partial RHS's of productions.
- Each partial RHS can eventually reduce to part of the missing suffix of its predecessor.
- Recognize a sequence of partial RHS's of productions.
- Algorithm.
- Should watch the video as explanation.
- Add a dummy production S' -> S to G.
- The NFA states are the items of G
- Including the extra production.
- Remember that we claim that the set of viable prefixes is a regular language, so can recognise using NFA.
- NFA(stack) = { yes, no } (viable prefix or not).
- For item E -> a.Xb (
$E \to \alpha . X \beta$ ) add transition.- E -> a.Xb ->X E -> aX.b
- Seen alpha on the stack, next input terminal or non-terminal.
- Shift.
- For item E -> a.Xb and production X -> y add
- E -> a.Xb ->e X -> .y
- Seen alpha, next input is non-terminal.
- (epsilon move).
- We only have partial RHS's of productions.
- It's possible that we see alpha but then see something that will eventually reduce to X.
- Every state is an accepting state.
- Start state is S' -> .S
S' -> E E -> T + E | T T -> int * T | int | (E)
- Draw start state.
S' -> .E
- See E on the stack, so draw ->E to:
S' -> E.
- This is the state if you've read the entire input and finished parsing.
- Then draw an epsilon move from S' -> .E to:
E -> .T
- Then draw an epsilon move from S' -> .E to:
E -> .T+E
-
Notice this is using the power of an NFA, and not even left factored.
-
E -> .T ->T E -> T.
-
When the dot reaches all the way to the right-hand side we say that we've recognised a handle.
-
If we don't see a T then need to see something that reduces to T.
- E -> .T ->e -> T -> .int
- E -> .T ->e -> T -> .(E)
- E -> .T ->e -> T -> .int * T
-
For terminals nothing reduces to them, so you need to see the terminal in order to shift the dot.
-
Can convert viable prefix NFA to DFA. !!AI watch lecture 04-04 again.
-
The states of the DFA are canonical collections of LR(0) items.
- The Dragon book gives another way of constructor LR(0) items.
-
Item X -> b.y is valid for viable prefix ab if:
S' ->* aXw > abyw
(by a right-most derivation)
-
After parsing ab, the valid items are the possible tops of the stack of items (?).
-
An item I is valid for a viable prefix a if the DFA recognizing viable prefixes terminates on input a in a state s containing I.
-
The items in s describe what the top of the item stack might be after reading input a.
-
An item is often valid for many prefixes.
-
e.g. item T -> (.E) is valid for prefixes:
( (( ((( (((( ...
-
Define very weak bottom-up parsing algorithm called LR(0) parsing.
-
Assume:
- stack contains
$\alpha$ . - Next input is t.
- DFA on input
$\alpha$ terminates in state s.
- stack contains
-
Reduce by
$X \to \beta$ if:- s contains item
$X \to \beta .$
- s contains item
-
Shift if:
- s contains item
$X \to \beta . t \omega$ - Equivalent to saying s has transition labelled t.
- s contains item
-
LR(0) has a reduce/reduce conflict if:
- Any state has two reduce items, i.e.
-
$X \to \beta .$ and$Y \to \omega .$
-
LR(0) has a shift/reduce conflict if:
- Any state has a reduce item and a shift item, i.e.
-
$X \to \beta .$ and$Y \to \omega . t \delta$ .
-
Example of DFA state with shift-reduce conflict:
E -> T. E -> T. + E
-
First suggest reduce, second suggest shift, if the input is '+'.
-
SLR = "Simple LR".
-
SLR improves on LR(0) shift/reduce heuristics.
- Fewer states have conflicts.
-
Same assumptions as LR(0).
-
Reduce by
$X \to \beta$ if:- s contains item
$X \to beta .$ -
new
$t \in Follow(X)$ .- (if t can't follow X it doesn't make sense to perform this reduction).
- s contains item
-
Shift if:
- s contains item
$X \to \beta . t \omega$ .
- s contains item
-
If there are any conflicts under these rules, the grammar is not SLR.
-
Take another look at shift-reduce conflict under LR(0):
E -> T. E -> T. + E
-
Will only reduce on input '+' if '+' in Follow(E).
-
Follow(E) = { $, ')' }
-
Lots of grammar aren't SLR.
- including all ambiguous grammars.
-
We can parse more grammars by using precedence declarations.
-
Our favourite ambiguous grammar:
E -> E + E | E * E | (E) | int
-
The DFA for the viable prefixes for this grammar contains a state with the following items, hence shift/reduce conflict:
E -> E * E . E -> E . + E
-
Declaring "* has a higher precedence than +" resolves this conflict in favour of reducing.
-
These declaration don't define precedence; they define conflict resolutions.
-
SLR parsing algorithm.
- Let M be DFA for viable prefixes of G.
- Let
| x_1 ,... x_n $
be the initial configuration. - Repeat until configuration is
S | $
.- Let
$\alpha | $\omega$
be current configuration. - Run M on current stack
$\alpha$ . - If M rejects
$\alpha$ , report parsing error.- Stack
$\alpha$ is not a viable prefix. - Rule not needed, rejection below prevents bad stacks.
- Stack
- If M accepts
$\alpha$ with items I, let a be next input.- Shift if
X -> b . a y
in I. - Reduce if
X -> b .
in I and a in Follow(X). - Report parsing error if neither applies.
- Shift if
- Let
- !!AI just watch it!
- Columns: Configuration, DFA Halt State, Action
-
Rerunning the viable prefixes automaton on the stack at each step is wasteful.
- Most of work is repeated.
-
Remember the state of the automaton on each prefix of the stack.
-
Change stack to contain pairs:
< Symbol, DFA State >
-
Bottom of stack is
<any, start>
.-
any
is any dummy symbol. -
start
is the start state of the DFA.
-
-
Define
goto[i, A] = j
if state_i ->A state_j.- goto is just the transition function of the DFA.
- One of the two parsing tables (the other one is the action table).
- goto is just the transition function of the DFA.
-
SLR parsing algorithm has four possible moves.
- Shift x
- Push
<a, x>
on the stack. - a is current input.
- x is a DFA state.
- Push
- Reduce
$X \to \alpha$ - As before.
- Accept.
- Error.
- Shift x
-
For each state s_i and terminal a
-
If s_i has item
$X \to \alpha . a \beta$ andgoto[i, a] = j
thenaction[i, a] = shift j
.- State is s_i, next input is a.
- goto for state_i with input a is state j.
- Hence action on state i with input a is shift j; it's OK to shift.
-
If s_i has item
$X \to \alpha .$ and$a \in Follow(A)$ and$X \neq S'$ thenaction[i, a] = reduce $X \to \alpha$
.- S' is special new start symbol; don't do this if we're reducing to S'.
-
If s_i has item
$S' \to S .$ thenaction[i, $] = accept
. -
Otherwise,
action[i, a] = error
.
-
-
Full algorithm
Let I = w$ be initial input Let j = 0 Let DFA state 1 have item S' -> .S Let stack = < dummy, 1> repeat case action[top_state(stack), I[j]] of shift k: push <I[j++], k> reduce X -> A: pop |A| pairs, push <X, goto[top_state(stack), X]> accept: halt normally error: halt and report error
-
I is input array, indexed by j
-
Say first DFA state is 1.
-
pop |A| pairs mean pop number of pairs equal to A (?)
-
Note that we only use DFA staes and the input.
- We don't use stack symbols!
-
We still need the symbols for semantic actions.
-
Some common constructs are not SLR(1)
-
LR(1) is more powerful.
- Lookahed built into the items.
- LR(1) item is a pair: LR(0) item, and x lookahead.
- [T -> . int * T, $ ] means
- After seeing T -> int * T reduce if lookahead sees $.
- More accurate than just using follow sets.
- Actual parser will show you LALR(1) automaton.
S -> Sa S -> b
- Is
b a*
- Left-recursive, but not a problem for LR(k) parsers.
- First step: add new production:
S' -> S S -> Sa S -> b
- Rather than build NFA and then using subset construction let's just build the DFA.
- Start DFA state:
S' -> .S
- Remember all NFA epsilon moves are due to not seeing a non-terminal on a stack but rather something that derives a non-terminal.
- epsilon moves to all the productions of S.
- We know subset production performs an epsilon closure of the start state of NFA to get the start states of DFA.
- Hence we know all the items in the start state of DFA:
S' -> .S S -> .Sa S -> .b
- What symbols may we see on the stack?
- b
- Next state is
S -> b
.
- Next state is
- S
- Next state has two items.
S' -> S.
(complete, dot all the way on the right)S -> S.a
(not complete)
- Next state has two items.
- b
- Last state
S -> Sa.
on input a. - Look for shift-reduce conflicts and reduce-reduce conflicts.
- Any two items that reduce on same start symbol?
- Any ambiguity between needing to shift and needing to reduce on an input? Remember Follow(X) rule.
- Key state that may contain problems is:
S' -> S. S -> S.a
- On input a we may either reduce to S' or shift.
- Will only reduce if a is in Follow(S').
- Follow(S') = { $ }.
- Nothing can follow it, because it is the start symbol.
- Dot all the way on the right => reduction.
- Lexical analysis: detects inputs with illegal tokens.
- Parsing: detects inputs with ill-formed parse tres.
- Sematic analysis: catch rest of errors, the last front-end phase.
- Some language constructs are not context-free, parsing can't catch it.
- Just like in lexing some things can't be caught by finite automaton.
-
Matching identifier declaration with uses.
-
Scope of identifier: portion of program in which identifier is accessible.
-
Different scopes for same name don't overlap, only one meaning for identifier.
-
Identifier may have restricted scope.
-
Static scope: scope only depends on program text.
-
Dynamic scope: scope also depends on execution.
-
A dynamically-scoped variable refers to closest enclosing binding in execution of program.
-
e.g.
g(y) = let a <- 4 in f(3); f(x) = a;
-
What is
a
? Is 4 because of dynamic scope.
-
Much of semantic analysis is recursive descent of an AST.
- Before: process AST node n.
- Recurse: process children of n.
- After: finish processing AST node n.
-
e.g. scope of let bindings is one subtree of AST.
let x: Int <- 0 in e
-
x is in subtree for e.
-
Idea:
- Before processing e: add definition of x to current definitions, overriding any previous definition of x.
- Recurse.
- After processing e: remove and restore definition of x.
-
Symbol table: tracks current bindings of identifiers.
-
Could use a stack for the symbol table.
- This handles variable hiding.
- But can't catch multiple definitions in the single scope of a function declaration, e.g.
f(x: Int, x: Int) { }
-
Fancier symbol table:
enter_scope()
: start a new nested scope.find_symbol(x)
: find current x, or nulladd_symbol(x)
: add symbol x to tablecheck_scope(x)
: true of x defined in current scope.exit_scope()
: exit current scope.
-
A stack of scopes.
-
Class names can be used before being defined.
-
Hence need two passes.
- Pass 1: gather all class names.
- Pass 2: do the checking.
-
Semantic analysis requires multiple passes, probably more than two.
-
Set of values, set of operations on those values.
-
e.g. int, can +, -, leq, geq.
-
Classes are one instantiation of the modern notion of type.
-
Certain operations are legal for values of each type.
- But at the assembly level e.g. adding two integers is always valid, even if e.g. one is int, other is pointer to pointer.
- Language's type system specifies which operations are valid for which types.
-
Statically typed: during compilation (C, Java, Cool)
-
Dynamically typed: during execution (Scheme, Lisp, Python)
-
No typing: assembly.
-
Type checking: fully built AST, verfiying types.
-
Type inference: filling in missing type information.
-
Two are different, but often used interchangeably.
## 09-05: Type Checking
- So far seen two example of formal notation:
- Regular expressions
- Context-free grammars.
- For type checking formalism is logical rules of inference.
- If Hypothesis is true, then Conclusion is true.
- Building blocks:
^
is "and"=>
is "if-then"x:T
is "x has type T".
e.g.
If e_1 has type Int, and e_2 has type int, then e_1 + e_2 has type Int.
(e_1 has type Int ^ e_2 has type Int) => e_1 + e_2 has type Int
(e_1: Int ^ e_2: Int) => e_1 + e_2: Int
- Traditionally inference rules are written:
|- Hypothesis ... |- Hypothesis
-------------------------------
|- Conclusion
|-
means "it is provable that..." (turnstile).
e.g.
i is an integer literal [int]
-----------------------
|- i : int
|- e_1: Int |- e_2: Int [Add]
------------------------
|- e_1 + e_2: Int
-
A type system is sound if:
- Whenever
|- e:T
, when executed e evaluates to a value of type T. - We only want sound rules.
- But some sound rules are better than others! Need to be as specific as possible.
- Whenever
-
Type checking proves facts e:T.
- Proof is on structure of AST.
- Proof has shape of AST.
- One type rule used per AST node.
-
Types are computed in a bottom-up pass over AST.
Readings notes
- CPTT: Compilers: Principles, Techniques, and Tools
- EC: Engineering a Compiler
- Read EC first, then CPTT.
- Week 2 (Lexical Analysis and Finite Automata)
- CPTT sections 3.1, 3.3, 3.4, 3.6, 3.7, 3.8
- EC: Chapter 2 through Section 2.5.1, except 2.4.4
EC, lexical analysis and finite automata
- Recognizer: program that identifies specific words in a stream of characters.
- Syntactic category: each output from recognizer is classified according to grammatical usage.
- Microsyntax: lexical structure of language.
- e.g. English groups alphabetic letters. Spaces terminate words.
- After English word identified use dictionary lookup to see if valid.
- Typical programming language has keywords, reserved for particular syntactic purpose and cannot be used as an identifier.
- Compiler writer starts with microsyntax specification, encodes into notation used by scanner generator.
Recognizing words
- FSM states are circle, above arrow are letters that cause transition.
- Double-circle is recognized word.
- Else, move to an error state.
-
Finite automaton (FA): a formalism for recognizers that has:
- a finite set of states (S) and an error state s_e.
- an alphabet (big sigma), usually union of edge labels in transition diagram.
- a transition function
$$\delta(s, c)$$ -> next state, s is state and c is character. - a start state (s_0), and
- one or more accepting states (S_A), the double circles in a transition diagram.
- Recognizer takes one transition per input character, run time is O(n), n is length of input string.
-
Lexeme: actual text for a word recognized by an FA. An actual instance of a syntactic category.
- 113 is a specific unsigned integer; latter is lexeme.
- Infinite sets, like set of integers, require cyclic transition diagrams.
Regular expressions
-
Set of words accepted by FA (F) forms a language L(F).
-
For any FA we can also describe L(F) using a regular expression (RE).
-
Zero or more instances is * operator, or Kleene closure.
-
RE is made up of three basic operations:
- Alternation (R | S).
- Concatenation (xy)
-
Closure (
$$R^*$$ ).-
Finite closure (
$$R^i$$ ). One to i instances of R. -
Postiive closure (
$$R^+$$ ). One or more instances of R.
-
Finite closure (
-
$$\epsilon$$ is empty string and is itself an RE.
-
Complement operator: notation ^c specific set
$$(\Sigma - c)$$ . High precedence than *, |, or$$^+$$ . -
Escape sequence: two or more characters that a scanner translates into another character, e.g. \n.
-
Regular languages: Any language that can be specified by an RE.
-
Programming languages vs. natural languages.
- In natural languages words require understanding of context to derive meaning.
- In programming languages words are almost always specified lexically.
-
REs are closed under concatenation, union, closure, and complement. e.g. RE_1 | RE_2 => RE.
-
Complete FA: an FA that explicitly includes all error transitions.
-
e-transition: A transition on the empty string
$$\epsilon$$ , that does not advance the input.- Combine FAs.
-
Nondeterministic FA: An FA that allows transitions on the empty string,
$$\epsilon$$ , and states that have multiple transitions on the same character.- Depends on future characters.
-
Deterministic FA: A DFA is an FA where the transition function
$$\delta(s, a)$$ is single-values. DFAs do not allow$$\epsilon$$ -transitions. -
Configuration of an NFA: the set of concurrently active states of an NFA.
- On multi-values transition function NFA clones itself once per possibility.
- Always finite, but could be exponential.
-
To simulate an NFA, we need a DFA with a state for each configuration of the NFA.
-
Powerset of N: the set of all subsets of N, denotes
$$2^N$$ . -
p46, figure 2.4: trivial NFAs for REs.
-
Thompson's construction: REs -> NFA.
- Use trivial NFAs in order consistent with precedence of operators.
-
Subset construction: NFA to DFA.
-
Valid configuration: configuration of an NFA that can be reached by some input string.
-
Subset construction example, p50-51.
-
(not in class) Hopcroft's algorithm: DFA to minimal DFA.
Table-driven scanners
CPTT, lexical analysis and finite automata
- Week 3 (Parsing)
- CPTT section 4.1-4.6, 4.8.1, 4.8.2
- EC sections 3.1-3.4
- CPTT
EC 3.1-3.4
- Parsing: given a stream s of words and a grammar G, find a derivation in G that produces s.
- Context-free grammar: for a language L its CFG defines the set of strings of symbols that are valid sentences in L.
- Sentence: a string of symbols that can be derived from the rules of a grammar.
- Language defined by G: the set of sentences that can be derived from G.
- Production: a rule in a CFG.
- Nonterminal symbol: a syntactic variable used in a grammar's productions.
- Terminal symbol: a word that can occur in a sentence.
- Word: a lexeme and its syntactic class. Words are repesented in a grammar by their syntactic category.
- Start symbol or goal symbol: a nonterminal symbol that represents the set of all sentences in L(G).
- Derivation: sequence of rewriting steps that begins with the grammer's start symbol and ends with a sentence in the language.
- Sequential form: a string of symbols that occurs as one step in a valid derivation.
- Parse tree or syntax tree: a graph that represents a derivation.
- p89 is good diagram of derivation and corresponding parse tree.
- Rightmost derivation: a derivation that rewrites, at each step, the rightmost nonterminal.
- Leftmost derivation: a derivation that rewrites, at each step, the leftmost nonterminal.
- Despite using rightmost or leftmost derivation the corresponding parse trees are identical.
- Ambiguity: a grammar G is ambiguous if some sentence in L(G) has more than one rightmost (or leftmost) derivation.
- p91, if-then-else ambiguity example.
- p93, figre 3.1: classis expression grammar for operator precedence, and p94 has an excample derivation and corresponding parse tree.
- Classes of CFGs and their parsers.
- Arbitary CFGs take O(n^3) to parse, usually not used.
- LR(1) are a large subset of unambigous CFGs. Parsed buttom-up, linear left-to-right, looking ahead at most one symbol. Favourite.
- LL(1) are a subset of LR(1). Parsed top-down, linear left-to-right, looking ahead at most one symbol.
- Regular grammars (RGs) are CFGs that generate regular languages. Subset of LL(1). Equivalent to regexp's.
- Why is top-down parsing efficient? A large subset of context-free grammers can be parsed without backtracking.
TOREAD from p97
- Week 4 (Semantic Analysis and Types)
- CPTT section 5.1-5.3, 6.3, 6.5
- EC sctions 4.1-4.4
- Week 5 (Runtime Systems and Code Generation)
- CPTT section 6.2, 7.1-7.4, 8.1-8.3, 8.6
- EC chapter 5, sections 7.1-7.2
- Week 6 (Optimization)
- CPTT section 8.5, 8.7, 9.1-9.4
- EC section 8.1-8.4, 8.6, 9.1-9.3
- Week 7 (Advanced topics: Register allocation and Garbage Collection)
- CPTT sections 7.5-7.6, section 8.8
- EC sections 13.1-13.2, 13.4
- Quiz 2
- Question 3
- When does a grammar not produce a regular language?
-
Look for any notion of counting, which a NFA cannot do as it only has a finite number of states. Compare these two grammars:
A -> (A( | epsilon A -> (A) | epsilon
-
The first is a regular grammar; just pairs of open parens.
-
The second is not a regular grammar; we're matching open parens with closing ones, hence we need to "count" how many open parens we have in order to close them.
-
- When does a grammar not produce a regular language?
- When looking for left-recursive grammars make sure the left-most token in some chain is a non-terminal, not a terminal, for production (even the ORs).
- For a CFG to encode operator precedence we need: - Addition associates to the left. - Multiplication binds more tightly than addition.
- Context-free => grammar requires no memory. Same as regular language?
- A NFA can determine parity, i.e. even or odd number of something, it can't count.
- Question 3