### **Lexical Analysis**
**First step of compilation** which converts source code into tokens.


### Important terms:

* **Lexeme**: The smallest meaningful unit in the source code. Example: `if`, `x`, `42`, `+` are all lexemes.
* **Token**: A **category** or **type** that groups lexemes. Example:
  * `if` → keyword token
  * `x` → identifier token
  * `42` → number token
  * `+` → operator token
* **Lexer** or **Scanner**: Program does lexical analysis, reads source code and produces tokens.
  * **Regular Expressions (Regex) Lexer**: Write rules like:
   * identifiers = `[a-zA-Z_][a-zA-Z0-9_]*`
   * numbers = `[0-9]+`
     Then, a program uses these rules to recognize tokens.
  * **Finite State Machine (FSM) Lexer**: Build a machine with states and transitions that recognize valid tokens step by step (like following a flowchart of character inputs).
  
### **Syntax Analysis**
After lexical analysis, the **parser** takes a sequence of tokens (from the lexer) and tries to build a structure (parse tree) using those grammar rules. e.g.:
```C
int x = 42;
```
Tokens: `int`, `x`, `=`, `42`, `;` then *parser* check if it fits grammar rule like `Statement → Type Identifier '=' Number ';'`, then builds a parse tree.

#### Parsing strategies:
* **Top-Down Parsing**: Start from the root of the parse tree and try to build it down to the leaves (tokens).
* **Bottom-Up Parsing**: Start from the tokens and try to build up to the root of the parse tree.

###  Parsing as Graph Search

1. **Parsing** = build structure (parse tree) from tokens using grammar.
2. **Naive BFS** → explores all possibilities → exponential growth.
3. **Prefix test** → cut branches if they don’t match target string’s prefix.
4. **Leftmost derivation** = always expand the leftmost non-terminal first.
   * Makes search systematic.
5. **BFS + Leftmost + Prefix test** → faster, fewer branches.
6. **Problem**: Some grammars still blow up (too many useless branches).

- **Recursive Descent Parsing**: A top-down parsing method with DFS instead of BFS using recursion. Each non-terminal in the grammar has a function. Each rule is handled by code inside that function.
    - **Problem**: Infinite recursion $\to$ Solution:  **Left recursion**.
- **Left Recursion**: A grammar is left recursive if a non-terminal A can eventually derive a form starting with itself $A ⇒^* Aγ$ where $⇒^*$ means in zero or more steps
    - Direct left recursion: $A ⇒ Aγ$
    - Indirect left recursion: $A ⇒^* Aγ$
    - **Problem**: Left recursion causes infinite recursion in recursive descent parsing.
    - **Solution**: Eliminate left recursion by rewriting rules. Remove Direct Left Recursion:

Suppose we have:

```
A ::= Aα1 | Aα2 | ... | Aαn | β1 | ... | βk
```

(where each `β` does NOT start with `A`).

We rewrite as:

```
A  ::= β1Â | β2Â | ... | βkÂ
Â  ::= α1Â | α2Â | ... | αnÂ | ε
```

(Â is a new non-terminal, and ε = empty string).

---

#### Predictive Parser

BFS, DFS are slow due to backtracking. A **lookahead parser** peeks at the next token to decide which rule to apply, reducing backtracking.

#### One-step lookahead
* We define FIRST(γ) which is the set of terminals (tokens) that can appear first if we derive γ.
* Rule for $LL(1)$ grammar:
    - First L, parse from left-to-right and second creates the leftmost derivation
    - If $A \rightarrow \alpha$, $A \rightarrow \beta$ then $FIRST(\alpha) \cap FIRST(\beta) = \emptyset$
* **LL(k) Parsers** use k tokens of lookahead to decide which rule to apply which is more powerful but more complex.
