# First Project

The first project requires you to implement a scanner, and a parser for the uC language, specified by [uC BNF Grammar](./uC_Grammar.ipynb) notebook. Study the specification of uC grammar carefully. To complete this first project, you will use the [PLY](http://www.dabeaz.com/ply/), a Python version of the [lex/yacc](http://dinosaur.compilertools.net/) toolset with same functionality but with a friendlier interface. Please read the complete contents of this section and carefully complete the steps indicated.

## Regular Expressions

Regular expressions are concise ways of describing a set of strings that meet a given pattern. For example, we can specify the regular expression:
```
r'[a-zA-Z_][0-9a-zA-Z_]*'
``` 
to describe valid identifiers in the uC language. Regular expressions are a mini-language that lets you specify the rules for constructing a string set. This specification mini-language is very similar between the different programming languages that contain the concept of regular expressions (also called RE or REGEX). Thus, learning to write regular expressions in Python will also be useful for describing REs in other programming languages.

Your first task is to write a set of regular expressions that will be used by the lexical parser to recognize the following patterns:

In [None]:
# valid uC identifiers
identifier = r'[a-zA-Z_][0-9a-zA-Z_]*'

In [None]:
# integer constants
int_const = r'0|[1-9][0-9]*'

In [None]:
# floating constants
float_const = r'([0-9]*\.[0-9]+)|([0-9]+\.)' 

In [None]:
# Comments in C-Style /* ... */
ccomment = r'/\*(.|\n)*?\*/'

In [None]:
# Unterminated C-style comment
uccomment = r'/\*(.|\n)*$'

In [None]:
# C++-style comment (//...)
cppcomment = r'//.*\n'

In [None]:
# string_literal
string_literal = r'".*?"'

In [None]:
# unmatched_quote
unquote = r'".*$'

In [None]:
# testing
import re
b = re.match(ccomment, "/***/")
if b:
    pass
else:
    print("Erro.")

## Deterministc Finite Automata

The lexical analyzer turns your input program into a set of tokens that are the language's accepted keywords, names, and punctuations. At the heart of this transition is the formalism known as finite automata. These are essentially graphs, like transition diagrams, with a few differences:

1. Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.

2. Finite automata come in two flavors:

(a) Non-deterministic finite automata (NFA) have no restrictions on their edge labels. A symbol can label several edges out of the same state, and $\in$, the empty string, is a possible label.

(b) Deterministic finite automata (DFA) have, for each state and for each symbol of its input alphabet, exactly one edge with that symbol leaving that state.

An DFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states, such that the symbols along the path spell out x. The following algorithm shows how to implement a simple DFA:

In [None]:
## DFA Simple Implementation Example:
class DFA:
    current_state = None;
    def __init__(self, states, alphabet, transition_function, start_state, accept_states):
        self.states = states;
        self.alphabet = alphabet;
        self.transition_function = transition_function;
        self.start_state = start_state;
        self.accept_states = accept_states;
        self.current_state = start_state;
        return;
    
    def transition_to_state_with_input(self, input_value):
        if ((self.current_state, input_value) not in self.transition_function.keys()):
            self.current_state = None;
            return;
        self.current_state = self.transition_function[(self.current_state, input_value)];
        return;
    
    def in_accept_state(self):
        return self.current_state in accept_states;
    
    def go_to_initial_state(self):
        self.current_state = self.start_state;
        return;
    
    def run_with_input_list(self, input_list):
        self.go_to_initial_state();
        for inp in input_list:
            self.transition_to_state_with_input(inp);
            continue;
        return self.in_accept_state();
    pass;

states = {0, 1, 2, 3};
alphabet = {'a', 'b', 'c', 'd'};

tf = dict();
tf[(0, 'a')] = 1;
tf[(0, 'b')] = 2;
tf[(0, 'c')] = 3;
tf[(0, 'd')] = 0;
tf[(1, 'a')] = 1;
tf[(1, 'b')] = 2;
tf[(1, 'c')] = 3;
tf[(1, 'd')] = 0;
tf[(2, 'a')] = 1;
tf[(2, 'b')] = 2;
tf[(2, 'c')] = 3;
tf[(2, 'd')] = 0;
tf[(3, 'a')] = 1;
tf[(3, 'b')] = 2;
tf[(3, 'c')] = 3;
tf[(3, 'd')] = 0;
start_state = 0;
accept_states = {2, 3};

d = DFA(states, alphabet, tf, start_state, accept_states);

In [None]:
inp_program = list('abcabdc')
print(d.run_with_input_list(inp_program));

In [None]:
inp_program = list('abcdabcdabcd');
print(d.run_with_input_list(inp_program));

**Exercise**: Change the DFA implementation example above for each of the following languages:

(a) All strings over {a, b, c} that contain an odd number of b’s

(b) All strings over {a, b, c} that contain an even number of a’s and an odd number of b’s

(c) All strings over {a, b, c, d} of length at least 2 whose second symbol does not appear elsewhere in the string. So bdabc, acbab, bacbd, abcdc $\in$ L, while aa, bcabc, abcbc, dd $\not\in$ L

## Writing a Lex
The process of “lexing” is that of taking input text and breaking it down into a stream of tokens. Each token is like a valid word from the dictionary. Essentially, the role of the lexer is to simply make sure that the input text consists of valid symbols and tokens prior to any further processing related to parsing.

Each token is defined by a regular expression. Thus, your task here is to define a set of regular expressions for the uC language. The actual job of lexing will be handled by PLY. For a better understanding study the [Lex](http://www.dabeaz.com/ply/ply.html#ply_nn3) chapter in the PLY documentation.

### Specification
Your lexer must recognize the symbols and tokens of uC Grammar. For instance, in the example below, the name on the left is the token name, and the value on the right is the matching text:

Reserved Keywords:
```
    FOR   : ’for’
    IF    : ’if’
    PRINT : ’print’
```

Identifiers:
```
    ID    : any text starting with a letter or ’_’, followed by any number of letters,
            digits, or underscores, that is not a reserved word.
```

Some Operators and Delimiters:
```
    PLUS    : '+'
    MINUS   : '-'
    TIMES   : '*'
    DIVIDE  : ’/’
    ASSIGN  : ’=’
    SEMI    : ’;’
    LPAREN  : ’(’
    RPAREN  : ’)’
```

Literals:
```
    INT_CONST : 123
    FLOAT_CONST : 1.234
    STRING_LITERAL : "Hello World\n"
```


Comments:  To be ignored by your lexer
```
     //             Skips the rest of the line
     /* ... */      Skips a block (no nesting allowed)
```

Errors: Your lexer must report the following error messages:
```
     lineno: Unterminated string
     lineno: Unterminated comment
```

### Lex Skeleton

In [None]:
import ply.lex as lex


class UCLexer():
    """ A lexer for the uC language. After building it, set the
        input text with input(), and call token() to get new
        tokens.
    """
    def __init__(self, error_func):
        """ Create a new Lexer.
            An error function. Will be called with an error
            message, line and column as arguments, in case of
            an error during lexing.
        """
        self.error_func = error_func
        self.filename = ''

        # Keeps track of the last token returned from self.token()
        self.last_token = None

    def build(self, **kwargs):
        """ Builds the lexer from the specification. Must be
            called after the lexer object is created.

            This method exists separately, because the PLY
            manual warns against calling lex.lex inside __init__
        """
        self.lexer = lex.lex(object=self, **kwargs)

    def reset_lineno(self):
        """ Resets the internal line number counter of the lexer.
        """
        self.lexer.lineno = 1

    def input(self, text):
        self.lexer.input(text)

    def token(self):
        self.last_token = self.lexer.token()
        return self.last_token

    def find_tok_column(self, token):
        """ Find the column of the token in its line.
        """
        last_cr = self.lexer.lexdata.rfind('\n', 0, token.lexpos)
        return token.lexpos - last_cr

    # Internal auxiliary methods
    def _error(self, msg, token):
        location = self._make_tok_location(token)
        self.error_func(msg, location[0], location[1])
        self.lexer.skip(1)

    def _make_tok_location(self, token):
        return (token.lineno, self.find_tok_column(token))

    # Reserved keywords
    keywords = (
        'ASSERT', 'BREAK', 'CHAR', 'ELSE', 'FLOAT', 'FOR', 'IF',
        'INT', 'PRINT', 'READ', 'RETURN', 'VOID', 'WHILE',
    )

    keyword_map = {}
    for keyword in keywords:
        keyword_map[keyword.lower()] = keyword

    #
    # All the tokens recognized by the lexer
    #
    tokens = keywords + (
        # Identifiers
        'ID',

        # constants
        'INT_CONST', 'FLOAT_CONST',

    )

    #
    # Rules
    #
    t_ignore = ' \t'

    # Newlines
    def t_NEWLINE(self, t):
        r'\n+'
        t.lexer.lineno += t.value.count("\n")

    def t_ID(self, t):
        r'[a-zA-Z_][0-9a-zA-Z_]*'
        t.type = self.keyword_map.get(t.value, "ID")
        return t

    def t_comment(self, t):
        r'/\*(.|\n)*?\*/'
        t.lexer.lineno += t.value.count('\n')

    def t_error(self, t):
        msg = "Illegal character %s" % repr(t.value[0])
        self._error(msg, t)

    # Scanner (used only for test)
    def scan(self, data):
        self.lexer.input(data)
        while True:
            tok = self.lexer.token()
            if not tok:
                break
            print(tok)

if __name__ == '__main__':

    import sys

    def print_error(msg, x, y):
        print("Lexical error: %s at %d:%d" % (msg, x, y))

    m = UCLexer(print_error)
    m.build()  # Build the lexer
    m.scan(open(sys.argv[1]).read())  # print tokens

### Testing
For initial development, try running the lexer on a sample input file such as:

In [None]:
/* comment */
int j = 3;
int main () {
  int i = j;
  int k = 3;
  int p = 2 * j;
  assert p == 2 * i;
}

And the result will look similar to the text shown below.

In [None]:
LexToken(INT,'int',2,14)
LexToken(ID,'j',2,18)
LexToken(EQUALS,'=',2,20)
LexToken(INT_CONST,'3',2,22)
LexToken(SEMI,';',2,23)
LexToken(INT,'int',3,25)
LexToken(ID,'main',3,29)
LexToken(LPAREN,'(',3,34)
LexToken(RPAREN,')',3,35)
LexToken(LBRACE,'{',3,37)
LexToken(INT,'int',4,41)
LexToken(ID,'i',4,45)
LexToken(EQUALS,'=',4,47)
LexToken(ID,'j',4,49)
LexToken(SEMI,';',4,50)
LexToken(INT,'int',5,54)
LexToken(ID,'k',5,58)
LexToken(EQUALS,'=',5,60)
LexToken(INT_CONST,'3',5,62)
LexToken(SEMI,';',5,63)
LexToken(INT,'int',6,67)
LexToken(ID,'p',6,71)
LexToken(EQUALS,'=',6,73)
LexToken(INT_CONST,'2',6,75)
LexToken(TIMES,'*',6,77)
LexToken(ID,'j',6,79)
LexToken(SEMI,';',6,80)
LexToken(ASSERT,'assert',7,84)
LexToken(ID,'p',7,91)
LexToken(EQ,'==',7,93)
LexToken(INT_CONST,'2',7,96)
LexToken(TIMES,'*',7,98)
LexToken(ID,'i',7,100)
LexToken(SEMI,';',7,101)
LexToken(RBRACE,'}',8,103)

Carefully study the output of the lexer and make sure that it makes sense. Once you are reasonably happy with the output, try running some of the more [tricky tests](./lex_unit_tests.ipynb) designed to stress test various corner cases. How would you go about turning these tests into proper unit tests? 

## Writing a Parser

In this step, you write the basic shell of a parser for the uC language. A formal BNF of the language is [here](./uC_Grammar.ipynb). Your task is to write parsing rules and build the AST for this grammar using PLY. Parsers are defined using PLY’s yacc module (see [PLY-Yacc](http://www.dabeaz.com/ply/ply.html#ply_nn22) documentation).

Your task is translate the BNF into a collection of parser functions. For example, a rule such as :
```
  <program> ::= {<global_declaration>}+
```  
Gets turned into a Python function of the form:

In [None]:
class Parser:
    ...
    def p_program(self, p):
        """ program  : global_declaration_list
        """
        p[0] = Program(p[1])

    def p_global_declaration_list(self, p):
        """ global_declaration_list : global_declaration
                                    | global_declaration_list global_declaration
        """
        p[0] = [p[1]] if len(p) == 2 else p[1] + [p[2]]

In the body of each rule, create an appropriate AST node and assign it to p[0] as shown above.

For the purposes of lineno number tracking, you should assign a line number to each AST node as appropriate. See http://www.dabeaz.com/ply/ply.html#ply_nn33. To do this, I suggest pulling the line number off of any nearby terminal symbol. For example:

In [None]:
    def p_identifier(self, p):
        """ identifier : ID """
        p[0] = ID(p[1], lineno=p.lineno(1))

## Abstract Syntax Tree objects
This section defines classes for different kinds of nodes of an Abstract Syntax Tree. During parsing, you will create these nodes and connect them together. In general, you will have a different AST node for each kind of grammar rule.

In [None]:
class Node(object):
    """
    Base class example for the AST nodes.
    
    By default, instances of classes have a dictionary for attribute storage.
    This wastes space for objects having very few instance variables.
    The space consumption can become acute when creating large numbers of instances.

    The default can be overridden by defining __slots__ in a class definition.
    The __slots__ declaration takes a sequence of instance variables and reserves
    just enough space in each instance to hold a value for each variable.
    Space is saved because __dict__ is not created for each instance.
    """
    __slots__ = ()
    
    def children(self):
        """ A sequence of all children that are Nodes. """
        pass

For each of specific AST nodes, you need to add the appropriate ```__slots__``` specification that indicates what fields are to be stored:

In [None]:
class Program(Node):
    __slots__ = ('gdecls', 'coord')
    
    def __init__(self, gdecls, coord=None):
        self.gdecls = gdecls
        self.coord = coord

    def children(self):
        nodelist = []
        for i, child in enumerate(self.gdecls or []):
            nodelist.append(("gdecls[%d]" % i, child))
        return tuple(nodelist)

    attr_names = ()

Just as another example, for a binary operator, you might store the operator, the left expression, and the right expression like this:

In [None]:
class BinaryOp(Node):
    __slots__ = ('op', 'lvalue', 'rvalue', 'coord')
    
    def __init__(self, op, left, right, coord=None):
        self.op = op
        self.lvalue = left
        self.rvalue = right
        self.coord = coord

    def children(self):
        nodelist = []
        if self.lvalue is not None: nodelist.append(("lvalue", self.lvalue))
        if self.rvalue is not None: nodelist.append(("rvalue", self.rvalue))
        return tuple(nodelist)

    attr_names = ('op', )

For Constant objects, you might store the type and value, like this:

In [None]:
class Constant(Node):
    __slots__ = ('type', 'value', 'coord')
    
    def __init__(self, type, value, coord=None):
        self.type = type
        self.value = value
        self.coord = coord

    def children(self):
        nodelist = []
        return tuple(nodelist)

    attr_names = ('type', 'value', )

Suggestion: You should start simple and incrementally work your way up to building the complete grammar.

## AST node classes

The list below defines the AST node classes and expected attribute names that must be used in uCParser:

ArrayDecl ( ), ArrayRef ( ), Assert ( ), Assignment (op), BinaryOp (op), Break ( ), Cast ( ), Compound ( ), Constant (type, value), Decl (name), DeclList ( ), EmptyStatement ( ), ExprList ( ), For ( ), FuncCall ( ), FuncDecl ( ), FuncDef ( ), GlobalDecl ( ), ID (name), If ( ), InitList ( ), ParamList ( ), Print ( ), Program ( ), PtrDecl ( ), Read ( ), Return ( ), Type (names), VarDecl (), UnaryOp (op), While ( ).

## Visiting the AST
The following classes for visiting the AST are taken from Python’s ast module:

In [None]:
class NodeVisitor(object):
    """ A base NodeVisitor class for visiting uc_ast nodes.
        Subclass it and define your own visit_XXX methods, where
        XXX is the class name you want to visit with these
        methods.

        For example:

        class ConstantVisitor(NodeVisitor):
            def __init__(self):
                self.values = []

            def visit_Constant(self, node):
                self.values.append(node.value)

        Creates a list of values of all the constant nodes
        encountered below the given node. To use it:

        cv = ConstantVisitor()
        cv.visit(node)

        Notes:

        *   generic_visit() will be called for AST nodes for which
            no visit_XXX method was defined.
        *   The children of nodes for which a visit_XXX was
            defined will not be visited - if you need this, call
            generic_visit() on the node.
            You can use:
                NodeVisitor.generic_visit(self, node)
        *   Modeled after Python's own AST visiting facilities
            (the ast module of Python 3.0)
    """

    _method_cache = None

    def visit(self, node):
        """ Visit a node.
        """

        if self._method_cache is None:
            self._method_cache = {}

        visitor = self._method_cache.get(node.__class__.__name__, None)
        if visitor is None:
            method = 'visit_' + node.__class__.__name__
            visitor = getattr(self, method, self.generic_visit)
            self._method_cache[node.__class__.__name__] = visitor

        return visitor(node)

    def generic_visit(self, node):
        """ Called if no explicit visitor function exists for a
            node. Implements preorder visiting of the node.
        """
        for c in node:
            self.visit(c)

### Showing the AST

Consider the previous uC program example:

In [None]:
/* comment */
int j = 3;
int main () {
  int i = j;
  int k = 3;
  int p = 2 * j;
  assert p == 2 * i;
}

A possible dump of the AST looks like this:

In [None]:
Program: 
    GlobalDecl: 
        Decl: ID(name='j'  )
            VarDecl:
                Type: ['int']   @ 2:1
            Constant: int, 3   @ 2:9
    FuncDef: 
        Type: ['int']   @ 3:1
        Decl: ID(name='main'  )
            FuncDecl: 
                VarDecl:
                    Type: ['int']   @ 3:1
        Compound:    @ 3:1
            Decl: ID(name='i'  )
                VarDecl:
                    Type: ['int']   @ 4:3
                ID: j   @ 4:11
            Decl: ID(name='k'  )
                VarDecl:
                    Type: ['int']   @ 5:3
                Constant: int, 3   @ 5:11
            Decl: ID(name='p'  )
                VarDecl:
                    Type: ['int']   @ 6:3
                BinaryOp: *   @ 6:11
                    Constant: int, 2   @ 6:11
                    ID: j   @ 6:15
            Assert:    @ 7:3
                BinaryOp: ==   @ 7:10
                    ID: p   @ 7:10
                    BinaryOp: *   @ 7:15
                        Constant: int, 2   @ 7:15
                        ID: i   @ 7:19

And the methods to generate a textual representation of the nodes and print all its attributes is showing below:

In [None]:
def _repr(obj):
    """
    Get the representation of an object, with dedicated pprint-like format for lists.
    """
    if isinstance(obj, list):
        return '[' + (',\n '.join((_repr(e).replace('\n', '\n ') for e in obj))) + '\n]'
    else:
        return repr(obj) 
    
class Node(object):
    """ Abstract base class for AST nodes.
    """
    def __repr__(self):
        """ Generates a python representation of the current node
        """
        result = self.__class__.__name__ + '('
        indent = ''
        separator = ''
        for name in self.__slots__[:-2]:
            result += separator
            result += indent
            result += name + '=' + (_repr(getattr(self, name)).replace('\n', '\n  ' + (' ' * (len(name) + len(self.__class__.__name__)))))
            separator = ','
            indent = ' ' * len(self.__class__.__name__)
        result += indent + ')'
        return result

    def children(self):
        """ A sequence of all children that are Nodes
        """
        pass

    def show(self, buf=sys.stdout, offset=0, attrnames=False, nodenames=False, showcoord=False, _my_node_name=None):
        """ Pretty print the Node and all its attributes and children (recursively) to a buffer.
            buf:
                Open IO buffer into which the Node is printed.
            offset:
                Initial offset (amount of leading spaces)
            attrnames:
                True if you want to see the attribute names in name=value pairs. False to only see the values.
            nodenames:
                True if you want to see the actual node names within their parents.
            showcoord:
                Do you want the coordinates of each Node to be displayed.
        """
        lead = ' ' * offset
        if nodenames and _my_node_name is not None:
            buf.write(lead + self.__class__.__name__+ ' <' + _my_node_name + '>: ')
        else:
            buf.write(lead + self.__class__.__name__+ ': ')

        if self.attr_names:
            if attrnames:
                nvlist = [(n, getattr(self, n)) for n in self.attr_names if getattr(self, n) is not None]
                attrstr = ', '.join('%s=%s' % nv for nv in nvlist)
            else:
                vlist = [getattr(self, n) for n in self.attr_names]
                attrstr = ', '.join('%s' % v for v in vlist)
            buf.write(attrstr)

        if showcoord:
            if self.coord:
                buf.write('%s' % self.coord)
        buf.write('\n')

        for (child_name, child) in self.children():
            child.show(buf, offset + 4, attrnames, nodenames, showcoord, child_name)

## Dealing with line and column information in AST

In [None]:
class Coord(object):
    """ Coordinates of a syntactic element. Consists of:
            - Line number
            - (optional) column number, for the Lexer
    """
    __slots__ = ('line', 'column')

    def __init__(self, line, column=None):
        self.line = line
        self.column = column

    def __str__(self):
        if self.line:
            coord_str = "   @ %s:%s" % (self.line, self.column)
        else:
            coord_str = ""
        return coord_str

The Coord class above should be used to store (and show in the AST) the lines and columns of the productions in the source code. To capture the coordinates for a production ```p``` of the parser indexed with ```token_idx``` use the following code in the UCParser class (the coordinate includes the ```lineno``` and the ```column```. Both follow the semantics of the lex, starting at 1):

In [None]:
    def _token_coord(self, p, token_idx):
        last_cr = p.lexer.lexer.lexdata.rfind('\n', 0, p.lexpos(token_idx))
        if last_cr < 0:
            last_cr = -1
        column = (p.lexpos(token_idx) - (last_cr))
        return Coord(p.lineno(token_idx), column)