In [1]:
%%HTML
<style>
.container { width: 100% }
</style>

# Evaluating an Exam Using Ply

This notebook shows how we can use the library [`ply`](https://ply.readthedocs.io/en/latest/ply.html)
to implement a scanner.  Our goal is to implement a scanner that can be used to evaluate the results
of an exam.  Assume the result of an exam is stored in the string `data` that is defined below:

In [2]:
data = \
'''Class: Algorithms and Complexity
   Group: TIT09AID
   MaxPoints = 60
   
   Exercise:      1. 2. 3. 4. 5. 6.
   Jim Smith:     9 12 10  6  6  0
   John Slow:     4  4  2  0  -  -
   Susi Sorglos:  9 12 12  9  9  6
'''

This data show that there has been a exam with the subject <em style="color:blue">Algorithms and Complexity</em>
in the group <em style="color:blue">TIT09AID</em>.  Furthermore, the data shows that in order to achieve
100%, <em style="color:blue">60</em> points would have been necessary.
    
There have been 6 different exercises in this exam and, in this small example,  only three students took part, namely *Jim Smith*, *John Slow*, and *Susi Sorglos*.  Each of the rows that contain the results that one student has achieved in this exam begins with the name of the student followed by the number of points that she has achieved in the different exercises. Our goal is to write a program that is able to compute the marks for all students.

We will be use the library [ply](https://ply.readthedocs.io/en/latest/ply.html).
In this example, we will only use the scanner that is provided by the module `ply.lex`. Hence we import this module from `ply`.

In [3]:
import ply.lex as lex

The function `mark(lexer)` takes the object `lexer` as its argument.  This object is supposed to be a scanner object that is generated by the expression `lex.lex()`.  This object provides two member variables:
- `lexer.sum_points` is the number of points achieved by the student whose mark is to be computed.
- `lexer.max_points` is the number of points that need to be achieved in order to get the best mark 
  of $1.0$.
  
It is assumed that the relation between the mark of an exam and the number of points achieved in this exam is linear and that a student who has achieved $50\%$ of `lexer.max_points` points will get the mark $4.0$, while a student who has achieved  $100\%$ of `lexer.max_points` points will be marked as $1.0$.

In [4]:
def mark(lexer):
    return 7 - 6 * lexer.sum_points / lexer.max_points

We begin by defining the list of tokens.  Note that the variable `tokens` is a keyword of `ply` to define the names of the token classes.  In this case, we have declared six different tokens.
- `HEADER` will match the first two lines of the string `data` as well as the fifth line that begins with 
  the substring `Exercise:`.  
  
  The precise definition of this token as well as the definition of the other token is given later.
- `MAXDEF` is a token that will match the line `MaxPoints = 60`.
- `NAME` is a token that will match the name of a student.
- `NUMBER` is a token that will match a natural number.
- `IGNORE` is a token that will match an empty line.  For example, the fourth line in `data` is empty.
- `LINEBREAK` is a token that will match the newline character `\n` at the end of a line.

In [5]:
tokens = [ 'HEADER',
           'MAXDEF',
           'NAME',
           'NUMBER',
           'IGNORE',
           'LINEBREAK'
         ]

Next, we need to provide the definition of the tokens.  One way to define tokens is via python functions.  The <em style="color:blue">document string</em> of these functions is a <em style="color:blue">raw string</em> that contains the regular expression defining the semantics of the token.  The regular expression can be followed by code that is needed to further process the token.  The name of the function defining a token has to have the form `t_`**name**, where **name** is the name of the token as declared in the list `tokens`.

The token `HEADER` matches any string that is made up of upper and lower case characters followed by a colon.  The token extends to the end of the line and includes the terminating newline.

When the function `t_HEADER` is called it is provided with a token `t`.  This is an object that has five
attributes:
- `t.lexer` is an object of class `Lexer` that contains the scanner that created this token `t`.
  We are free to attach additional attributes to this object.
- `t.type` is  a string containing the type of the token.  For tokens processed in the function
  `t_HEADER` this type is always the string `HEADER`.
- `t.value` is the actual string matched by the token.
- `t.lineno` is the line number.  However, it is our responsibility to update this variable
  by incrementing `t.lexer.lineno` every time we read a newline.
- `t.lexpos` is the position of the token in input string that is scanned.

In [6]:
def t_HEADER(t):
    r'[A-Za-z]+:.*\n'
    t.lexer.lineno += 1
    return t

The token `MAXDEF` matches a substring of the form `MaxPoints = 60`.  Note that the regular expression defining the semantics of this token uses the expression `\s` to match the white space before and after the character `=`.  This is necessary because `ply.lex` uses verbose regular expressions that can contain whitespace for formatting.  Hence a blank character ` ` inside a regular expression is silently discarded.

After defining the regular expression, the function `t_MAXDEF` has some <em style="color:blue">action code</em> that is used to extract the maximal number of points form the token value and store this number in the variable `t.lexer.name`.  Furthermore, we initialize the student name to the empty string.

In [7]:
def t_MAXDEF(t):
    r'MaxPoints\s=\s[1-9][0-9]*'
    t.lexer.max_points = int(t.value[12:])
    t.lexer.name       = ''
    return t

The token `NAME` matches the name of a student followed by a colon.  In general, a student name can be any sequence of letters that contain optional hypens and blanks. We have to reset the sum of points that is stored in `lexer.sum_points`to `0`.

In [8]:
def t_NAME(t):
    r'[a-zA-Z- ]+:'
    t.lexer.name = t.value[:-1]
    t.lexer.sum_points = 0
    return t

The token `NUMBER`matches a natural number.  We have to convert the value, which initially is a string of digits, into an integer.  Furthermore, this value is the added to the number of point the current student has achieved.

In [9]:
def t_NUMBER(t):
    r'0|[1-9][0-9]*'
    t.value = int(t.value)
    t.lexer.sum_points += t.value
    return t

The token `IGNORE` matches a line that contains only whitespace.  In order to keep track of line numbers we have to increment `lexer.lineno`.  However, we do not return a token at the end of the function.  Hence, an empty line is silently discarded.

In [10]:
def t_IGNORE(t):
    r'^[\s\t]*\n'
    t.lexer.lineno += 1

The token `LINEBREAK` matches a newlince character `\n`.  If a student name is defined, then we output the result for this student.  Note that we set `lexer.name` back to the empty string once we have processed the student.
This allows for empty lines between different students.

In [11]:
def t_LINEBREAK(t):
    r'\n'
    t.lexer.lineno += 1
    if t.lexer.name != '':
        print(f'{t.lexer.name} has {t.lexer.sum_points} points and the mark {round(mark(t.lexer), 2)}.')
    t.lexer.name = ''

The string `t_ignore` specifies those characters that should be ignored.  Note that this string is **not** interpreted as a regular expression.  It is just a a string of <em style="color:blue">single characters</em>.  These characters are allowed to occur as part of other tokens, but when they occur on their own and would otherwise generate a scanning error, they are silently discarded instead of triggering an error. 

In [12]:
t_ignore  = '- \t'

The function `t_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens.  In our implementation we print the first character that could not be matched, discard this character and continue.

In [13]:
def t_error(t):
    print(f"Illegal character '{t.value[0]}'")
    t.lexer.skip(1)

The line below is necessary to trick `ply.lex` into assuming this program is written in an ordiary python file instead of a *Jupyter notebook*.

In [14]:
__file__ = 'main'

The line below generates the scanner.

In [15]:
lexer = lex.lex()

Next, we feed an input string into the generated scanner.

In [16]:
lexer.input(data)

In order to scan the data that we provided in the last line, we iterate over all tokens generated by our scanner.

In [17]:
def scan(lexer):
    for t in lexer:
        pass

In [18]:
scan(lexer)

Jim Smith has 43 points and the mark 2.7.
John Slow has 10 points and the mark 6.0.
Susi Sorglos has 57 points and the mark 1.3.
