In [2]:
from IPython.display import HTML
HTML(open('../style.css').read())

The following example has been extracted from the official documentation of Ply.

## A Tokenizer for Numbers and the Arithmetical Operators

The module `ply.lex` contains the code that is necessary to create a scanner.

In [3]:
!pip install ply



In [4]:
import ply.lex as lex

We start with a definition of the <em style="color:blue">token names</em>.  Note that all token names have to start with 
a capital letter.  We have to define these token names as a list with the name `tokens`:

In [5]:
tokens = [
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
   'LPAREN',
   'RPAREN'
]

There are two ways to define these tokens:
 - *immediate token definitions* define the token by assigning a regular expression to a variable of the form `t_name`,
   where `name` is the name of the token that is defined.
 - *functional token definitions* define the token via a function.  The regular expression that defines the token
   is the string appearing in the first line of the function body.
   
We see examples below.  We start with the *immediate token definitions*.  Note that we have to use *raw strings* here to prevent 
the expansion of backslash sequences.  Furthermore, those symbols that are interpreted as *operator* symbols inside a regular expression have to be escaped with a backslash character.

In [6]:
t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'

If we need to transform the value of a token, we can define the token via a function.  In that case, the first line of the function 
has to be a string that is a regular expression.  This regular expression then defines the token.  After that,
we can add code to transform the token.  The string that makes up the token is stored in `t.value`.  Below, this string
is cast into an integer via the predefined function `int`.

In [7]:
def t_NUMBER(t):
    r'0|[1-9][0-9]*'
    t.value = int(t.value)
    return t

The rule below is used to keep track of line numbers. We use the function `len` since there might be
more than one newline.  The member variable `lexer.lineno` keeps track of the current line number.  This variable
is maintained so that we are able to specify the precise location of unkown characters in error messages.

In [8]:
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

The keyword `t_ignore` specifies those characters that should be discarded.
In the following cell it specifies that space characters and tabulator characters are to be ignored.  Note that we **must not** use a raw string here, since otherwise `\t` would not denote a tabulator character.

In [9]:
t_ignore = ' \t'

All characters not recognized by any of the defined tokens are handled by the function `t_error`.
The function `t.lexer.skip(1)` skips one character, which is the character that has not been recognized. Scanning resumes after this character has been discarded.

In [10]:
def t_error(t):
    print(f"Illegal character '{t.value[0]}' at line {t.lexer.lineno}.")
    print(f"This is the {t.lexpos}th character.")
    t.lexer.skip(1)

Below the function `lex.lex()` creates the lexer specified above.  Since this code is expected to be part 
of some Python file but really isn't part of a file since it is placed in a Jupyter notebook we have to set the variable 
`__file__` manually to fool the system into believing that the code given above is located in a file 
called `hugo.py`.  The name `hugo` is totally irrelevant and could be replaced by any other name.

In [11]:
__file__ = 'hugo'
lexer = lex.lex(debug=True)

lex: tokens   = ['NUMBER', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN']
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive'}
lex: Adding rule t_NUMBER -> '0|[1-9][0-9]*' (state 'INITIAL')
lex: Adding rule t_newline -> '\n+' (state 'INITIAL')
lex: Adding rule t_PLUS -> '\+' (state 'INITIAL')
lex: Adding rule t_TIMES -> '\*' (state 'INITIAL')
lex: Adding rule t_LPAREN -> '\(' (state 'INITIAL')
lex: Adding rule t_RPAREN -> '\)' (state 'INITIAL')
lex: Adding rule t_MINUS -> '-' (state 'INITIAL')
lex: Adding rule t_DIVIDE -> '/' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_NUMBER>0|[1-9][0-9]*)|(?P<t_newline>\n+)|(?P<t_PLUS>\+)|(?P<t_TIMES>\*)|(?P<t_LPAREN>\()|(?P<t_RPAREN>\))|(?P<t_MINUS>-)|(?P<t_DIVIDE>/)'


Now `lexer` is the scanner that has been created by the previous command. 

In [12]:
lexer

<ply.lex.Lexer at 0x2549c291a90>

Lets test the generated scanner, that is stored in `lexer`, with the following string:

In [13]:
data = """
       3 + 4 * 10 + 007 + (-20) * 2
       42
       a
       """

Let us feed the scanner with the string `data`.  This is done by calling the method `input` of the generated scanner.

In [14]:
data

'\n       3 + 4 * 10 + 007 + (-20) * 2\n       42\n       a\n       '

I have set the line number to `1` before scanning in order to be able to run the scanner multiple times, since each time the scanner runs the line number is changed.

In [15]:
lexer.lineno = 1
lexer.input(data)

Now we put the lexer to work by using it as an *iterable*.  This way, we can simply iterate over all the tokens that our scanner recognizes.

In [16]:
for token in lexer:
    print(token)

LexToken(NUMBER,3,2,8)
LexToken(PLUS,'+',2,10)
LexToken(NUMBER,4,2,12)
LexToken(TIMES,'*',2,14)
LexToken(NUMBER,10,2,16)
LexToken(PLUS,'+',2,19)
LexToken(NUMBER,0,2,21)
LexToken(NUMBER,0,2,22)
LexToken(NUMBER,7,2,23)
LexToken(PLUS,'+',2,25)
LexToken(LPAREN,'(',2,27)
LexToken(MINUS,'-',2,28)
LexToken(NUMBER,20,2,29)
LexToken(RPAREN,')',2,31)
LexToken(TIMES,'*',2,33)
LexToken(NUMBER,2,2,35)
LexToken(NUMBER,42,3,44)
Illegal character 'a' at line 4.
This is the 54th character.


We see that the generated tokens contain four pieces of information:
 1. The *type* of the token.
 2. The *value* of the token.  This is either a number or a string.
 3. The *line number* of the token.  The line number starts with 1.
    However, note that the first line of `data` is empty.
 4. The *character count*.  For example, the last token is the $54^{\textrm{th}}$ character.
    The character count starts with `0`.