Note that you have to execute the command `jupyter notebook` in the parent directory of 
this directory for otherwise `jupyter` won't be able to access the file `style.css`.

In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

This example has been extracted from the official documentation of Ply.

## A Tokenizer for Numbers and the Arithmetical Operators

The module `ply.lex` contains the code that is necessary to create a scanner.

In [None]:
import ply.lex as lex

We start with a definition of the <em style="color:blue">token names</em>.  Note that all token names have to start with 
a capital letter.  We have to define these token names as a list with the name `tokens`:

In [None]:
tokens = [
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
   'LPAREN',
   'RPAREN'
]

There are two ways to define these tokens:
 - *immediate token definitions* define the token by assigning a regular expression to a variable of the form `t_name`,
   where `name` is the name of the token that is defined.
 - *functional token definitions* define the token via a function.  The regular expression that defines the token
   is the string appearing in the first line of the function body.
   
We see examples below.  We start with the *immediate token definitions*.  Note that we have to use *raw strings* here to prevent 
the expansion of backslash sequences.  Furthermore, operator symbols have to be escaped with a backslash character.

In [None]:
t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'

If we need to transform a token, we can define the token via a function.  In that case, the first line of the function 
has to be a string that is a regular expression.  This regular expression then defines the token.  After that,
we can add code to transform the token.  The string that makes up the token is stored in `t.value`.  Below, this string
is transformed into an integer via the predefined function `int`.

In [None]:
def t_NUMBER(t):
    r'0|[1-9][0-9]*'
    t.value = int(t.value)
    return t

The rule below is used to keep track of line numbers. We use the function `length` since there might be
more than one newline.  The member variable `lexer.lineno` keeps track of the current line number.  This
is needed so that we are able to specify the precise location of errors in error messages.

In [None]:
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

The keyword `t_ignore` specifies those characters that should be discarded.
In this case, spaces and tabs are ignored.  Note that we cannot use a raw string here.

In [None]:
t_ignore = ' \t'

All characters not recognized by any of the defined tokens are handled by the function `t_error`.
The function `t.lexer.skip(1)` skips the character that has not been recognized. Scanning resumes 
after this character has been discarded.

In [None]:
def t_error(t):
    print(f"Illegal character {t.value[0]} at line {t.lexer.lineno}.")
    t.lexer.skip(1)

Below the function `lex.lex()` creates the lexer specified above.  Since this code is expected to be part 
of some python file but really isn't since it is placed in a Jupyter notebook we have to set the variable 
`__file__` manually to fool the system into believing that the code given above is located in a file 
called `hugo.py`.  Of course, the name `hugo` is totally irrelevant and could be replaced by any other name.

In [None]:
__file__ = 'hugo'
lexer = lex.lex()

Lets test the generated scanner, that is stored in `lexer`, with the following string:

In [None]:
data = """3 + 4 * 10 + 007 + (-20) * 2
          3 + 4 * 10 + abc + (-20) * 2
       """

Let us feed the scanner with the string `data`.  This is done by calling the method `input` of the generated scanner.

In [None]:
lexer.input(data)

Now we put the lexer to work by using it as an *iterable*.  This way, we can simply iterate over all the tokens that our scanner recognizes.

In [None]:
for tok in lexer:
    print(tok)

We see that the generated tokens contain four pieces of information:
 1. The *type* of the token.
 2. The *value* of the token.  This is either a number or a string.
 3. The *line number* of the token.  The line count starts with 1.
    As we have only two lines of data, this count is either 1 or 2 in this example.
 4. The *character count*.  For example, the last token is the $66^{\textrm{th}}$ character.
    The character count starts with `0`.