In [1]:
from IPython.display import HTML
HTML(open('../style.css').read())

# Converting <span style="font-variant:small-caps;">Html</span> to Text

This notebook shows how we can use the package [`ply`](https://ply.readthedocs.io/en/latest/ply.html)
to extract the text that is embedded in an <span style="font-variant:small-caps;">Html</span> file.  
In order to be concise, it only supports a small subset of 
<span style="font-variant:small-caps;">Html</span>.  Below is the content of my old
<a href="http://wwwlehre.dhbw-stuttgart.de/~stroetma/">web page</a> that I had used when I was still working at the DHBW Stuttgart.  The goal of this notebook is to write 
a scanner that is able to extract the text from this web page.

In [2]:
data = \
'''
<!doctype html>
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="flying-pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
'''

In [3]:
from IPython.core.display import HTML

In [4]:
HTML(data)

The original web page is still available at https://wwwlehre.dhbw-stuttgart.de/~stroetma/.

## Imports

We will use the package [ply](https://ply.readthedocs.io/en/latest/ply.html) to remove the 
<span style="font-variant:small-caps;">Html</span> tags and extract the text that
is embedded in the <span style="font-variant:small-caps;">Html</span> shown above.
In this example, we will only use the scanner that is provided by the module `ply.lex`. 
Hence we import the module `ply.lex` that contains the scanner generator from `ply`.

In [5]:
import ply.lex as lex

## Token Declarations

We begin by declaring the tokens.  Note that the variable `tokens` is a keyword of `ply` to define the names of the token classes.  In this case, we have declared nine different tokens.
- `HEAD_START` will match the tag `<head>` that starts the definition of the 
  <span style="font-variant:small-caps;">Html</span> header.
- `HEAD_END` will match the tag `</head>` that ends the definition of the 
  <span style="font-variant:small-caps;">Html</span> header.
- `SCRIPT_START` will match the tag `<script>` that starts embedded *JavaScript* code.
- `SCRIPT_END` will match the tag `</script>` that ends embedded *JavaScript* code.
- `TAG` is a token that represents arbitrary <span style="font-variant:small-caps;">Html</span> tags.
  These can be opening tags or closing tags.
- `LINEBREAK` is a token that will match the newline character `\n` at the end of a line.
- `NAMED_ENTITY` is a token that represents named 
  <span style="font-variant:small-caps;">Html5</span> entities.
- `UNICODE` is a token that represents a unicode entity.
- `ANY` is a token that matches any character.

In [6]:
tokens = [ 'HEAD_START',     # r'<head>'
           'HEAD_END',       # r'</head>'
           'SCRIPT_START',   # r'<script>'
           'SCRIPT_END',     # r'</script>'
           'TAG',            # r'<[^>]+>'
           'LINEBREAK',      # r'(\s*\n\s*)+'
           'NAMED_ENTITY',   # r'&[a-zA-Z]+;?'
           'UNICODE',        # r'&\#[0-9]+;?'
           'ANY'             # r'.'
         ]

## Definition of the States

Once we are inside an <span style="font-variant:small-caps;">Html</span> header or inside of some
*JavaScript* code the rules of the scanning game change.  Therefore, we declare two new <em style="color:blue">exclusive scanner states</em>:
- `header` is the state the scanner is in while it is scanning an 
  <span style="font-variant:small-caps;">Html</span> header.
- `script` is the state of the scanner while scanning *JavaScript* code.  

These states are *exclusive* states and hence the other token definitions do not apply in these
states.

In [7]:
states = [ ('header', 'exclusive'),
           ('script', 'exclusive')
         ]

## Token Definitions

We proceed to give the definition of the tokens.  Note that none of the function defined below
returns a token.  Rather all of these function print the transformation of the 
<span style="font-variant:small-caps;">Html</span> that they have matched.

### The Definition of the Token `HEAD_START`

Once the scanner reads the opening tag `<head>` it switches into the state `header`.  The function `begin` of the lexer can be used to switch into a different scanner state.  In the state `header`, the scanner continues to read and discard characters until the closing tag `</head>` is encountered.  Note that this token is only recognized in the state `INITIAL`.  The state `INITIAL` is the initial state of the scanner, i.e. the scanner always starts in this state.

In [8]:
def t_HEAD_START(t):
    r'<head>'
    t.lexer.begin('header')

### The Definition of the Token `SCRIPT_START`

Once the scanner reads the opening tag `<script>` it switches into the state `script`.  In this state it will continue to read and discard characters until it sees the closing tag `</script>`.

In [9]:
def t_SCRIPT_START(t):
    r'<script>'
    t.lexer.begin('script')

### The Definition of the Token `LINEBREAK``

Groups of newline characters are condensed into a single newline character.
As we are not interested in the variable `t.lexer.lineno` in this example, we don't have to count the newlines.
This token is active in the `INITIAL` state.

In [10]:
def t_LINEBREAK(t):
    r'(\s*\n\s*)+'
    print()

### The Definition of the Token `TAG`

The token `TAG` is defined as any string that starts with the character `<` and ends with the character 
`>`. Betweens these two characters there has to be a nonzero number of characters that are different from 
the character `>`.  The text of the token is discarded.

In [11]:
def t_TAG(t):
    r'<[^>]+>'

### The Definition of the Token `NAMED_ENTITY`

In order to support named <span style="font-variant:small-caps;">Html</span> entities we need to import
the dictionary `html5` from the module `html.entities`.  For every named 
<span style="font-variant:small-caps;">Html</span> entity `e`, `html[e]` is the unicode symbol that is specified by `e`.

In [12]:
from html.entities import html5

In [13]:
html5['auml']

'ä'

The regular expression `&[a-zA-Z]+;?` searches for <span style="font-variant:small-caps;">Html</span>
entity names.  These are strings that start with the character `&` followed by the name of the entity, optionally followed by the character `;`.  For example, `&auml;` is the entity name that specifies the German umlaut `ä`.  If a Unicode entity name is found, the corresponding character is printed.

In [14]:
def t_NAMED_ENTITY(t):
    r'&[a-zA-Z]+;?'
    if t.value[-1] == ';':            # ';' is not part of the entity name
        entity_name = t.value[1:-1]   # chop off '&' at the start and ';' at the end
    else:
        entity_name = t.value[1:]     # only chop '&' off 
    unicode_char = html5[entity_name]
    print(unicode_char, end='')       # don't print a line break

### The Definition of the Token `UNICODE` 

The regular expression `&\#[0-9]+;?` searches for <span style="font-variant:small-caps;">Html</span> entities that specify a unicode character numerically.  The corresponding strings start with the character `&`
followed by the character `#` followed by digits and are optionally ended by the character `;`.

Note that we had to escape the character `#` with a  backslash because otherwise this character would signal the begin of a comment.

Note further that the function `chr` takes a number and returns the corresponding unicode character.
For example, `chr(128034)` returns the character `'🐢'`. 

In [15]:
def t_UNICODE(t):
    r'&\#[0-9]+;?'
    if t.value[-1] == ';':
        number = t.value[2:-1]      # chop of '&#' at the start and ';' at the end
    else:
        number = t.value[2:]        # chop of '&#' at the start
    print(chr(int(number)), end='')

In [16]:
chr(8555)

'Ⅻ'

In [17]:
chr(128034)

'🐢'

### The Definition of the Token `ANY` 

The regular expression `.` matches any character that is different from a newline character.  These characters are printed unmodified.  Note that the scanner tries the regular expressions for a given state in the order that they are defined in this notebook.  Therefore, it is crucial that the function `t_ANY` is defined after all other token definitions for the `INITIAL` state are given.  The `INITIAL` state is the default state of the scanner and therefore the state the scanner is in when it starts scanning.

In [18]:
def t_ANY(t):
    r'.'
    print(t.value, end='')

### The Definition of the Token `HEAD_END` 

The regular expression `</head>` matches the closing head tag.  Note that this regular expression is only
active in state `header` as the name of this function starts with `t_header`.  Once the closing tag has been found, the function `lexer.begin` switches the lexer back into the state `INITIAL`, which is the 
<em style="color:blue">start state</em> of the scanner.  In the state `INITIAL`, all token definitions are active, that do not start with either `t_header` or `t_script`.

In [19]:
def t_header_HEAD_END(t):
    r'</head>'
    t.lexer.begin('INITIAL')

### The Definition of the Token `SCRIPT_END`

The regular expression `</script>` matches the closing script tag.  Note that this regular expression is only
active in state `script`.  Once the closing tag has been found, the function `lexer.begin` switches the lexer back into the state `INITIAL`, which is the start state of the scanner.  

In [20]:
def t_script_SCRIPT_END(t):
    r'</script>'
    t.lexer.begin('INITIAL')

### The Definition of the Token `ANY`

If the scanner is either in the state `header` or the state `script`, the function 
`t_header_script_ANY` eats up all characters without echoing them.

In [21]:
def t_header_script_ANY(t):
    r'.|\n'

## Error Handling

The function `t_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens.  In our implementation we print the first character that could not be matched, discard this character and continue.

<b>Note:</b>  Because of our definition for the token `ANY`, there can be no scanning **error**.

In [22]:
def t_error(t):
    print(f"Illegal character: '{t.value[0]}'")
    t.lexer.skip(1)

The function `t_header_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens and the scanner is in state `header`.  Actually, this function can never be called.

In [23]:
def t_header_error(t):
    print(f"Illegal character in state 'header': '{t.value[0]}'")
    t.lexer.skip(1)

The function `t_script_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens and the scanner is in state `script`.  Actually, this function can never be called.

In [24]:
def t_script_error(t):
    print(f"Illegal character in state 'script': '{t.value[0]}'")
    t.lexer.skip(1)

## Running the Scanner

The line below is necessary to trick `ply.lex` into assuming this program is written in an ordinary python file instead of a *Jupyter notebook*.

In [25]:
__file__ = 'main'

The line below generates the scanner.  Because the option `debug=True` is set, we can see the regular expression that is generated for scanning.

In [26]:
lexer = lex.lex(debug=True)

lex: tokens   = ['HEAD_START', 'HEAD_END', 'SCRIPT_START', 'SCRIPT_END', 'TAG', 'LINEBREAK', 'NAMED_ENTITY', 'UNICODE', 'ANY']
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive', 'header': 'exclusive', 'script': 'exclusive'}
lex: Adding rule t_HEAD_START -> '<head>' (state 'INITIAL')
lex: Adding rule t_SCRIPT_START -> '<script>' (state 'INITIAL')
lex: Adding rule t_LINEBREAK -> '(\s*\n\s*)+' (state 'INITIAL')
lex: Adding rule t_TAG -> '<[^>]+>' (state 'INITIAL')
lex: Adding rule t_NAMED_ENTITY -> '&[a-zA-Z]+;?' (state 'INITIAL')
lex: Adding rule t_UNICODE -> '&\#[0-9]+;?' (state 'INITIAL')
lex: Adding rule t_ANY -> '.' (state 'INITIAL')
lex: Adding rule t_header_HEAD_END -> '</head>' (state 'header')
lex: Adding rule t_header_script_ANY -> '.|\n' (state 'header')
lex: Adding rule t_script_SCRIPT_END -> '</script>' (state 'script')
lex: Adding rule t_header_script_ANY -> '.|\n' (state 'script')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_HEAD_STA

Next, we feed our input string into the generated scanner.

In [27]:
lexer.input(data)

In order to scan the data that we provided in the last line, we iterate over all tokens generated by our scanner.

In [28]:
def scan(lexer):
    for _ in lexer:
        pass

In [29]:
scan(lexer)









Prof. Dr. Karl Stroetmann



Duale Hochschule Baden-Württemberg
Coblitzallee 1-9
68163 Mannheim
Germany

Office:     Raum 344B
Phone:    +49 621 4105-1376
Fax:        +49 621 4105-1194
Skype:     karlstroetmann


Some links:


My lecture notes,
as well as the programs presented in class, can be found
at 
https://github.com/karlstroetmann.

Most of my papers can be found at researchgate.net.
The programming language SetlX can be downloaded at 
http://randoom.org/Software/SetlX.






As I am getting old and wise, I have to accept the limits of
my own capabilities.  I have condensed these deep philosophical
insights into a most beautiful pearl of poetry.  I would like
to share these humble words of wisdom:

I am a teacher by profession,    
mostly really by obsession;      
But even though I boldly try,    
I just cannot teach pigs to fly.
Instead, I slaughter them and fry.



Any sufficiently advanced poetry is indistinguishable from divine wisdom.

His holiness Pope Hugo Ⅻ.



