In [None]:
%%HTML
<style>
.container { width: 100% }
</style>

# Converting <span style="font-variant:small-caps;">Html</span> to Text

This notebook shows how we can use the library [`ply`](https://ply.readthedocs.io/en/latest/ply.html)
to extract the text that is embedded in an <span style="font-variant:small-caps;">Html</span> file.  
In order to be concise, it does only support a small subset of 
<span style="font-variant:small-caps;">Html</span>.  Below is the content of my old
<a href="http://wwwlehre.dhbw-stuttgart.de/~stroetma/">web page</a> that I had used when I still 
worked at the DHBW Stuttgart.  The goal of this notebook is to write a scanner that is able to extract 
the text from this web page.

In [None]:
data = \
'''
<html>
  <head>
    <meta charset="utf-8">
    <title>Homepage of Prof. Dr. Karl Stroetmann</title>
    <link type="text/css" rel="stylesheet" href="style.css" />
    <link href="http://fonts.googleapis.com/css?family=Rochester&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Pacifico&subset=latin,latin-ext"
          rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Cabin+Sketch&subset=latin,latin-ext" rel="stylesheet" type="text/css">
    <link href="http://fonts.googleapis.com/css?family=Sacramento" rel="stylesheet" type="text/css">
  </head>
  <body>
    <hr/>

    <div id="table">
      <header>
        <h1 id="name">Prof. Dr. Karl Stroetmann</h1>
      </header>

      <div id="row1">
        <div class="right">
          <a id="dhbw" href="http://www.ba-stuttgart.de">Duale Hochschule Baden-W&uuml;rttemberg</a>
          <br/>Coblitzallee 1-9
          <br/>68163 Mannheim
          <br/>Germany
	  <br>
          <br/>Office: &nbsp;&nbsp;&nbsp; Raum 344B
          <br/>Phone:&nbsp;&nbsp;&nbsp; +49 621 4105-1376
          <br/>Fax:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; +49 621 4105-1194
          <br/>Skype: &nbsp;&nbsp;&nbsp; karlstroetmann
        </div>  


        <div id="links">
          <strong class="some">Some links:</strong>
          <ul class="inlink">
            <li class="inlink">
	      My <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">lecture notes</a>,
              as well as the programs presented in class, can be found
              at <br>
              <a class="inlink" href="https://github.com/karlstroetmann?tab=repositories">https://github.com/karlstroetmann</a>.
              
            </li>
            <li class="inlink">Most of my papers can be found at <a class="inlink" href="https://www.researchgate.net/">researchgate.net</a>.</li>
            <li class="inlink">The programming language SetlX can be downloaded at <br>
              <a href="http://randoom.org/Software/SetlX"><tt class="inlink">http://randoom.org/Software/SetlX</tt></a>.
            </li>
          </ul>
        </div>
      </div>
    </div>
    
    <div id="intro">
      As I am getting old and wise, I have to accept the limits of
      my own capabilities.  I have condensed these deep philosophical
      insights into a most beautiful pearl of poetry.  I would like 
      to share these humble words of wisdom:
      
      <div class="poetry">
        I am a teacher by profession,    <br>
        mostly really by obsession;      <br>
        But even though I boldly try,    <br>
        I just cannot teach <a href="http://img1.wikia.nocookie.net/__cb20070831020747/uncyclopedia/images/a/a2/Flying_Pig.jpg" id="fp">pigs</a> to fly.</br>
        Instead, I slaughter them and fry.
      </div>
      
      <div class="citation">
        <div class="quote">
          Any sufficiently advanced poetry is indistinguishable from divine wisdom.
        </div>
        <div id="sign">His holiness Pope Hugo &#8555;.</div>
      </div>
    </div>
</div>

</body>
</html>
'''

We will be use the library [ply](https://ply.readthedocs.io/en/latest/ply.html) to extract the text that
is embedded in the <span style="font-variant:small-caps;">Html</span> shown above.
In this example, we will only use the scanner that is provided by the module `ply.lex`. 
Hence we import the module `ply.lex` that contains the scanner generator from `ply`.

In [None]:
import ply.lex as lex

We begin by defining the list of tokens.  Note that the variable `tokens` is a keyword of `ply` to define the names of the token classes.  In this case, we have declared nine different tokens.
- `HEAD_START` will match the tag `<head>` that starts the definition of the 
  <span style="font-variant:small-caps;">Html</span> header.
- `HEAD_END` will match the tag `</head>` that ends the definition of the 
  <span style="font-variant:small-caps;">Html</span> header.
- `SCRIPT_START` will match the tag `<script>` that starts embedded *javascript* code.
- `SCRIPT_END` will match the tag `</script>` that ends embedded *javascript* code.
- `TAG` is a token that represents arbitrary <span style="font-variant:small-caps;">Html</span> tags.
- `LINEBREAK` is a token that will match the newline character `\n` at the end of a line.
- `NAMED_ENTITY` is a token that represents named <span style="font-variant:small-caps;">Html5</span>
  entities.
- `UNICODE` is a token that represents a unicode entity.
- `ANY` is a token that matches any character.

In [None]:
tokens = [ 'HEAD_START',
           'HEAD_END'
           'SCRIPT_START',
           'SCRIPT_END',
           'TAG',
           'LINEBREAK', 
           'NAMED_ENTITY',
           'UNICODE',
           'ANY'
         ]

Once we are inside an <span style="font-variant:small-caps;">Html</span> header or inside of some
*javascript* code the rules of the scanning game change.  Therefore, we declare two new *exclusive* states:
- `header` is the state the scanner is in while it is scanning an 
  <span style="font-variant:small-caps;">Html</span> header.
- `script` is the state of the scanner while scanning *javascript*.  

In [None]:
states = [ ('header', 'exclusive'),
           ('script', 'exclusive')
         ]

We proceed to define the definition of the tokens.  Note that none of the function defined below
returns a token.  Rather all of these function print the transformation of the 
<span style="font-variant:small-caps;">Html</span> that they have matched.

Once the scanner reads the opening tag `<head>` it switches into the state `header`.  In this state it will continue to read and discard characters until it sees the closing tag `/head>`.

In [None]:
def t_HEAD_START(t):
    r'<head>'
    t.lexer.begin('header')

Once the scanner reads the opening tag `<script>` it switches into the state `script`.  In this state it will continue to read and discard characters until it sees the closing tag `/script>`.

In [None]:
def t_SCRIPT_START(t):
    r'<script[^>]+>'
    t.lexer.begin('script')

Groups of newline characters are condensed into a single newline character.

In [None]:
def t_LINEBREAK(t):
    r'\n+'
    print()

The token `TAG` is defined as any string that starts with the character `<` and ends with the character 
`>`. Betweens these two characters there has to be a nonzero number of characters that are different from 
the character `>`.

In [None]:
def t_TAG(t):
    r'<[^>]+>'
    pass

In order to support named <span style="font-variant:small-caps;">Html</span> entities we need to import
the dictionary `html5` from the module `html.entities`.  For every named 
<span style="font-variant:small-caps;">Html</span> entity `e`, `html[e]` is the unicode symbol that is specified by `e`.

In [None]:
from html.entities import html5

In [None]:
html5['auml']

The regular expresion `&[a-zA-Z]+;?` searches for <span style="font-variant:small-caps;">Html</span>
entity names.  These are strings that start with the character `&` followed by the name of the entity, optionally followed by the character `;`.  Then the unicode character corresponding to the name is looked up and printed.

In [None]:
def t_NAMED_ENTITY(t):
    r'&[a-zA-Z]+;?'
    if t.value[-1] == '?':
        entity_name = t.value[1:-1]
    else:
        entity_name = t.value[1:]
    unicode_char = html5[entity_name]
    print(unicode_char, end='')

The regular expression `&\#[0-9]+` searches for <span style="font-variant:small-caps;">Html</span> entities that specify a unicode cahracter numerically.  The corresponding strings start with the character `&`
followed by the character `#` followed by digits and ended with the character `;`.

Note that we had to escape the character `#` with a  backslash because otherwise this character would signal the begin of a comment.

In [None]:
def t_UNICODE(t):
    r'&\#[0-9]+;'
    print(chr(int(t.value[2:-1])), end='')

The regular expression `.` matches any character that is different from a newline character.

In [None]:
def t_ANY(t):
    r'.'
    print(t.value, end='')

The regular expression `</head>` matches the closing head tag.  Note that is regular expression is only
active in state `header` as the name of this function starts with `t_header`.  Once the closing tag has been found, the function `lexer.begin` switches the lexer back into the state `INITIAL`, which is the 
start state of the scanner.  In this state, all token definitions are active, that do not start with 
either `t_header` or `t_script`.

In [None]:
def t_header_HEAD_END(t):
    r'</head>'
    t.lexer.begin('INITIAL')

If the scanner is either in the state `header` or the state `script`, the function 
`t_header_script_ANY` eats all characters without echoing them.

In [None]:
def t_header_script_ANY(t):
    r'.|\n'
    pass

The regular expression `</script>` matches the closing script tag.  Note that is regular expression is only
active in state `script`.  Once the closing tag has been found, the function `lexer.begin` switches the lexer back into the state `INITIAL`, which is the 
start state of the scanner.  

In [None]:
def t_script_SCRIPT_END(t):
    r'</script>'
    t.lexer.begin('INITIAL')

The function `t_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens.  In our implementation we print the first character that could not be matched, discard this character and continue.

In [None]:
def t_error(t):
    print(f"Illegal character: '{t.value[0]}'")
    t.lexer.skip(1)

The function `t_header_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens and the scanner is in state `header`. 

In [None]:
def t_header_error(t):
    print(f"Illegal character in state 'header': '{t.value[0]}'")
    t.lexer.skip(1)

The function `t_script_error` is called when a substring at the beginning of the input can not be matched by any of the regular expressions defined in the various tokens and the scanner is in state `header`. 

In [None]:
def t_script_error(t):
    print(f"Illegal character in state 'script': '{t.value[0]}'")
    t.lexer.skip(1)

The line below is necessary to trick `ply.lex` into assuming this program is written in an ordiary python file instead of a *Jupyter notebook*.

In [None]:
__file__ = 'main'

The line below generates the scanner.

In [None]:
lexer = lex.lex(debug=True)

Next, we feed our input string into the generated scanner.

In [None]:
lexer.input(data)

In order to scan the data that we provided in the last line, we iterate over all tokens generated by our scanner.

In [None]:
def scan(lexer):
    for t in lexer:
        pass

In [None]:
scan(lexer)