<div class="alert alert-danger">
**Due date:** 2018-01-19
</div>

# L0: Text segmentation

## Introduction

From a computer&rsquo;s perspective, a text in the first place is a sequence of characters, such as letters and digits. Before we can process a text with language technology tools, we need to segment it into linguistically more meaningful units, such as paragraphs, sentences, or words. This basic technique is called **text segmentation**. When the target units are words, segmentation is called **tokenisation**. In this lab you will implement a simple tokeniser for running text.

In [1]:
import nlp0

## Data

The text you will be working with is an article from Swedish Wikipedia: [Gustav III](https://sv.wikipedia.org/wiki/Gustav_III). Look at the webpage and see how it is structured.

A Wikipedia page consists not only of text but also of other data, such as pictures and tables. Before you can start tokenising the text, you would usually need to extract it from the page using a tool like [Scrapy](https://scrapy.org). For this lab this has been already done for you, which means that your starting point will be the extracted text.

### Read in the raw text

In order to read in the extracted text in Python, we define a helper function `read_data()`. The function opens the given file and returns its content as a list with lines of text. The textfile uses newline characters (`\n`) to end each line; this character is removed using Python's [`str.rstrip()`](https://docs.python.org/3.5/library/stdtypes.html#str.rstrip).

In [2]:
def read_data(filename):
    with open(filename) as f:
        return [line.rstrip() for line in f]

You can now read in the raw text:

In [3]:
text1 = read_data("/home/TDDE09/labs/l0/data/text1.txt")

Look at the text in a text editor and try to identify peculiarities that might create problems for further analysis. The text is automatically extracted, using methods that read the data from the website&rsquo;s HTML tree.

You can even look at the text directly from the notebook. The next command prints a list with the first 50&nbsp;lines of the text:

In [4]:
print(text1[:50])

['Gustav III', ', född', '13 januari', ' (', 'g.s.', ')/', '24 januari', ' (', 'n.s.', ')', '1746', ', död', '29 mars', '', '1792', ', var', 'Sveriges kung', ' 1771–1792.', '', 'Han var son till', 'Adolf Fredrik', ' och', 'Lovisa Ulrika', ', bror till', 'Karl XIII', ', far till', 'Gustav IV Adolf', ', och kusin till', 'Katarina II av Ryssland', '. På grund av sitt stora kulturintresse – han instiftade bland annat', 'Svenska Akademien', ' – kallas han ibland "Teaterkungen".', '[', '1', ']', ' 1772 genomförde han', 'Gustav III:s statskupp', ', då regeringen avsattes, Sveriges första politiska partier tvångsupplöstes och kungen blev i praktiken enväldig. Hans', 'upplysta despotism', ' bekräftades senare genom en inskränkt tryckfrihetsförordning 1774 och', 'förenings- och säkerhetsakten', ' 1789, som kraftigt begränsade riksdagens makt. Samtidigt liberaliserades ekonomin och strafflagstiftningen,', 'dödsstraffet', ' begränsades,', 'tortyr', ' under förhör förbjöds och tortyrkammaren i Stoc

The following code snippet recreates the content from the text file in lines 51 to 60, glueing the lines together using the newline character:

In [5]:
print("\n".join(text1[50:60]))

Katoliker
 och
judar
 tilläts att bosätta sig i riket, dock med begränsade medborgerliga rättigheter.

Hans krig mot Ryssland 1788-1790 slutade utan landvinningar eller -förluster efter
slaget vid Svensksund
, som blev en seger mot den ryska flottan. Gustav var personlig vän med det franska kungahuset och engagerade sig i motståndet mot den
franska revolutionen
 och undertryckte all opposition med järnhand. Hans politik gjorde honom impopulär inom delar av adeln och det bildades en sammansvärjning för att avsätta honom. Han skadesköts vid ett


### Read in the gold standard

There exists a gold standard tokenisation for the raw text. This tokenisation follows the rules used in the [Stockholm–Umeå Corpus (SUC)](https://spraakbanken.gu.se/swe/resurs/suc3), a standard corpus for Swedish. The file containing the gold standard tokenisation consists of all tokens from the raw text, with one token per line.

In [6]:
gold1 = read_data("/home/TDDE09/labs/l0/data/gold1.txt")

Look at the gold standard and try to understand the principles it is based on. Most tokens are normal words or punctuation marks, but note that abbreviations are handled as one token.

In [7]:
print(gold1[:50])

['Gustav', 'III', ',', 'född', '13', 'januari', '(', 'g.s.', ')', '/', '24', 'januari', '(', 'n.s.', ')', '1746', ',', 'död', '29', 'mars', '1792', ',', 'var', 'Sveriges', 'kung', '1771', '–', '1792', '.', 'Han', 'var', 'son', 'till', 'Adolf', 'Fredrik', 'och', 'Lovisa', 'Ulrika', ',', 'bror', 'till', 'Karl', 'XIII', ',', 'far', 'till', 'Gustav', 'IV', 'Adolf', ',']


## Whitespace tokenisation

The next cell contains a very simple tokeniser:

In [8]:
def tokenize_ws(lines):
    tokens = []
    for line in lines:
        for token in line.split():
            tokens.append(token)
    return tokens

This function takes a list with text lines, splits every line at whitespace using the function [`str.split()`](https://docs.python.org/3.5/library/stdtypes.html#str.split), and collects the resulting strings in a list `tokens`.

### Compare the tokenisation with the gold standard

Test the tokeniser on the first 50 lines of the text:

In [9]:
print(tokenize_ws(text1[:50]))

['Gustav', 'III', ',', 'född', '13', 'januari', '(', 'g.s.', ')/', '24', 'januari', '(', 'n.s.', ')', '1746', ',', 'död', '29', 'mars', '1792', ',', 'var', 'Sveriges', 'kung', '1771–1792.', 'Han', 'var', 'son', 'till', 'Adolf', 'Fredrik', 'och', 'Lovisa', 'Ulrika', ',', 'bror', 'till', 'Karl', 'XIII', ',', 'far', 'till', 'Gustav', 'IV', 'Adolf', ',', 'och', 'kusin', 'till', 'Katarina', 'II', 'av', 'Ryssland', '.', 'På', 'grund', 'av', 'sitt', 'stora', 'kulturintresse', '–', 'han', 'instiftade', 'bland', 'annat', 'Svenska', 'Akademien', '–', 'kallas', 'han', 'ibland', '"Teaterkungen".', '[', '1', ']', '1772', 'genomförde', 'han', 'Gustav', 'III:s', 'statskupp', ',', 'då', 'regeringen', 'avsattes,', 'Sveriges', 'första', 'politiska', 'partier', 'tvångsupplöstes', 'och', 'kungen', 'blev', 'i', 'praktiken', 'enväldig.', 'Hans', 'upplysta', 'despotism', 'bekräftades', 'senare', 'genom', 'en', 'inskränkt', 'tryckfrihetsförordning', '1774', 'och', 'förenings-', 'och', 'säkerhetsakten', '1789,

Compare this tokenisation with the gold standard. Which differences do you find?

Most differences can be explained as **undersegmentation**, where the tokeniser has missed to split a token. The opposite is **oversegmentation**, where the tokeniser splits a character sequence that should really be one token.

In order to examine the differences, you can use the function `diff()` from the lab module. This function expects two arguments, a list with gold standard tokens and a list with automatically predicted tokens. It returns a new list that shows the differences between the two tokenisations in a compact way. The following command shows the first ten differences:

In [10]:
nlp0.diff(gold1, tokenize_ws(text1))[:10]

[([')', '/'], [')/']),
 (['1771', '–', '1792', '.'], ['1771–1792.']),
 (['"', 'Teaterkungen', '"', '.'], ['"Teaterkungen".']),
 (['avsattes', ','], ['avsattes,']),
 (['enväldig', '.'], ['enväldig.']),
 (['1789', ','], ['1789,']),
 (['makt', '.'], ['makt.']),
 (['strafflagstiftningen', ','], ['strafflagstiftningen,']),
 (['begränsades', ','], ['begränsades,']),
 (['gott', '.'], ['gott.'])]

The list contains pairs whose first component is a sequence of tokens that appear in the gold standard but not in the automatic tokenisation, and whose second component is a sequence of tokens that appear in the automatic tokenisation but not in the gold standard. The following code snippet prints the list in a more readable way:

In [11]:
# Helper function that formats a list of tokens
def fmt_tokens(tokens):
    return "{} {}".format(" ".join(tokens), len(tokens))

# Print out information about divergent subsequences
print("Gold tokens".ljust(40), "Predicted tokens".ljust(40))
print()
for gold_tokens, pred_tokens in nlp0.diff(gold1, tokenize_ws(text1)):
    print(fmt_tokens(gold_tokens).ljust(40), fmt_tokens(pred_tokens).ljust(40))

Gold tokens                              Predicted tokens                        

) / 2                                    )/ 1                                    
1771 – 1792 . 4                          1771–1792. 1                            
" Teaterkungen " . 4                     "Teaterkungen". 1                       
avsattes , 2                             avsattes, 1                             
enväldig . 2                             enväldig. 1                             
1789 , 2                                 1789, 1                                 
makt . 2                                 makt. 1                                 
strafflagstiftningen , 2                 strafflagstiftningen, 1                 
begränsades , 2                          begränsades, 1                          
gott . 2                                 gott. 1                                 
riket , 2                                riket, 1                                
rättigheter . 2

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Examine the differences between the gold standard and the whitespace-based tokenisation. Try to classify different types of undersegmentation and think of ways how one could eliminate them. Give at least three examples from different types and describe their characteristics. Give at least one example of oversegmentation.
</div>
</div>

In order to solve this problem, you can either examine the output from the previous code cell by hand or write code to solve this task for you.

In [12]:
# You might want to write some code here.

*TODO: Insert your answer to Problem&nbsp;1 here*

Examples for different types of under-segmentation:

* Example 1 [1771 – 1792 . 4   vs 1771–1792. 1]
    * Here the gold standard wants to split on the special characters "–" and "." but the whitespace tokenizer does not support this
* Example 2 [lös ! " , 4   vs   lös!", 1] 
    *  This is a similar issue. The gold standard considers the "!" character and the quotation mark as tokens on their own.
* Example 3 löd ; Gustaf , 4   vs löd; Gustaf, 2
    *  This is a similar issue. The gold standard considers the ";" character and the "," character as tokens on their own.

Example for over-segmentation:
tecknings- 1    vs     tecknings - 2

For this to occur the text must have contained "tecknings - ..."  and because of the whitespace our tokenizer chose to split into 2 tokens, but the gold standard considers it as one.

### Compute precision and recall

One way to do a quantitative evaluation of the tokeniser is to compute its **precision** and its **recall**. Precision is defined as the percentage of correct tokens among all tokens the system has identified. Recall is defined as the percentage of correctly identified tokens among all tokens in the gold standard. In order to compute those values you can use the next code cell:

In [13]:
tokens_ws = tokenize_ws(text1)

print("Errors: {}".format(nlp0.n_errors(gold1, tokens_ws)))
print("Precision: {:.2%}".format(nlp0.precision(gold1, tokens_ws)))
print("Recall: {:.2%}".format(nlp0.recall(gold1, tokens_ws)))

Errors: 1055
Precision: 90.82%
Recall: 82.03%


## Tokenisation based on regular expressions

In the second part of this lab you will exchange the simple whitespace-based tokenisation with a more advanced tokenisation based on **regular expressions**. Before you can use regular expressions in Python you have to first load the relevant module:

In [14]:
import re

A simple tokeniser based on regular expressions looks like this:

In [15]:
def tokenize_re(regex, lines):
    output = []
    for line in lines:
        for match in re.finditer(regex, line):
            output.append(match.group(0))
    return output

This function finds all longest, non-overlapping occurrences of the pattern `regex` in the row `line` and returns them as a list. The line is scanned from left to right and the matching substrings are returned in the same order.

In order to simulate and run the whitespace-based tokeniser using regular expression you can use the following lines of code:

In [16]:
# Regular expression the tokeniser will use 
zero = "etc."                                 # Hardcoded
first = "\w[\'\’]\w+"                         # E.g. d'andrea 
second = "\w+\-\w+"                           # E.g. "word1-word2"
third = "\w+\-"                               # E.g. "förenings-"
fourth = "\w+[\-\.\:]+(\w+[\-\.\:]?)+"        # E.g. "g.s."
fifth = "(\'\w+)"                             # E.g. "what're" and this captures "'re"
sixth = "\w+"                                 # E.g. "word"
seventh ="\S"                                 # Special characters

regex = r"%s|%s|%s|%s|%s|%s|%s|%s" % (zero, first, second, third, fourth, fifth, sixth, seventh)
tokens_re = tokenize_re(regex, text1)


print("Errors: {}".format(nlp0.n_errors(gold1, tokens_re)))
print("Precision: {:.2%}".format(nlp0.precision(gold1, tokens_re)))
print("Recall: {:.2%}".format(nlp0.recall(gold1, tokens_re)))

# In order to debug the regex, you might want to comment in the next line.
nlp0.diff(gold1, tokens_re)

Errors: 9
Precision: 99.85%
Recall: 99.93%


[(['-förluster'], ['-', 'förluster']),
 (['tecknings-'], ['tecknings', '-']),
 (['drama-'], ['drama', '-'])]

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Find a regular expression that eliminates as many differences between the gold standard and the automatic tokenisation as possible. Your final tokeniser should have at least 99.5% precision and recall.
</div>
</div>

Here are some hints:

* Read the [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) and [the documentation for the module  `re`](https://docs.python.org/3.5/library/re.html).

* If you want to use grouping sub-expressions, you might want to use *non-capturing* groups.

* If your expression gets too long and hard to read, have a look at [`re.VERBOSE`](https://docs.python.org/3.5/library/re.html#re.VERBOSE) for writing the expression over multiple lines.

* If you want to practice your regex skills a little more, hop over to [RegexOne](https://regexone.com) or [RegExr](http://regexr.com).

## Evaluate the tokeniser on new text

Your last task is to evaluate your tokeniser on another article from Swedish Wikipedia: [Katarina II av Ryssland](https://sv.wikipedia.org/wiki/Katarina_II_av_Ryssland). (She was Gustav&nbsp;III&rsquo;s cousin.)

The raw text and the gold standard tokenisation is loaded like this:

In [17]:
text2 = read_data("/home/TDDE09/labs/l0/data/text2.txt")
gold2 = read_data("/home/TDDE09/labs/l0/data/gold2.txt")



<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Evaluate your regular expression from Problem&nbsp;2 on the new text. Report the results and try to explain them. Write a short text (max. 250 words) of discussion.
</div>
</div>

In [18]:

tokens_re = tokenize_re(regex, text2)


print("Errors: {}".format(nlp0.n_errors(gold2, tokens_re)))
print("Precision: {:.2%}".format(nlp0.precision(gold2, tokens_re)))
print("Recall: {:.2%}".format(nlp0.recall(gold2, tokens_re)))

# In order to debug the regex, you might want to comment in the next line.
nlp0.diff(gold2, tokens_re)

Errors: 10
Precision: 99.82%
Recall: 99.92%


[(['Алексе́евна'], ['Алексе', '́', 'евна']),
 (['14000'], ['14', '000']),
 (['3000'], ['3', '000'])]


With this new text we get 2 types of errors. The first error type is as follows:
[(['Алексе́евна'], ['Алексе', 'евна']).
Our regex fails here because the symbol е́ is not considered an alphanumeric character included in \w.
The second error is:
(['14000'], ['14', '000'])
This one occurs since the number is written as "14 000" in the text and this means that our regex will split based on the whitespace.
