# Chapter 3: Corpus Processing Tools
A description of regular expressions and efficient tools to process text

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## Modules 
We use the `regex` module to have a better Unicode support

In [1]:
import regex as re

## The Corpora
Adjust the paths of your files

In [2]:
PATH = '../datasets/classics/'

In [3]:
odyssey_file = PATH + 'odyssey.mb.txt'
iliad_file = PATH + 'iliad.mb.txt'

In [4]:
iliad = open(iliad_file).read()
iliad[:100]

'Provided by The Internet Classics Archive.\nSee bottom for copyright. Available online at\n    http://'

In [5]:
odyssey = open(odyssey_file).read()
odyssey[:100]

'Provided by The Internet Classics Archive.\nSee bottom for copyright. Available online at\n    http://'

## Matching one occurrence

#### A first match with `re.search()`

In [6]:
line = 'The aerial acceleration alerted the ace pilot'

In [7]:
match = re.search('ac*e', line)
match      # <regex.Match object; span=(4, 6), match='ae'>

<regex.Match object; span=(4, 6), match='ae'>

#### Getting the match value

In [8]:
match.group()  # ae

'ae'

## Getting all the matches

#### The list of all the strings with `ac*e`, `ac?e`, `ac+e`, `ac{2}e`, `ac{2,}e`, and `a.e`

In [9]:
match_list = re.findall('ac*e', line)   # ['ae', 'acce', 'ace']
match_list

['ae', 'acce', 'ace']

In [10]:
re.findall('ac?e', line)  # ['ae', 'ace']

['ae', 'ace']

In [11]:
re.findall('ac+e', line)

['acce', 'ace']

In [12]:
re.findall('ac{2}e', line)

['acce']

In [13]:
re.findall('ac{2,}e', line)

['acce']

In [14]:
re.findall('a.e', line)

['ale', 'ace']

#### The match groups (the objects) with the `finditer()` iterator

In [15]:
match_iter = re.finditer('ac*e', line)
list(match_iter)

[<regex.Match object; span=(4, 6), match='ae'>,
 <regex.Match object; span=(11, 15), match='acce'>,
 <regex.Match object; span=(36, 39), match='ace'>]

## Nonprintable characters and modifiers
#### Start of a string

In [16]:
iliad_opening = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip()

In [17]:
re.findall('^S', iliad_opening)  # m/^S/g

['S']

In [18]:
re.findall('^s', iliad_opening)  # m/^s/g

[]

#### The case-insensitive modifier

We did not match `S`. We can make the regex case-insensitive

In [19]:
re.findall('^s', iliad_opening, re.I)  # m/^s/ig

['S']

#### Case insensitive and multiline

The start anchor `^` corresponds to the unique start a string. With the multiline modifier, `re.M` a `\n` also defines a start position

In [20]:
re.findall('^s', iliad_opening, re.I | re.M)  # m/^s/img

['S', 's']

#### Getting all the matches

In [21]:
iliad_opening = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip()
iliad_opening

'Sing, O goddess, the anger of Achilles \nson of Peleus, that brought countless ills upon the Achaeans.'

In [22]:
match_list = re.findall('^s', iliad_opening, re.I | re.M)
print(match_list)

['S', 's']


#### Getting all the matches with `finditer()`

In [23]:
match_list = re.finditer('^s', iliad_opening, re.I | re.M)
match_list

<_regex.Scanner at 0x105804620>

In [24]:
list(match_list)

[<regex.Match object; span=(0, 1), match='S'>,
 <regex.Match object; span=(40, 41), match='s'>]

## Substitution

#### Global replacement: `s/pattern/replacement/g`

In [25]:
print(re.sub('es+', 'EZ', iliad_opening))  # s/es+/EZ/g

Sing, O goddEZ, the anger of AchillEZ 
son of Peleus, that brought countlEZ ills upon the Achaeans.


 
#### Just one replacement: s/regex/replacement/

In [26]:
print(re.sub('es+', 'EZ', iliad_opening, 1))  # s/es+/EZ/g

Sing, O goddEZ, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.


## Backreferences

In [27]:
line

'The aerial acceleration alerted the ace pilot'

In [28]:
re.search('ac+e', line)

<regex.Match object; span=(11, 15), match='acce'>

In [29]:
match = re.search('a(c+)e', line)

The back-reference

In [30]:
match.group(1)

'cc'

The whole pattern

In [31]:
match.group()

'acce'

Equivalent to

In [32]:
match.group(0)

'acce'

More back references

In [33]:
match = re.search('(.)(c+)e', line)

Back reference 1, `(.)` stored in `\1`

In [34]:
match.group(1)

'a'

Backreference 2, `(c+)` stored in `\2`

In [35]:
match.group(2)

'cc'

Matching a sequence of three identical characters

In [36]:
match = re.search(r'(.)\1\1', 'accceleration')
match.group(1)                # 'c'

'c'

#### Raw Strings and The `r` prefix

In [37]:
'\1'

'\x01'

In [38]:
'\141'

'a'

In [39]:
'\n'

'\n'

In [40]:
r'\1'

'\\1'

In [41]:
r'\141'

'\\141'

In [42]:
r'\n'

'\\n'

#### Patterns and the Regex Engine
The regex engine compiles the pattern into a finite-state automaton. It is recommended to use an `r` to avoids ambiguities when it reads the Perl regex. For instance, `r'a+\n'`, with two metacharacters: `+` and `\n`.

In [43]:
re.search(r'a+\n', 'aaa\nbc')

<regex.Match object; span=(0, 4), match='aaa\n'>

Nonetheless, the regex engine is often able to guess without the `r` as here: 

In [44]:
re.search('a+\n', 'aaa\nbc')

<regex.Match object; span=(0, 4), match='aaa\n'>

But not here

In [45]:
re.search(r'(.)\1', 'aaa\1bc')

<regex.Match object; span=(0, 2), match='aa'>

In [46]:
re.search('(.)\1', 'aaa\1bc')

<regex.Match object; span=(2, 4), match='a\x01'>

Note that in the search string, the second argument, a `r` would force the interpretation of `\n` as the sequence `\` and `n`

In [47]:
r'aaa\nbc'

'aaa\\nbc'

In [48]:
list(r'aaa\nbc')

['a', 'a', 'a', '\\', 'n', 'b', 'c']

And the search would fail

In [49]:
re.search(r'a+\n', r'aaa\nbc')

## Python escape sequences

In [50]:
re.search(r'\p{N}+\t\p{N}+', 'Frequencies: 100	200	300')

<regex.Match object; span=(13, 20), match='100\t200'>

In [51]:
re.search('\N{COMMERCIAL AT}', 'pierre@dot.com')

<regex.Match object; span=(6, 7), match='@'>

 ### Substitutions

#### Multiple backreferences `m/\$ *([0-9]+)\.?([0-9]*)/`

In [52]:
price = "We'll buy it for $72.40"

In [53]:
match = re.search(r'\$ *([0-9]+)\.?([0-9]*)', price)
match.group()  # ’$72.40’ The entire match

'$72.40'

In [54]:
match.group(1)  # ’72’ The first group

'72'

In [55]:
match.group(2)  # ’40’ The second group

'40'

#### Substitutions `s/\$ *([0-9]+)\.?([0-9]*)/\1 dollars and \2 cents/g`

In [56]:
re.sub(r'\$ *([0-9]+)\.?([0-9]*)',
       r'\1 dollars and \2 cents', price)
# We’ll buy it for 72 dollars and 40 cents

"We'll buy it for 72 dollars and 40 cents"

## Matching objects

In [57]:
price = "We'll buy it for $72.40"

In [58]:
match = re.search(r'\$ *([0-9]+)\.?([0-9]*)', price)
match

<regex.Match object; span=(17, 23), match='$72.40'>

#### Input

In [59]:
match.string            # We’ll buy it for $72.40

"We'll buy it for $72.40"

#### Groups

In [60]:
match.groups()          # (’72’, ’40’)

('72', '40')

In [61]:
match.group(0)          # '$72.40'

'$72.40'

In [62]:
match.group(1)

'72'

In [63]:
match.group(2)

'40'

#### Match objects: The indices

In [64]:
match.start(0)

17

In [65]:
match.end(0)

23

In [66]:
match.start(1)

18

In [67]:
match.end(1)

20

#### Example

In [68]:
odyssey_opening = """Tell me, O muse, of that ingenious hero
  who travelled far and wide after he had sacked
  the famous town of Troy.""".strip()
odyssey_opening

'Tell me, O muse, of that ingenious hero\n  who travelled far and wide after he had sacked\n  the famous town of Troy.'

In [69]:
match = re.search(',.*,', odyssey_opening, re.S)
match

<regex.Match object; span=(7, 16), match=', O muse,'>

In [70]:
odyssey_opening[0:match.start()]             # ’Tell me’

'Tell me'

In [71]:
odyssey_opening[match.start():match.end()]   # ’, O muse,’

', O muse,'

In [72]:
odyssey_opening[match.end():]   # ’of that ingenious hero
#  who travelled far and wide after he had sacked
#  the famous town of Troy.’

' of that ingenious hero\n  who travelled far and wide after he had sacked\n  the famous town of Troy.'

### Parameterable Regular Expressions

In [73]:
string = 'my string'
width = 20

In [74]:
def make_regex(string, width):
    return ('.{{0,{width}}}{string}.{{0,{width}}}'
            .format(string=string, width=width))

In [75]:
make_regex(string, width)

'.{0,20}my string.{0,20}'

But

In [76]:
make_regex(string + '.', width)

'.{0,20}my string..{0,20}'

In [77]:
def make_regex(string, width):
    string = re.escape(string)
    return ('.{{0,{width}}}{string}.{{0,{width}}}'
            .format(string=string, width=width))

In [78]:
string = 'my string.'
width = 20

In [79]:
make_regex(string, width)

'.{0,20}my\\ string\\..{0,20}'

In [80]:
pattern = make_regex('Penelope', 15)
re.search(pattern, odyssey, re.S).group()

' of his\nmother Penelope, who persist i'

## Concordances: `.{0,15}the Achaeans.{0,15}`

In [81]:

# pattern = 'Penelope'
pattern = 'the Achaeans'
width = 25
text = odyssey

In [82]:
# spaces match tabs and newlines
pattern = re.sub(' ', r'\\s+', pattern)
pattern

'the\\s+Achaeans'

In [83]:
# Replaces newlines with spaces in the text
text = re.sub(r'\s+', ' ', text)

In [84]:
for match in re.finditer(pattern, text):
    print(text[match.start() - width:match.end() + width])

ill embolden him to call the Achaeans in assembly, and speak o
ting were done; for then the Achaeans would have built a mound
 he got home last of all the Achaeans; if you hear that your f
ls Minerva had laid upon the Achaeans. Penelope, daughter of I
lysses did some wrong to the Achaeans which you would now aven
nswer, that both you and the Achaeans may understand-'Send you
g. I would obey you, but the Achaeans, and more particularly t
ld not tell you all that the Achaeans suffered, and you would 
e, for it was sunset and the Achaeans were heavy with wine. Wh
ying hard words, whereon the Achaeans sprang to their feet wit
our to the Achaean name, the Achaeans applaud Orestes and his 
ke me, for no one of all the Achaeans worked so hard or risked
ips, he told me all that the Achaeans meant to do. He killed m
 him so heavily that all the Achaeans cheered him- if he is st
ell me true, whether all the Achaeans whom Nestor and I left b
o of the chief men among the Achaeans perished during t

### Lookahead

In [85]:
la_text = 'Meanwhile great Ajax kept on trying to drive a spear into Hector, but Hector was so skilful that he held his broad shoulders well under cover of his ox-hide shield, ever on the look-out for the whizzing of the arrows and the heavy thud of the spears.'
la_text

'Meanwhile great Ajax kept on trying to drive a spear into Hector, but Hector was so skilful that he held his broad shoulders well under cover of his ox-hide shield, ever on the look-out for the whizzing of the arrows and the heavy thud of the spears.'

In [86]:
string = 'Hector'
width = 20

In [87]:
pattern = make_regex(string, width)
pattern

'.{0,20}Hector.{0,20}'

In [88]:
for match in re.finditer(pattern, la_text):
    print(match.group())

 drive a spear into Hector, but Hector was so 


In [89]:
la_pattern = '.{0,20}Hector(?=(.{0,20}))'

In [90]:
re.search(la_pattern, la_text)

<regex.Match object; span=(38, 64), match=' drive a spear into Hector'>

In [91]:
for match in re.finditer(la_pattern, la_text):
    print(match.group(0), match.group(1))

 drive a spear into Hector , but Hector was so 
, but Hector  was so skilful that


In [92]:
la_pattern = '(?<=(.{0,20}))Hector(?=(.{0,20}))'

In [93]:
for match in re.finditer(la_pattern, la_text):
    print(match.group(1), match.group(0), match.group(2))

 drive a spear into  Hector , but Hector was so 
ar into Hector, but  Hector  was so skilful that


## Min-edit

In [94]:
[source, target] = ('language', 'lineage')

In [95]:
length_s = len(source) + 1
length_t = len(target) + 1

# Initialize first row and column
table = [None] * length_s

for i in range(length_s):
    table[i] = [None] * length_t
    table[i][0] = i
for j in range(length_t):
    table[0][j] = j

In [96]:
table

[[0, 1, 2, 3, 4, 5, 6, 7],
 [1, None, None, None, None, None, None, None],
 [2, None, None, None, None, None, None, None],
 [3, None, None, None, None, None, None, None],
 [4, None, None, None, None, None, None, None],
 [5, None, None, None, None, None, None, None],
 [6, None, None, None, None, None, None, None],
 [7, None, None, None, None, None, None, None],
 [8, None, None, None, None, None, None, None]]

In [97]:
# Fills the table. Start index of rows and columns is 1
for i in range(1, length_s):
    for j in range(1, length_t):
        # Is it a copy or a substitution?
        cost = 0 if source[i - 1] == target[j - 1] else 2
        # Computes the minimum
        minimum = table[i - 1][j - 1] + cost
        if minimum > table[i][j - 1] + 1:
            minimum = table[i][j - 1] + 1
        if minimum > table[i - 1][j] + 1:
            minimum = table[i - 1][j] + 1
        table[i][j] = minimum

In [98]:
table

[[0, 1, 2, 3, 4, 5, 6, 7],
 [1, 0, 1, 2, 3, 4, 5, 6],
 [2, 1, 2, 3, 4, 3, 4, 5],
 [3, 2, 3, 2, 3, 4, 5, 6],
 [4, 3, 4, 3, 4, 5, 4, 5],
 [5, 4, 5, 4, 5, 6, 5, 6],
 [6, 5, 6, 5, 6, 5, 6, 7],
 [7, 6, 7, 6, 7, 6, 5, 6],
 [8, 7, 8, 7, 6, 7, 6, 5]]

In [99]:
for j in range(length_t):
    for i in range(length_s):
        print(table[i][length_t - j - 1], " ", end='')
    print()

7  6  5  6  5  6  7  6  5  
6  5  4  5  4  5  6  5  6  
5  4  3  4  5  6  5  6  7  
4  3  4  3  4  5  6  7  6  
3  2  3  2  3  4  5  6  7  
2  1  2  3  4  5  6  7  8  
1  0  1  2  3  4  5  6  7  
0  1  2  3  4  5  6  7  8  


In [100]:
'Minimum distance: ', table[length_s - 1][length_t - 1]

('Minimum distance: ', 5)