# Chapter 3: Corpus Processing Tools
A description of regular expressions and efficient tools to process text

Programs from the book: [Python for Natural Language Processing](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

## Modules 
We use the `regex` module to have a better Unicode support

In [1]:
import regex as re


## The Corpora
Adjust the paths of your files

In [2]:
odyssey_file = '../../corpus/odyssey.mb.txt'
iliad_file = '../../corpus/iliad.mb.txt'
selma_file = '../../corpus/Selma.txt'


In [3]:
iliad = open(iliad_file).read()
iliad[:100]


'The Iliad\nBy Homer\n\n\nTranslated by Samuel Butler\n\n--------------------------------------------------'

In [4]:
odyssey = open(odyssey_file).read()
odyssey[:100]


'The Odyssey\nBy Homer\n\n\nTranslated by Samuel Butler\n\n------------------------------------------------'

In [5]:
selma = open(selma_file).read()
selma[:100]


'Nils Holgerssons underbara resa genom Sverige\nSelma Lagerlöf\n\nInnehåll\n\tDen kristna dagvisan - Sveri'

## Matching one occurrence

#### A first match with `re.search()`

In [6]:
line = 'The aerial acceleration alerted the ace pilot'


In [7]:
match = re.search('ac*e', line)
match      # <regex.Match object; span=(4, 6), match='ae'>


<regex.Match object; span=(4, 6), match='ae'>

#### Getting the match value

In [8]:
match.group()  # ae


'ae'

## Getting all the matches

#### The list of all the strings with `ac*e`, `ac?e`, `ac+e`, `ac{2}e`, `ac{2,}e`, and `a.e`

In [9]:
match_list = re.findall('ac*e', line)   # ['ae', 'acce', 'ace']
match_list


['ae', 'acce', 'ace']

In [10]:
re.findall('ac?e', line)  # ['ae', 'ace']


['ae', 'ace']

In [11]:
re.findall('ac+e', line)


['acce', 'ace']

In [12]:
re.findall('ac{2}e', line)


['acce']

In [13]:
re.findall('ac{2,}e', line)


['acce']

In [14]:
re.findall('a.e', line)


['ale', 'ace']

#### The match groups (the objects) with the `finditer()` iterator

In [15]:
match_iter = re.finditer('ac*e', line)
list(match_iter)


[<regex.Match object; span=(4, 6), match='ae'>,
 <regex.Match object; span=(11, 15), match='acce'>,
 <regex.Match object; span=(36, 39), match='ace'>]

## Nonprintable characters and modifiers
#### Start of a string

In [16]:
iliad_opening = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip()


In [17]:
re.findall('^S', iliad_opening)  # m/^S/g


['S']

In [18]:
re.findall('^s', iliad_opening)  # m/^s/g


[]

#### The case-insensitive modifier

We did not match `S`. We can make the regex case-insensitive

In [19]:
re.findall('^s', iliad_opening, re.I)  # m/^s/ig


['S']

#### Case insensitive and multiline

The start anchor `^` corresponds to the unique start a string. With the multiline modifier, `re.M` a `\n` also defines a start position

In [20]:
re.findall('^s', iliad_opening, re.I | re.M)  # m/^s/img


['S', 's']

#### Getting all the matches

In [21]:
iliad_opening = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip()
iliad_opening


'Sing, O goddess, the anger of Achilles \nson of Peleus, that brought countless ills upon the Achaeans.'

In [22]:
match_list = re.findall('^s', iliad_opening, re.I | re.M)
print(match_list)


['S', 's']


#### Getting all the matches with `finditer()`

In [23]:
match_list = re.finditer('^s', iliad_opening, re.I | re.M)
match_list


<_regex.Scanner at 0x105f1d810>

In [24]:
list(match_list)


[<regex.Match object; span=(0, 1), match='S'>,
 <regex.Match object; span=(40, 41), match='s'>]

## Substitution

#### Global replacement: `s/pattern/replacement/g`

In [25]:
print(re.sub('es+', 'EZ', iliad_opening))  # s/es+/EZ/g


Sing, O goddEZ, the anger of AchillEZ 
son of Peleus, that brought countlEZ ills upon the Achaeans.


 
#### Just one replacement: s/regex/replacement/

In [26]:
print(re.sub('es+', 'EZ', iliad_opening, 1))  # s/es+/EZ/g


Sing, O goddEZ, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.


## Backreferences

In [27]:
line


'The aerial acceleration alerted the ace pilot'

In [28]:
re.search('ac+e', line)


<regex.Match object; span=(11, 15), match='acce'>

In [29]:
match = re.search('a(c+)e', line)


The back-reference

In [30]:
match.group(1)


'cc'

The whole pattern

In [31]:
match.group()


'acce'

Equivalent to

In [32]:
match.group(0)


'acce'

More back references

In [33]:
match = re.search('(.)(c+)e', line)


Back reference 1, `(.)` stored in `\1`

In [34]:
match.group(1)


'a'

Backreference 2, `(c+)` stored in `\2`

In [35]:
match.group(2)


'cc'

Matching a sequence of three identical characters

In [36]:
match = re.search(r'(.)\1\1', 'accceleration')
match.group(1)                # 'c'


'c'

#### Raw Strings and The `r` prefix

In [37]:
'\1'


'\x01'

In [38]:
'\141'


'a'

In [39]:
r'\1'


'\\1'

In [40]:
r'\141'


'\\141'

In [41]:
iliad_opening


'Sing, O goddess, the anger of Achilles \nson of Peleus, that brought countless ills upon the Achaeans.'

In [42]:
r'\n'


'\\n'

In [43]:
re.sub(' +\n', ' ', iliad_opening)


'Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans.'

## Python escape sequences

In [44]:
re.search(r'\p{N}+\t\p{N}+', 'Frequencies: 100	200	300')


<regex.Match object; span=(13, 20), match='100\t200'>

In [45]:
re.search('\N{COMMERCIAL AT}', 'pierre@dot.com')


<regex.Match object; span=(6, 7), match='@'>

 ### Substitutions

#### Multiple backreferences `m/\$ *([0-9]+)\.?([0-9]*)/`

In [46]:
price = "We'll buy it for $72.40"


In [47]:
match = re.search(r'\$ *([0-9]+)\.?([0-9]*)', price)
match.group()  # ’$72.40’ The entire match


'$72.40'

In [48]:
match.group(1)  # ’72’ The first group


'72'

In [49]:
match.group(2)  # ’40’ The second group


'40'

#### Substitutions `s/\$ *([0-9]+)\.?([0-9]*)/\1 dollars and \2 cents/g`

In [50]:
re.sub(r'\$ *([0-9]+)\.?([0-9]*)',
       r'\1 dollars and \2 cents', price)
# We’ll buy it for 72 dollars and 40 cents


"We'll buy it for 72 dollars and 40 cents"

## Matching objects

In [51]:
price = "We'll buy it for $72.40"


In [52]:
match = re.search(r'\$ *([0-9]+)\.?([0-9]*)', price)
match


<regex.Match object; span=(17, 23), match='$72.40'>

#### Input

In [53]:
match.string            # We’ll buy it for $72.40


"We'll buy it for $72.40"

#### Groups

In [54]:
match.groups()          # (’72’, ’40’)


('72', '40')

In [55]:
match.group(0)          # '$72.40'


'$72.40'

In [56]:
match.group(1)


'72'

In [57]:
match.group(2)


'40'

#### Match objects: The indices

In [58]:
match.start(0)


17

In [59]:
match.end(0)


23

In [60]:
match.start(1)


18

In [61]:
match.end(1)


20

#### Example

In [62]:
odyssey_opening = """Tell me, O muse, of that ingenious hero
  who travelled far and wide after he had sacked
  the famous town of Troy.""".strip()
odyssey_opening


'Tell me, O muse, of that ingenious hero\n  who travelled far and wide after he had sacked\n  the famous town of Troy.'

In [63]:
match = re.search(',.*,', odyssey_opening, re.S)
match


<regex.Match object; span=(7, 16), match=', O muse,'>

In [64]:
odyssey_opening[0:match.start()]             # ’Tell me’


'Tell me'

In [65]:
odyssey_opening[match.start():match.end()]   # ’, O muse,’


', O muse,'

In [66]:
odyssey_opening[match.end():]   # ’of that ingenious hero
#  who travelled far and wide after he had sacked
#  the famous town of Troy.’


' of that ingenious hero\n  who travelled far and wide after he had sacked\n  the famous town of Troy.'

### Parameterable Regular Expressions

In [67]:
string = 'my string'
width = 20


In [68]:
def make_regex(string, width):
    return ('.{{0,{width}}}{string}.{{0,{width}}}'
            .format(string=string, width=width))


In [69]:
make_regex(string, width)


'.{0,20}my string.{0,20}'

In [70]:
def make_regex(string, width):
    string = re.escape(string)
    return ('.{{0,{width}}}{string}.{{0,{width}}}'
            .format(string=string, width=width))


In [71]:
string = 'my string.'
width = 20


In [72]:
make_regex(string, width)


'.{0,20}my\\ string\\..{0,20}'

In [73]:
pattern = make_regex('Penelope', 15)
re.search(pattern, odyssey, re.S).group()


' of his\nmother Penelope, who persist i'

## Concordances: `.{0,15}Nils Holgersson.{0,15}`

In [74]:
pattern = 'Nils Holgersson.'
#pattern = 'Penelope'
#pattern = 'the Achaeans'
width = 25
text = odyssey
text = selma


In [75]:
# spaces match tabs and newlines
pattern = re.sub(' ', r'\\s+', pattern)
pattern


'Nils\\s+Holgersson.'

In [76]:
# Replaces newlines with spaces in the text
text = re.sub(r'\s+', ' ', text)


In [77]:
for match in re.finditer(pattern, text):
    print(text[match.start() - width:match.end() + width])



! Se på Tummetott! Se på Nils Holgersson Tummetott!" Genast vände 
r," sade han. "Jag heter Nils Holgersson och är son till en husman
lden. "Inte är det värt, Nils Holgersson, att du är ängslig eller 
 i dem. På den tiden, då Nils Holgersson drog omkring med vildgäss
ulle allt visa honom vad Nils Holgersson från Västra Vemmenhög var
om ägde rum det året, då Nils Holgersson for omkring med vildgässe
m vad det kan kosta dem. Nils Holgersson hade inte haft förstånd p
de det inte mer sägas om Nils Holgersson, att han inte tyckte om n
 Rosenbom?" För där stod Nils Holgersson mitt uppe på Rosenboms na
 Med ens fingo de syn på Nils Holgersson, och då sköt den store vi
vila. När vildgässen och Nils Holgersson äntligen hade letat sig f
 slags arbetare. Men vad Nils Holgersson inte så, det var, att sta
nde han fråga, och om då Nils Holgersson sade nej, började han gen
de lille Mats, och om nu Nils Holgersson också hade tegat, så hade
åg så försmädlig ut, att Nils Holgersson kastade sig över hon

### Lookahead

In [78]:
la_text = 'Meanwhile great Ajax kept on trying to drive a spear into Hector, but Hector was so skilful that he held his broad shoulders well under cover of his ox-hide shield, ever on the look-out for the whizzing of the arrows and the heavy thud of the spears.'
la_text


'Meanwhile great Ajax kept on trying to drive a spear into Hector, but Hector was so skilful that he held his broad shoulders well under cover of his ox-hide shield, ever on the look-out for the whizzing of the arrows and the heavy thud of the spears.'

In [79]:
string = 'Hector'
width = 20


In [80]:
pattern = make_regex(string, width)
pattern


'.{0,20}Hector.{0,20}'

In [81]:
for match in re.finditer(pattern, la_text):
    print(match.group())


 drive a spear into Hector, but Hector was so 


In [82]:
la_pattern = '.{0,20}Hector(?=(.{0,20}))'


In [83]:
re.search(la_pattern, la_text)

<regex.Match object; span=(38, 64), match=' drive a spear into Hector'>

In [84]:
for match in re.finditer(la_pattern, la_text):
    print(match.group(0), match.group(1))


 drive a spear into Hector , but Hector was so 
, but Hector  was so skilful that


In [85]:
la_pattern = '(?<=(.{0,20}))Hector(?=(.{0,20}))'


In [86]:
for match in re.finditer(la_pattern, la_text):
    print(match.group(1), match.group(0), match.group(2))


 drive a spear into  Hector , but Hector was so 
ar into Hector, but  Hector  was so skilful that


## Min-edit

In [87]:
[source, target] = ('language', 'lineage')


In [88]:
length_s = len(source) + 1
length_t = len(target) + 1

# Initialize first row and column
table = [None] * length_s

for i in range(length_s):
    table[i] = [None] * length_t
    table[i][0] = i
for j in range(length_t):
    table[0][j] = j


In [89]:
table


[[0, 1, 2, 3, 4, 5, 6, 7],
 [1, None, None, None, None, None, None, None],
 [2, None, None, None, None, None, None, None],
 [3, None, None, None, None, None, None, None],
 [4, None, None, None, None, None, None, None],
 [5, None, None, None, None, None, None, None],
 [6, None, None, None, None, None, None, None],
 [7, None, None, None, None, None, None, None],
 [8, None, None, None, None, None, None, None]]

In [90]:
# Fills the table. Start index of rows and columns is 1
for i in range(1, length_s):
    for j in range(1, length_t):
        # Is it a copy or a substitution?
        cost = 0 if source[i - 1] == target[j - 1] else 2
        # Computes the minimum
        minimum = table[i - 1][j - 1] + cost
        if minimum > table[i][j - 1] + 1:
            minimum = table[i][j - 1] + 1
        if minimum > table[i - 1][j] + 1:
            minimum = table[i - 1][j] + 1
        table[i][j] = minimum


In [91]:
table


[[0, 1, 2, 3, 4, 5, 6, 7],
 [1, 0, 1, 2, 3, 4, 5, 6],
 [2, 1, 2, 3, 4, 3, 4, 5],
 [3, 2, 3, 2, 3, 4, 5, 6],
 [4, 3, 4, 3, 4, 5, 4, 5],
 [5, 4, 5, 4, 5, 6, 5, 6],
 [6, 5, 6, 5, 6, 5, 6, 7],
 [7, 6, 7, 6, 7, 6, 5, 6],
 [8, 7, 8, 7, 6, 7, 6, 5]]

In [92]:
for j in range(length_t):
    for i in range(length_s):
        print(table[i][length_t - j - 1], " ", end='')
    print()


7  6  5  6  5  6  7  6  5  
6  5  4  5  4  5  6  5  6  
5  4  3  4  5  6  5  6  7  
4  3  4  3  4  5  6  7  6  
3  2  3  2  3  4  5  6  7  
2  1  2  3  4  5  6  7  8  
1  0  1  2  3  4  5  6  7  
0  1  2  3  4  5  6  7  8  


In [93]:
'Minimum distance: ', table[length_s - 1][length_t - 1]


('Minimum distance: ', 5)