# Chapter 2: Corpus Processing Tools
Author: Pierre Nugues

We use the `regex` module to have a better Unicode support

In [1]:
import regex as re

## Matching one occurrence

#### A first match with `re.search()`

In [2]:
line = 'The aerial acceleration alerted the ace pilot'
match = re.search('ab*c', line)
match      # <regex.Match object; span=(11, 13), match='ac'>

<regex.Match object; span=(11, 13), match='ac'>

#### Getting the match value

In [3]:
match.group() # ac

'ac'

## Getting all the matches

#### The list of all the strings

In [4]:
match_list = re.findall('ab*c', line)   # ['ac', 'ac']
match_list

['ac', 'ac']

#### The match groups (the objects)

In [5]:
match_iter = re.finditer('ab*c', line)   
list(match_iter)

[<regex.Match object; span=(11, 13), match='ac'>,
 <regex.Match object; span=(36, 38), match='ac'>]

## Interactive match

#### Using the shell (does not work with the notebooks)

In [6]:
import sys

for line in sys.stdin:
    if re.search('ab*c', line):    # m/ab*c/
        print('-> ' + line, end='')

KeyboardInterrupt: 

#### Using IPython ipywidgets

In [7]:
# https://github.com/ipython/ipywidgets
import ipywidgets as widgets
from IPython.display import display

# The input box
text = widgets.Text()
display(text)

def handle_submit(sender):
    if re.search('ab*c', text.value):
        print('->', text.value)
    text.value = ''

# Hitting return fires handle_submit
text.on_submit(handle_submit)

Text(value='')

-> jkldsalabbcöl


## Nonprintable characters and modifiers
#### Start of a line

We create a list of multiple strings with `split()`

In [8]:
# text = sys.stdin.read()
text = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".split('\n')
text

['Sing, O goddess, the anger of Achilles ',
 'son of Peleus, that brought countless ills upon the Achaeans.',
 '']

`split()` adds empty strings. We strip the string before we split it. 

In [9]:
# text = sys.stdin.read()
text = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip().split('\n')
text

['Sing, O goddess, the anger of Achilles ',
 'son of Peleus, that brought countless ills upon the Achaeans.']

In [10]:
for line in text:
    match = re.search('^s', line) # m/^s/
    if match:
        print('-> ' + match.group())

-> s


#### The case-insensitive modifier

We did not match `S`. We can make the regex case-insensitive

In [11]:
for line in text:
    match = re.search('^s', line, re.I) # m/^s/i
    if match:
        print('-> ' + match.group())

-> S
-> s


#### Case insensitive and multiline

The start anchor `^` corresponds to the unique start a string. With the multiline modifier, `re.M` a `\n` also defines a start position

In [12]:
text = """Sing, O goddess, the anger of Achilles son
of Peleus, that brought countless ills upon the Achaeans.
""".strip()
text

'Sing, O goddess, the anger of Achilles son\nof Peleus, that brought countless ills upon the Achaeans.'

In [13]:
match = re.search('^s', text, re.I | re.M) # m/^s/im
if match:
    print('-> ' + match.group())

-> S


In [14]:
match = re.search('^o', text, re.I | re.M)
if match:
    print('-> ' + match.group())

-> o


#### Getting all the matches

In [15]:
text = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip()
text

'Sing, O goddess, the anger of Achilles \nson of Peleus, that brought countless ills upon the Achaeans.'

In [16]:
match_list = re.findall('^s', text, re.I | re.M)
print(match_list)

['S', 's']


#### Getting all the matches with `finditer()`

In [17]:
match_list = re.finditer('^s', text, re.I | re.M)
match_list

<_regex.Scanner at 0x7fd3a90b4820>

In [18]:
list(match_list)

[<regex.Match object; span=(0, 1), match='S'>,
 <regex.Match object; span=(40, 41), match='s'>]

## Substitution

#### Global replacement: `s/regex/replacement/g`

In [19]:
text = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip().split('\n')
text

['Sing, O goddess, the anger of Achilles ',
 'son of Peleus, that brought countless ills upon the Achaeans.']

In [20]:
for line in text:
    if re.search('es+', line):
        print("Old: " + line)
        # Replaces all the occurrences
        line = re.sub('es+', 'ES', line)
        print("New: " + line)
# s/ab+c/ABC/g

Old: Sing, O goddess, the anger of Achilles 
New: Sing, O goddES, the anger of AchillES 
Old: son of Peleus, that brought countless ills upon the Achaeans.
New: son of Peleus, that brought countlES ills upon the Achaeans.


 
#### Just one replacement: s/regex/replacement/

In [21]:
text = """Sing, O goddess, the anger of Achilles 
son of Peleus, that brought countless ills upon the Achaeans.
""".strip().split('\n')
text

['Sing, O goddess, the anger of Achilles ',
 'son of Peleus, that brought countless ills upon the Achaeans.']

In [22]:
for line in text:
    if re.search('es+', line):
        print("Old: " + line)
        # Replaces all the occurrences
        line = re.sub('es+', 'ES', line, 1)
        print("New: " + line)
# s/ab+c/ABC/

Old: Sing, O goddess, the anger of Achilles 
New: Sing, O goddES, the anger of Achilles 
Old: son of Peleus, that brought countless ills upon the Achaeans.
New: son of Peleus, that brought countlES ills upon the Achaeans.


## Backreferences

In [23]:
line = 'abbbcdeeef'

In [24]:
match = re.search('^(.)(b+)c+', line)

The whole pattern

In [25]:
match.group()

'abbbc'

Equivalent to

In [26]:
match.group(0)

'abbbc'

Back reference 1, `(.)` stored in `\1`

In [27]:
match.group(1)  

'a'

Backreference 2, `(b+)` stored in `\2`

In [28]:
match.group(2)

'bbb'

Matching a sequence of three identical characters

In [29]:
match = re.search(r'(.)\1\1', line)
match.group(1)                # 'b'

'b'

 #### Substitutions `s/(.)\1\1/***/g`

In [30]:
re.sub(r'(.)\1\1', '***', 'abbbcdeeef')  # 'a***cd***f'

'a***cd***f'

#### Multiple backreferences `m/\$ *([0-9]+)\.?([0-9]*)/`

In [31]:
price = "We'll buy it for $72.40"

In [32]:
match = re.search(r'\$ *([0-9]+)\.?([0-9]*)', price)
match.group() # ’$72.40’ The entire match

'$72.40'

In [33]:
match.group(1) # ’72’ The first group

'72'

In [34]:
match.group(2) # ’40’ The second group

'40'

#### Substitutions `s/\$ *([0-9]+)\.?([0-9]*)/\1 dollars and \2 cents/g`

In [35]:
re.sub(r'\$ *([0-9]+)\.?([0-9]*)',
       r'\1 dollars and \2 cents', price)
   # We’ll buy it for 72 dollars and 40 cents

"We'll buy it for 72 dollars and 40 cents"

#### Why `r`

In [36]:
'\1'

'\x01'

In [37]:
'\141'

'a'

In [38]:
r'\1'

'\\1'

In [39]:
r'\141'

'\\141'

## Matching objects

In [40]:
price = "We'll buy it for $72.40"

In [41]:
match = re.search(r'\$ *([0-9]+)\.?([0-9]*)', price)
match

<regex.Match object; span=(17, 23), match='$72.40'>

#### Input

In [42]:
match.string            # We’ll buy it for $72.40

"We'll buy it for $72.40"

#### Groups

In [43]:
match.groups()          # (’72’, ’40’)

('72', '40')

In [44]:
match.group(0)          # '$72.40'

'$72.40'

In [45]:
match.group(1)

'72'

In [46]:
match.group(2)

'40'

#### Match objects: The indices

In [47]:
match.start(0)

17

In [48]:
match.end(0)

23

In [49]:
match.start(1)

18

In [50]:
match.end(1)

20

#### Example

In [51]:
line = """Tell me, O muse, of that ingenious hero
  who travelled far and wide after he had sacked
  the famous town of Troy.""".strip()
line

'Tell me, O muse, of that ingenious hero\n  who travelled far and wide after he had sacked\n  the famous town of Troy.'

In [52]:
match = re.search(',.*,', line, re.S)
match

<regex.Match object; span=(7, 16), match=', O muse,'>

In [53]:
line[0:match.start()]             # ’Tell me’

'Tell me'

In [54]:
line[match.start():match.end()]   # ’, O muse,’

', O muse,'

In [55]:
line[match.end():]   # ’of that ingenious hero
         #  who travelled far and wide after he had sacked
         #  the famous town of Troy.’

' of that ingenious hero\n  who travelled far and wide after he had sacked\n  the famous town of Troy.'

## Concordances: `.{0,15}Nils Holgersson.{0,15}`

In [56]:
pattern = 'Nils Holgersson'
width = 15

We build a regex from these parameters: `.{0,width}pattern.{0,width}`

In [57]:
('.{{0,{width}}}{pattern}.{{0,{width}}}'
 .format(pattern=pattern, width=width))

'.{0,15}Nils Holgersson.{0,15}'

In [58]:
file_name = '../../corpus/Selma.txt'
text = open(file_name).read()
text[:100]

'Nils Holgerssons underbara resa genom Sverige\nSelma Lagerlöf\n\nInnehåll\n\tDen kristna dagvisan - Sveri'

In [59]:
# spaces match tabs and newlines
pattern = re.sub(' ', r'\\s+', pattern)
pattern

'Nils\\s+Holgersson'

In [60]:
# Replaces newlines with spaces in the text
text = re.sub(r'\s+', ' ', text)

In [61]:
concordance = ('(.{{0,{width}}}{pattern}.{{0,{width}}})'
               .format(pattern=pattern, width=width))
concordance

'(.{0,15}Nils\\s+Holgersson.{0,15})'

In [62]:
for match in re.finditer(concordance, text):
    print(match.group(1))

Nils Holgerssons underbara res
mmetott! Se på Nils Holgersson Tummetott!" Ge
an. "Jag heter Nils Holgersson och är son til
e är det värt, Nils Holgersson, att du är äng
 den tiden, då Nils Holgersson drog omkring m
visa honom vad Nils Holgersson från Västra Ve
m det året, då Nils Holgersson for omkring me
kan kosta dem. Nils Holgersson hade inte haft
e mer sägas om Nils Holgersson, att han inte 
" För där stod Nils Holgersson mitt uppe på R
ingo de syn på Nils Holgersson, och då sköt d
vildgässen och Nils Holgersson äntligen hade 
etare. Men vad Nils Holgersson inte så, det v
åga, och om då Nils Holgersson sade nej, börj
ats, och om nu Nils Holgersson också hade teg
mädlig ut, att Nils Holgersson kastade sig öv
Och inte ville Nils Holgersson slåss med en t
den tiden, när Nils Holgersson for omkring me
kull honom. Om Nils Holgersson genast hade ro
 på egen hand, Nils Holgersson," sade han då 
mle-Drumle ner Nils Holgersson på bottnen av 
 - "Jo, jag är Nils Holgersson från Västra Ve
 de