# Regular expression search (2)

by Koenraad De Smedt at UiB


---
This notebook is a continuation on regular expression search.
It demonstrates how to:

1.   Use the result of a match as a truth value
2.   Obtain the matching parts from a search
3.   Find all matches

For more information on regular expressions and their use for NLP, read ➜ Jurafsky & Martin, *Speech and Language Processing, 3rd ed.*, Ch. 2: [Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf). However, note that there are a few system dependent conventions. Jurafsky & Martin use slashes to delimit regular expressions, but in Python they are simply strings.

See also the [documentation of Python regular expression operations](https://docs.python.org/3/library/re.html) and the [Python regular expression howto](https://docs.python.org/3/howto/regex.html).

---

In order to use regular expressions in Python, we import the `re` module. Let's also make an example string in which we will search various patterns.

In [None]:
import re

juliet = '''My bounty is as boundless as the sea,
My love as deep; the more I give to thee,
The more I have, for both are infinite.'''

### Using the result as a truth value

The simplest way to use the result of the search is as a truth value, for instance, in a conditional expression.

In [None]:
if (re.search('My \w+ \w+', juliet)):
  print('There is a match')
else:
  print('There is no match')

### Extracting the matching part

If you need to obtain the whole matching part of the string, it can be extracted from the match object using `.group(0)`.

In [None]:
m = (re.search('My \w+ \w+', juliet))
print(m.group(0))

If you make groups with parentheses, then the match to the first group is obtained by `.group(1)` and so on.

In [None]:
m = (re.search('My (\w+) (\w+)', juliet))
print(m.group(0))
print(m.group(1))
print(m.group(2))

### Finding all matches

The `re.findall` function returns all non-overlapping matching parts of the string, not just the first one. A list of the matching parts is returned.

In [None]:
print(re.findall('My \w+ \w+', juliet))

However, if you use *groups* in `re.findall`, then all the matches for all groups are returned as tuples in a list. If no match is found, the empty list is given.

In [None]:
print(re.findall('My (\w+) (\w+)', juliet))

### Exercises

Choose one of the texts below (or another text with a variety of punctuation and numbers) as test strings to do the following exercises.

1.  Use a regex to find how many digits there are in the text.  Note: instead of `[0-9]+`, you can also use `\d+`. Try it.
2.  Use a regex to find how many sequences of digits there are in the string.
3.  Use a regex to find the numbers in the text, where a number is a sequence of digits but may also contain a period or comma between digits. Tip: use a disjunction; before the `|` write a pattern for numbers with period or comma and after `|` write a pattern for plain numbers.
4.  Use a regex to find the number of sentence delimiters in the text, where a sentence delimiter consists of *one or more* consecutive periods, colons, semicolons, exclamation marks or question marks, followed by a space or newline.


In [None]:
text1 = '''20 isbreer i Norge er nå borte: 364 kvadratkilometer isbre
har forsvunnet mellom 2006 og 2022. Det tilsvarer et område på
størrelse med Mjøsa!!!
Samtidig som de 20 breene har forsvunnet, har isbreer i Norge totalt
minket 14,5 % siden forrige kartlegging. Ismassene som har smeltet
bort siden da, har en størrelse på 364 kvadratkilometer til sammen; det
er et område like stort som omtrent 50.000 fotballbaner...
Kan denne utviklingen stanses??'''

In [None]:
text2 = '''«Après le record de 2022, le marché mondial du luxe devrait
enregistrer une croissance de 5 % à 12 % en 2023», estimait une
étude sectorielle publiée en juin. LVMH a confirmé un «retour à la
normale», le 10 octobre [...].
Le groupe de luxe, qui fait office de baromètre de l’industrie du luxe,
avait publié un chiffre d’affaires de 19,96 milliards d’euros, réalisé
en trois mois, cet été, soit 9 % de plus qu’au troisième trimestre 2022.
Mais cette croissance est inférieure aux + 17 % enregistrés au premier
semestre 2023.'''