# **Regular expressions**

python library: **Lib/re.py**

Regular expressions (also called “RegEx”) are a powerful tool and a standardized language to search for patterns in strings.

The built-in python module ``re`` contains different methods for performing matches. The most signigiciant are:
- ``match()``:$\qquad \text{Determine if the RE matches at the beginning of the string.}$<br>
- ``search()``:$\qquad \text{Scan through a string, looking for any location where this RE matches.}$<br>
- ``findall()``:$\qquad \text{Find all substrings where the RE matches, and returns them as a list.}$<br>
- ``finditer()``:$\qquad \text{Find all substrings where the RE matches, and returns them as an iterator.}$<br>

This notebook shows examples to following metacharacter settings:
- Anchors
- Quantifiers
- Disjunctions
- Character Classes

python documentation:
- Regular Expression Syntax [library "re"](https://docs.python.org/3/library/re.html)
- How-to use re [re-how-to](https://docs.python.org/3/howto/regex.html#regex-howto)




In [None]:
# Load resouces for all following code cells
import re

## **Anchors mark a position in the string**

- $\text{^}$ : $\quad$ means the beginning of the string
- $\text{\$}$: $\quad$  means the end

In [None]:
# Define regular expression

# Does them sentence "bananas are cheap" begin with "banana"?
p = re.search('^banana', "bananas are cheap")

# Show results:

if p:
    # Print the string wich matches the regular expression
    print('matching string:\t' + p.group())
    
    # Print the beginning and last character of the match
    print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
    
    # Print the start and end position as tuple:
    print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
else:
    print("no match")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 0 	 end: 6
index of first and last matching charcter as tuple: 	(0, 6)


In [None]:
# The function match() would return the same result
#
# Note:
# The match() function only checks if the RE matches at the beginning of the string while search()
# will scan forward through the string for a match. 
# It’s important to keep this distinction in mind. 
# Remember, match() will only report a successful match which will start at 0; 
# if the match wouldn’t start at zero, match() will not report it.

p = re.match('^banana', "bananas are cheap")

if p:
  print('matching string: ' + p.group())

matching string: banana


In [None]:
# Check if the sentence "I have a banana" ends with "banana"

p = re.search('banana$', "I have a banana")

# Show results:

if p:
  # Print the string wich matches the regular expression
  print('matching string:\t' + p.group())
  
  # Print the beginning and last character of the match
  print('index of first and last matching character in the sentence: \t start: {} \t end: {}'.format(p.start(), p.end()))
  
  # Print the start and end position as tuple:
  print('index of first and last matching charcter as tuple: \t{}'.format(p.span()))
  
else:
    print("no match")

matching string:	banana
index of first and last matching character in the sentence: 	 start: 9 	 end: 15
index of first and last matching charcter as tuple: 	(9, 15)


In [None]:
# $ Matches at the end of a line, which is defined as either
#  - the end of the string,
#  or,
#  - any location followed by a newline character.

p = re.search('}$', '{block}')

print('search {} in {}'.format('}$', '{block}'))
if p:
  print('is {} the last character?: True\t match: {}'.format('}', p.group()))
else:
  print('{} is not the last character in this line'.format('{'))

print('\n-----------------------------------------')


p = re.search('}$', '{block} ')

print('search {} in {}'.format('}$', '{block} '))
if p:
  print('is {} the last character?: True\t match: {}'.format('}', p.group()))
else:
  print('{} is not the last character in this line'.format('{'))

print('\n-----------------------------------------')

p = re.search('}$', '{block}\n')

print('search {} in {}'.format('}$', '{block}\\n'))
if p:
  print('is {} the last character in this line? True \t match: {}'.format('}', p.group()))
else:
    print('{} is not the last character in this line'.format('{'))

search }$ in {block}
is } the last character?: True	 match: }

-----------------------------------------
search }$ in {block} 
{ is not the last character in this line

-----------------------------------------
search }$ in {block}\n
is } the last character in this line? True 	 match: }


In [None]:
# ^ means the beginning of a string
# $ means the end of a string

 
# Do following strings begin and end with 'banana' ?
print(re.search('^banana$', 'bananas are cheap'))

print(re.search('^banana$', 'I have a banana'))

print(re.search('^banana$', 'banana').group())

None
None
banana


## **Quantifiers indicate the number of repetitions of the previous character**

- $\text{*}$: $\quad$  means zero or more
- $\text{+}$: $\quad$ means one or more
- $\text{?}$: $\quad$ means zero or one repetitions
- If more precise quantifiers are needed, the number of repetitions can be written in curly brackets

In [None]:
# '?' means zero or one repetitions

print(re.search('ba?nana', 'banana').group())
print(re.search('ba?nana', 'bnana').group())
print(re.search('ba?nana', 'baaaaanana'))

banana
bnana
None


In [None]:
# '+' means one or more repetitions
print(re.search('ba+nana', 'banana').group())
print(re.search('ba+nana', 'bnana'))
print(re.search('ba+nana', 'baaaaanana').group())

banana
None
baaaaanana


In [None]:
# '*' means zero or more repetitions
print(re.search('ba*nana', 'banana').group())
print(re.search('ba*nana', 'bnana').group())
print(re.search('ba*nana', 'baaaaanana').group())

banana
bnana
baaaaanana


In [None]:
# a{2,7} means that the letter “a” must repeat at least 2 times and 
# at maximum 7 times for the string to match the pattern

print(re.search('a{2,7}', 'banana'))
print(re.search('a{2,7}', 'bnana'))
print(re.search('a{2,7}', 'baaaaanana').group())

None
None
aaaaa


## **Disjunctions represent a logical OR.**

written in squared brackets [ ] or separated by the pipe sign |.

In [None]:
# b[aou]nana matches 'b*anana' strings which have either 'a', 'o' or 'u' as 2nd character

print(re.search('b[aou]nana', 'banana').group())
print(re.search('b[aou]nana', 'bonana').group())
print(re.search('b[aou]nana', 'bunana').group())

banana
bonana
bunana


In [None]:
# b(a|o|u)nana matches 'b*anana' strings which have either 'a', 'o' or 'u' as 2nd character

print(re.search('b(a|o|u)nana', 'banana').group())
print(re.search('b(a|o|u)nana', 'bonana').group())
print(re.search('b(a|o|u)nana', 'bunana').group())

banana
bonana
bunana


In [None]:
# Here, the 3 strings 'banana', 'bonana' 'bunana' are disjuncted to the word 'banana', 'bonana' 'bunana'.  

print(re.search('banana|bonana|bunana', 'banana').group())
print(re.search('banana|bonana|bunana', 'bonana').group())
print(re.search('banana|bonana|bunana', 'bunana').group())

banana
bonana
bunana


## **Character classes represent certain groups of characters.**

* $\text{\d or [0-9]}\quad \quad \quad \quad$ :   matches digits
* $\text{\w or [0-9A-Za-z_]}\quad$ :  matches alphanumeric characters and underscores
* $\text{\s} \qquad \qquad \qquad \qquad$:  matches white spaces
* $. \qquad \qquad \qquad \qquad$ :  matches any character.

In [None]:
# The expressions would match string which start with 'Hello'

print(re.search('^Hello \w+$', 'Hello World').group())
print(re.search('^Hello \w+$', 'Hello new World'))

print(re.search('^Hello.+$', 'Hello World').group())
print(re.search('^Hello.+$', 'Hello new world').group())

Hello World
None
Hello World
Hello new world


## find regular expressions in files 

In [None]:
# Create sample text file

text = """\
 this is a sample text which contains several characters.
 For example, I bought cheap bananas today. 
 """
# write content in output file
with open("/content/regText.txt", 'w') as file:
  file.write(text)

In [None]:
# Find the epression 'text' in sample file 'regText.txt'

!fgrep -ew text regText.txt  

grep: text: No such file or directory
regText.txt: this is a sample text which contains several characters.


In [None]:
# Replace the word banana with test in regText.txt

!sed -e 's/banana/test/' regText.txt

 this is a sample text which contains several characters.
For example, I bought cheap tests today. 


Copyright © 2020 IUBH Internationale Hochschule