# **Regular Expressions**

Regular expressions (also called “RegEx”) are a powerful tool and a standardized language to search for patterns in strings. They are available in almost any programming language (e.g., in Python, Perl, JavaScript, Awk) and can also be used in shell scripts and at the UNIX command line. Many editors (e.g., Vim, Emacs, and Sed) support regular expressions for search-and-replace operations. In rule-based NLP, regular expressions can be used to extract data from text. Especially well-defined patterns such as dates, times, and prices can easily be found and extracted from text using regular expressions [[1]](#scrollTo=Al4v1MD0ZhJq).

This notebook shows examples for the following topics:
- Compiling regular expressions
- Anchors
- Quantifiers
- Disjunctions
- Character classes
- Finding regular expressions in files

For more information about the regular expression syntax, please refer to  [[2](https://docs.python.org/3/library/re.html)] and [[3](https://docs.python.org/3/howto/regex.html#regex-howto)] .

## **Compiling regular expressions**

To work with regular expressions, we should import the ``re`` module. It provides a set of powerful regular expression facilities, which allows us to quickly check whether a given string matches a given pattern or contains such a pattern.

Regular expressions in Python are compiled into pattern objects by running the ``re.compile()`` function. Once a pattern object is created, you can run with it various functions such as:
- ``match()``:$\qquad$ Determines if the regular expression matches at the **beginning** of the string.<br>
- ``search()``:$\qquad$ Scans a string, looks for any location where this regular expression matches and returns the **first match**.<br>
- ``findall()``:$\qquad$ Finds **all** substrings where the regular expression matches and returns them as a list.<br>

In this notebook, two different compiling methods are demonstrated:
1. Define a regular expression and compile it into a pattern object ``reg``. After that, run a function with this pattern object.
2. Define a regular expression inside the function by using the ``re`` module.  

Below, you can see an example for each function:


### match()
``match()`` determines if the regular expression matches at the **beginning** of the string and returns an object which contains the match.

In [4]:
# import the 're' module
import re

# Method-1
## Define regular expression and compile it into a pattern object "reg"
reg = re.compile('banana')

## Run match() function with the pattern object "reg"
p1 = reg.match('''We visited a banana farm in our island. 
                  A farmer told us that the retailers determined 
                  to keep banana prices low''')


# Method-2
## Define regular expression inside the match() function.
## Python automatically compiles the regular expression and runs the function.
p2 = re.match('banana', '''We visited a banana farm in our island. 
                  A farmer told us that the retailers determined 
                  to keep banana prices low''')

# Print
## Print the object returned by the match() function.
## If there is a match, the match only can be printed with: print(p1.group())
print(p1)
print(p2)

None
None


###search()
``search()`` finds any location where the regular expression matches and returns an object which contains the **first match**.

In [5]:
# import the 're' module
import re

# Method-1
## Define regular expression and compile it into a pattern object "reg"
reg = re.compile('banana')

## Run search() function with the pattern object "reg"
## match() determines if the regular expression matches at the **beginning** of the string.
p1 = reg.search('''We visited a banana farm in our island. 
                  A farmer told us that the retailers determined 
                  to keep banana prices low''')


# Method-2
## Define regular expression inside the search() function.
## Python automatically compiles the regular expression and runs the function.
p2 = re.search('banana', '''We visited a banana farm in our island. 
                  A farmer told us that the retailers determined 
                  to keep banana prices low''')

# Print
## Print the object returned by the search() function.
## If there is a match, the match only can be printed with: print(p1.group())
print(p1)
print(p2)

<re.Match object; span=(13, 19), match='banana'>
banana
<re.Match object; span=(13, 19), match='banana'>


### findall()
``findall()`` finds **all** substrings where the regular expression matches and returns them as a list.

In [None]:
# import the 're' module
import re

# Method-1
## Define regular expression and compile it into a pattern object "reg"
reg = re.compile('banana')

## Run findall() function with the pattern object "reg"
p1 = reg.findall('''We visited a banana farm in our island. 
                  A farmer told us that the retailers determined 
                  to keep banana prices low''')


# Method-2
## Define regular expression inside the findall() function.
## Python automatically compiles the regular expression and runs the function.
p2 = re.findall('banana', '''We visited a banana farm in our island. 
                  A farmer told us that the retailers determined 
                  to keep banana prices low''')

# Print
## Print the list returned by the findall() function.
print(p1)
print(p2)

['banana', 'banana']
['banana', 'banana']


## **Anchors**

Anchors do not match any character. Instead, they match a position before, after, or between characters.


For more information about anchors, please refer to [[4](https://www.regular-expressions.info/anchors.html)].


### <b>" ^ " </b>

The anchor <b>" ^ " </b> matches the position before the first character in the string.

The following example shows how to find a string which begins with the word "banana":

In [6]:
# Import the 're' module
import re

# Compile the regular expression
reg = re.compile('^banana')

# Define strings for the search function
p1 = reg.search("are bananas cheap?")
p2 = reg.search("bananas are cheap.")

# Show results:
if p1:
    # Print the result
    print("The string 'p1' begins with '" + p1.group()+"'")

else:
    print("The string 'p1' does not begin with 'banana'.")

if p2:
    # Print the string which matches the regular expression
    print("The string 'p2' begins with '" + p2.group()+"'.")
    
else:
    print("The string 'p2' does not begin with 'banana'")


The string 'p1' does not begin with 'banana'.
The string 'p2' begins with 'banana'.


###<b>" $ " </b>

The anchor <b>" $ " </b> matches the position right after the last character in the string.

The following example shows how to find a string which ends with the word "banana":

In [7]:
# Import the 're' module
import re

# Compile the regular expression
reg = re.compile('banana$')

# Define strings for the search function
p1 = reg.search("I have a banana")
p2 = reg.search("I have a banana and an apple")

# Show results:
if p1:
    # Print the result
    print("The string 'p1' ends with '" + p1.group()+"'")

else:
    print("The string 'p1' does not end with 'banana'.")

if p2:
    # Print the string which matches the regular expression
    print("The string 'p2' ends with '" + p2.group()+"'.")
    
else:
    print("The string 'p2' does not end with 'banana'")



The string 'p1' ends with 'banana'
The string 'p2' does not end with 'banana'


###<b>" $ "</b> and <b>" ^ "</b>
The following example shows how to find the string “banana” in only 1 line:

In [None]:
# Import the 're' module
import re

# Compile the regular expression
reg = re.compile('^banana$')

# Define strings for the search function
p1=reg.search('bananas are cheap')
p2=reg.search('I have a banana')
p3=reg.search('I have a banana and an apple')
p4=reg.search('banana')

# Print the results
print(p1)
print(p2)
print(p3)
print(p4.group())

None
None
None
banana


## **Quantifiers**
Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found [[5](https://docs.microsoft.com/en-us/dotnet/standard/base-types/quantifiers-in-regular-expressions)].



###<b>" ? "</b>

The quantifier <b>" ? "</b> matches 1 or 0 characters in front of the <b>" ? "</b>.

The following example shows how to find an exact match for the strings “banana” and "bnana":

In [None]:
# Import the 're' module
import re

# Compile the regular expression
reg = re.compile('ba?nana')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana'))

banana
bnana
None


###<b>" + "</b>

The quantifier <b>" + "</b> matches 1 or more characters in front of the <b>" + "</b>. If there is any choice, the first matching string in a line is used.


The following example shows how to find a match for the string “banana” and other variations where the first apperance of the character 'a' repeats more than once:

In [None]:
# Import the 're' module
import re

# Compile the regular expression
reg = re.compile('ba+nana')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('banana').group())
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

banana
None
baaaaanana


###<b>" * "</b>

The quantifier <b>" * "</b> matches 0 or more characters in front of the <b>" * "</b>. If any matching string is found, the first matching string in a line is used. 


The following example shows how to find a match for the string “banana” and other variations where the first character does not exist or repeats at least once:

In [None]:
# import the 're' module
import re

# Compile the regular expression
reg = re.compile('ba*nana')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('banana').group())
print(reg.search('bnana').group())
print(reg.search('baaaaanana').group())

banana
bnana
baaaaanana


###<b>" { } "</b>

The quantifier <b>" { } "</b> is used as range quantifier. The number of repetitions can be written between <b>" { } "</b>.


The following example shows how to find a match for the string “banana” where the character 'a' repeats at least 2 times and at maximum 7 times:

In [None]:
# import the 're' module
import re

# Compile the regular expression
## a{2,7} means that the character 'a' must repeat at least 2 times and at maximum 7 times in the string to match the pattern.
reg = re.compile('a{2,7}')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('banana'))
print(reg.search('bnana'))
print(reg.search('baaaaanana').group())

None
None
aaaaa


## **Disjunctions**

Disjunctions represent a logical OR and can be expressed by using <b>" [ ] "</b> or <b>" | "</b>.



###<b>" [ ] "</b>
The disjunction <b>" [ ] "</b> is used to match one character of a set of characters.

The following example shows how to find an exact match for the strings “banana”, "bonana" and "bunana":

In [9]:
# import the 're' module
import re

# Compile the regular expression
## The regular expression 'b[aou]nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b[aou]nana')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())
print(reg.search('binana'))

banana
bonana
bunana
None


###<b>" | "</b>

The disjunction <b>" | "</b> is used as an OR operator.

The following example shows how to find an exact match for the strings “banana”, "bonana" and "bunana":

In [None]:
# import the 're' module
import re

# Compile the regular expression
## The regular expression 'b(a|o|u)nana' matches strings which have either 'a', 'o' or 'u' as 2nd character.
reg = re.compile('b(a|o|u)nana')

## Alternatively, you can run the following code which returns the same result:
##reg = re.compile('banana|bonana|bunana')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('banana').group())
print(reg.search('bonana').group())
print(reg.search('bunana').group())
print(reg.search('binana'))

banana
bonana
bunana
None


## **Character classes**

Character classes represent certain groups of characters:

<ul>

<li><b>" \d "</b> or <b>" [0-9] "</b> matches digits.</li>
<ul>
<li>The pattern " [0-9] " matches any single-digit number.</li>
<li>The pattern " [^ 0-9] " matches any single digit character that is not a digit.</li>
<li>NOTE: In the "Anchors" section, it is shown that  " ^ " works as an anchor when it is used outside " [ ] ".
But if it is inside " [ ] ", it works as a complement operator, i.e. matches any character other than the ones mentioned inside " [ ] ".</li>
</ul>

<li><b>" \w "</b> or <b>" [0-9A-Za-z_] "</b> matches alphanumeric characters and underscores.</li>
<li><b>" \s "</b> matches white space characters.</li>
<li><b>" . "</b> matches any character.</li>

</ul>


### Find an alphanumerical pattern

The following example shows how to find a string which contains the characters "a", "b" or "c" at the beginning and the numbers "1","2" or "3" at the end:

In [13]:
# import the 're' module
import re

# Compile the regular expression.
## The regular expression '[abc]+[123]' matches strings where the pattern contains the characters "a", "b" or "c" at the beginning 
## and the numbers "1","2" or "3" at the end.
reg = re.compile('[abc]+[123]')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('a1').group())
print(reg.search('45'))
print(reg.search('ab23').group())
print(reg.search('44ee'))


a1
None
ab2
None


### Find a specific string
The following example shows how to find a string which contains only two words where the first word is "Hello":

In [None]:
# import the 're' module
import re

# This regular expression matches strings which start with 'Hello' and end with 1 following word.
reg = re.compile('^Hello \w+$')

# Define strings for the search function and print the matching patterns for each string
print(reg.search('Hello world').group())
print(reg.search('Hello everyone').group())
print(reg.search('Hello new world'))

Hello world
Hello everyone
None


The following example shows how to find any string where the first word is "Hello":

In [14]:
# import the 're' module
import re

# The expression would match strings which start with 'Hello'
reg = re.compile('^Hello.+$')


# Define strings for the search function and print the matching patterns for each string
print(reg.search('Hello world').group())
print(reg.search('Hello everyone').group())
print(reg.search('Hello new world').group())
print(reg.search('Say Hello new world'))

Hello world
Hello everyone
Hello new world
None


### Find white spaces
The following example shows how to find all white spaces in the given string:


In [None]:
# import the 're' module
import re

# Define strings for the search function and print the white spaces for each string
result = re.findall(r'[\s]', 'The Indian Express')

print(result)

[' ', ' ']


## **Find regular expressions in files**

Text editors such as Vim and Emacs can be used to find and optionally replace text by using regular expressions. At the command line or in shell scripts, these operations can also be performed, for example, with ``sed`` or ``grep`` [[1]](#scrollTo=Al4v1MD0ZhJq).

``sed`` is a stream editor for manipulating text from files or input streams. 

``grep`` is used to read a stream, file or list of files, and print the lines containing a match for the pattern. It allows us to search plain text files for specific lines using regular expressions.


#### Create a sample text file
The following example defines a text which contains two lines of string. Then it creates the file "regText.txt" and saves the defined text:


In [21]:
# Create a sample text
# \n defines a newline
text ="This is a sample text which contains several characters.\nFor example, I bought cheap bananas today."

# Write content and save the file
with open("/content/regText.txt", 'w') as file:
  file.write(text)

#### Find a word inside the sample file
The following example runs a search within the "regText.txt" file. Then it finds and returns the line which contains the word "text":


In [22]:
# Find the epression 'text' in the file
!grep text '/content/regText.txt'

This is a sample text which contains several characters.


#### Replace two words
The following example runs a search within the "regText.txt" file. Then it finds the word "banana" and replaces it with the word "test":

In [23]:
# Replace the word "banana" with "test"
!sed -e 's/banana/test/' regText.txt

This is a sample text which contains several characters.
For example, I bought cheap tests today.

# **References**

- [1] NLP and Computer Vision_DLMAINLPCV01 Course Book
- [2] https://docs.python.org/3/library/re.html
- [3] https://docs.python.org/3/howto/regex.html#regex-howto
- [4] https://www.regular-expressions.info/anchors.html
- [5] https://docs.microsoft.com/en-us/dotnet/standard/base-types/quantifiers-in-regular-expressions


Copyright © 2022 IU International University of Applied Sciences