# Programming with Python

# 8 Regular Expressions

A **regular expression** (also called **regex, re** or **regexp**) is a sequence of characters that define a **search pattern**. Usually this pattern is used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique that developed in theoretical computer science and formal language theory. <br>
Regular expressions in python is an extra module that needs to be imported using following syntax:

In [3]:
# import regular expressions module
import re

**find a match**

To find a match using regular expressions, you call the search method of the re module **re.search(rgx, string)**. <br>
The search method need two mandatory argument: a regular expression **rgx** as string literal and a **string** that is matched against the rgx.
The search method return a corresponding Match Object (instance). <br>

Example: Find a **single digit**

In [4]:
string = 'Prüfgeschwindigkeit\t1\tmm/s'
m = re.search('\d',string)
m

<re.Match object; span=(20, 21), match='1'>

**list element 0** always return the matched object:

In [6]:
m[0]

'1'

**search and replace** <br>
The sub(rgx, repl, string)-method matches a **rgx** argument in a **string** and replace this rgx by **repl**

In [8]:
# replace digit by 2
string = 'Prüfgeschwindigkeit\t1\tmm/s'
new_string = re.sub('\d', '2',string)
new_string

'Prüfgeschwindigkeit\t2\tmm/s'

**collections** <br>
Using **[]** allows to define a collection of expressions that are combined by a logical or. <br>
for example: [ae] means a or e is True

In [14]:
rgx = 'M[ae][iy]er'

print(bool(re.search(rgx, 'Maier')))
print(bool(re.search(rgx, 'Mayer')))
print(bool(re.search(rgx, 'Meier')))
print(bool(re.search(rgx, 'Meyer')))

True
True
True
True


**optional chars**  
Use ? to make the predeceding char optional:

In [5]:
rgx = 'M[ae][iy]e?r'
print(bool(re.search(rgx, 'Mayr')))
print(bool(re.search(rgx, 'Mair')))

True
True


? can also be used with collections:

In [7]:
rgx = '9[,.]?0'
print(bool(re.search(rgx, '9.0')))
print(bool(re.search(rgx, '9,0')))
print(bool(re.search(rgx, '90')))

True
True
True


**Some predefined collections in regular expressions**:
<br>
<br> **\d** digits [0-9]
<br> **\D** non-digits [^0-9]
<br>
<br> **\s** whitespace [\t\n\r\f\v]
<br> **\S** non-whitespace [^\t\n\r\f\v]
<br>
<br> **\w** words char [a-zA-Z0-9_]
<br> **\W** non-words char [^a-zA-Z0-9_]

**Quantifiers**
You can define how many matched objects you can expect.
<br>
<br> **\d*** 0+ digits
<br> **\d+** 1+ digits
<br>
<br> **\d{1,3}** 1-3 digits
<br> **\d{1,}** 1+ digits (same as \d+)
<br> **\d{0,}** 0+ digits (same as \d*)

**get values from matched objects**

By using round brackets in the regex, the enclosed values can be accessed by index.

In [10]:
string = 'Prüfgeschwindigkeit\t1\tmm/s'
m = re.search('(\w+)\s+(\d+)\s+(\S+)', string)
m

<_sre.SRE_Match object; span=(0, 26), match='Prüfgeschwindigkeit\t1\tmm/s'>

In [12]:
print('m[0]= ', repr(m[0])) # return the complete match object
print('m[1]= ', m[1]) # returns the first object enclosed by round brackets
print('m[2]= ', m[2]) # returns the second object enclosed by round brackets
print('m[3]= ', m[3]) # returns the third object enclosed by round brackets

m[0]=  'Prüfgeschwindigkeit\t1\tmm/s'
m[1]=  Prüfgeschwindigkeit
m[2]=  1
m[3]=  mm/s


### Further reading:
for further reference check: https://docs.python.org/2/library/re.html