**Regular expressions(regex)** are a powerful formalism for pattern matching in strings. 

regex is a sequence of characters that specifies a search pattern in text.

Regular expressions provide a declarative language to match patterns within strings.

They are commonly used for string validation, parsing, and transformation. 

Regexes are everywhere. Different languages like Python, PHP and Java all use regexes, but with minor differences.

Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module:

In [1]:
import re

- When writing regular expression in Python, it is recommended that you use raw strings instead of regular Python strings. 

- **Raw strings begin with a special prefix (r)** and signal Python not to interpret backslashes and special metacharacters in the string, allowing you to pass them through directly to the regular expression engine.

- This means that a pattern like "\n\w" will not be interpreted and can be written as r"\n\w" instead of "\\n\\w" as in other languages, which is much easier to read.

# RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

| Function | Description | 
| --- | --- |
| findall | Returns a list containing all matches |
| search | Returns a Match object if there is a match anywhere in the string |
| split | Returns a list where the string has been split at each match |
| sub | Replaces one or many matches with a string |

## The findall() Function 

- The findall() function returns a list containing all matches.

- The list contains the matches in the order they are found.

- If no matches are found, an empty list is returned:

> matchList = re.findall(pattern, input_str, flags=0)

In [4]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


In [138]:
txt = 'this cost 1039190'
x = re.findall(r"\d", txt)
print(x)

['1', '0', '3', '9', '1', '9', '0']


In [7]:
#Return an empty list if no match was found.

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


## The search() Function

- The search() function searches the string for a match, and returns a Match object if there is a match.

- If there is more than one match, only the first occurrence of the match will be returned:
- If no matches are found, the value None is returned:
>matchObject = re.search(pattern, input_str, flags=0)

In [8]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [10]:
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


### Re.match and re.search

**re.match()** checks for a match only at the beginning of the string, while **re.search()** checks for a match anywhere in the string 

In [68]:
m = re.match("c", "abcdef")    # No match
s = re.search("c", "abcdef")   # Match

In [70]:
print(m)

None


Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:

In [12]:
r = re.match("c", "abcdef")    # No match
s = re.search("^c", "abcdef")  # No match
search = re.search("^a", "abcdef")  # Match

In [17]:
search

<re.Match object; span=(0, 1), match='a'>

However in MULTILINE mode, match() only matches at the beginning of the string, whereas using search() with a regular expression **beginning with '^' will match at the beginning of each line.**

In [18]:
r = re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
s = re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match

### Match Object
- A Match Object is an object containing information about the search and the result.
- If there is no match, the value None will be returned, instead of the Match Object.

In [22]:
txt = "The rain in Yangon"
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

In [24]:
match = re.match("ai", txt)
if match:
    print('there is a match')
else:
    print('there is not a match')

there is not a match


The Match object has properties and methods used to retrieve information about the search, and the result:

- **.span()** returns a tuple containing the start-, and end positions of the match.
- **.string** returns the string passed into the function
- **.group()** returns the part of the string where there was a match

### span 
Print the position (start- and end-position) of the first match occurrence.

In [43]:
txt = "The rain in Yangon"
x = re.search(r"Y\w+", txt)
print(x.span())

(12, 18)


In [32]:
print("re.match examples")
print(re.match('super', 'superstition').span())
print(re.match('super', 'insuperable'))

print("re.search examples")
print(re.search('super', 'superstition').span())
print(re.search('super', 'insuperable').span())

re.match examples
(0, 5)
None
re.search examples
(0, 5)
(2, 7)


### string
Print the string passed into the function.

In [41]:
txt = "The rain in Yangon."
x = re.search(r"Y\w+", txt)
print(x.string)

The rain in Yangon.


### group

Print the part of the string where there was a match.

In [40]:
txt = "The rain in Spain"
x = re.search(r"S\w+", txt)
print(x.group())

Spain


In [37]:
x.group()

'Spain'

In [33]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist doctor")
print(m[0])      # The entire match

print(m[1])      # The first parenthesized subgroup.

print(m[2])       # The second parenthesized subgroup.

Isaac Newton
Isaac
Newton


In [47]:
match_object = re.match(r'(\w+)@(\w+)\.(\w+)', 'username@elysian.com')

 # for entire match
print(match_object.group())
# also print(match_object.group(0)) can be used
 
# for the first parenthesized subgroup
print(match_object.group(1))
 
# for the second parenthesized subgroup
print(match_object.group(2))
 
# for the third parenthesized subgroup
print(match_object.group(3))
 
# for a tuple of all matched subgroups
print(match_object.group(1, 2, 3))

username@elysian.com
username
elysian
com
('username', 'elysian', 'com')


## The split() Function

- The split() function returns a list where the string has been split at each match:
- You can control the number of occurrences by specifying the maxsplit parameter
>re.split(pattern, string, maxsplit=0, flags=0)

In [11]:
#Split at each white-space character:

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [12]:
#Split the string only at the first occurrence
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


In [45]:
split_text = re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
split_text

['0', '3', '9']

In [46]:
split_text = re.split('[a-f]+', '0a3B9')
split_text

['0', '3B9']

## The sub() Function
- The sub() function replaces the matches with the text of your choice
- You can control the number of replacements by specifying the count parameter
>replacedString = re.sub(pattern, replacement_pattern, input_str, count, flags=0)

In [13]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


In [14]:
#Replace the first 2 occurrences:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


**Note re Flags**

- re.IGNORECASE makes the pattern case insensitive so that it matches strings of different capitalizations

- re.MULTILINE is necessary if your input string has newline characters (\n), this flag allows the start and end metacharacter (^ and \$ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string

# Metacharacters

Metacharacters are characters with a special meaning:

| Character | Description | 
| --- | --- |
| []| A set of characters |
| \ | Signals a special sequence (can also be used to escape special characters) |
| . | Any character (except newline character) |
| ^ | Starts with|
| $ | Ends with |
| * | Zero or more occurrences|
| + | One or more occurrences |
| ? | Zero or one occurrences |
| {} | Exactly the specified number of occurrences |
| | | Either or |
| () | Capture and group |

In [40]:
# []

txt = "The rain in Yangon"
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-g]", txt)
print(x)

['e', 'a', 'a', 'g']


In [42]:
# \
txt = "It will cost 999 dollars"

#Find all digit characters:
x = re.findall("\d", txt)
print(x)

['9', '9', '9']


In [46]:
# .
txt = "hello planet hell"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
x = re.findall("he...", txt)
print(x)

['hello']


In [48]:
# ^
txt = "hello planet earth"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if x:
    print("Yes, the string starts with 'hello'")
else:
    print("No match")

Yes, the string starts with 'hello'


In [52]:
# $
txt = "hello planet earth"

#Check if the string ends with 'planet':

x = re.findall("earth$", txt)
if x:
    print("Yes, the string ends with 'earth'")
else:
    print("No match")

Yes, the string ends with 'earth'


In [54]:
# *
txt = "hello planet mars"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":
x = re.findall("he.*o", txt)

print(x)

['hello']


In [55]:
# +
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)

print(x)


['hello']


In [56]:
# ?
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?o", txt)

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

[]


In [3]:
# {}
txt = "hello planet earth"

#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall("ea.{2}h", txt)

print(x)

['earth']


In [62]:
# |
txt = "The rain falls heavily today at Yangon."

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['falls']
Yes, there is at least one match!


# Sets

| Set | Description | 
| --- | --- |
| [arn] | Returns a match where one of the specified characters (a, r, or n) is present |
| [a-n] | Returns a match for any lower case character, alphabetically between a and n |
| [^arn] | Returns a match for any character EXCEPT a, r, and n |
| [0123] | Returns a match where any of the specified digits (0, 1, 2, or 3) are present |
| [0-9] | Returns a match for any digit between 0 and 9 |
| [0-5][0-9] | Returns a match for any two-digit numbers from 00 and 59 |
| [a-zA-Z] | Returns a match for any character alphabetically between a and z, lower case OR upper case |
| [+] | In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string |


In [75]:
txt = "There is a lot of traffic today"

#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", txt)
print(x)
print('---------------------------')
y = re.findall("[a-n]", txt)
print(y)
print('---------------------------')
z = re.findall("[^arn]", txt)
print(z)

['r', 'a', 'r', 'a', 'a']
---------------------------
['h', 'e', 'e', 'i', 'a', 'l', 'f', 'a', 'f', 'f', 'i', 'c', 'd', 'a']
---------------------------
['T', 'h', 'e', 'e', ' ', 'i', 's', ' ', ' ', 'l', 'o', 't', ' ', 'o', 'f', ' ', 't', 'f', 'f', 'i', 'c', ' ', 't', 'o', 'd', 'y']
---------------------------


In [76]:
txt = "This coat's price is too high 90$."

d = re.findall("[0123]", txt)
print(d)
d = re.findall("[0-9]", txt)
print(d)

['0']
['9', '0']


In [79]:
txt = "I wake up today at 7:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:
x = re.findall("[0-5][0-9]", txt)
print(x)

['45']


In [80]:
txt = "I wake up today at 7:45 AM"

#Check if the string has any characters from a to z lower case, and A to Z upper case:
x = re.findall("[a-zA-Z]", txt)
print(x)

['I', 'w', 'a', 'k', 'e', 'u', 'p', 't', 'o', 'd', 'a', 'y', 'a', 't', 'A', 'M']


In [88]:
txt = "I wake up today at 7:45 AM"

#Check if the string has any + characters:

x = re.findall("[+]", txt)
print(x)

[]


# Special Sequences

| Character | Description | Example | 
| --- | --- | --- |
| \A | Returns a match if the specified characters are at the beginning of the string | "\AThe" |
| \b | Returns a match where the specified characters are at the beginning or at the end of a word | r"\bain" r"ain\b" |
| \B | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word  | r"\Bain" r"ain\B" |
| \d | Returns a match where the string contains digits (numbers from 0-9) | "\d" |
| \D | Returns a match where the string DOES NOT contain digits | "\D" |
| \s | Returns a match where the string contains a white space character | "\s" |
| \S | Returns a match where the string DOES NOT contain a white space character | "\S" |
| \w | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) | "\w" |
| \W | Returns a match where the string DOES NOT contain any word characters | "\W" |
| \Z | Returns a match if the specified characters are at the end of the string | "Spain\Z" |


- \d : Matches any decimal digit; this is equivalent to the class [0-9].
- \D : Matches any non-digit character; this is equivalent to the class [^0-9].
- \s : Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
- \S : Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
- \w : Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
- \W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

## Difference between ^ and \A, $ and \Z


They are different when it comes to matching string with multiple lines.

^ can match at the start of the string and after each line break. \A only ever matches at the start of the string

$ can match at the end of the string and before each line break. \Z only ever matches at the end of the string.

In [103]:
txt = "It is raining heavily in Yangon.\nIt is not raining at all in Mandalay."

#Check if the string starts with "It":
#Multiline mode
x = re.findall("\AIt", txt, re.MULTILINE)
print(x)
y = re.findall("^It", txt, re.MULTILINE)
print(y)

#no Multiline mode
print('No multiline')
x = re.findall("\AIt", txt)
print(x)
y = re.findall("^It", txt)
print(y)

['It']
['It', 'It']
No multiline
['It']
['It']


In [108]:
#Multiline mode
txt = "He is a student.\n She is also a student."
print('end string')
x = re.findall("student.\Z", txt, re.MULTILINE)
print(x)
y = re.findall("student.$", txt, re.MULTILINE)
print(y)

end string
['student.']
['student.', 'student.']


## \b and \B

'\b' finds/matches the pattern at the beginning or end of each word.

'\B' does not find/match the pattern at the beginning or end of each word.

In [95]:
txt = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)
print(x)
x = re.findall(r"ain\b", txt)
print(x)

[]
['ain', 'ain']


In [121]:
x = re.findall(r"\Bain", txt)
print(x)
x = re.findall(r"ain\B", txt)
print(x)

['ain', 'ain']
[]


## Other Sequences

In [124]:
txt = "I wake up at 7:45 AM today."

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)
print(x)
x = re.findall("\D", txt)
print(x)

['7', '4', '5']
['I', ' ', 'w', 'a', 'k', 'e', ' ', 'u', 'p', ' ', 'a', 't', ' ', ':', ' ', 'A', 'M', ' ', 't', 'o', 'd', 'a', 'y', '.']


In [126]:
txt = "I wake up at 7:45 AM today."

#Return a match at every white-space character:

x = re.findall("\s", txt)
print(x)
x = re.findall("\S", txt)
print(x)

[' ', ' ', ' ', ' ', ' ', ' ']
['I', 'w', 'a', 'k', 'e', 'u', 'p', 'a', 't', '7', ':', '4', '5', 'A', 'M', 't', 'o', 'd', 'a', 'y', '.']


In [128]:
txt = "I wake up at 7:45 AM today."

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)
print(x)
x = re.findall("\W", txt)
print(x)

['I', 'w', 'a', 'k', 'e', 'u', 'p', 'a', 't', '7', '4', '5', 'A', 'M', 't', 'o', 'd', 'a', 'y']
[' ', ' ', ' ', ' ', ':', ' ', ' ', '.']


# Re.compile

 We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.
 
>pattern = re.compile(pattern, flags=0)

In [134]:
import re
pattern=re.compile("..eep")
text = "I wanna sleep"

result=pattern.findall(text)
print(result)
result2= re.findall("..eep",text)
print(result2)

['sleep']
['sleep']


In [None]:
####  Group Name : 

#    - ?P<name>

match = re.search('(?P<name>.*) (?P<phone>.*)', 'John 123456')
print(match.group('name'))
print(match.group('phone'))