# <center>Regular Expressions</center>

In this tutorial, you will learn about regular expressions (RegEx), and use Python's re module to work with RegEx (with the help of examples).

A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. 

**Pattern Example:**

`^a...s$`

The above code defines a RegEx pattern. The pattern means: any five letter string starting with **a** and ending with **s**.

A pattern defined using RegEx can be used to match against a string.

|Expression|String|Matched?|
|----------|------|--------|
||abs|No match|
||alias| Match|
|**^a...s$**|abyss| Match|
||Alias| No match (**A** instead of **a**)|
||An abacus| No match|

Python has a module named re to work with RegEx.  Here's an example:

In [None]:
import re

pattern = '^a...s$' # Pattern: any five letters start with a, and ends in s.
test_string = 'alias' # Starts with an a, and ends with a s. 
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	

Here, we used **re.match( )** function to search **pattern** within the **test_string**. The method returns a match object if the search is successful. If not, it returns search unsuccessful.

## Specify Pattern Using RegEx

To specify regular expressions, metacharacters are used. In the previous example, **^** and **$** are metacharacters.

`^a...s$`

### MetaCharacters

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

|Metacharacters|Name|Metacharacters|Name|
|--------------|----|----|----|
|**[ ]**| Square Brackets|**{ }**|Braces|
|**.**|Period|**( )**|Group|
|**^**|Caret|**\**|Backslash|
|**$**|Dollar|**\|**|Alternation|
|**\***|Star|
|**+**|Plus|
|**?**|Question Mark|

### [ ] - Square brackets

Square brackets specifies a set of characters you wish to match.

|Expression|String|Matched?|
|----------|------|--------|
|**[abc]**|a|1 match|
|**[abc]**|ac|2 matches|
|**[abc]**|Hey Jude|No match|
|**[abc]**|abc de ca|5 matches|

Here, **[abc]** will match if the string you are trying to match contains any of the **a**, **b** or **c**.

You can also specify a range of characters using dash (**-**) inside square brackets.

* **[a-e]** is the same as **[abcde]**.
* **[1-4]** is the same as **[1234]**.
* **[0-39**] is the same as **[01239]**.

You can complement (invert) the character set by using caret (**^**) symbol at the start of a square-bracket.

* **[^abc]** means any character except **a** or **b** or **c**.
* **[^0-9**] means any non-digit character.

### . - Period

A period matches any single character (except newline **'\n'**).

|Expression|String|Matched?|
|----------|------|--------|
|**..**|a|No match|
|**..**|ac|1 match|
|**..**	|acd|1 match|
|**..**|acde|2 matches (contains 4 characters)|

### ^ - Caret

The caret symbol **^** is used to check if a string starts with a certain character.

|Expression|String|Matched?|
|----------|------|--------|
|^a|a|1 match|
|^a|abc|1 match|
|^a|bac|No match|
|^ab|abc|1 match|
|^ab|acb|No match (starts with a but not followed by b)|

### $ - Dollar

The dollar symbol **$** is used to check if a string **ends with** a certain character.

|Expression|String|Matched?|
|----------|------|--------|
|**a\$**|a|1 match|
|**a\$**|formula|1 match|
|**a\$**|cab|	No match|

### * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

|Expression|String|Matched?|
|----------|------|--------|
|**ma*n**|mn|1 match|
|**ma*n**|man|1 match|
|**ma*n**|maaan|1 match|
|**ma*n**|main|No match (**a** is not followed by **n**)|
|**ma*n**|woman|1 match|

### + - Plus

The plus symbol **+** matches one or more occurrences of the pattern left to it.

|Expression|String|Matched?|
|----------|------|--------|
|**ma+n**|mn|No match (no a character)|
|**ma+n**|man|1 match|
|**ma+n**|maaan|1 match|
|**ma+n**|main|No match (**a** is not followed by **n**)|
|**ma+n**|woman|1 match|

### ? - Question Mark

The question mark symbol **?** matches zero or one occurrence of the pattern left to it.

|Expression|String|Matched?|
|----------|------|--------|
|**ma?n**|mn|1 match|
|**ma?n**|man|1 match|
|**ma?n**|maaan|No match (more than one a character)|
|**ma?n**|main|No match (a is not followed by n)|
|**ma?n**|woman|1 match|

### { } - Braces

Consider this code: **{n,m}**. This means at least **n**, and at most **m** repetitions of the pattern left to it.

|Expression|String|Matched?|
|----------|------|--------|
|**a{2,3}**|abc dat	|No match|
|**a{2,3}**|abc daat|1 match (at daat)|
|**a{2,3}**|aabc daaat|2 matches (at aabc and daaat)|
|**a{2,3}**|aabc daaaat|2 matches (at aabc and daaaat)|

Let's try one more example. This RegEx **[0-9]{2,4}** matches at least 2 digits but not more than 4 digits

|Expression|String|Matched?|
|----------|------|--------|
|**[0-9]{2,4}**|ab123csde|1 match (match at ab123csde)|
|**[0-9]{2,4}**|12 and 345673|3 matches (12, 3456, 73)|
|**[0-9]{2,4}**|1 and 2|No match|

### | - Alternation

Vertical bar **|** is used for alternation (**or** operator).

|Expression|String|Matched?|
|----------|------|--------|
|**a\|b**|cde|No match|
|**a\|b**|ade|1 match (match at ade)|
|**a\|b**|acdbea|3 matches (at acdbea)|

Here, **a|b** match any string that contains either **a** or **b**

### ( ) - Group

Parentheses ( ) is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

|Expression|String|Matched?|
|----------|------|--------|
|**(a\|b\|c)xz**|ab xz|No match|
|**(a\|b\|c)xz**|abxz|1 match (match at abxz)|
|**(a\|b\|c)xz**|axz cabxz|2 matches (at axzbc cabxz)|

### \ - Backslash

Backslash **\** is used to escape various characters including all metacharacters. For example,

**\$a** match if a string contains **\$** followed by **a**. Here, **\$** is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put **\** in front of it. This makes sure the character is not treated in a special way.

## Special Sequences

Special sequences make commonly used patterns easier to write. Here's a list of special sequences:


### \A - Matches if the specified characters are at the start of a string.

|Expression|String|Matched?|
|----------|------|--------|
|**\Athe**|the sun|Match|
|**\Athe**|In the sun|No match|

### \b - Matches if the specified characters are at the beginning or end of a word.

|Expression|String|Matched?|
|----------|------|--------|
|**\bfoo**|football|Match|
|**\bfoo**|a football|Match|
|**\bfoo**|afootball|No match|
|**foo\b**|the foo|Match|
|**foo\b**|the afoo test|Match|
|**foo\b**|the afootest|No match|



### \B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

|Expression|String|Matched?|
|----------|------|--------|
|**\Bfoo**|	football|	No match|
|**\Bfoo**|a football|	No match|
|**\Bfoo**|afootball|	Match|
|**foo\B**|the foo|No match|
|**foo\B**|the afoo test|No match|
|**foo\B**|the afootest|Match|

### \d - Matches any decimal digit. Equivalent to [0-9]

|Expression|String|Matched?|
|----------|------|--------|
|**\d**|12abc3|3 matches (at 12abc3)|
|**\d**|Python|No match|

### \D - Matches any non-decimal digit. Equivalent to [^0-9]

|Expression|String|Matched?|
|----------|------|--------|
|**\D**|1ab34"50|3 matches (at 1ab34"50)|
|**\D**|1345|No match|

### \s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

|Expression|String|Matched?|
|----------|------|--------|
|**\s**|Python RegEx|1 match|
|**\s**|PythonRegEx|No match|

### \S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

|Expression|String|Matched?|
|----------|------|--------|
|**\S**|a b|2 matches (at  a b)|
|**\S**||No matches

### \w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

|Expression|String|Matched?|
|----------|------|--------|
|**\w**	|12&": ;c |	3 matches (at 12&": ;c)|
|**\w**|%"> !|	No match|

### \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

|Expression|String|Matched?|
|----------|------|--------|
|**\W**|1a2%c|1 match (at 1a2%c)|
|**\W**|Python|No match|

### \Z - Matches if the specified characters are at the end of a string.

|Expression|String|Matched?|
|----------|------|--------|
|**Python\Z**|	I like Python|	1 match|
|**Python\Z**|I like Python Programming|No match|
|**Python\Z**|Python is fun.|No match|

## Python RegEx

Python has a module named re to work with regular expressions. To use it, we need to import the module.

`import re`

The module defines several functions and constants to work with RegEx.

### re.findall( )

The **re.findall( )** method returns a list of strings containing all matches.

#### Example:

In [None]:
# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

If the pattern is not found, re.findall() returns an empty list.

### re.split( )

The **re.split** method splits the string where there is a match and returns a list of strings where the splits have occurred.

#### Example:

In [None]:
import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

If the pattern is not found, **re.split( )** returns a list containing the original string.

You can pass **maxsplit** argument to the **re.split( )** method. It's the maximum number of splits that will occur.

In [None]:
import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

**Note:** The default value of **maxsplit** is 0; meaning all possible splits.

### re.sub( )

The syntax of **re.sub( )** is:

`re.sub(pattern, replace, string)`

The method returns a string where matched occurrences are replaced with the content of **replace** variable.

#### Example:

In [None]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

If the pattern is not found, **re.sub( )** returns the original string.

You can pass **count** as a fourth parameter to the **re.sub( )** method. If omited, it results to 0. This will replace all occurrences.

In [None]:
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

### re.subn( )

The **re.subn( )** is similar to **re.sub( )** expect it returns a tuple of 2 items containing the new string and the number of substitutions made.

#### Example:

In [None]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = 'mr perfect'

new_string = re.subn(pattern, replace, string) 
print(new_string)

### re.search( )

The **re.search( )** method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, **re.search( )** returns a match object; if not, it returns **None**.

`match = re.search(pattern, str)`

#### Example: 

In [None]:
import re

string = "Mr Perfect is fun"

# check if 'Python' is at the beginning
match = re.search('\AMr Perfect', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

Here, match contains a match object.

## Match object

You can get methods and attributes of a match object using **dir( )** function.

Some of the commonly used methods and attributes of match objects are:

### match.group( )

The **group( )** method returns the part of the string where there is a match.

#### Example:

In [1]:
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})' 

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

801 35


Here, **match** variable contains a match object.

Our pattern **(\d{3}) (\d{2})** has two subgroups `(\d{3})` and `(\d{2})`. You can get the part of the string of these parenthesized subgroups. Here's how:

In [2]:
match.group(1)

'801'

In [3]:
match.group(2)

'35'

In [4]:
match.group(1, 2)

('801', '35')

In [5]:
match.groups()

('801', '35')

### match.start( ), match.end( ) and match.span( )

The **start( )** function returns the index of the start of the matched substring. Similarly, **end( )** returns the end index of the matched substring.

In [6]:
match.start()

2

In [7]:
match.end()

8

The **span( )** function returns a tuple containing start and end index of the matched part.

In [8]:
match.span()

(2, 8)

### match.re and match.string

The **re** attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string.

In [9]:
match.re

re.compile(r'(\d{3}) (\d{2})', re.UNICODE)

In [10]:
match.string

'39801 356, 2102 1111'

### Using r prefix before RegEx

When **r** or **R** prefix is used before a regular expression, it means raw string. For example, **'\n'** is a new line whereas **r'\n'** means two characters: a backslash **\\** followed by **n**.

Backslash **\\** is used to escape various characters including all metacharacters. However, using **r** prefix makes **\\** treat as a normal character.

#### Example: Raw string using r prefix

In [None]:
import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# <center>THE END<center>

## Practice Exercises

### Exercise 1: Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9).

In [None]:
# Write your solution here

### Exercise 2: Write a Python program that matches a string that has an a followed by zero or more b's.

In [None]:
# Write your solution here

### Exercise 3: Write a Python program that matches a string that has an a followed by one or more b's.

In [None]:
# Write your solution here

### Exercise 4: Write a Python program that matches a string that has an a followed by zero or one 'b'.

In [None]:
# Write your solution here

### Exercise 5: Write a Python program that matches a string that has an a followed by three 'b'.

In [None]:
# Write your solution here

### Exercise 6: Write a Python program that matches a string that has an a followed by two to three 'b'.

In [None]:
# Write your solution here

### Exercise 7: Write a Python program to find sequences of lowercase letters joined with a underscore.

In [None]:
# Write your solution here

### Exercise 8: Write a Python program to find the sequences of one upper case letter followed by lower case letters.

In [None]:
# Write your solution here

### Exercise 9: Write a Python program that matches a string that has an 'a' followed by anything, ending in 'b'.

In [None]:
# Write your solution here

### Exercise 10: Write a Python program that matches a word at the beginning of a string.

In [None]:
# Write your solution here

### Exercise 11: Write a Python program that matches a word at the end of a string, with optional punctuation.

In [None]:
# Write your solution here

### Exercise 12: Write a Python program that matches a word containing 'z'.

In [None]:
# Write your solution here

### Exercise 13: Write a Python program that matches a word containing 'z', not at the start or end of the word.

In [None]:
# Write your solution here

### Exercise 14: Write a Python program to match a string that contains only upper and lowercase letters, numbers, and underscores.

In [None]:
# Write your solution here

### Exercise 15: Write a Python program where a string will start with a specific number.

In [None]:
# Write your solution here

### Exercise 16: Write a Python program to remove leading zeros from an IP address.

In [None]:
# Write your solution here

### Exercise 17: Write a Python program to check for a number at the end of a string.

In [None]:
# Write your solution here

### Exercise 18: Write Python program to search the numbers (0-9) of length between 1 to 3 in a given string.

In [None]:
# Write your solution here

### Exercise 19: Write a Python program to search some literals strings in a string.

In [None]:
# Write your solution here

### Exercise 20: Write a Python program to search a literals string in a string and also find the location within the original string where the pattern occurs.

In [None]:
# Write your solution here

### Exercise 1:

## Solutions

### Exercise 1:

In [None]:
import re
def is_allowed_specific_char(string):
    charRe = re.compile(r'[^a-zA-Z0-9.]')
    string = charRe.search(string)
    return not bool(string)

print(is_allowed_specific_char("ABCDEFabcdef123450")) 
print(is_allowed_specific_char("*&%@#!}{"))

### Exercise 2:

In [None]:
import re
def text_match(text):
        patterns = 'ab*?'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("ac"))
print(text_match("abc"))
print(text_match("abbc"))

### Exercise 3:

In [None]:
import re
def text_match(text):
        patterns = 'ab+?'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("ab"))
print(text_match("abc"))

### Exercise 4:

In [None]:
import re
def text_match(text):
        patterns = 'ab?'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("ab"))
print(text_match("abc"))
print(text_match("abbc"))
print(text_match("aabbc"))

### Exercise 5:

In [None]:
import re
def text_match(text):
        patterns = 'ab{3}?'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("abbb"))
print(text_match("aabbbbbc"))

### Exercise 6:

In [None]:
import re
def text_match(text):
        patterns = 'ab{2,3}?'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("ab"))
print(text_match("aabbbbbc"))

### Exercise 7: 

In [None]:
import re
def text_match(text):
        patterns = '^[a-z]+_[a-z]+$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("aab_cbbbc"))
print(text_match("aab_Abbbc"))
print(text_match("Aaab_abbbc"))

### Exercise 8:

In [None]:
import re
def text_match(text):
        patterns = '^[a-z]+_[a-z]+$'
        if not re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("aab_cbbbc"))
print(text_match("aab_Abbbc"))
print(text_match("Aaab_abbbc"))

### Exercise 9:

In [None]:
import re
def text_match(text):
        patterns = 'a.*?b$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("aabbbbd"))
print(text_match("aabAbbbc"))
print(text_match("accddbbjjjb"))

### Exercise 10:

In [None]:
import re
def text_match(text):
        patterns = '^\w+'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown fox jumps over the lazy dog."))
print(text_match(" The quick brown fox jumps over the lazy dog."))

### Exercise 11:

In [None]:
import re
def text_match(text):
        patterns = '\w+\S*$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown dog jumps over the lazy cat."))
print(text_match("The quick brown dog jumps over the lazy cat. "))
print(text_match("The quick brown dog jumps over the lazy cat "))

### Exercise 12:

In [None]:
import re
def text_match(text):
        patterns = '\w*z.\w*'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown dog jumps over the lazy cat."))
print(text_match("Python Exercises."))

### Exercise 13: 

In [None]:
import re
def text_match(text):
        patterns = '\Bz\B'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown dog jumps over the lazy cat."))
print(text_match("Python Exercises."))

### Exercise 14:

In [None]:
import re
def text_match(text):
        patterns = '^[a-zA-Z0-9_]*$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown dog jumps over the lazy cat."))
print(text_match("Python_Exercises_1"))

### Exercise 15:

In [None]:
import re
def match_num(string):
    text = re.compile(r"^5")
    if text.match(string):
        return True
    else:
        return False
print(match_num('5-2345861'))
print(match_num('6-2345861'))

### Exercise 16:

In [None]:
import re
ip = "216.08.094.196"
string = re.sub('\.[0]*', '.', ip)
print(string)

### Exercise 17:

In [None]:
import re
def end_num(string):
    text = re.compile(r".*[0-9]$")
    if text.match(string):
        return True
    else:
        return False

print(end_num('abcdef'))
print(end_num('abcdef6'))

### Exercise 18:

In [None]:
import re
results = re.finditer(r"([0-9]{1,3})", "Exercises number 1, 12, 13, and 345 are important")
print("Number of length 1 to 3")
for n in results:
     print(n.group(0))

### Exercise 19:

In [None]:
import re
patterns = [ 'cat', 'dog', 'cow' ]
text = 'The quick brown dog jumps over the lazy cat.'
for pattern in patterns:
    print('Searching for "%s" in "%s" ->' % (pattern, text),)
    if re.search(pattern,  text):
        print('Matched!')
    else:
        print('Not Matched!')

### Exercise 20:

In [None]:
import re
pattern = 'dog'
text = 'The quick brown dog jumps over the lazy cat.'
match = re.search(pattern, text)
s = match.start()
e = match.end()
print('Found "%s" in "%s" from %d to %d ' % \
    (match.re.pattern, match.string, s, e))