# RegEx-Regular Expressions

#### A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example,
 ^a...s$


##### The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s
#### A pattern defined using RegEx can be used to match against a string.


|Expression|	String|	Matched?|
|----------|----------|---------|
|^a...s$   | abs	  | No match|
|          | alias    | Match   |
|          | abyss	  | Match   |
|          | Alias	  | No match|
|          | An abacus| No match|

In [55]:
# Python has a module named re to work with RegEx
import re

### RegEx Functions
#### The re module offers a set of functions that allows us to search a string for a match:

|Function| Description |
|--------|-------------|
|findall |	Returns a list containing all matches|
| search |	Returns a Match object if there is a match anywhere in the string |
| split	 |  Returns a list where the string has been split at each match|
| sub	 |  Replaces one or many matches with a string|

In [28]:
import re

pattern = '^a...s$'   #To specify regular expressions, metacharacters are used. In the above example, ^ and $ are metacharacters.
test_string = 'abysaaas'
result = re.match(pattern, test_string)
print(result)
if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")


None
Search unsuccessful.


#### MetaCharacters
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:


|Character|Description	| Example |
|---------|-------------|---------|
|[]	|A set of characters|"[a-m]"|	
|\	|Signals a special sequence (can also be used to escape special characters)	| "\d"	|
|.	|Any character (except newline character)	|"he..o"	|
|^	|Starts with	|"^hello"	|
|*	|Zero or more occurrences|"aix*"	|
|+	|One or more occurrences|"aix+"	|
|{}	|Exactly the specified number of occurrences	|"al{2}"	|
|()	|Capture and group|  |


|$	|Ends with      |"world$" |
||	|Either or	|"falls|stays"	|


### [ ] - Square brackets

#### Square brackets specifies a set of characters you wish to match
| Expression |	String | Matched?|
|------------|---------|---------|
| [abc]      | 	a	   |1 match  |
|            |  ac     |2 matches|
|            | Hey Jude|No match |
|            |abc de ca|5 matches|

In [57]:
import re

txt = "The rain in Spain and Newyork"
#Find all lower case characters alphabet a ,b or c:
x = re.findall("[abc]", txt)
print(x)

['a', 'a', 'a']


In [58]:
txt = "The rain in Spain and Newyork"
#You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.
x = re.findall("[^abc]", txt)
print(x)

['T', 'h', 'e', ' ', 'r', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'i', 'n', ' ', 'n', 'd', ' ', 'N', 'e', 'w', 'y', 'o', 'r', 'k']


In [None]:
# You can also specify a range of characters using - inside square brackets.

# [a-e] is the same as [abcde].
# [1-4] is the same as [1234].
# [0-39] is the same as [01239]

### . - Period
#### A period matches any single character (except newline '\n')

|Expression|String|Matched?|
|----------|------|--------|
|    ..    |a	  |No match|
|          |ac	  |1 match |
|          |acd   |	1 match|
|          |acde  |2matches|

In [59]:
import re

txt = "hello world"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..o", txt)
print(x)


['hello']


#### ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.


|Expression|	String|	Matched?|
|----------|----------|---------|
|^a	|a|	1 match|
||abc|	1 match|
||bac|	No match|
|^ab|	abc|	1 match|
||acb	|No match |

#### $ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

|Expression|	String|	Matched?|
|----------|----------|---------|
|a$	|a	|1 match|
||formula	|1 match|
||cab	|No match|

#### * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

|Expression|	String|	Matched?|
|----------|----------|---------|
|ma*n|	mn	|1 match|
||man	|1 match|
||maaan|	1 match|
||main	|No match |
||woman|	1 match|

#### + - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

|Expression|	String|	Matched?|
|----------|----------|---------|
|ma+n	|mn	|No match|
||man	|1 match|
||maaan|	1 match|
||main	|No match |
||woman|	1 match|

#### ? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

|Expression|	String|	Matched?|
|----------|----------|---------|
|ma?n|	mn	|1 match|
||man	|1 match|
||maaan|	No match |
||main	|No match |
||woman	|1 match|

#### {} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

|Expression|	String|	Matched?|
|----------|----------|---------|
|a{2,3}	|abc dat	|No match|
||abc daat|	1 match (at daat)|
||aabc daaat|	2 matches (at aabc and daaat)|
||aabc daaaat	|2 matches (at aabc and daaaat)|

 Let's try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than 4 digits

|Expression|	String|	Matched?|
|----------|----------|---------|
|[0-9]{2,4}|	ab123csde	|1 match (match at ab123csde)|
||12 and 345673|	3 matches (12, 3456, 73)|
||1 and 2|	No match|

In [None]:
#  | - Alternation
# Vertical bar | is used for alternation (or operator).

# |Expression|	String|	Matched?|
# |----------|----------|---------|
# |a|b	|cde	|No match||
# ||ade	|1 match (match at ade)|
# ||acdbea|	3 matches (at acdbea)|

# Here, a|b match any string that contains either a or b

In [None]:
# #### () - Group

# Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

# |Expression|	String|	Matched?|
# |----------|----------|---------|
# |(a|b|c)xz |	ab xz |	No match|
# |          |abxz      |	1 match (match at abxz)|
# |          |axz cabxz |2 matches (at axzbc cabxz)|

### Special Sequences

Special sequences make commonly used patterns easier to write. Here's a list of special sequences:

|Character|Description|Example|
|---------|-----------|-------|
|\A	|Returns a match if the specified characters are at the beginning of the string|	"\AThe"	|
|\b|	Returns a match where the specified characters are at the beginning or at the end of a word||r"\bain"|
||(the "r" in the beginning is making sure that the string is being treated as a "raw string")	|r"ain\b"|	
|\B|	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word|r"\Bain"
||(the "r" in the beginning is making sure that the string is being treated as a "raw string")|r"ain\B"	|
|\d	|Returns a match where the string contains digits (numbers from 0-9)	|"\d"	|
|\D	|Returns a match where the string DOES NOT contain digits|	"\D"|	
|\s|	Returns a match where the string contains a white space character	|"\s"	|
|\S|	Returns a match where the string DOES NOT contain a white space character	|"\S"	|
|\w	|Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)|	"\w"	|
|\W	|Returns a match where the string DOES NOT contain any word characters|	"\W"	|
|\Z	|Returns a match if the specified characters are at the end of the string	|"Spain\Z"|

####  Example 1: re.findall()

In [62]:


# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

['12', '89', '34']


#### Example 2: re.split()

In [63]:
import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

['Twelve:', ' Eighty nine:', '.']


In [64]:
string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

['Twelve:', ' Eighty nine:89 Nine:9.']


#### Example 3: re.sub()


In [65]:

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

abc12de23f456


In [66]:

import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6

abc12de 23 
 f45 6


#### Example 4: re.subn()

In [67]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)

('abc12de23f456', 4)


#### Example 5: re.search()

In [69]:
import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
    print("pattern found inside the string")
else:
    print("pattern not found")  

# Output: pattern found inside the string

pattern found inside the string


### Match object
You can get methods and attributes of a match object using dir() function.

Some of the commonly used methods and attributes of match objects are:

#### Example 6: Match object


In [71]:
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
    print(match.group())
else:
    print("pattern not found")

# Output: 801 35

801 35


In [75]:
print(match.group(1))
print(match.group(2))
print(match.group(1, 2))
print(match.groups())

801
35
('801', '35')
('801', '35')


In [77]:
print(match.start())
print(match.end())
print(match.span())

2
8
(2, 8)


match.re and match.string.
The re attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string

In [78]:
print(match.re)

re.compile('(\\d{3}) (\\d{2})')


In [79]:
print(match.string)

39801 356, 2102 1111


#### Example 7: Raw string using r prefix

In [81]:
import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']

['\n', '\r']


#### Important Links :
#### https://www.programiz.com/python-programming/regex
#### https://www.w3schools.com/python/python_regex.asp
#### https://regex101.com/