There are several methods to search matches : finditer, match, search, findall

In [3]:
# Method 1 : Giving the pattern directly to finditer
import re 
test = '117abciujoi8987abc7465'
match = re.finditer(r'abc',test)
for i in match:
    print(i)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(15, 18), match='abc'>


In [3]:
# Method 2 : Findall
import re 
test = '117abciujoi8987abc7465'
match = re.findall(r'abc',test)
for i in match:
    print(i)

abc
abc


In [5]:
# Method 3 : Match
import re 
test = '117abciujoi8987abc7465'
match = re.match(r'abc',test)
print(match)   # Searches for abc on the beginning of the test string

None


In [6]:
# Method 4 : Search
import re 
test = '117abciujoi8987abc7465'
match = re.search(r'abc',test)
print(match)    # Returns the first location of the pattern

<re.Match object; span=(3, 6), match='abc'>


In [14]:
# Method 5 : Explicitly compiling the pattern and most commonly used
import re 
test = '117abciujoi8987abc7465'
pattern = re.compile(r'abc')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(15, 18), match='abc'>


There are 4 different methods we can use on the match object above : group, start, end, span

In [13]:
import re 
test = '117abciujoi8987abc7465'
pattern = re.compile(r'abc')
matches = pattern.finditer(test)
for  match in matches:
    print(match.span(),match.start(),match.end(),match.group())

(3, 6) 3 6 abc
(15, 18) 15 18 abc


### Metacharacters:
All meta characters: . ^ $ * + ? { } [ ] \ | ( )
Meta characters need need to be escaped (with ) if we actually want to search for the character.

- **.** Any character (except newline character) "he..o"<br>
- **^**  Starts with "^hello"<br>
- **\$**  Ends with "world$"<br>
- **\***  Zero or more occurrences "aix*"<br>
- **\+**  One or more occurrences "aix+"<br>
- **{ }**  Exactly the specified number of occurrences "al{2}"<br>
- **[]** A set of characters "[a-m]"<br>
- **\\** Signals a special sequence (can also be used to escape special characters) "\d"<br>
- **|** Either or "falls|stays"<br>
- **( )** Capture and group<br>


In [19]:
import re 
test = '117'
pattern = re.compile('.')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(1, 2), match='1'>
<re.Match object; span=(2, 3), match='7'>


In [16]:
import re 
test = '117abciujoi8987abc7465'
pattern = re.compile('^117')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(0, 3), match='117'>


In [18]:
import re 
test = '117abciujoi8987abc7465'
pattern = re.compile('7465$')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(18, 22), match='7465'>


### More Metacharacters / Special Sequences<br>
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

- \d :Matches any decimal digit; this is equivalent to the class [0-9].
- \D : Matches any non-digit character; this is equivalent to the class [^0-9].
- \s : Matches any whitespace character;
- \S : Matches any non-whitespace character;
- \w : Matches any alphanumeric (word) character; this is equivalent to the class [a-zA-Z0-9_].
- \W : Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
- \b Returns a match where the specified characters are at the beginning or at the end of a word r"\bain" r"ain\b"
- \B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"
- \A Returns a match if the specified characters are at the beginning of the string "\AThe"
- \Z Returns a match if the specified characters are at the end of the string "Spain\Z"<br>

Here we can find a pattern. All the capital letters are the exact opposite of the small ones.

In [23]:
import re 
test = '17soi7ab5'
pattern = re.compile('\d')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(1, 2), match='7'>
<re.Match object; span=(5, 6), match='7'>
<re.Match object; span=(8, 9), match='5'>


In [24]:
import re 
test = '17soi7ab5'
pattern = re.compile('\D')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(2, 3), match='s'>
<re.Match object; span=(3, 4), match='o'>
<re.Match object; span=(4, 5), match='i'>
<re.Match object; span=(6, 7), match='a'>
<re.Match object; span=(7, 8), match='b'>


### Sets in regex:<br>
We can use a set of characters to define a pattern in regex.

In [26]:
import re 
test = '17soi7ab5'
pattern = re.compile('[7sa]')
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(1, 2), match='7'>
<re.Match object; span=(2, 3), match='s'>
<re.Match object; span=(5, 6), match='7'>
<re.Match object; span=(6, 7), match='a'>


In [42]:
import re 
test = '17soi7ab5'
pattern = re.compile('[1-5a-d]')  #using ranges inside a set and it is case sensitive
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(6, 7), match='a'>
<re.Match object; span=(7, 8), match='b'>
<re.Match object; span=(8, 9), match='5'>


### Quantifiers:
- \* - 0 or more
- \+ - one or more
- ? - 0 or 1 (Either or)
- { } - Exact number
- {1,100} - range (min,max)

In [44]:
import re 
test = '17soi7ab5'
pattern = re.compile('[a-d]?\d*')  #
matches = pattern.finditer(test)
for  match in matches:
    print(match)

<re.Match object; span=(0, 2), match='17'>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 6), match='7'>
<re.Match object; span=(6, 7), match='a'>
<re.Match object; span=(7, 9), match='b5'>
<re.Match object; span=(9, 9), match=''>


In [49]:
import re 
names = '''
hi 
hello
1232
Mr Sri
Ms Sriya
Mrs Sri
Mr.Sri
Ms. Sriya
'''
pattern = re.compile('Ms*r*s*[.\s]+\w+') # Ms*r*s* = (Mr|Ms|Mrs)
matches = pattern.finditer(names)
for  match in matches:
    print(match)

<re.Match object; span=(1, 7), match='Mr Sri'>
<re.Match object; span=(8, 16), match='Ms Sriya'>
<re.Match object; span=(17, 24), match='Mrs Sri'>
<re.Match object; span=(25, 31), match='Mr.Sri'>
<re.Match object; span=(32, 41), match='Ms. Sriya'>


In [75]:
import re 
emails = '''
hi 
hello
1232
sriya111@gmail.com
sriya2004@gmail.com
msriya@yahoo.co.uk
sriya_44@yahoo.com
'''
pattern = re.compile('\w+@\w+\.(com|co\.uk|org)') 
matches = pattern.finditer(emails)
for  match in matches:
    print(match)

<re.Match object; span=(16, 34), match='sriya111@gmail.com'>
<re.Match object; span=(35, 54), match='sriya2004@gmail.com'>
<re.Match object; span=(55, 73), match='msriya@yahoo.co.uk'>
<re.Match object; span=(74, 92), match='sriya_44@yahoo.com'>


We can further use the group method to get specific informations by doing some changes in the above code

In [74]:
import re 
emails = '''
hi 
hello
1232
sriya111@gmail.com
sriya2004@gmail.com
msriya@yahoo.co.uk
sriya_44@yahoo.com
'''
pattern = re.compile('(\w+)@(\w+)\.(com|co\.uk|org)') 
matches = pattern.finditer(emails)
for  match in matches:
    print(f'name is {match.group(1)}')
    print(f'domain name is {match.group(2)}')
    print(f'ending is {match.group(3)}')

name is sriya111
domain name is gmail
ending is com
name is sriya2004
domain name is gmail
ending is com
name is msriya
domain name is yahoo
ending is co.uk
name is sriya_44
domain name is yahoo
ending is com


### Modifying strings in regex:
There are two methods of modifying a string with regex. They are split and sub

In [71]:
import re 
test = '123abc456abc123ABC'
pattern = re.compile('abc')
splitted = pattern.split(test)
print(splitted)

['123', '456', '123ABC']
<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(9, 12), match='abc'>


In [73]:
import re 
test = 'Hello World'
pattern = re.compile('World')
substituted = pattern.sub('Sriya',test)
print(substituted)

Hello Sriya


Final example

In [87]:
import re 
url = '''
hi 
hello
1232
http://sriya.com
https://m-sriya.com
http://msriya.org
https://google.com
'''
pattern = re.compile('https*://[0-9a-zA-Z-_]+\.[a-z]+') 
matches = pattern.finditer(url)
for  match in matches:
    print(match)

<re.Match object; span=(16, 32), match='http://sriya.com'>
<re.Match object; span=(33, 52), match='https://m-sriya.com'>
<re.Match object; span=(53, 70), match='http://msriya.org'>
<re.Match object; span=(71, 89), match='https://google.com'>
