### Regular Expression

- `re.search` - 
The `search()` first match anywhere in the string. Returns a `Match` object when the pattern is found. If the pattern is not found, `search()` returns `None`.

Each Match object holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs.  

- `re.match` 
This searches for the first match at the very start of the string. This also returns a `Match` object.

- `re.findall` 
This searches for all the matches anywhere in the string and returns the substrings found. 

- `re.finditer` 
It is just like findall but returns an iterator that produces `Match` instance instead of the strings returned by `findall`().

### `re.compile`

In [91]:
content  = 'mayank 1sam 11128 peter1 smith dog cat'
pattern = '[a-z]+'

p = re.compile(pattern)

In [94]:
%pprint
[x for x in dir(p) if '_' not in x]

Pretty printing has been turned OFF


['findall', 'finditer', 'flags', 'fullmatch', 'groupindex', 'groups', 'match', 'pattern', 'scanner', 'search', 'split', 'sub', 'subn']

In [95]:
p.findall(content)

['mayank', 'sam', 'peter', 'smith', 'dog', 'cat']

In [115]:
a = p.finditer(content)
for match in a:
    print(match.start(), match.end(), match.string[match.start():match.end()])

0 6 mayank
8 11 sam
18 23 peter
25 30 smith
31 34 dog
35 38 cat


#### Flags - 

|Flag      |Meaning      | 
|----------|-------------|
|ASCII, A  |Make several escapes like \w,\b,\s and \d match only on ASCII characters with the respective property|
|DOTALL, S |Make `.` match any character, including newline|
|IGNORECASE, I| Do case-insensitive matches|
|MULTILINE, M|Multiline matching, affecting `^` and `$`   |
|VERBOSE, X|Enable verbose REs which can be organized more cleanly and incorporating comments|

**Note - ** Following  characters have special meaning - 

\ ^ $ . | ?  * + ( ) { } [ ]


pattern|explanation
-------|-------------
'.' dot| any character except newline. Specify `re.DOTALL` to include newline
^m     | m at the start of string. In `re.MULTILINE` mode, also matches immediately after each newline
m$     | m at the end of string, to also match m before newline as well, specify `re.MULTILINE`
m..    |m followed with any 2 char, newline not counted, anywhere in string
ab*    |a and 0 or more occurence of b
ab+    |a and 1 or more occurene of b
ab?    |a and 0 or 1 occurence of b
ab{3}  |a followed by 3 b
ab{2,3}|a followed by 2 to 3 b
\d     | all digits
\D     |everything else except digit
\s     |all whitespace character (eg \n)
\S     |all except whitespace character
\w     |matches all word characters (0 to 9, a to z, A to Z, _)
\W     |all but word characters
[a-zA-Z0-9] |Matches any letter from (a to z) or (A to Z) or (0 to 9).
[^a-zA-Z0-9]| Matches every character not in the list
[ab]   | matches either a or b
[0-9]  |matches any digit in the range
[0-5][0-9]| matches any number from 00 to 59
[0-9]+ | will match the longest number (to be checked)
a &#124; b| match either a or b

**Note**

 - a | b ->/match either a or b
 - `*` is greedy in nature and tries to consume as much of input as possible. To control its greediness, we use `*?`. In fact, if `?` is added to qualifiers (+,*, and ? itself) it will perform matches in a non-greedy manner.
 - Be careful that `[a-z]` matches any alphabet character but `[a\-z]` will match a,- or z (\ is escaping - here).  



#### Raw String

Consider following 2 examples. Note that in both examples, we are trying to match literal character `'\'` in string. But the problem is `'\'` has special meaning in both RE and string. In both cases, it is used for escaping. 

Suppose we have a string `'This\nThat'`. Normally `\n` will be read as a single character (newline character) but what if we want to treat it as two separate characters (`\` and `n`)? One way is to escape using `\` to escape itself. That is we will write `'This\\nThat'`. And to match this backslash, our regular expression will be `'\\\\'` as 2 backlash in regex will be needed to match each backslash in string and string has 2 backslashes.

The solution is to use raw string, that is string written in the form r'some text'. 

In [86]:
pattern = '\\\\'
content = 'mayan\\xx'

re.findall(pattern, content)

['\\']

In [84]:
pattern = r'\\'
content = r'mayan\xx'

re.findall(pattern, content)

['\\']

#### Examples of RE

In [5]:
import re
f = open('file.txt','r') #content is 'amayank'
content = f.read()
pattern = 'mayank'
match = re.search(pattern,content) #only first match. Anywhere in the string
match

<_sre.SRE_Match object; span=(1, 7), match='mayank'>

In [15]:
print('Found "{}"\n in\n "{}"\n from position {} to position {}'
      .format(match.re.pattern, match.string, match.start(), match.end()))

Found "mayank"
 in
 "amayank
mayank
mayank

abbbasaacfafafbbbbaaaag

1___2222"
 from position 1 to position 7


In [24]:
import re
f = open('file.txt','r')
content = f.read()
a = re.match('mayank',content) #only first match. Start of the string
print(a)

None


In [25]:
import re
f = open('file.txt','r')
content = f.read()
a = re.search('ayank',content) #only first match. Anywhere in the string
a.group()

'ayank'

In [26]:
a = re.findall('mayank',content) #all instances of match
print(a)

['mayank', 'mayank', 'mayank']


In [27]:
a = re.findall('ayank',content) #all instances of match
print(a)

['ayank', 'ayank', 'ayank']


In [36]:
f = open('file.txt','r')
content = f.readlines() #list of strings
a = re.match('ama', content[0]) #content[0] is first line as string
a.re.pattern,a.string, a.start(), a.end()

('ama', 'amayank\n', 0, 3)

In [75]:
f = open('file.txt','r')
content = f.read() #entire file content in a single string
a = re.match('my', content)
a.group()

'my'

In [76]:
type(content)

str

### `re.sub`

In [67]:
string = "mayankdfddsf"
pattern = "(may)"

re.sub(pattern, '***', string)

'***ankdfddsf'

### `re.split`

In [71]:
string = "mayankdfddsf iammayank"
pattern = "(may)"

re.split(pattern, string)

['', 'may', 'ankdfddsf iam', 'may', 'ank']

In [75]:
string = "mayankdfddsf iammayank"
pattern = "\W+"

re.split(pattern, string)

['mayankdfddsf', 'iammayank']

### `re.finditer`

In [42]:
f = open('file.txt','r')
content = f.read()

pattern = 'ab'

for match in re.finditer(pattern, content):
    s = match.start()
    e = match.end()
    print('Found {!r} at {:d}:{:d}'.format(
        content[s:e], s, e))

Found 'ab' at 23:25
Found 'ab' at 46:48


In [52]:
text = 'abbaabbba'
pattern = 'ab*'
matches = re.finditer(pattern, text)
for match in matches:
    s = match.start()
    e = match.end()
    substr = text[s:e]
    n_backslashes = text[:s].count('\\')
    prefix = '.' * (s + n_backslashes)
    print("  {}'{}'".format(prefix, substr))



  'abb'
  ...'a'
  ....'abbb'
  ........'a'


### RE Patterns

### `^`,`$`,`.`,`*`,`+`,`?` 

In [18]:
# . mathces any character

content = 'cookies'
a = re.findall('coo..es', content) #matched
a

['cookies']

In [56]:
content

'amayank\nmayank\nmayank\n\nabbbasaacfafafbbbbaaaagab\n\n1___2222'

### `^`,`$`,`.`,`*`,`+`,`?` 

In [57]:
# ^m -> match 'm' at the start of the string

a = re.findall('^m', content) #no matched 
print(a)
b = re.findall('^m', 'mxyzmzm') #match
print(b)

[]
['m']


In [58]:
# m$ -> match 'm' at the end of the string
a = re.findall('m$', content) #no match
print(a)
b = re.findall('a$', 'bca') #matched
print(b)

[]
['a']


In [62]:
string = "mayank\nxyzaaak"
pattern = "k$"

re.findall(pattern, string, re.MULTILINE)

['k', 'k']

In above example, pattern will match the last `k` in absense of `re.MULTILINE` flage. To also match `k` appearing before the newline, use `re.MULTILINE` flag. 

In [87]:
# m.. -> will match 3 character string starting with m in anywhere in the string. Newline doesn't count as a character

a = re.findall('m..','ma\n') # '\n' doesn't count as a character so no match
print(a)
b = re.findall('m..', 'mayank') #matched
print(b)
c = re.findall('m..','xyzmayank') #will search anywhere in the string, hence matched.
print(c)

[]
['may']
['may']


In [89]:
# ab* will match a and 0 or more occurence of b. a, ab, abb and so on will be matched. Anyuhere in the string. 

items = ['a','ab','abb','bab','xyz']
for i in items:
    a = re.findall('ab*', i)
    print(a)

['a']
['ab']
['abb']
['ab']
[]


In [91]:
# ab+ will match a and followed by 1 or more occurence of b.  ab, abb and so on will be matched. Anyuhere in the string. 

items = ['a','ab','abb','bab','xyz']
for i in items:
    a = re.findall('ab+', i)
    print(a)

[]
['ab']
['abb']
['ab']
[]


In [92]:
# ab? will match a and followed by 0 or 1  occurence of b.  a and ab will be matched but not abb.  Anyuhere in the string. 

items = ['a','ab','abb','bab','xyz']
for i in items:
    a = re.findall('ab?', i)
    print(a)

['a']
['ab']
['ab']
['ab']
[]


#### Specific Charachter Group

In [108]:
# \s -> matches whitespace character
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('\s', items)
a

['\n']

In [110]:
%pprint
# \S -> matches all but whitespace character
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('\S', items)
a

Pretty printing has been turned OFF


['a', 'b', 'a', 'm', 'b', 'a', 'm', 'm', 'b', 'a', 'm', 'm', 'm', 'b', 'a', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'b']

In [111]:
# \W -> matches all but word characters (0 to 9, a to z, A to Z, _)
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('\W', items)
a

['\n']

In [112]:
# \w -> matches all word characters (0 to 9, a to z, A to Z, _)
items = 'ammmb\naccccb'
a = re.findall('\w', items)
a

['a', 'm', 'm', 'm', 'b', 'a', 'c', 'c', 'c', 'c', 'b']

#### Customized Character Set

In [16]:
# [a-zA-Z0-9] Matches any letter from (a to z) or (A to Z) or (0 to 9). 
#Characters that are not within a range can be matched by complementing the set.
#If the first character of the set is ^, all the characters that are not in the set will be matched.

item1 = '$$$$%^$^&&&&'
item2 = 'a%'
a = re.findall('[a-zA-Z0-9]',item1)
b = re.findall('[a-zA-Z0-9]',item2)
c = re.findall('[^a-zA-Z0-9]',item1)
a,b,c

([], ['a'], ['$', '$', '$', '$', '%', '^', '$', '^', '&', '&', '&', '&'])

In [113]:
# [ab] -> matches either a or b
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('[ab]', items)
a

['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']

In [11]:
import re
strings= ['gray', 'grey']
pattern = 'gr[ae]y' #either 'a' or 'e'. Not both.

for i in strings:
    a = re.findall(pattern, i)
    print(a)


['gray']
['grey']


In [114]:
# [0-9] -> matches any digit in this range
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
a = re.findall('[0-9]', items)
a

['0', '1']

In [115]:
# [0-5][0-9] -> matches any digit in the range[00,59]
items = 'abambammbammmb\naccccccccccccccccccccb0i1a55'
a = re.findall('[0-5][0-9]', items)
a

['55']

#### Pipe

In [123]:
# A|B -> match either regex A or B. 
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
a = re.findall('[0-9]|[b]', items)
a
#didn't get it?

['b', 'b', 'b', 'b', 'b', '0', '1']

In [63]:
# A|B -> Either pattern should match. If first matched, second isn't tested
pattern = ['am|b', 'am+|ab']
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
for i in pattern:
    a = re.findall(i, items)
    print(a)


['b', 'am', 'b', 'am', 'b', 'am', 'b', 'b']
['ab', 'am', 'amm', 'ammm']


In [9]:
# A
pattern = ['a|b', '[ab]'] #what is the difference between 2?
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
for i in pattern:
    a = re.findall(i, items)
    print(a)


['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']
['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']


In [3]:
import re
string = 'setvalue'
pattern = 'get|getvalue|set|setvalue'
re.findall(pattern,string)


['set']

Compare above example with the following 2 examples. Note that `findall` keep searching for patterns in remaining part of string even if it succeeds in finding a match. 

In [4]:
string = 'setsetvalue'
pattern = 'get|getvalue|set|setvalue'
re.findall(pattern,string)


['set', 'set']

In [6]:
import re
string = 'setgetvalue'
pattern = 'get|getvalue|set|setvalue'
re.findall(pattern,string)


['set', 'get']

#### `a{m, n}` - m to n `a` character 

In [14]:
string = 'akkkkkueeiaouul'
pattern = '[aeiou]{2,}'

re.findall(pattern, string)

['ueeiaouu']

In [15]:
string = 'akkkkkueeiaouul'
pattern = '[aeiou]{2,3}'

re.findall(pattern, string)

['uee', 'iao', 'uu']

In [16]:
string = 'akkkkkueeiaouul'
pattern = '[aeiou]{2}'

re.findall(pattern, string)

['ue', 'ei', 'ao', 'uu']

**Note**

`*` is greedy in nature and tries to consume as much of input as possible. To control its greediness, we use `*?`

In [25]:
pattern1 = '^(.*)(s|es)$'
pattern2 = '^(.*?)(s|es)$'
pattern3 = '^(.*?)(es|s)$'

re.findall(pattern1, 'processes'), re.findall(pattern2, 'processes'), re.findall(pattern3, 'processes')


([('processe', 's')], [('process', 'es')], [('process', 'es')])

In [15]:
pattern1 = '^(.*)(s|es)$'
pattern2 = '^(.*?)(s|es)$'

re.findall(pattern1, 'abbbb'), re.findall(pattern2, 'abbbb')

([], [])

#### `\b` Word boundary

In [116]:
p = re.compile(r'\bclass\b')
a = p.findall('no class at all')
b = p.findall('the classified docs')
a,b

(['class'], [])

#### Lookahead assertion 

 - `(?=...)` - Positive lookahead assertion
 - `(?!...)` - Negative lookahead assertion
 
Suppose we want to match the finename and split it apart into basename and extension. For example, suppose we have a filename as `test.pdf`. Here `test` is basename and `pdf` is extension. To match this type of filename, we can write RE as **`.*[.].*$`** Notice that the `.` needs to be treated specially because it's a metacharacter, so it's inside a character class to only match that specific character.

But suppose, we don't want to match any file with extension `.bat`. Easy as it may seems, writing a RE to meet these conditions would prove to be a messy task. Negative lookahead cuts through all this mess - **`.*[.](?!bat$)[^.]*$`**

The negative lookahead means: if the expression `bat` doesn't match at this point, try the rest of the pattern. If `bat$` does match, the whole pattern will fail. The trailing `$` is required to ensure that something like `sample.batch` is allowed. 

In [23]:
strings = ['123', 'island333']
pattern = '\d+'

for i in strings:
    a =  re.findall(pattern, i)
    print(a)

['123']
['333']


In [2]:
#finding 10 digits number starting from 9
import re
pattern = '^9[\d]{9}$'

numbers = ['123', '9897971000','0000000000', '11111111111', '99999999999', 'a9999888822', '9999A999999' ]

for i in numbers:
    a = re.findall(pattern, i)
    print(a)



[]
['9897971000']
[]
[]
[]
[]
[]


In [118]:
# finding email
pattern = '[a-z0-9]+[\._]?[a-z0-9]+@[a-z0-9]+[\.][a-z]+'

re.findall(pattern, 'a11dfdff mayank1@a2zemail.com xyz11_11@gmail.com @@@gmail.com\
maya..@gmail.com addd@dddd a@.com mayank.kaizen@gmail.com  123@gmail.com, mayank@outlook.com')


['mayank1@a2zemail.com', 'xyz11_11@gmail.com', 'mayank.kaizen@gmail.com', '123@gmail.com', 'mayank@outlook.com']