### Regular Expression

- `re.search` - 
The `search()` function takes the pattern and text to scan, and returns a `Match` object when the pattern is found. If the pattern is not found, `search()` returns `None`.

Each Match object holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs.and returns  

- `re.match` 
This searches for the first match at the very start of the string. This also returns a `Match` object.

- `re.findall` 
This searches for all the matches anywhere in the string and returns the substrings found. 

- `re.finditer` 
It is just like findall but returns an iterator that produces `Match` instances instead of the strings returned by `findall`().

In [5]:
import re
f = open('file.txt','r') #content is 'amayank'
content = f.read()
pattern = 'mayank'
match = re.search(pattern,content) #only first match. Anywhere in the string
match

<_sre.SRE_Match object; span=(1, 7), match='mayank'>

In [15]:
print('Found "{}"\n in\n "{}"\n from position {} to position {}'
      .format(match.re.pattern, match.string, match.start(), match.end()))

Found "mayank"
 in
 "amayank
mayank
mayank

abbbasaacfafafbbbbaaaag

1___2222"
 from position 1 to position 7


In [24]:
import re
f = open('file.txt','r')
content = f.read()
a = re.match('mayank',content) #only first match. Start of the string
print(a)

None


In [25]:
import re
f = open('file.txt','r')
content = f.read()
a = re.search('ayank',content) #only first match. Anywhere in the string
a.group()

'ayank'

In [26]:
a = re.findall('mayank',content) #all instances of match
print(a)

['mayank', 'mayank', 'mayank']


In [27]:
a = re.findall('ayank',content) #all instances of match
print(a)

['ayank', 'ayank', 'ayank']


In [36]:
f = open('file.txt','r')
content = f.readlines() #list of strings
a = re.match('ama', content[0]) #content[0] is first line as string
a.re.pattern,a.string, a.start(), a.end()

('ama', 'amayank\n', 0, 3)

In [75]:
f = open('file.txt','r')
content = f.read() #entire file content in a single string
a = re.match('my', content)
a.group()

'my'

In [76]:
type(content)

str

### `re.finditer`

In [42]:
f = open('file.txt','r')
content = f.read()

pattern = 'ab'

for match in re.finditer(pattern, content):
    s = match.start()
    e = match.end()
    print('Found {!r} at {:d}:{:d}'.format(
        content[s:e], s, e))

Found 'ab' at 23:25
Found 'ab' at 46:48


In [52]:
text = 'abbaabbba'
pattern = 'ab*'
matches = re.finditer(pattern, text)
for match in matches:
    s = match.start()
    e = match.end()
    substr = text[s:e]
    n_backslashes = text[:s].count('\\')
    prefix = '.' * (s + n_backslashes)
    print("  {}'{}'".format(prefix, substr))



  'abb'
  ...'a'
  ....'abbb'
  ........'a'


### RE Patterns

In [18]:
# . mathces any character

content = 'cookies'
a = re.findall('coo..es', content) #matched
a

['cookies']

In [56]:
content

'amayank\nmayank\nmayank\n\nabbbasaacfafafbbbbaaaagab\n\n1___2222'

In [57]:
# ^m -> match 'm' at the start of the string

a = re.findall('^m', content) #no matched 
print(a)
b = re.findall('^m', 'mxyzmzm') #match
print(b)

[]
['m']


In [58]:
# m$ -> match 'm' at the end of the string
a = re.findall('m$', content) #no match
print(a)
b = re.findall('a$', 'bca') #matched
print(b)

[]
['a']


In [87]:
# m.. -> will match 3 character string starting with m in anywhere in the string. Newline doesn't count as a character

a = re.findall('m..','ma\n') # '\n' doesn't count as a character so no match
print(a)
b = re.findall('m..', 'mayank') #matched
print(b)
c = re.findall('m..','xyzmayank') #will search anywhere in the string, hence matched.
print(c)

[]
['may']
['may']


In [89]:
# ab* will match a and 0 or more occurence of b. a, ab, abb and so on will be matched. Anyuhere in the string. 

items = ['a','ab','abb','bab','xyz']
for i in items:
    a = re.findall('ab*', i)
    print(a)

['a']
['ab']
['abb']
['ab']
[]


In [91]:
# ab+ will match a and followed by 1 or more occurence of b.  ab, abb and so on will be matched. Anyuhere in the string. 

items = ['a','ab','abb','bab','xyz']
for i in items:
    a = re.findall('ab+', i)
    print(a)

[]
['ab']
['abb']
['ab']
[]


In [92]:
# ab? will match a and followed by 0 or 1  occurence of b.  a and ab will be matched but not abb.  Anyuhere in the string. 

items = ['a','ab','abb','bab','xyz']
for i in items:
    a = re.findall('ab?', i)
    print(a)

['a']
['ab']
['ab']
['ab']
[]


In [108]:
# \s -> matches whitespace character
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('\s', items)
a

['\n']

In [110]:
%pprint
# \S -> matches all but whitespace character
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('\S', items)
a

Pretty printing has been turned OFF


['a', 'b', 'a', 'm', 'b', 'a', 'm', 'm', 'b', 'a', 'm', 'm', 'm', 'b', 'a', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'b']

In [111]:
# \W -> matches all but word characters (0 to 9, a to z, A to Z, _)
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('\W', items)
a

['\n']

In [112]:
# \w -> matches all word characters (0 to 9, a to z, A to Z, _)
items = 'ammmb\naccccb'
a = re.findall('\w', items)
a

['a', 'm', 'm', 'm', 'b', 'a', 'c', 'c', 'c', 'c', 'b']

In [16]:
# [a-zA-Z0-9] Matches any letter from (a to z) or (A to Z) or (0 to 9). 
#Characters that are not within a range can be matched by complementing the set.
#If the first character of the set is ^, all the characters that are not in the set will be matched.

item1 = '$$$$%^$^&&&&'
item2 = 'a%'
a = re.findall('[a-zA-Z0-9]',item1)
b = re.findall('[a-zA-Z0-9]',item2)
c = re.findall('[^a-zA-Z0-9]',item1)
a,b,c

([], ['a'], ['$', '$', '$', '$', '%', '^', '$', '^', '&', '&', '&', '&'])

In [113]:
# [ab] -> matches either a or b
items = 'abambammbammmb\naccccccccccccccccccccb'
a = re.findall('[ab]', items)
a

['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']

In [114]:
# [0-9] -> matches any digit in this range
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
a = re.findall('[0-9]', items)
a

['0', '1']

In [115]:
# [0-5][0-9] -> matches any digit in the range[00,59]
items = 'abambammbammmb\naccccccccccccccccccccb0i1a55'
a = re.findall('[0-5][0-9]', items)
a

['55']

In [123]:
# A|B -> match either regex A or B. 
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
a = re.findall('[0-9]|[b]', items)
a
#didn't get it?

['b', 'b', 'b', 'b', 'b', '0', '1']

In [63]:
# A|B -> Either pattern should match
pattern = ['am|b', 'am+|ab']
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
for i in pattern:
    a = re.findall(i, items)
    print(a)


['b', 'am', 'b', 'am', 'b', 'am', 'b', 'b']
['ab', 'am', 'amm', 'ammm']


In [9]:
# A
pattern = ['a|b', '[ab]'] #what is the difference between 2?
items = 'abambammbammmb\naccccccccccccccccccccb0i1'
for i in pattern:
    a = re.findall(i, items)
    print(a)


['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']
['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b']


**Note**

`*` is greedy in nature and tries to consume as much of input as possible. To control its greediness, we use `*?`

In [14]:
pattern1 = '^(.*)(s|es)$'
pattern2 = '^(.*?)(s|es)$'

re.findall(pattern1, 'processes'), re.findall(pattern2, 'processes')

([('processe', 's')], [('process', 'es')])

In [15]:
pattern1 = '^(.*)(s|es)$'
pattern2 = '^(.*?)(s|es)$'

re.findall(pattern1, 'abbbb'), re.findall(pattern2, 'abbbb')

([], [])


pattern|explanation
-------|-------------
'.' dot| any character
^m     | m at the start of string
m$     | m at the end of string
m..    |m followed with any 2 char, newline not counted, anywhere in string
ab*    |a and 0 or more occurence of b
ab+    |a and 1 or more occurene of b
ab?    |a and 0 or 1 occurence of b
ab{3}  |a followed by 3 b
ab{2,3}|a followed by 2 to 3 b
\s     |all whitespace character (eg \n)
\S     |all except whitespace character
\w     |matches all word characters (0 to 9, a to z, A to Z, _)
\W     |all but word characters
[a-zA-Z0-9] |Matches any letter from (a to z) or (A to Z) or (0 to 9).
[^a-zA-Z0-9]| Matches every character not in the list
[ab]   | matches either a or b
[0-9]  |matches any digit in the range
[0-5][0-9]| matches any number from 00 to 59
[0-9]+ | will match the longest number (to be checked)


**Note**

 - a | b ->/match either a or b
 - `*` is greedy in nature and tries to consume as much of input as possible. To control its greediness, we use `*?`



In [10]:
pattern = '[aeiou]{2,}'

#what will this match?
a = 'a|b'
b = '[ab]'
#what is the difference between a and b