## Regular expression

In [1]:
import re

### Search and match
The search() function takes the pattern and text to scan, and returns a Match object when the pattern is found. If the pattern is not found, search() returns None.

In [2]:
pattern = "apple"
sen = "There is an apple under the apple table"
match = re.search(pattern, sen)
match # match object for the first match

<re.Match object; span=(12, 17), match='apple'>

In [3]:
# we can get start and end of the match and subset from the actuual sentence
sen[12:17]

'apple'

### Compiling Expressions
it is more efficient to compile the expressions a program uses frequently. The compile() function converts an expression string into a RegexObject.

In [4]:
regex = re.compile(pattern)
regex.search(sen) # matches "apple" and returns the match object 

<re.Match object; span=(12, 17), match='apple'>

In [5]:
regex = re.compile('batman')
regex.search(sen) # returns None

### Multiple Matching
The findall() function returns all of the substrings of the input that match the pattern without overlapping.

In [6]:
sentence = "There are multiple apples in this apples sentence. ok apples"
pattern = "apples"
re.findall(pattern, sentence)

['apples', 'apples', 'apples']

#### Quantifiers

* 'ab*',     # a followed by zero or more b
* 'ab+',     # a followed by one or more b
* 'ab?',     # a followed by zero or one b
* 'ab{3}',   # a followed by three b
* 'ab{2,3}', # a followed by two to three b (minimun and maximum)

In [7]:
sen = 'abbaaabbbbaaaaa'
re.findall('ab*', sen)

['abb', 'a', 'a', 'abbbb', 'a', 'a', 'a', 'a', 'a']

In [8]:
re.findall('ab+', sen)

['abb', 'abbbb']

In [9]:
re.findall('ab?', sen)

['ab', 'a', 'a', 'ab', 'a', 'a', 'a', 'a', 'a']

We can use finditer to get match object as a return value

In [10]:
for match in re.finditer('ab?', sen):
    print(match)

<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(4, 5), match='a'>
<re.Match object; span=(5, 7), match='ab'>
<re.Match object; span=(10, 11), match='a'>
<re.Match object; span=(11, 12), match='a'>
<re.Match object; span=(12, 13), match='a'>
<re.Match object; span=(13, 14), match='a'>
<re.Match object; span=(14, 15), match='a'>


In [11]:
text = """
abc 123 
Hello world!

abc

Mr. Rojit
Mr. Bill

123-456-7899
789:321:4567
786 538 9032

There is good in bad. And (bad) in [good].

My site = https://rojitmanandhar.com.np/

"""

#### Simple search pattern

In [12]:
abcpattern = re.compile(r'abc')
for match in re.finditer(abcpattern, text):
    print(match)

<re.Match object; span=(1, 4), match='abc'>
<re.Match object; span=(24, 27), match='abc'>


using dot(.) to match any character except new line

In [13]:
dotpattern = re.compile(r'.')
# for match in re.finditer(dotpattern, text):
#     print(match)

*  \d : Digit from 0-9
*  \D : Not a digit pattern9

In [14]:
digit_pattern = re.compile(r'\d')
for match in re.finditer(digit_pattern, text):
    print(match)

<re.Match object; span=(5, 6), match='1'>
<re.Match object; span=(6, 7), match='2'>
<re.Match object; span=(7, 8), match='3'>
<re.Match object; span=(49, 50), match='1'>
<re.Match object; span=(50, 51), match='2'>
<re.Match object; span=(51, 52), match='3'>
<re.Match object; span=(53, 54), match='4'>
<re.Match object; span=(54, 55), match='5'>
<re.Match object; span=(55, 56), match='6'>
<re.Match object; span=(57, 58), match='7'>
<re.Match object; span=(58, 59), match='8'>
<re.Match object; span=(59, 60), match='9'>
<re.Match object; span=(60, 61), match='9'>
<re.Match object; span=(62, 63), match='7'>
<re.Match object; span=(63, 64), match='8'>
<re.Match object; span=(64, 65), match='9'>
<re.Match object; span=(66, 67), match='3'>
<re.Match object; span=(67, 68), match='2'>
<re.Match object; span=(68, 69), match='1'>
<re.Match object; span=(70, 71), match='4'>
<re.Match object; span=(71, 72), match='5'>
<re.Match object; span=(72, 73), match='6'>
<re.Match object; span=(73, 74), match

In [15]:
non_digit_pattern = re.compile(r'\D')
for match in re.finditer(non_digit_pattern, text):
#     print(match)
    pass

* \w : word character (a-z, A-Z, 0-9 and _ )
* \W : non word characters

In [16]:
word_pattern = re.compile(r'\w')
for match in re.finditer(word_pattern, text):
#     print(match)
    pass

In [17]:
non_word_pattern = re.compile(r'\W')
for match in re.finditer(non_word_pattern, text):
#     print(match)
    pass

### Anchoring
In addition to describing the content of a pattern to match, you can also specify the relative location in the input text where the pattern should appear using anchoring instructions.