# re (Regualr expression) Module

In [1]:
import re

- A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. 
- They’re typically used to find a sequence of characters within a string so you can extract and manipulate them. 

__`findall >> list;`
`search >> object;`
`match >> object;`
`sub >> str`__

### Synopsis:

#### Functions: 

1) __re.findall()__ : Returns all the matches of RE pattern as a list of string.

2) __re.search()__ : Search for a Re pattern and returns first occurance only.

3) __re.match()__ : Returns a match object if zero or more characters at the beginning of string match the regular expression pattern.

4) __re.sub()__ : It works as a replace function. Returns a new string with regular expression pattern replaced with the new replacement.

5) __re.split()__ : It splits string by the occurrences of RE pattern or splits string 'B' into a list using the delimiter 'A'.

6) __re.compile()__ : Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.

#### Brackets: 

__[]__ - Using square bracket, we can put a bunch of different characters (alphabates, special characters, numbers, whitespace character). This will look through and pick one that mactches the pattern. These square brackets are way to create your own list of patterns. Used to indicate a set of characters. Special characters lose their special meaning inside this square bracket. ex. [0-9], [a-z], [a-zA-Z]

__{}__ - Curly brackets specify the amount of thing or we can use this as a range also `"{m,n}- Causes the resulting RE to match from m to n repetitions of the preceding RE."`

__()__ - we can also group things using parenthesis.

#### Special Escape sequence:

1) __\d__ Matches any decimal digits __\d == [0-9]__

2) __\D__ except digital. Matches any character which is not a decimal digit. __\D == [^0-9]__

3) __\s__ Matches whitespace characters which includes. __\s == \t\n__

4) __\S__ Matches any character which is not a whitespace character. __\S == [^ \t\n\r\f\v]__

5) __\w__ Matches characters as well as numbers and the underscore. __\w == [a-zA-Z0-9_]__

6) __\W__ except [a-zA-Z0-9_]. __\W == [^a-zA-Z0-9_]__

7) __\b__ It will check for characters __at the beginning of word or end of word__ or we can say it matches word boundry.

#### Special characters:

1) __'+' (Plus sign)__ The resulting RE to match __`one or more occurance`__ of the preceding RE.

2) __'\*' (star sign)__ The resulting RE to match __`zero or more occurances`__ of the preceding RE.

3) __'.' (dot sign)__ matches zero or more occurances of any character except a newline. But if the DOTALL flag has been specified, this matches any character including a newline.

4) __'?' (Question mark sign)__ The resulting RE to match zero or one occurances of the preceding RE.

5) __'^' (Caret sign); startswith__ matches at the beginning of string. Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

6) __'$' (Dollar sign); endswith__  matches at the end of the string, and just before the newline, and in MULTILINE mode also matches before a newline.

7) __'|' (Or sign)__ A|B, creates a regular expression that will match either A or B.

### Functions

#### 1) re.findall()
- It is going to return list of matches
- Return all non-overlapping matches of pattern as a list of strings or tuples. The string is scanned left-to-right and matches are returned in the order found. Empty matches are also included in the result. 

`Syntax: re.findall(pattern, string)`

In [5]:
# Accessing numbers [0-9]

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[0-9]{1}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['8', '9', '8', '3', '7', '4', '3', '9', '6', '8', '9', '6', '8', '0', '3', '0', '6', '2', '0', '2', '2']
21


In [6]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[0-9]{2}',text1) # [condition] and {count you want}
print(result) # it will search for continue count
print(len(result))

['89', '83', '74', '39', '68', '96', '03', '06', '20', '22']
10


In [None]:
import re
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """
result = re.findall('[]',text1)

In [7]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[0-9]{10}',text1) # [condition] and {count you want}
print(result) # since mobile number is of 10 digits
print(len(result))

['8983743968']
1


In [14]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[0-5]{1}',text1) # [condition] and {count you want}
print(result) # we can specify range also [0-5]
print(len(result))

['3', '4', '3', '0', '3', '0', '2', '0', '2', '2']
10


In [15]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[012345]{1}',text1) # [condition] and {count you want}
print(result) # we can specify only numbers we want [012345]
print(len(result))

['3', '4', '3', '0', '3', '0', '2', '0', '2', '2']
10


In [8]:
# Accessing lower characters [a-z]

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[a-z]{1}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['y', 'm', 'o', 'b', 'i', 'l', 'e', 'u', 'm', 'b', 'e', 'r', 'i', 's', 'm', 'y', 'e', 'm', 'a', 'i', 'l', 'i', 'd', 'i', 's', 'o', 'm', 'k', 'a', 'r', 'f', 'a', 'd', 't', 'a', 'r', 'e', 'g', 'm', 'a', 'i', 'l', 'c', 'o', 'm', 't', 'o', 'd', 'a', 'y', 's', 'd', 'a', 't', 'e', 'i', 's']
57


In [9]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[a-z]{3}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['mob', 'ile', 'umb', 'ema', 'omk', 'arf', 'adt', 'are', 'gma', 'com', 'tod', 'ays', 'dat']
13


In [17]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[abcdefg]{2}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['be', 'fa', 'da', 'da']
4


In [20]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[abcdefg]{3}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['fad']
1


In [10]:
# Accessing upper characters [A-Z]

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[A-Z]{1}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['M', 'N']
2


In [11]:
# Accessing lower & upper characters [A-Za-z]

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[A-Za-z]{1}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['M', 'y', 'm', 'o', 'b', 'i', 'l', 'e', 'N', 'u', 'm', 'b', 'e', 'r', 'i', 's', 'm', 'y', 'e', 'm', 'a', 'i', 'l', 'i', 'd', 'i', 's', 'o', 'm', 'k', 'a', 'r', 'f', 'a', 'd', 't', 'a', 'r', 'e', 'g', 'm', 'a', 'i', 'l', 'c', 'o', 'm', 't', 'o', 'd', 'a', 'y', 's', 'd', 'a', 't', 'e', 'i', 's']
59


In [12]:
# Accessing lower, upper characters & numbers [A-Za-z0-9]

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[A-Za-z0-9]{1}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['M', 'y', 'm', 'o', 'b', 'i', 'l', 'e', 'N', 'u', 'm', 'b', 'e', 'r', 'i', 's', '8', '9', '8', '3', '7', '4', '3', '9', '6', '8', 'm', 'y', 'e', 'm', 'a', 'i', 'l', 'i', 'd', 'i', 's', 'o', 'm', 'k', 'a', 'r', 'f', 'a', 'd', 't', 'a', 'r', 'e', '9', '6', '8', 'g', 'm', 'a', 'i', 'l', 'c', 'o', 'm', 't', 'o', 'd', 'a', 'y', 's', 'd', 'a', 't', 'e', 'i', 's', '0', '3', '0', '6', '2', '0', '2', '2']
80


In [23]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[a-z]{5}[0-9]{0,3}[@]{1}[a-z.]{4,10}',text1) # [condition] and {count you want}
print(result)
print(len(result))

['dtare968@gmail.com']
1


In [24]:
# See the differnece between above and this example

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[a-z]{,155}[0-9]{0,3}[@]{1}[a-z.]{4,10}',text1) 
print(result)
print(len(result))

['omkarfadtare968@gmail.com']
1


In [32]:
# This is useful when we have some dat and we want to access only email ids 

text = """ my email id is : viratkohli@vctcpune.com
        viratkohli123@gmail.com
        virat_kohli123@gmail.com
        virat.kohli@gmail.com
        virat_kohli_123@vctcpune.co.in
        virat_kohli_123@coep.edu.in
        hr@capgemini.com """
email_id = re.findall('[a-z_.]{2,20}[0-9]{0,4}[@]{1}[a-z.]{5,20}',text)
print(email_id)
print(len(email_id))

['viratkohli@vctcpune.com', 'viratkohli123@gmail.com', 'virat_kohli123@gmail.com', 'virat.kohli@gmail.com', 'virat_kohli_123@vctcpune.co.in', 'virat_kohli_123@coep.edu.in', 'hr@capgemini.com']
7


In [34]:
# Accessing aadhar card number (we know that aadhar card numbers are spaced with 4 digits and has lenghth of 12 digits)

text = """ 1234 245 0978    
        1234 2345 0987    
        4567 2345 0987678    
        1234 2345 987 """

aadhar_num = re.findall('[0-9]{4}[ ][0-9]{4}[ ][0-9]{4}',text)
aadhar_num

['1234 2345 0987', '4567 2345 0987']

In [35]:
# Accessing pan card number (we know that pan card number has a unique pattern 5 capital alphabetes followed by 4 digits and single upper case alphabate)

text = """ ASDFB0987P
        RTYUI65780
        POIUY3456T """
pan_num = re.findall('[A-Z]{5}[0-9]{4}[A-Z]',text)
pan_num

['ASDFB0987P', 'POIUY3456T']

In [36]:
# Accessing dates 

text = """ ASDFB0987P
        RTYUI65780
        POIUY3456T
        
        Dates:
        01/03/2022
        28/02/2022
        08-02-2022
        25/05/2022 """
dates = re.findall('[0-9]{2}[/-][0-9]{2}[/-][0-9]{4}',text)
dates

['01/03/2022', '28/02/2022', '08-02-2022', '25/05/2022']

In [37]:
text = """ ASDFB0987P
        RTYUI65780
        POIUY3456T
        
        Dates:
        01/03/2022
        28/02/2022
        08-02-2022
        25/05/2022
        40-08-2021 """
dates = re.findall('[0-3][0-9][/-][0-9]{2}[/-][0-9]{4}',text)
dates

['01/03/2022', '28/02/2022', '08-02-2022', '25/05/2022']

__To make this pattern searching method easier we have some special sequence as below.__

1) \d Matches any decimal digits __\d == [0-9]__

2) \D except digital. Matches any character which is not a decimal digit. __\D == [^0-9]__

In [42]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[0-9]{10}',text1)
print(result)
print(len(result))

['8983743968']
1


In [43]:
# We can write in this way also

text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[\d]{10}',text1)
print(result)
print(len(result))

['8983743968']
1


In [44]:
text1 = """ My mobile Number is 8983743968
            my email id is : omkarfadtare968@gmail.com
            todays date is : 03/06/2022 """

result = re.findall('[\D]{2}',text1)
print(result)
print(len(result))

[' M', 'y ', 'mo', 'bi', 'le', ' N', 'um', 'be', 'r ', 'is', '\n ', '  ', '  ', '  ', '  ', '  ', ' m', 'y ', 'em', 'ai', 'l ', 'id', ' i', 's ', ': ', 'om', 'ka', 'rf', 'ad', 'ta', 're', '@g', 'ma', 'il', '.c', 'om', '\n ', '  ', '  ', '  ', '  ', '  ', ' t', 'od', 'ay', 's ', 'da', 'te', ' i', 's ', ': ']
51


__To make this pattern searching method easier we have some special sequence as below.__

3) \s Matches whitespace characters which includes. __\s == \t\n__

4) \S Matches any character which is not a whitespace character. __\S == [^ \t\n\r\f\v]__

In [45]:
text = """ 1234 245 0978    
        1234 2345 0987    
        4567 2345 0987678    
        1234 2345 987 """
aadhar = re.findall('[\d]{4}[\s]{1}[\d]{4}[\s]{1}[\d]{4}',text)
aadhar

['1234 2345 0987', '4567 2345 0987']

In [47]:
text = """ 1234 245 0978    
        1234 2345 0987    
        4567 2345 0987678    
        1234 2345 987 """
aadhar = re.findall('\d{4}\s{1}\d{4}\s{1}\d{4}',text) # square braket is not neccesary while using special sequence but while specifying [a-z] it is necessary
aadhar

['1234 2345 0987', '4567 2345 0987']

In [50]:
text = """ ASDFB0987P
        RTYUI65780
        POIUY3456T
        1234 2345 0987 """
pan_num = re.findall('\S',text)
pan_num

['A',
 'S',
 'D',
 'F',
 'B',
 '0',
 '9',
 '8',
 '7',
 'P',
 'R',
 'T',
 'Y',
 'U',
 'I',
 '6',
 '5',
 '7',
 '8',
 '0',
 'P',
 'O',
 'I',
 'U',
 'Y',
 '3',
 '4',
 '5',
 '6',
 'T',
 '1',
 '2',
 '3',
 '4',
 '2',
 '3',
 '4',
 '5',
 '0',
 '9',
 '8',
 '7']

__To make this pattern searching method easier we have some special sequence as below.__`

5) \w Matches characters as well as numbers and the underscore. __\w == [a-zA-Z0-9_]__

6) \W except [a-zA-Z0-9_]. __\W == [^a-zA-Z0-9_]__

In [54]:
text = """ Data science is the domain of study that deals with vast volumes of data using modern tools and 
techniques to find unseen patterns, derive meaningful information, 
and make business decisions. Data science uses complex machine learning algorithms to build predictive models.
ASDFB0987P
RTYUI65780
POIUY3456T
1234 2345 0987 """

result = re.findall('\w{2,9}',text)
print(result)

['Data', 'science', 'is', 'the', 'domain', 'of', 'study', 'that', 'deals', 'with', 'vast', 'volumes', 'of', 'data', 'using', 'modern', 'tools', 'and', 'technique', 'to', 'find', 'unseen', 'patterns', 'derive', 'meaningfu', 'informati', 'on', 'and', 'make', 'business', 'decisions', 'Data', 'science', 'uses', 'complex', 'machine', 'learning', 'algorithm', 'to', 'build', 'predictiv', 'models', 'ASDFB0987', 'RTYUI6578', 'POIUY3456', '1234', '2345', '0987']


In [56]:
text = """ Data science is the domain of study that deals with vast volumes of data
RTYUI65780@
POIUY3456T&
1234 2345 0987* """

result = re.findall('\W',text)
print(result)

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '\n', '@', '\n', '&', '\n', ' ', ' ', '*', ' ']


__To make this pattern searching method easier we have some special sequence as below.__

7) \b It will check for characters __at the beginning of word or end of word__

In [148]:
text = """Bank IFSC Code is :
          HDFC0009873
          9736467483926
          
          9876543218
          ASDFG5678PA
          ASDSDVFHDBD5566POIUYY
          HGHSBMHSD4567XBCBXJHCXCMX"""

pan_num = re.findall(r'\b[A-Z]{5}\d{4}[A-Z]\b',text)
pan_num

['ASDFG5678P']

In [157]:
text = """Bank IFSC Code is :
          HDFC0009873
          9736467483926
          
          9876543218
          ASDFG5678PA
          ASDSDVFHDBD5566POIUYY
          HGHSBMHSD4567XBCBXJHCXCMX"""

pan_num = re.findall(r'\b\d{10}\b',text)
pan_num

['9876543218']

In [160]:
text = "This is pune "
output = re.findall(r'\bne',text)
output

[]

In [112]:
text = """Bank IFSC Code is :
          HDFC0009873
          9736467483926
          
          9876543218
          ASDFG5678P
          ASDFG56784
          ASDSDVFHDBD5566POIUYY
          HGHSBMHSD4567XBCBXJHCXCMX"""

mobile = re.findall(r'\b\d{10}\b',text)
mobile

['9736467483', '9876543218']

In [166]:
# Practical exapmle

import pytesseract
path = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# folder_path = r'C:\Users\MY\anaconda3\Data_science\Velocity\Python_\My_notes\h1_glob_Module\sample1'

In [171]:
pytesseract.pytesseract.tesseract_cmd = path
pan_data = pytesseract.image_to_string(r'C:\Users\MY\anaconda3\Data_science\Velocity\Python_\My_notes\h1_glob_Module\sample1\pan2.jpeg')
pan_data

name = re.findall(r'[A-Z]{3,10}\s\b[A-Z]{5,10}\s[A-Z]{6}',pan_data)
print(f'Name of the candidate is: ',name[0])

dob = re.findall('\d{2}[/]\d{2}[/]\d{4}',pan_data)
print(f'Date of Birth of the candidate is: ',dob[0])
    
pan_num = re.findall('[A-Z]{5}\d{4}[A-Z]',pan_data)
print(f'PAN CARD number is: ',pan_num[0])

Name of the candidate is:  MONIKA MAHADEV SHINDE
Date of Birth of the candidate is:  31/10/1992
PAN CARD number is:  EJAPS0276M


__note**__

__`findall >> list;`
`search >> object;`
`match >> object;`
`sub >> str`__

In [202]:
# More practical example

import pytesseract
path = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

In [203]:
image_name_list = []
name_list = []
dob_list = []
pan_num_list = []

In [204]:
import os
folder_path = r'C:\Users\MY\anaconda3\Data_science\Velocity\Python_\My_notes\h1_glob_Module\sample1\sample2'
file_names = os.listdir(folder_path)

In [205]:
for image_name in file_names:
    print('Image name: ',image_name )
    image_name_list.append(image_name)
    image_name = os.path.join(folder_path,image_name)
    pan_data = pytesseract.image_to_string(image_name)
    name = re.findall(r'[A-Z]{3,11}\s\b[A-Z]{5,10}\s[A-Z]{7}',pan_data)
    if name:
        print(f'Name of the candidate is: ',name[0])
        name_list.append(name[0])
    else:
        name_list.append('')
    dob = re.findall('\d{2}[/]\d{2}[/]\d{4}', pan_data)
    if dob:
        print(f'Date of Birth of the candidate is: ',dob[0])
        dob_list.append(dob[0])
    else:
        dob_list.append('')
    pan_num = re.findall('[A-Z]{5}\d{4}[A-Z]',pan_data)
    if pan_num:
        print(f'Pan card number of the candidate is: ',pan_num[0])
        pan_num_list.append(pan_num[0])
    else:
        pan_num_list.append('')
    
    print('*'*30)

Image name:  pan1.jpg
Name of the candidate is:  HARICHANDRA PANDURANG KANSARE
Date of Birth of the candidate is:  14/06/1987
Pan card number of the candidate is:  CGZPK0281N
******************************
Image name:  pan2.jpeg
Name of the candidate is:  MAHADEV SHINDE
MAHADEV
Date of Birth of the candidate is:  31/10/1992
Pan card number of the candidate is:  EJAPS0276M
******************************
Image name:  pan3.jpg
Name of the candidate is:  ADITYA MEHENDALE
PRAKASH
Date of Birth of the candidate is:  02/06/1976
Pan card number of the candidate is:  BODPM4264E
******************************


In [211]:
result = {'Image Name' : image_name_list, 'Candidate Name': name_list, 'Date of birth' : dob_list, 'PAN card number' : pan_num_list}
print(result)

{'Image Name': ['pan1.jpg', 'pan2.jpeg', 'pan3.jpg'], 'Candidate Name': ['HARICHANDRA PANDURANG KANSARE', 'MAHADEV SHINDE\nMAHADEV', 'ADITYA MEHENDALE\nPRAKASH'], 'Date of birth': ['14/06/1987', '31/10/1992', '02/06/1976'], 'PAN card number': ['CGZPK0281N', 'EJAPS0276M', 'BODPM4264E']}


In [215]:
import pandas as pd

df  = pd.DataFrame(result)
df.to_csv('pan_card.csv')

In [216]:
df = pd.read_csv('pan_card.csv')
df

Unnamed: 0.1,Unnamed: 0,Image Name,Candidate Name,Date of birth,PAN card number
0,0,pan1.jpg,HARICHANDRA PANDURANG KANSARE,14/06/1987,CGZPK0281N
1,1,pan2.jpeg,MAHADEV SHINDE\nMAHADEV,31/10/1992,EJAPS0276M
2,2,pan3.jpg,ADITYA MEHENDALE\nPRAKASH,02/06/1976,BODPM4264E


#### 2) re.search()
- It searches for a match and return an object of first occurrence only. Returns None, if no position in the string matches the pattern.
- __.group()__
- it it will return the search string
- obj.group()

`Syntax: re.search(pattern, string)`

In [219]:
text = """PAN CARDs
        QWERT5678A
        QWERT3456P
        POIYU5678P
        9988776655
        1234567777
        QWEKL9876L
        
        3456 7890 1234"""

pan_num = re.search(r'\b[A-Z]{5}\d{4}[A-Z]',text)
print(pan_num)
print(pan_num.group())
print(pan_num.start())
print(pan_num.end())

<re.Match object; span=(18, 28), match='QWERT5678A'>
QWERT5678A
18
28


In [220]:
text = """PAN CARDs
        QWERT5678A
        QWERT3456P
        POIYU5678P
        9988776655
        1234567777
        QWEKL9876L
        
        3456 7890 1234"""

pan_num = re.search(r'\b[A-Z]{5}\d{4}[A-Z]',text)
print(pan_num)
if pan_num:
    print('Match found')
    print(pan_num.group())

<re.Match object; span=(18, 28), match='QWERT5678A'>
Match found
QWERT5678A


#### 3) re.match()
- If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. 
- Returns None, if the string does not match the pattern. 
- Basically it searches for a match 'at beginning of string ' and return an object.

`Syntax: re.match(pattern, string)`

In [222]:
text = """PAN CARDs
        QWERT5678A
        QWERT3456P
        POIYU5678P
        9988776655
        1234567777
        QWEKL9876L
        
        3456 7890 1234"""

result = re.match(r'[A-Z]{5}\d{4}[A-Z]',text)
if result:
    print('Match found')
    print(result.group())
    
else:
    print('No match found')

No match found


In [223]:
text = """WERTYU CARDs
        QWERT5678A
        QWERT3456P
        POIYU5678P
        9988776655
        1234567777
        QWEKL9876L
        
        3456 7890 1234"""

result = re.match(r'[A-Z]{3}',text)
if result:
    print('Match found')
    print(result.group())
    
else:
    print('No match found')

Match found
WER


In [224]:
text = """ python class
        QWERT5678A
        QWERT3456P
        POIYU5678P
        9988776655
        1234567777
        QWEKL9876L
        
        3456 7890 1234"""

result = re.match(r'python',text)
if result:
    print('Match found')
    print(result.group())
    
else:
    print('No match found')

No match found


#### 4) re.sub()
It works as a replace function. Returns a new string with regular expression replaced with the replacement. 

`Syntax: re.sub(pattern, replacement, string)`

In [226]:
string = """Data science is an interdisciplinary field uses 45,678 scienctific methods"""
string = string.replace('Data science', 'Python')
string

'Python is an interdisciplinary field uses 45,678 scienctific methods'

In [227]:
string = """Data science is an interdisciplinary field uses 45678 scienctific methods"""
new_string = re.sub('\d','+', string)
print(new_string)

Data science is an interdisciplinary field uses +++++ scienctific methods


In [228]:
string = """Data science is an @#$%^ interdisciplinary field uses 45678 scienctific methods"""
new_string = re.sub('[^A-Za-z]','', string)
print(new_string)

Datascienceisaninterdisciplinaryfieldusesscienctificmethods


In [229]:
tring = """Data science is an @#$%^ interdisciplinary field uses 45678 scienctific methods"""
new_string = re.sub('[^A-Za-z]',' ', string)
print(new_string)

Data science is an       interdisciplinary field uses       scienctific methods


__Characters with special meaning__

#### 1) '+' (Plus sign)
- The resulting RE to match __`one or more occurance`__ of the preceding RE.

In [231]:
text = """Data         science is the domain of       study that deals with     vast volumes of data using modern tools and 
techniques to find unseen patterns, derive meaningful     information, and     make business decisions.
Data science uses complex machine learning algorithms to build predictive models."""

result = re.sub('\s+', ' ',text)
result

'Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.'

In [233]:
text = """Data @#$$%^  science is the domain of study that deals with vast volumes of data using modern tools and 
techniques to find unseen patterns, derive meaningful ##$%^&  information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models."""

result = re.sub('\W+', ' ',text)
result

'Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns derive meaningful information and make business decisions Data science uses complex machine learning algorithms to build predictive models '

In [234]:
text = """Data @#$$%^  science is the domain of study that deals with vast volumes of data using modern tools and 
techniques to find unseen patterns, derive meaningful ##$%^&  information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models."""

result = re.sub('[^a-zA-Z0-9]+', ' ',text)
result

'Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns derive meaningful information and make business decisions Data science uses complex machine learning algorithms to build predictive models '

In [235]:
text = 'python and data science pythonnnn pyttthon'
result = re.findall('python',text)
result

['python', 'python']

In [236]:
text = 'python and data science pythonnnn pyttthon'
result = re.findall('pyt+hon',text)
result

['python', 'python', 'pyttthon']

In [237]:
text = 'python and data science pythonnnn pyttthon pyttthonnnnn'
result = re.findall('pyt+hon',text)
result

['python', 'python', 'pyttthon', 'pyttthon']

#### 2) '\*' (star sign)
-  The resulting RE to match __`zero or more occurances`__ of the preceding RE.

In [238]:
text = 'python and data science pyhonnnn pyttthon pyttthooonnnnn'
result = re.findall('pyt*hon',text)
result

['python', 'pyhon', 'pyttthon']

In [239]:
text = 'python and data science pyhonnnn pyttthon pyttthnnnnn'
result = re.findall('pyt*ho*n',text)
result

['python', 'pyhon', 'pyttthon', 'pyttthn']

In [240]:
text = 'python and data science pyhon  pytthhhhon'
result = re.findall('pyt*h+on',text)
result

['python', 'pyhon', 'pytthhhhon']

#### 3) '.' (dot sign)
- matches zero or more occurances of any character except a newline. But if the DOTALL flag has been specified, this matches any character including a newline.

In [241]:
text = 'python and data science pyhon pytthhhhon'
result = re.findall('py..on', text)
result

['python']

In [242]:
text = 'python and data science pyhon pytthhhhon pysuon'
result = re.findall('sci..ce', text)
result

['science']

In [243]:
text = 'python and data science pyhon pytthhhhon pysuon'
result = re.findall('p...on', text)
result

['python', 'pysuon']

In [244]:
string = '8812345656 8812343030 9812343030 7812343030 8898763030 8823456700'
result = re.findall('9.......30', string)
result

['9812343030']

#### 4) '?' (Question mark sign)
- The resulting RE to match zero or one occurances of the preceding RE.

In [245]:
text  = 'python and data science pyhon pytthhhhon'
result = re.findall('pyt?hon',text)
result

['python', 'pyhon']

In [249]:
# Difference between '*', '+' and '?'

text  = 'python and data science pyhon pytthon'
result = re.findall('pyt*hon',text)
result

['python', 'pyhon', 'pytthon']

In [246]:
text  = 'python and data science pyhon pytthon'
result = re.findall('pyt?hon',text)
result

['python', 'pyhon']

In [248]:
text  = 'python and data science pyhon pytthon'
result = re.findall('pyt+hon',text)
result

['python', 'pytthon']

#### 5) '^' (Caret sign); startswith 
- matches at the beginning of string. Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

In [251]:
string = '8812345656 8812343030 9812343030 7812343030 8898763030 8823456700'
result = re.findall('^8', string)
result

['8']

In [252]:
string = '7812345656 8812345656 8812343030 9812343030 7812343030 8898763030 8823456700'
result = re.findall('^\d', string)
result

['7']

In [253]:
string = '7812345656 8812345656 8812343030 9812343030 7812343030 8898763030 8823456700'
result = re.findall('^[987]', string) # Caret is outside the square bracket
result

['7']

In [254]:
# difference

string = '7812345656 8812345656 8812343030 9812343030 7812343030 8898763030 8823456700'
result = re.findall('[^987]', string) # Caret is outside the square bracket
result

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '5',
 '6',
 ' ',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '5',
 '6',
 ' ',
 '1',
 '2',
 '3',
 '4',
 '3',
 '0',
 '3',
 '0',
 ' ',
 '1',
 '2',
 '3',
 '4',
 '3',
 '0',
 '3',
 '0',
 ' ',
 '1',
 '2',
 '3',
 '4',
 '3',
 '0',
 '3',
 '0',
 ' ',
 '6',
 '3',
 '0',
 '3',
 '0',
 ' ',
 '2',
 '3',
 '4',
 '5',
 '6',
 '0',
 '0']

In [255]:
string = 'Python and data science pyton pytthon'
result = re.findall('^[a-zA-Z]', string)
result

['P']

In [256]:
string = 'Python and data science pyton pytthon'
result = re.findall('^[a-zA-Z]{6}', string)
result

['Python']

#### 6) '$' (Dollar sign); endswith
- matches at the end of the string, and just before the newline, and in MULTILINE mode also matches before a newline.

In [257]:
string = 'Python and data science pyton pytthon'
result = re.findall('n$', string)
result

['n']

In [258]:
string = 'Python and data science pyton pytthon'
result = re.findall('[a-z]$', string)
result

['n']

In [259]:
string = '   Python and data science pyton pytthon 12345   '
result = re.findall('[a-z0-9]$', string) # string is ending with whitespace character.
result

[]

In [264]:
string = '   Python and data science pyton pytthon 12345   '
string = string.strip() # using strip function to removee leadig and trailing spaces
result = re.findall('[a-z0-9]$', string)
result

['5']

In [265]:
text = """Data science is the domain of study that deals with vast volumes of data using modern 
tools and techniques to find unseen patterns"""
result = re.findall('^D......', text)
result

['Data sc']

In [266]:
text = """Data science is the domain of study that deals with vast volumes of data using modern 
tools and techniques to find unseen patterns"""
result = re.findall('^D.*', text)
result

['Data science is the domain of study that deals with vast volumes of data using modern ']

In [267]:
text = """Data science is the domain of study that deals with vast volumes of data using modern 
tools and techniques to find unseen patterns"""
result = re.findall('^D.+', text)
result

['Data science is the domain of study that deals with vast volumes of data using modern ']

#### 7) '|' (Or sign)
- A|B, creates a regular expression that will match either A or B.

In [268]:
text = """Data science is the domain of study that deals with vast volumes of data using modern 
tools and techniques to find unseen patterns"""
result = re.findall('Data|science', text) # 1 occurances
result

['Data', 'science']

In [273]:
text = """Data science is the domain of study that Data deals with vast volumes of science data using modern 
tools and techniques to find unseen patterns"""
result = re.findall('Data|science', text) # 2 occurances
result

['Data', 'science', 'Data', 'science']

In [274]:
text = """Data science is the domain of study that Data deals with vast volumes of science data using modern 
tools and techniques to find unseen patterns 2345"""
result = re.findall('Data|science|2345', text)
result

['Data', 'science', 'Data', 'science', '2345']

In [275]:
text = """27-05-2022 and 27/05/2022"""
result = re.findall('\d{2}[-/]\d{2}[-/]\d{4}', text) # You can write in this way also
result

['27-05-2022', '27/05/2022']

In [276]:
text = """27-05-2022 and 27/05/2022"""
result = re.findall(r'\b\d{2}[-]\d{2}[-]\d{4}\b|\d{2}[/]\d{2}[/]\d{4}', text)
result

['27-05-2022', '27/05/2022']

In [278]:
text = """27-05-2022 and 27/05/2022, '27 May 2022'"""
result = re.findall(r'\b\d{2}[-/]\d{2}[-/]\d{4}\b|\d{2}[ ][A-Za-z]{3}[ ]\d{4}', text) # You can give as many patterns I want using or
result

['27-05-2022', '27/05/2022', '27 May 2022']

__.group()__

In [279]:
text = ' batwoman and batman'
result = re.search('bat(wo)?man', text)
result.group()

'batwoman'

#### 5) re.split()
- Split string by the occurrences of pattern or splits string 'B' into a list using the delimiter 'A'

`Syntax: re.split(pattern,string,maxsplit=0)`

In [280]:
text = """Data science is the domain of study that Data deals with vast volumes of science data using modern 
tools and techniques to find unseen patterns 2345 1234"""
list1 = text.split()
print(list1)

['Data', 'science', 'is', 'the', 'domain', 'of', 'study', 'that', 'Data', 'deals', 'with', 'vast', 'volumes', 'of', 'science', 'data', 'using', 'modern', 'tools', 'and', 'techniques', 'to', 'find', 'unseen', 'patterns', '2345', '1234']


In [282]:
text = """Data science 1 is the domain of study that Data deals with vast volumes of science data using modern 
tools and techniques to find unseen patterns 2345 1234"""
list1 = text.split('1')
print(list1)

['Data science ', ' is the domain of study that Data deals with vast volumes of science data using modern \ntools and techniques to find unseen patterns 2345 ', '234']


In [281]:
text = """python A and B data science"""
list1 = re.split('[A-Z]',text)
print(list1)

['python ', ' and ', ' data science']


In [283]:
text = """99,33,44,12-34-67-51"""
list1 = re.split('[,-]',text)
print(list1)

['99', '33', '44', '12', '34', '67', '51']


In [284]:
text = """99,33,44,12-34-67-51"""
list1 = re.split(',|-',text)
print(list1)

['99', '33', '44', '12', '34', '67', '51']


#### 6) re.compile()
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.

`Syntax: re.compile(pattern)`

In [285]:
string5 = '2123-456-9876'
result = re.findall('\d{3}[-]\d{3}[-]\d{4}',string5)
result

['123-456-9876']

In [286]:
pattern = re.compile('(\d{3}[-])?\d{3}[-]\d{4}')
string1 = '123-456-9876'
result = pattern.search(string1)
result.group()

'123-456-9876'

In [287]:
string1 = '123-456-9876'
string2 = '456-9876'
string3 = '123-9876'
string4 = '567-456-9876'
string5 = '2123-456-9876'
string6 = '123-456-9876'

In [288]:
pattern = re.compile('(\d{3}[-])?\d{3}[-]\d{4}')

In [289]:
result1 = pattern.search(string1)
result2 = pattern.search(string2)
result3 = pattern.search(string3)
result4 = pattern.search(string4)
result5 = pattern.search(string5)
result6 = pattern.search(string6)

print(result1.group())
print(result2.group())
print(result3.group())
print(result4.group())
print(result5.group())
print(result6.group())

123-456-9876
456-9876
123-9876
567-456-9876
123-456-9876
123-456-9876


In [290]:
# For example

# prog = re.compile(pattern) # We can use this again and again
# result = prog.match(string)

# Which is equivalent to
# result = re.match(pattern, string) 