# Python_regex::

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

Python has a built-in package called re, which can be used to work with Regular Expressions.

In [1]:
import re

# RegEx Functions::

![image.png](attachment:image.png)

## The findall() Function
The findall() function returns a list containing all matches.

## MetaCharacters:

In [12]:
txt = "The rain in spain."
#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


In [7]:
txt = "That will be 59 dollars"

#Find all digit characters:

a = re.findall("\d", txt)
print(a)

['5', '9']


In [11]:
txt = "hello world"

#Search for a sequence that starts with "wo", followed by two (any) characters, and an "o":

x = re.findall("wo..d", txt)
print(x)

['world']


In [16]:
#Check if the string starts with 'hello':
txt = "hello world"
x = re.findall("^hello", txt)
if x:
    
    print("Yes, the string starts with 'hello'")
else:
    print("No match")


Yes, the string starts with 'hello'


In [17]:
txt = "hello world"

#Check if the string ends with 'world':

x = re.findall("world$", txt)
if x:
    print("Yes, the string ends with 'world'")
else:
    print("No match")


Yes, the string ends with 'world'


In [22]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains "a" followed by exactly two "l" characters:

x = re.findall("al{2}", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")


['all']
Yes, there is at least one match!


In [23]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['falls']
Yes, there is at least one match!


## Special sequence::

In [29]:
txt = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)
print(x)

[]


In [28]:
txt = "The rain in Spain"

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"ain\b", txt)

print(x)

['ain', 'ain']


Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word.

In [30]:
txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", txt)

print(x)

['ain', 'ain']


In [31]:
txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"ain\B", txt)

print(x)

[]


Returns a match where the string contains digits (numbers from 0-9).

In [35]:
txt = "i have 65 dollars."
x = re.findall("\d",txt)
print(x)

['6', '5']


Returns a match where the string DOES NOT contain digits	"\D"

In [36]:
txt = "i have 65 dollars."
x = re.findall("\D",txt)
print(x)

['i', ' ', 'h', 'a', 'v', 'e', ' ', ' ', 'd', 'o', 'l', 'l', 'a', 'r', 's', '.']


Returns a match where the string contains a white space character	"\s".

In [37]:
txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

[' ', ' ', ' ']


Returns a match where the string DOES NOT contain a white space character	"\S"

In [38]:
txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\S", txt)

print(x)

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']


Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	"\w".

In [39]:

txt = "The rain in Spain"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']


In [40]:

txt = "The rain in Spain"

#Returns a match where the string DOES NOT contain any word characters

x = re.findall("\W", txt)

print(x)

[' ', ' ', ' ']


In [41]:
txt = "The rain in Spain"

#Check if the string ends with "Spain":

x = re.findall("Spain\Z", txt)

print(x)

['Spain']


## Sets::

### A set is a set of characters inside a pair of square brackets [] with a special meaning:

![image.png](attachment:image.png)

In [42]:

txt = "The rain in Spain"

#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", txt)

print(x)

['r', 'a', 'n', 'n', 'a', 'n']


In [46]:
txt = "The rain in Spain"

#Check if the string has any characters between a and n:

x = re.findall("[a-n]", txt)
print(x)

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']


In [47]:
txt = "The rain in Spain"

#Check if the string has other characters than a, r, or n:

x = re.findall("[^arn]", txt)

print(x)

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']


In [48]:
txt = "The rain in Spain"

#Check if the string has any 0, 1, 2, or 3 digits:

x = re.findall("[0123]", txt)

print(x)

[]


In [49]:
txt = "8 times before 11:45 AM"

#Check if the string has any digits:

x = re.findall("[0-9]", txt)

print(x)


['8', '1', '1', '4', '5']


In [50]:
txt = "8 times before 16:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:

x = re.findall("[0-5][0-9]", txt)


print(x)

['16', '45']


In [51]:
txt = "8 times before 11:45 AM"

#Check if the string has any characters from a to z lower case, and A to Z upper case:

x = re.findall("[a-zA-Z]", txt)

In [52]:
print(x)


['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']


In [53]:
txt = "8 times before  + / = # 11:45 AM"

#Check if the string has any + characters:

x = re.findall("[+,/,=,#]", txt)

print(x)

['+', '/', '=', '#']


## The search() Function

The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

In [54]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [60]:
#Split at each white-space character:

import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [61]:
#You can control the number of occurrences by specifying the maxsplit parameter:
import re

txt = "The rain in Spain"
x = re.split("\s", txt,2)
print(x)


['The', 'rain', 'in Spain']


## The sub() Function
The sub() function replaces the matches with the text of your choice:

In [66]:
#Replace every white-space character with the number 9:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


In [67]:
# we can control the number of replacements by specifying the count parameter:
# replace the first 2 occurance
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


## Match Object::

![image.png](attachment:image.png)

In [68]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


In [70]:
#Print the string passed into the function:
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


In [71]:
#The regular expression looks for any words that starts with an upper case "S"
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


## Practice::

In [72]:
var='Image/NotProcessed/image_sign_small.jpg'

In [73]:
var = re.split("/",var)
print(var)

['Image', 'NotProcessed', 'image_sign_small.jpg']


In [86]:
keys='92 @ hello world \n h'
txt = keys
x = re.split("\n", txt)
print(x)
y=x[0]
z=re.sub(r"[^a-zA-Z0-9]+", ' ', y)
print(z)
z1 = re.sub("[0-9]", "", z,2)
print(z1)
z2 =z1.strip()
print(z2)


['92 @ hello world ', ' h']
92 hello world 
 hello world 
hello world


In [87]:
import re
txt="gs://context_primary/pdf_type_entity_extraction_pdf_PDF-1.pdf"
x = re.sub("[.]pdf", "_result.json", txt)
print(x)

gs://context_primary/pdf_type_entity_extraction_pdf_PDF-1_result.json


In [23]:
import re
txt="gs://context_primary/pdf_type_entity_extraction_pdf_PDF-1.pdf"
match1 = re.match(r'gs://([^/]+)/(.+)', txt)


In [24]:
prefix = match1.group(1)
print(prefix)
bucket = match1.group(2)
print(bucket)

context_primary
pdf_type_entity_extraction_pdf_PDF-1.pdf


In [102]:
import re
txt='comp_form1_result.txt'
result_big = re.split('[.]',txt)
result_big[0]

'comp_form1_result'

In [108]:
import re
txt='gs://context_primary/document/NotProcessed/doc_nlp (1).pdf'
match = re.match(r'gs://([^/]+)/(.+)', txt)
b=match.group(2)
print(b)
z=match.group(1)
print(z)
x = re.split("\s", b)
print(x)
y=x[0]
print(y)
y=re.split("[/]",y)
print(y)
l=y[2]
l


document/NotProcessed/doc_nlp (1).pdf
context_primary
['document/NotProcessed/doc_nlp', '(1).pdf']
document/NotProcessed/doc_nlp
['document', 'NotProcessed', 'doc_nlp']


'doc_nlp'

In [109]:
import re
txt='gs://context_primary/document/NotProcessed/doc_nlp (1).pdf'
match = re.match(r'gs://([^/]+)/(.+)', txt)
match=match.group(2)

match=re.sub("[.]pdf", "_result.json", match)
match= re.split("[/]", match)
match=match[2]
match

'doc_nlp (1)_result.json'

In [110]:
import re
k='gs://context_primary/document/Processed/text/b2e2e670-19da-11eb-b82b-7b5705310159doc_hello_result.txt+b2e2e670-19da-11eb-b82b-7b5705310159'
x = re.split('[+]', k)
print(x)


['gs://context_primary/document/Processed/text/b2e2e670-19da-11eb-b82b-7b5705310159doc_hello_result.txt', 'b2e2e670-19da-11eb-b82b-7b5705310159']


In [111]:
#bigquery table contain only letters, numbers, or underscores
import re
txt="document/NotProcessed/doc_schema_testing (3).csv"
txt1="document/NotProcessed/doc_schema_testing(3).csv"
txt = re.split('[/]', txt)
txt=txt[2]
txt = re.split('[.]', txt)
txt=txt[0]
print(txt)

txt=re.sub(r"[^a-zA-Z0-9]+", '_', txt)
print(txt)
#txt=txt.replace(" ","")
#txt



doc_schema_testing (3)
doc_schema_testing_3_


In [112]:
import re
txt='0.9121770262718201%Bussiness plan document'
txt=re.split('[%]',txt)
score=txt[0]
form_type=txt[1]
print(score)
print(form_type)

0.9121770262718201
Bussiness plan document


In [2]:
import re
y ="My 2 favourite numbers are 45 and 78"
x =re.findall("[0-9]+",y)# lookin for digit one or  more than one.
print(x)

['2', '45', '78']


<re.Match object; span=(3, 4), match='2'>


![image.png](attachment:image.png)