# What is the name of the feature responsible for generating Regex objects?



<p>To interact with regex objects or patterns we can make use of<strong> re</strong> module. We can use this module as below</p>
<p>&nbsp;</p>
<p><em><strong> import re</strong></em></p>
<p>&nbsp;</p>
<p><span style="font-size: 10pt;"><span style="color: #222222; font-family: Inter-Regular, system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Helvetica, Arial, sans-serif; background-color: #fefefe;">Python&rsquo;s <em><strong>re.compile</strong></em></span><span style="color: #222222; font-family: Inter-Regular, system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Helvetica, Arial, sans-serif; background-color: #fefefe;">&nbsp;method is used to compile a regular expression pattern provided as a string into a regex pattern object.</span></span></p>
<p>&nbsp;</p>
<p><span style="font-size: 10pt;"><span style="color: #222222; font-family: Inter-Regular, system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Helvetica, Arial, sans-serif; background-color: #fefefe;">Later we can use this pattern object to search for a match inside different target strings using regex methods such as </span><em><strong><span style="color: #222222; background-color: #fefefe; white-space: pre-wrap;"><span style="font-family: Consolas, Monaco, Andale Mono, Ubuntu Mono, monospace;">re.match()</span></span></strong></em><span style="color: #222222; font-family: Inter-Regular, system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Helvetica, Arial, sans-serif; background-color: #fefefe;">&nbsp;or </span><em><strong><span style="color: #222222; background-color: #fefefe; white-space: pre-wrap;"><span style="font-family: Consolas, Monaco, Andale Mono, Ubuntu Mono, monospace;">re.search()</span></span></strong></em></span></p>

# Why do raw strings often appear in Regex objects?


Raw strings are used so that backslashes do not have to be escaped.

The issue with using a normal string to write regex's that contain a \ is that you end up having to write \\ for every \. So the string literals "stuff\\things" and r"stuff\things" produce the same string. This gets especially useful if you want to write a regular expression that matches against backslashes.

# What is the return value of the search() method?


The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match.

If no matches are found, the value None is returned:

In [1]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [2]:
import re

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


# From a Match item, how do you get the actual strings that match the pattern?


The group() method returns strings of the matched text.

In [3]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


# In the regex which created from the r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does group zero cover? Group 2? Group 1?


Group 0 is the entire match, group 1 covers the first set of parentheses, and group 2 covers the second set of parentheses.

# In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?



We can make use of backslash \ to fit real parenheses and periods
eg

 \. 
 
 \ (
 
 \ )


# The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?


If the regex has no groups, a list of strings is returned. 

If the regex has groups, a list of tuples of strings is returned.

In [13]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


In [16]:
import re
s = 'ab1cd2efg1hij2k'
a = re.findall( r'((?:1)([a-z]+)(?:2)|([a-z]))', s )
print(a)

[('a', '', 'a'), ('b', '', 'b'), ('1cd2', 'cd', ''), ('e', '', 'e'), ('f', '', 'f'), ('g', '', 'g'), ('1hij2', 'hij', ''), ('k', '', 'k')]


# In standard expressions, what does the | character mean?


The | character signifies matching “either, or” between two groups.

In [19]:
import re
txt = 'tim is walking and tom is running'
pattern = 'tim|tom'
re.findall(pattern, txt)

['tim', 'tom']

# In regular expressions, what does the character stand for ?


The ? character can either match zero or one of the previous group

In [21]:
import re

txt = "helo hello heo planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?o", txt)

print(x)

['helo', 'heo']


# In regular expressions, what is the difference between the + and * characters?


The + matches one or more.

In [25]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)

print(x)


['hello']


 The * matches zero or more.

In [22]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*o", txt)

print(x)


['hello']


#  What is the difference between {4} and {4,5} in regular expression?


The {3} matches exactly three instances of the preceding group. The {3,5} matches between three and five instances.

In [27]:
import re

txt = "hello helo heo helllo hellllo planet"

#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall("he.{3}o", txt)

print(x)



x = re.findall("he.{4}o", txt)

print(x)

['helllo']
['hellllo']


#  What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?


<p><strong>\d</strong> Returns a match where the string contains digits (numbers from 0-9)</p>

In [28]:
import re

txt = "The rain in Spain"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


<p><strong>\w</strong> Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)</p>

In [29]:
import re

txt = "The rain in Spain"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


<p><strong>\s</strong> Returns a match where the string contains a white space character</p>

In [30]:
import re

txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


#  What do means by \D, \W, and \S shorthand character classes signify in regular expressions?



<p><strong>\D</strong> Returns a match where the string DOES NOT contain digits</p>

In [31]:
import re

txt = "The rain in Spain"

#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


<p><strong>\W</strong> Returns a match where the string DOES NOT contain any word characters</p>

In [32]:
import re

txt = "The rain in Spain"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


<p><strong>\S</strong> Returns a match where the string DOES NOT contain a white space character</p>

In [33]:
import re

txt = "The rain in Spain"

#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


#  What is the difference between .* ? and .* ?


The .* performs a greedy match, and the .*? performs a nongreedy match.

#  What is the syntax for matching both numbers and lowercase letters with a character class?


In [35]:
import re

str = 'We live at 9-162 Malibeu. My phone number is 666688888.'
#search using regex
x = re.findall('[0-9]+', str)
print(x)

['9', '162', '666688888']


In [91]:
sequences = ['asdasdadsadsad','asdasdadsasd','NONEARELOWERCASE', '666688888.', '665786976576765'] 

lower_indx = []
Upr_indx = []
numeric = []

for seq in sequences:
    matches = re.finditer("[a-z]+", seq) # List of Match objects.
    lower_indx.append([match.group(0)  for match in matches]) # add substrings
    matches = re.finditer("[A-Z]+", seq) # List of Match objects.
    Upr_indx.append([match.group(0) for match in matches]) # add substrings
    matches = re.finditer("[0-9]+", seq) # List of Match objects.
    numeric.append([match.group(0) for match in matches]) # add substrings
    
    
print('Lower Index')
print([ele for ele in lower_indx if ele != []])

print('Upper Index')
print([ele for ele in Upr_indx if ele != []])

print('Numeric values')
print([ele for ele in numeric if ele != []])

Lower Index
[['asdasdadsadsad'], ['asdasdadsasd']]
Upper Index
[['NONEARELOWERCASE']]
Numeric values
[['666688888'], ['665786976576765']]


#  What is the procedure for making a normal expression in regax case insensitive?


In [41]:
import re

data = [
    {'system_name': 'a1pvdb092', 'fdc_inv_sa_team': 'X2AIX_GBS'},
    {'system_name': 'W00000001.1DC.com', 'fdc_inv_sa_team': 'LAA.BRAZIL.AAA.WINDOWS\n'},
    {'system_name': 'a10000048', 'fdc_inv_sa_team': 'X2AIX_NSS'},
    {'system_name': 'a10000049', 'fdc_inv_sa_team': 'X2AIX_NSS'},
]

for row in data:
    sysname = row['system_name']
    print([re.sub(r'\.1dc\.com', '', sysname, flags=re.IGNORECASE)])

['a1pvdb092']
['W00000001']
['a10000048']
['a10000049']


#  What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?


In regex . means any character except \n

So if you have newlines in your string, then .* will not pass that newline(\n).

But in Python, if you use the re.DOTALL flag(also known as re.S) then it includes the \n(newline) with that dot .

In [82]:
string = '\nSubject sentence is:  Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is :  may have\npredicate sentence is:  a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'

match = re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string)


print(match.groups())

AttributeError: 'NoneType' object has no attribute 'groups'

In [79]:
string = '\nSubject sentence is:  Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is :  may have\npredicate sentence is:  a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'

match = re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string ,re.DOTALL)


print(match.groups())

('  Appropriate support for families of children diagnosed with hearing impairment\n', '  may have\n', '  a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss')


#  If numReg = re.compile(r'\d+'), what will numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen') return?


In [69]:
numReg = re.compile(r'\d+')

numReg.sub('X', '11 drummers, 10 pipers, five rings, 4 hen')

'X drummers, X pipers, five rings, X hen'

#  What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?



The re.VERBOSE argument allows you to add whitespace and comments to the string passed to re.compile().

In [83]:
import re
verbose_item_pattern = re.compile(r"""
    $            # end of line boundary
    \s{1,2}      # 1-or-2 whitespace character, including the newline
    I            # a capital I
    [tT][eE][mM] # one character from each of the three sets this allows for unknown case
    \s+          # 1-or-more whitespaces INCLUDING newline
    \d{1,2}      # 1-or-2 digits
    [.]?         # 0-or-1 literal .
    \(?          # 0-or-1 literal open paren
    [a-e]?       # 0-or-1 letter in the range a-e
    \)?          # 0-or-1 closing paren
    .*           # any number of unknown characters so we can have words and punctuation
    [^0-9]       # anything but [0-9]
    $            # end of line boundary
    """, re.VERBOSE|re.MULTILINE)

x = verbose_item_pattern.search("""
 Item 1.0(a) foo bar
""")

print(x.group())


 Item 1.0(a) foo bar



#  How would you write a regex that match a number with comma for every three digits? It must match the given following:
'42'
'1,234'
'6,368,745'



In [90]:
com = re.compile(r'^(\d{1,3}(,\d{3})*)$')

list = ['42','1,234','6,368,745','234,3,234,34']

for num in list:
    x = com.search(num)
    if x is not None:
        print(x.group())

42
1,234
6,368,745
