# Regular Expressions

**Why use regular expressions?**

In [1]:
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"

In [2]:
index = log.index("[")
print(log[index+1:index+6])

12345


In [5]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"

result = re.search(regex, log)
print(result[1])

12345


Here the re module is used which lets us use the search function to find regular expressions inside strings. Then, a regular expression is defined as `r"\[(\d+)\]"`

## Simple matching in Python

In [6]:
import re
result = re.search(r"aza", "plaza")
print(result)

<re.Match object; span=(2, 5), match='aza'>


In [7]:
import re
result = re.search(r"aza", "bazaar")
print(result)

<re.Match object; span=(1, 4), match='aza'>


In [8]:
import re
result = re.search(r"aza", "maze")
print(result)

print(re.search(r"^x", "xenon"))

None
<re.Match object; span=(0, 1), match='x'>


In [9]:
import re
print(re.search(r"p.ng", "penguin"))

<re.Match object; span=(0, 4), match='peng'>


In [10]:
import re
print(re.search(r"p.ng", "clapping"))
print(re.search(r"p.ng", "sponge"))

<re.Match object; span=(4, 8), match='ping'>
<re.Match object; span=(1, 5), match='pong'>


In [11]:
import re
print(re.search(r"p.ng", "Pangaea", re.IGNORECASE))

<re.Match object; span=(0, 4), match='Pang'>


## Wildcards and character classes

In [1]:
import re
print(re.search(r"[Pp]ython", "Python"))

<re.Match object; span=(0, 6), match='Python'>


In [2]:
import re
print(re.search(r"[a-z]way", "The end of the highway"))
print(re.search(r"[a-z]way", "What a way to go"))
print(re.search("cloud[a-zA-Z0-9]", "cloudy"))
print(re.search("cloud[a-zA-Z0-9]", "cloud9"))

<re.Match object; span=(18, 22), match='hway'>
None
<re.Match object; span=(0, 6), match='cloudy'>
<re.Match object; span=(0, 6), match='cloud9'>


In [11]:
import re
print(re.search(r"[^a-zA-Z]", "This is a sentence with spaces."))
print(re.search(r"[^a-zA-Z ]", "This is a sentence with spaces."))

print(re.search(r"cat|dog", "I like cats."))
print(re.search(r"cat|dog", "I love dogs!"))
print(re.search(r"cat|dog", "I like both dogs and cats."))

print(re.search(r"cat|dog", "I like cats."))
print(re.search(r"cat|dog", "I love dogs!"))
print(re.search(r"cat|dog", "I like both dogs and cats."))

print(re.findall(r"cat|dog", "I like both dogs and cats."))

<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(30, 31), match='.'>
<re.Match object; span=(7, 10), match='cat'>
<re.Match object; span=(7, 10), match='dog'>
<re.Match object; span=(12, 15), match='dog'>
<re.Match object; span=(7, 10), match='cat'>
<re.Match object; span=(7, 10), match='dog'>
<re.Match object; span=(12, 15), match='dog'>
['dog', 'cat']


## Repetition qualifiers

In [13]:
import re

# *: 0 or more
print(re.search(r"Py.*n", "Pygmalion"))
print(re.search(r"Py.*n", "Python Programming")) #Greedy

print(re.search(r"Py[a-z]*n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Pyn"))

<re.Match object; span=(0, 9), match='Pygmalion'>
<re.Match object; span=(0, 17), match='Python Programmin'>
<re.Match object; span=(0, 6), match='Python'>
<re.Match object; span=(0, 3), match='Pyn'>


In [5]:
import re

# + one or more
print(re.search(r"o+l+", "goldfish"))
print(re.search(r"o+l+", "woolly"))
print(re.search(r"o+l+", "boil"))

<re.Match object; span=(1, 3), match='ol'>
<re.Match object; span=(1, 5), match='ooll'>
None


In [6]:
import re

# ? zero or one
print(re.search(r"p?each", "To each their own"))
print(re.search(r"p?each", "I like peaches"))

<re.Match object; span=(3, 7), match='each'>
<re.Match object; span=(7, 12), match='peach'>


## Escaping characters

In [7]:
import re
print(re.search(r".com", "welcome"))
print(re.search(r"\.com", "welcome"))
print(re.search(r"\.com", "mydomain.com"))

<re.Match object; span=(2, 6), match='lcom'>
None
<re.Match object; span=(8, 12), match='.com'>


In [8]:
import re
print(re.search(r"\w*", "This is an example"))
print(re.search(r"\w*", "And_this_is_another"))

<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 19), match='And_this_is_another'>


## Regular expressions in action

In [14]:
import re
print(re.search(r"A.*a", "Argentina"))
print(re.search(r"A.*a", "Azerbaijan"))
print(re.search(r"^A.*a$", "Australia")) # begin and end with a

<re.Match object; span=(0, 9), match='Argentina'>
<re.Match object; span=(0, 9), match='Azerbaija'>
<re.Match object; span=(0, 9), match='Australia'>


In [15]:
import re
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"
print(re.search(pattern, "_this_is_a_valid_variable_name"))
print(re.search(pattern, "this isn't a valid variable"))
print(re.search(pattern, "my_variable1"))
print(re.search(pattern, "2my_variable1"))

<re.Match object; span=(0, 30), match='_this_is_a_valid_variable_name'>
None
<re.Match object; span=(0, 12), match='my_variable1'>
None


## Exercise

The check_web_address function checks if the text passed qualifies as a top-level web address, meaning that it contains alphanumeric characters (which includes letters, numbers, and underscores), as well as periods, dashes, and a plus sign, followed by a period and a character-only top-level domain such as ".com", ".info", ".edu", etc. Fill in the regular expression to do that, using escape characters, wildcards, repetition qualifiers, beginning and end-of-line characters, and character classes.

In [16]:
import re
def check_web_address(text):
  pattern = r"\.\w+$"
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True

True
False
True
False
True


The check_time function checks for the time format of a 12-hour clock, as follows: the hour is between 1 and 12, with no leading zero, followed by a colon, then minutes between 00 and 59, then an optional space, and then AM or PM, in upper or lower case. Fill in the regular expression to do that. How many of the concepts that you just learned can you use here?

In [17]:
import re
def check_time(text):
  pattern = r"[0-9]+:[0-5][0-9]\s*[AaPp][mM]"
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False
print(check_time("6:02 am")) # True
print(check_time("6:02km")) # False

True
True
False
False
True
False


The contains_acronym function checks the text for the presence of 2 or more characters or digits surrounded by parentheses, with at least the first character in uppercase (if it's a letter), returning True if the condition is met, or False otherwise. For example, "Instant messaging (IM) is a set of communication technologies used for text-based communication" should return True since (IM) satisfies the match conditions." Fill in the regular expression in this function: 

In [19]:
import re
def contains_acronym(text):
  pattern = r"\(\w\w*\)" 
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True

True
True
False
True
True


What does the "r" before the pattern string in re.search(r"Py.*n", sample.txt) indicate?

An intern implemented a zip code checker, but it works only with five-digit zip codes. Your task is to update the checker so that it includes all nine digits of the zip code; the leading five digits and the optional four after the hyphen. The zip code needs to be preceded by at least one space, and cannot be at the start of the text. Update the regular expression.

In [21]:
import re

def correct_function(text):
  result = re.search(r"[0-9]{4}\s*\w*\s*-*[0-9]{4}", text)  # Corrected regex pattern with space
  return result is not None

def check_zip_code(text):
  return correct_function(text)  # Call the correct_function

# Call the check_zip_code function with test cases
print(check_zip_code("The zip codes for New York are 10001 thru 11104."))  # True
print(check_zip_code("90210 is a TV show"))  # False (no space before 90210)
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001."))  # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9."))  # False

True
False
True
False


# Advance RegEx

## Capturing groups

In [23]:
import re
result = re.search(r"^(\w*), (\w*)$", "Lovelace, Ada")
print(result)
print(result.groups())
print(result[0])
print(result[1])
print(result[2])
"{} {}".format(result[2], result[1])

<re.Match object; span=(0, 13), match='Lovelace, Ada'>
('Lovelace', 'Ada')
Lovelace, Ada
Lovelace
Ada


'Ada Lovelace'

In [24]:
import re
def rearrange_name(name):
    result = re.search(r"^(\w*), (\w*)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])
rearrange_name("Lovelace, Ada")

'Ada Lovelace'

In [25]:
import re
def rearrange_name(name):
    result = re.search(r"^(\w*), (\w*)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])
rearrange_name("Ritchie, Dennis")

'Dennis Ritchie'

In [26]:
import re
def rearrange_name(name):
    result = re.search(r"^([\w \.-]*), ([\w \.-]*)$", name)
    if result == None:
        return name
    return "{} {}".format(result[2], result[1])
rearrange_name("Hopper, Grace M.")

'Grace M. Hopper'

## More on repetition qualifiers

In [27]:
import re
print(re.search(r"[a-zA-Z]{5}", "a ghost"))

<re.Match object; span=(2, 7), match='ghost'>


In [28]:
import re
print(re.search(r"[a-zA-Z]{5}", "a scary ghost appeared"))

<re.Match object; span=(2, 7), match='scary'>


In [29]:
import re
print(re.findall(r"[a-zA-Z]{5}", "a scary ghost appeared"))

['scary', 'ghost', 'appea']


In [30]:
import re
re.findall(r"\b[a-zA-Z]{5}\b", "A scary ghost appeared") # ***********

['scary', 'ghost']

In [31]:
import re
print(re.findall(r"\w{5,10}", "I really like strawberries"))

['really', 'strawberri']


In [32]:
import re
print(re.findall(r"\w{5,}", "I really like strawberries"))

['really', 'strawberries']


In [33]:
import re
print(re.search(r"s\w{,20}", "I really like strawberries"))

<re.Match object; span=(14, 26), match='strawberries'>


## Extracting a PID using regexes in Python

In [34]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
print(result[1])

12345


In [35]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
print(result[1])

34567


In [36]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
result = re.search(regex, "99 elephants in a [cage]")
print(result[1])
#Note that this print command results in an error as shown in the video. 

TypeError: 'NoneType' object is not subscriptable

In [37]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
result = re.search(regex, "99 elephants in a [cage]")
def extract_pid(log_line):
    regex = r"\[(\d+)\]"
    result = re.search(regex, log_line)
    if result is None:
        return ""
    return result[1]
print(extract_pid(log))

12345


In [38]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
result = re.search(regex, "99 elephants in a [cage]")
def extract_pid(log_line):
    regex = r"\[(\d+)\]"
    result = re.search(regex, log_line)
    if result is None:
        return ""
    return result[1]
print(extract_pid(log))
print(extract_pid("99 elephants in a [cage]"))

12345



## Splitting and replacing

In [39]:
import re
re.split(r"[.?!]", "One sentence. Another one? And the last one!")

['One sentence', ' Another one', ' And the last one', '']

In [40]:
import re
re.split(r"([.?!])", "One sentence. Another one? And the last one!")

['One sentence', '.', ' Another one', '?', ' And the last one', '!', '']

In [41]:
import re
re.sub(r"[\w.%+-]+@[\w.-]+", "[REDACTED]", "Received an email for go_nuts95@my.example.com")

'Received an email for [REDACTED]'

In [42]:
import re
re.sub(r"^([\w .-]*), ([\w .-]*)$", r"\2 \1", "Lovelace, Ada")

'Ada Lovelace'

In [43]:
def find_gov_urls(website):
 pattern = r'https?://[^\s]*\.gov\b' #enter the regex pattern here
 result = re.findall(pattern, website) #enter the re method here
 return result


print(find_gov_urls("https://www.data.gov is a great place to find open source datasets!")) # Should return ['https://www.data.gov']
print(find_gov_urls("Learn more about US National Parks at https://www.nps.gov, https://www.nationalparks.org, or https://www.recreation.gov.")) # Should return ['https://www.nps.gov', 'https://www.recreation.gov']
print(find_gov_urls("The Library of Congress (https://www.loc.gov) is an incredible resource!")) # Should return ['https://www.loc.gov']
print(find_gov_urls("The Library of Congress (www.loc.gov) is an incredible resource!")) # Should return []

['https://www.data.gov']
['https://www.nps.gov', 'https://www.recreation.gov']
['https://www.loc.gov']
[]


In [44]:
def parse_sentences(sentence):
 pattern =r'\w+[\w\'\-]*|[.,!?;+]' #enter the regex pattern here
 result = re.findall(pattern, sentence) #enter the re method  here
 return result

print(parse_sentences("Hello! How are you doing?")) # should return ['Hello!', 'How', 'are', 'you', 'doing?']
print(parse_sentences("what a beautiful day it is")) # should return ['what', 'a', 'beautiful', 'day', 'it', 'is']
print(parse_sentences("2 + 2 is definitely 4!")) # should return ['2', '+', '2', 'is', 'definitely', '4!']


['Hello', '!', 'How', 'are', 'you', 'doing', '?']
['what', 'a', 'beautiful', 'day', 'it', 'is']
['2', '+', '2', 'is', 'definitely', '4', '!']


In [45]:
def find_productID(report):
  pattern = r'\b1\d{3}-[A-Z]{2}-\d{2}\b'#enter the regex pattern here
  result = re.findall(pattern, report) #enter the re method  here
  return result
  
print(find_productID("Products 1234-AB-30 and 2234-AB-30, not items 12-AB-30 or 12345-AB-30")) # Should return ['1234-AB-30']
print(find_productID("Products of interest are 1234-AB-30, 1678-XZ-11, and 1561-CD-57. We're not interested in other products like 2345-AB-29.")) # Should return ['1234-AB-30', '1678-XZ-11', '1561-CD-57']

['1234-AB-30']
['1234-AB-30', '1678-XZ-11', '1561-CD-57']
