# Assignment 7 Solutions

#### 1. What is the name of the feature responsible for generating Regex objects?
**Ans:** **`re.compile()`** is the feature responsible for  generation of Regex objects.
In Python, the re module provides support for working with regular expressions. The feature responsible for generating Regex objects is the compile() function, which is used to compile a regular expression pattern into a Regex object.

The compile() function takes a regular expression pattern as its first argument, and it returns a Regex object that can be used to match against text.

In [3]:
import re

regex_pattern = re.compile(r'\d+')

Here, the regular expression pattern r'\d+' matches one or more digits, and we pass it to the compile() function to create a Regex object. We can then use this Regex object to search for matches in text:

In [4]:
text = 'The price is $100'
match = regex_pattern.search(text)

if match:
    print('Match found:', match.group())
else:
    print('No match')


Match found: 100


In this example, the search() function of the Regex object is used to search for a match of the pattern in the text string. If a match is found, the group() function is used to extract the matched string.

#### 2. Why do raw strings often appear in Regex objects?

The difference between a regular string and a raw string is that in a regular string, certain characters are treated specially, such as the backslash character "". The backslash is used to escape certain characters, such as a newline "\n" or a tab "\t", or to insert special characters such as Unicode characters.

In a raw string, however, the backslash character is treated as a regular character and is not used to escape any other characters. This means that a raw string can contain backslashes without needing to escape them with another backslash.

In [5]:
regular_string = "Hello\nWorld"
raw_string = r"Hello\nWorld"

print(regular_string)  # output: Hello
                       #         World
print(raw_string)      # output: Hello\nWorld


Hello
World
Hello\nWorld


For example, consider the regex pattern "\d+", which matches one or more digits. If we were to use a regular string instead of a raw string to create this pattern, we would need to escape the backslash character by writing it as "\". So the pattern would become "\d+". However, by using a raw string, we can write the pattern as r"\d+", which is much easier to read and write.

raw strings are often used in regex object creation to simplify the creation of regex patterns and avoid potential bugs caused by escaping characters that have no special meaning in regex syntax.

#### 3. What is the return value of the search() method?
**Ans:** The return value of `re.search(pattern,string)` method is a match object if the pattern is observed in the string else it returns a None

In [10]:
import re

# search for a pattern in a string
text = "The quick brown fox jumps over the lazy dog"
pattern = r"brown"

match_object = re.search(pattern, text)

if match_object:
    print("Pattern found at position:", match_object.start())
else:
    print("Pattern not found")


Pattern found at position: 10


The search() method returns a match object if the pattern is found in the string, and None if the pattern is not found.

#### 4. From a Match item, how do you get the actual strings that match the pattern?
**Ans:** For Matched items `group()` methods returns actual strings that match the pattern

In [12]:
import re

# search for a pattern in a string
text = "The quick brown fox jumps over the lazy dog"
pattern = r"brown"

match_object = re.search(pattern, text)

if match_object:
    print("Pattern found :", match_object.group())
else:
    print("Pattern not found")

Pattern found : brown


#### 5. In the regex which created from the r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does group zero cover? Group 2? Group 1?
**Ans:** In the Regex **`r'(\d\d\d)-(\d\d\d-\d\d\d\d)'`** the zero group covers the entire pattern match where as the first group cover **`(\d\d\d)`** and the second group cover **`(\d\d\d-\d\d\d\d)`**

In [19]:
import re

text = "My phone number is 123-456-7890"
pattern = r'(\d\d\d)-(\d\d\d-\d\d\d\d)'

match_object = re.search(pattern, text)

if match_object:
    print("Group 0 (full match):", match_object.group(0))
    print("Group 1 (area code):", match_object.group(1))
    print("Group 2 (phone number):", match_object.group(2))

Group 0 (full match): 123-456-7890
Group 1 (area code): 123
Group 2 (phone number): 456-7890


In [14]:
print(match_object.group())

123-456-7890


#### 6. In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?
**Ans:** The **`\.`** **`\(`** and **`\)`** escape characters in the raw string passed to re.compile() will match actual parenthesis characters.

In [20]:
import re

text = "(example.com)"
pattern = r"\(example\.com\)"  # escape parentheses and period

match_object = re.search(pattern, text)

if match_object:
    print("Match found:", match_object.group())
else:
    print("Match not found")

Match found: (example.com)


In [21]:
print(match_object.group())

(example.com)


#### 7. The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?
**Ans:** If the regex pattern has no groups, a list of strings matched is returned. if the regex pattern has groups, a list of tuple of strings is returned.

In [27]:
import re

text = "The cat sat on the mat. The other cat ate the rat."
pattern = r"cat"

matches = re.findall(pattern, text)
print(matches)
# Output: ['cat', 'cat']

['cat', 'cat']


In [28]:
import re

text = "My phone number is 123-456-7890. Her phone number is 456-789-1234."
pattern = r"(\d{3})-(\d{3}-\d{4})"

matches = re.findall(pattern, text)
print(matches)
# Output: [('123', '456-7890'), ('456', '789-1234')]

[('123', '456-7890'), ('456', '789-1234')]


#### 8. In standard expressions, what does the | character mean?
**Ans:** In Standard Expressions `|` means `OR` operator.

In [29]:
import re

text = "I have a cat and a dog"
pattern = r"cat|dog"

matches = re.findall(pattern, text)
print(matches)
# Output: ['cat', 'dog']


['cat', 'dog']


#### 9. In regular expressions, what does the `?` character stand for?
**Ans:** In regular Expressions, `?` characters represents zero or one match of the preceeding group.

In [30]:
import re

pattern = r"colou?r"

string1 = "color"
string2 = "colour"

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)

print(match1)  # <re.Match object; span=(0, 5), match='color'>
print(match2)  # <re.Match object; span=(0, 6), match='colour'>


<re.Match object; span=(0, 5), match='color'>
<re.Match object; span=(0, 6), match='colour'>


 the regular expression "colou?r" matches both "color" and "colour" because the "u" character preceding the question mark is optional.

In [31]:
import re

pattern = r"https?://(www\.)?google\.com"

string1 = "http://google.com"
string2 = "https://www.google.com"

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)

print(match1)  # <re.Match object; span=(0, 17), match='http://google.com'>
print(match2)  # <re.Match object; span=(0, 22), match='https://www.google.com'>


<re.Match object; span=(0, 17), match='http://google.com'>
<re.Match object; span=(0, 22), match='https://www.google.com'>


#### 10.In regular expressions, what is the difference between the + and * characters?
In regular expressions, the + and * characters are used to specify the number of occurrences of the preceding character or group. The main difference between the two characters is that the + matches one or more occurrences of the preceding character or group, while the * matches zero or more occurrences of the preceding character or group.

In [8]:
import re
match_1 = re.search("Bat(wo)*man","Batman returns")
print(match_1)
match_2 = re.search("Bat(wo)+man","Batman returns")
print(match_2)

<re.Match object; span=(0, 6), match='Batman'>
None


In [33]:
import re

pattern = r"go+l"

string1 = "gol"
string2 = "gool"
string3 = "gooool"

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)
match3 = re.match(pattern, string3)

print(match1) 
print(match2)  # <re.Match object; span=(0, 4), match='gool'>
print(match3)  # <re.Match object; span=(0, 6), match='gooool'>

<re.Match object; span=(0, 3), match='gol'>
<re.Match object; span=(0, 4), match='gool'>
<re.Match object; span=(0, 6), match='gooool'>


In [34]:
import re

pattern = r"go*l"

string1 = "gl"
string2 = "gol"
string3 = "gool"
string4 = "gooool"

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)
match3 = re.match(pattern, string3)
match4 = re.match(pattern, string4)

print(match1)  # <re.Match object; span=(0, 2), match='gl'>
print(match2)  # <re.Match object; span=(0, 3), match='gol'>
print(match3)  # <re.Match object; span=(0, 4), match='gool'>
print(match4)  # <re.Match object; span=(0, 6), match='gooool'>


<re.Match object; span=(0, 2), match='gl'>
<re.Match object; span=(0, 3), match='gol'>
<re.Match object; span=(0, 4), match='gool'>
<re.Match object; span=(0, 6), match='gooool'>


#### 11. What is the difference between {4} and {4,5} in regular expression?
**Ans:** `{4}` means that its preceeding group should repeat 4 times. where as `{4,5}` means that its preceeding group should repeat mininum 4 times and maximum 5 times inclusively

In [36]:
import re

pattern = r"go{4}d"

string1 = "gooood"
string2 = "good"
string3 = "goooodd"

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)
match3 = re.match(pattern, string3)

print(match1)  # <re.Match object; span=(0, 6), match='gooood'>
print(match2)  # None
print(match3)


<re.Match object; span=(0, 6), match='gooood'>
None
<re.Match object; span=(0, 6), match='gooood'>


In [37]:
import re

pattern = r"go{4,5}d"

string1 = "gooood"
string2 = "gooooood"
string3 = "good"
string4 = "goooodd"

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)
match3 = re.match(pattern, string3)
match4 = re.match(pattern, string4)

print(match1)  # <re.Match object; span=(0, 6), match='gooood'>
print(match2)
print(match3) 
print(match4)  


<re.Match object; span=(0, 6), match='gooood'>
None
None
<re.Match object; span=(0, 6), match='gooood'>


#### 12. What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?
**Ans:** \d, \w and \s are special sequences in regular expresssions in python:
1. **`\w`** – Matches a word character equivalent to [a-zA-Z0-9_]
2. **`\d`** – Matches digit character equivalent to [0-9]
3. **`\s`** – Matches whitespace character (space, tab, newline, etc.)

In regular expressions, shorthand character classes are special codes that represent a group of characters. 

In [39]:
import re

pattern = r"\d+"

string1 = "1234"
string2 = "a1b2c3"
string3 = "   45   "

# Matching using the regular expression
match1 = re.match(pattern, string1)
match2 = re.match(pattern, string2)
match3 = re.match(pattern, string3)

print(match1)  # <re.Match object; span=(0, 4), match='1234'>
print(match2) 
print(match3) 


<re.Match object; span=(0, 4), match='1234'>
None
None


In [44]:
import re

# Define a regular expression pattern to match any string that contains a word character
pattern = r'\w+'

# Test the pattern against some example strings
strings = ['Hello, world!', '12345', '   \t']

for s in strings:
    match = re.match(pattern, s)
    if match:
        print(f"'{s}' matches the pattern '{pattern}'")
        print(match)
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")
        print(match)


'Hello, world!' matches the pattern '\w+'
<re.Match object; span=(0, 5), match='Hello'>
'12345' matches the pattern '\w+'
<re.Match object; span=(0, 5), match='12345'>
'   	' does not match the pattern '\w+'
None


In [47]:
import re

# Define a regular expression pattern to match any string that contains whitespace
pattern = r'\s+'

# Test the pattern against some example strings
strings = ['Hello, world!', '12345', '   \t']

for s in strings:
    match = re.match(pattern, s)
    if match:
        print(f"'{s}' matches the pattern '{pattern}'")
        print(match)
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")
        print(match)


'Hello, world!' does not match the pattern '\s+'
None
'12345' does not match the pattern '\s+'
None
'   	' matches the pattern '\s+'
<re.Match object; span=(0, 4), match='   \t'>


#### 13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?
**Ans:** \D, \W and \S are special sequences in regular expresssions in python:
1. **`\W`** – Matches any non-alphanumeric character equivalent to [^a-zA-Z0-9_]
2. **`\D`** – Matches any non-digit character, this is equivalent to the set class [^0-9]
3. **`\S`** – Matches any non-whitespace character

#### 14. What is the difference between `.*?` and `.*`?
**Ans:** **`.*`** is a Greedy mode, which returns the longest string that meets the condition. Whereas **`.*?`** is a non greedy mode which returns the shortest string that meets the condition.

In [48]:
import re

# Define a regular expression pattern that matches any text between two tags
pattern = r'<.*>'

# Test the pattern against an example string
text = '<html><head><title>Title</title></head><body>Body</body></html>'
matches = re.findall(pattern, text)

print(matches)


['<html><head><title>Title</title></head><body>Body</body></html>']


In [49]:
# Define a regular expression pattern that matches any text between two tags in a non-greedy way
pattern = r'<.*?>'

# Test the pattern against the same example string
matches = re.findall(pattern, text)

print(matches)


['<html>', '<head>', '<title>', '</title>', '</head>', '<body>', '</body>', '</html>']


#### 15. What is the syntax for matching both numbers and lowercase letters with a character class?
**Ans:** The Synatax is Either **`[a-z0-9]`** or **`[0-9a-z]`**

#### 16. What is the procedure for making a normal expression in regax case insensitive?
**Ans:** We can pass **`re.IGNORECASE`** as a flag to make a noraml expression case insensitive

To make a regular expression case-insensitive in Python, you can use the re.IGNORECASE flag or re.I as a second argument to the re.compile() function.

In [50]:
import re

# Define a regular expression pattern
pattern = re.compile(r'hello', re.IGNORECASE)

# Test the pattern against some example strings
text1 = 'Hello, World!'
text2 = 'HELLO, WORLD!'
text3 = 'Hi there!'
matches1 = pattern.findall(text1)
matches2 = pattern.findall(text2)
matches3 = pattern.findall(text3)

print(matches1)  # Output: ['Hello']
print(matches2)  # Output: ['HELLO']
print(matches3)  # Output: []


['Hello']
['HELLO']
[]


#### 17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?
**Ans:** 
In regular expressions, the . (dot) character matches any character except for a newline character (\n).

However, if you pass the re.DOTALL flag as the second argument to the re.compile() function, then the dot will match any character, including a newline character.

In [53]:
import re

# Define a regular expression pattern that matches any character between two tags
pattern = re.compile(r'<.*>')

# Test the pattern against an example string
text = '<html>\n<head>\n<title>Title</title>\n</head>\n<body>\nBody\n</body>\n</html>'
matches = pattern.findall(text)

print(matches)


['<html>', '<head>', '<title>Title</title>', '</head>', '<body>', '</body>', '</html>']


In [51]:
import re

# Define a regular expression pattern that matches any character between two tags
pattern = re.compile(r'<.*>', re.DOTALL)

# Test the pattern against an example string
text = '<html>\n<head>\n<title>Title</title>\n</head>\n<body>\nBody\n</body>\n</html>'
matches = pattern.findall(text)

print(matches)


['<html>\n<head>\n<title>Title</title>\n</head>\n<body>\nBody\n</body>\n</html>']


#### 18. If numReg = re.compile(r'\d+'), what will numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen') return?
**Ans:** The Ouput will be **`'X drummers, X pipers, five rings, X hen'`**

In [54]:
import re
numReg = re.compile(r'\d+')
numReg.sub('X', '11 drummers, 10 pipers, five rings, 4 hen')

'X drummers, X pipers, five rings, X hen'

The sub() method replaces all matches of the numRegex pattern with the string 'X' in the second argument, which is the input string '11 drummers, 10 pipers, five rings, 4 hen'.

#### 19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?
**Ans:** **`re.VERBOSE`** will allow to add whitespace and comments to string passed to **`re.compile()`**.

The re.VERBOSE flag enables verbose mode, which allows you to break up the regular expression into multiple lines and add comments using the # character. This can be useful for creating complex regular expressions that are easier to read and understand.

In [55]:
import re

phoneRegex = re.compile(r'''
    (\d{3}-|\(\d{3}\)\s?)?   # optional area code
    \d{3}                  # first three digits
    -                      # separator
    \d{4}                  # last four digits
    \b                     # word boundary
''', re.VERBOSE)

# example usage
text = "Call me at 555-1234"
match = phoneRegex.search(text)
print(match.group())

555-1234


#### 20. How would you write a regex that match a number with comma for every three digits? It must match the given following:
`'42','1,234', '6,368,745'`but not the following: `'12,34,567'` (which has only two digits between the commas) `'1234'` (which lacks commas)

In [12]:
import re
pattern = r'^\d{1,3}(,\d{3})*$'
pagex = re.compile(pattern)
for ele in ['42','1,234', '6,368,745','12,34,567','1234']:
    print('Output:',ele, '->', pagex.search(ele))

Output: 42 -> <re.Match object; span=(0, 2), match='42'>
Output: 1,234 -> <re.Match object; span=(0, 5), match='1,234'>
Output: 6,368,745 -> <re.Match object; span=(0, 9), match='6,368,745'>
Output: 12,34,567 -> None
Output: 1234 -> None


#### 21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:
`'Haruto Watanabe'`  
`'Alice Watanabe'`  
`'RoboCop Watanabe'`  

but not the following:

`'haruto Watanabe'` (where the first name is not capitalized)  
`'Mr. Watanabe'` (where the preceding word has a nonletter character)  
`'Watanabe'` (which has no first name)  
`'Haruto watanabe'` (where Watanabe is not capitalized)  

**Ans:** **`pattern = r'[A-Z]{1}[a-z]*\sWatanabe'`**

In [13]:
import re
pattern = r'[A-Z]{1}[a-z]*\sWatanabe'
namex = re.compile(pattern)
for name in ['Haruto Watanabe','Alice Watanabe','RoboCop Watanabe','haruto Watanabe','Mr. Watanabe','Watanabe','Haruto watanabe']:
    print('Output: ',name,'->',namex.search(name))

Output:  Haruto Watanabe -> <re.Match object; span=(0, 15), match='Haruto Watanabe'>
Output:  Alice Watanabe -> <re.Match object; span=(0, 14), match='Alice Watanabe'>
Output:  RoboCop Watanabe -> <re.Match object; span=(4, 16), match='Cop Watanabe'>
Output:  haruto Watanabe -> None
Output:  Mr. Watanabe -> None
Output:  Watanabe -> None
Output:  Haruto watanabe -> None


#### 22. How would you write a regex that matches a sentence where the first word is either Alice, Bob,or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:
`'Alice eats apples.'`  
`'Bob pets cats.'`  
`'Carol throws baseballs.'`  
`'Alice throws Apples.'`  
`'BOB EATS CATS.'`  

but not the following:  

`'RoboCop eats apples.'`  
`'ALICE THROWS FOOTBALLS.'`   
`'Carol eats 7 cats.'`  

**Ans:** pattern = **`r'(Alice|Bob|Carol)\s(eats|pets|throws)\s(apples|cats|baseballs)\.'`**

In [14]:
import re
pattern = r'(Alice|Bob|Carol)\s(eats|pets|throws)\s(apples|cats|baseballs)\.'
casex = re.compile(pattern,re.IGNORECASE)
for ele in ['Alice eats apples.','Bob pets cats.','Carol throws baseballs.','Alice throws Apples.','BOB EATS CATS.','RoboCop eats apples.'
,'ALICE THROWS FOOTBALLS.','Carol eats 7 cats.']:
    print('Output: ',ele,'->',casex.search(ele))

Output:  Alice eats apples. -> <re.Match object; span=(0, 18), match='Alice eats apples.'>
Output:  Bob pets cats. -> <re.Match object; span=(0, 14), match='Bob pets cats.'>
Output:  Carol throws baseballs. -> <re.Match object; span=(0, 23), match='Carol throws baseballs.'>
Output:  Alice throws Apples. -> <re.Match object; span=(0, 20), match='Alice throws Apples.'>
Output:  BOB EATS CATS. -> <re.Match object; span=(0, 14), match='BOB EATS CATS.'>
Output:  RoboCop eats apples. -> None
Output:  ALICE THROWS FOOTBALLS. -> None
Output:  Carol eats 7 cats. -> None
