# 1. What is the name of the feature responsible for generating Regex objects?

In Python, the feature responsible for generating Regex objects (regular expression objects) is the re module. The re module provides support for working with regular expressions, allowing you to create, compile, and use regular expressions to search, match, and manipulate strings.

To use regular expressions in Python, you need to import the re module and then use its functions to work with regex patterns and strings. Some commonly used functions from the re module include re.compile(), re.search(), re.match(), re.findall(), and others.

In [24]:
#example
import re

# Regular expression pattern to match one or more digits
pattern = r'\d+'

# Sample string
text = "There are 90 apples and 476 oranges."

# Using re.findall() to find all occurrences of the pattern
matches = re.findall(pattern, text)

print(matches) 


['90', '476']


# 2. Why do raw strings often appear in Regex objects?


Avoiding Double Escaping: As mentioned earlier, raw strings prevent the need for double escaping of backslashes. Regular expressions frequently use backslashes as escape characters for special characters and metacharacters. Without a raw string, you would have to escape the backslashes twice – once for Python's string literal and once for the regular expression. 

Improving Readability: Regular expressions can become complex and include many special characters and metacharacters. Using raw strings makes the regex patterns more readable by eliminating unnecessary backslashes.

In [12]:
#example
import re
 
# Regular string with escape sequences (double escaping)
regular_string_pattern = "\\bapple\\b"
# Raw string with no escape sequences
raw_string_pattern = r"\bapple\b"


# Sample text
text = "An apple a day keeps the doctor away. I like apples."

# Using re.findall() with a regular string pattern
matches_regular = re.findall(regular_string_pattern, text)
print("Matches with regular string:", matches_regular) 
# Using re.findall() with a raw string pattern
matches_raw = re.findall(raw_string_pattern, text)
print("Matches with raw string:", matches_raw)  


Matches with regular string: ['apple']
Matches with raw string: ['apple']


 the regular string "\bapple\b" requires double escaping of the backslashes to correctly match the word "apple" as a whole word. However, when using the raw string r"\bapple\b", there's no need for double escaping, and the regex pattern works as intended, matching only the whole word "apple" and not parts of other words like "apples" in the text.

# 3. What is the return value of the search() method?

The search() method of the re module in Python returns a match object if a match is found, or None if no match is found. The match object contains information about the match, such as the matched substring, its starting and ending positions in the input string, and more.

In [7]:
import re

text = "Hello, World!"
pattern = r"new"

match = re.search(pattern, text)
print(match)


None


In [8]:
#example
import re

text = "Hello, World!"
pattern = r"Hello"

match = re.search(pattern, text)

if match:
    print("Pattern found")
else:
    print("Pattern not found")

Pattern found


# 4. From a Match item, how do you get the actual strings that match the pattern?

To get the actual strings that match the pattern from a re.Match object in Python, you can use the .group() method without any arguments. Here's an example:

When you use match, it only looks for matches at the beginning of the string, and you can access the matched string using the group() method of the match object.

In [18]:
import re

pattern = r'\d+'  # This is a pattern that matches one or more digits
text = '123 apples and 456 oranges.'

match_obj = re.match(pattern, text)  # Find a match at the beginning of the text

if match_obj:
    matched_string = match_obj.group()  # Get the matched string
    print(matched_string)  
else:
    print("No match found")


123


 If you're using re.findall() instead of re.search(), you can directly get a list of all matched strings:

In [26]:
import re

pattern = r'\d+'  # This is a pattern that matches one or more digits
text = 'There are 123 apples and 456 oranges.'

matches = re.findall(pattern, text)  # Find all occurrences of the pattern in the text

print(matches) 

['123', '456']


# 5. In the regex which created from the r&#39;(\d\d\d)-(\d\d\d-\d\d\d\d)&#39;, what does group zero cover?Group 2? Group 1?

In the regular expression r'(\d\d\d)-(\d\d\d-\d\d\d\d)', the groups are defined by the parentheses. 

Group 0 (or match.group(0)): This represents the entire matched portion of the text that satisfies the entire regular expression. In this case, it would be the full phone number in the format 123-456-7890. This group always covers the entire matched string.

Group 1 (or match.group(1)): This captures the first set of three digits in the phone number (the area code). In this example, it would capture 123.

Group 2 (or match.group(2)): This captures the second set of three digits, followed by a hyphen, and then the last four digits of the phone number. In this example, it would capture 456-7890.

In [27]:
import re

pattern = r'(\d\d\d)-(\d\d\d-\d\d\d\d)'
text = 'Phone numbers: 123-456-7890 and 987-654-3210'

match = re.search(pattern, text)
if match:
    group_0 = match.group(0)
    group_1 = match.group(1)
    group_2 = match.group(2)
    
    print("Group 0:", group_0)  
    print("Group 1:", group_1)  
    print("Group 2:", group_2)  
else:
    print("No match found")


Group 0: 123-456-7890
Group 1: 123
Group 2: 456-7890


# 6. In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?

In regular expressions (regex), parentheses and periods have special meanings. Parentheses are used for grouping and capturing, while periods (also known as dots) are used to match any character. If you want to match literal parentheses and periods in your regex pattern, you need to escape them using a backslash (\) to indicate that you're looking for those specific characters.

Here's how you would use backslashes to match literal parentheses and periods in a regex:



Escaping Parentheses: To match literal parentheses, use \( to match an opening parenthesis and \) to match a closing parenthesis.

In [None]:
 \(Hello\)


In [1]:
import re

text = "This is (a test) string with (parentheses)."
pattern = r'\(a test\)'
matches = re.findall(pattern, text)
print(matches)


['(a test)']


Escaping Periods: To match a literal period (dot), use \\..

In [None]:
www\.example\.com


In [2]:
import re

text = "This is a test with some dots like this..."
pattern = r'\.\.\.'
matches = re.findall(pattern, text)
print(matches)


['...']


# 7. The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?

The findall() method in Python's re module returns a list of string matches when the regular expression pattern contains no capturing groups, and it returns a list of tuples when the pattern includes one or more capturing groups.

1)List of String Matches (No Capturing Groups): When the regular expression pattern does not contain any capturing groups (i.e., no parentheses ( and ) to group parts of the pattern), findall() returns a list of strings. Each element in the list represents a complete match found in the input string.

In [3]:
import re

text = "The price of apples is $1.25, and the price of bananas is $0.75."
pattern = r'\$\d+\.\d+'
matches = re.findall(pattern, text)
print(matches)


['$1.25', '$0.75']


2)List of Tuples with Capturing Groups: When the regular expression pattern contains one or more capturing groups (i.e., one or more sets of parentheses ( and ) to group parts of the pattern), findall() returns a list of tuples. Each tuple corresponds to a match, and within each tuple, you will find the substrings that correspond to the capturing groups in the pattern.

In [5]:
import re

text = "John Roy (Age: 30), Kris Smith (Age: 25)"
pattern = r'(\w+) \w+ \(Age: (\d+)\)'
matches = re.findall(pattern, text)
print(matches)


[('John', '30'), ('Kris', '25')]


# 8. In standard expressions, what does the | character mean?

In standard regular expressions, the | character is used to denote an alternation, which means it functions as an OR operator. It allows you to specify multiple alternatives in your regular expression pattern, and it will match any of the alternatives.

Pattern Alternation: When you use | in a regular expression pattern, it separates different subpatterns, and the regex engine will try to match any of these subpatterns. If any of the subpatterns matches, the whole pattern is considered a match.

In [6]:
import re

text = "The color of the car is either red or blue."
pattern = r'red|blue'
matches = re.findall(pattern, text)
print(matches)


['red', 'blue']


Grouping with Parentheses: You can use parentheses ( and ) to group alternatives when you have more complex patterns. This allows you to specify the scope of the alternation.

In [7]:
import re

text = "The cat and the hat are on the mat."
pattern = r'(cat|hat)'
matches = re.findall(pattern, text)
print(matches)


['cat', 'hat']


# 9. In regular expressions, what does the character stand for?

In regular expressions, the character . (period) is a metacharacter that represents any character except a newline. It acts as a wildcard, matching any single character in the input string, with the exception of a newline character. Here's how the . character works in regular expressions:

Match Any Single Character (Except Newline): The . character, when used in a regular expression pattern, matches any single character (letter, digit, symbol, whitespace, etc.) in the input text, except for a newline character.

In [8]:
import re

text = "cat, hat, bat"
pattern = r'.at'  # This pattern matches any three-letter word ending with "at."
matches = re.findall(pattern, text)
print(matches)


['cat', 'hat', 'bat']


Excludes Newlines by Default: The . character does not match newline characters by default. If you want it to match newline characters as well, you can use the re.DOTALL flag when compiling your regular expression. This flag makes the . match any character, including newline characters.

In [9]:
import re

text = "Line 1\nLine 2\nLine 3"
pattern = r'.*'  # This pattern matches any sequence of characters.
matches = re.findall(pattern, text, re.DOTALL)
print(matches)


['Line 1\nLine 2\nLine 3', '']


# 10.In regular expressions, what is the difference between the + and * characters?

In regular expressions, the "+" and "*" characters are quantifiers that specify how many times the preceding element should appear in the text being matched, but they have different meanings

"+" (Plus Quantifier):

"+" stands for "one or more occurrences" of the preceding element.
It requires that the preceding element must appear at least once, but it can appear more than once.
For example, the regular expression "abc+" would match "abc," "abcc," "abccc," and so on, but it would not match "ab" because there is no "c" immediately following.

"*" (Asterisk Quantifier):

"*" stands for "zero or more occurrences" of the preceding element.
It allows the preceding element to appear zero times or more times, which means it can be completely absent or occur multiple times.
For example, the regular expression "abc*" would match "ab," "abc," "abcc," "abccc," and so on, including cases where "c" is absent.

# 11. What is the difference between {4} and {4,5} in regular expression?

In regular expressions, both "{4}" and "{4,5}" are used as quantifiers to specify the number of occurrences of the preceding element in a pattern, but they have slightly different meanings:

"{4}":

"{4}" specifies exactly four occurrences of the preceding element.
It requires that the preceding element appears precisely four times consecutively in the text being matched.
For example, the regular expression "a{4}" would match "aaaa" but would not match "aaa" or "aaaaa."

"{4,5}":

"{4,5}" specifies a range of occurrences, allowing between four and five occurrences of the preceding element.
It means that the preceding element must appear at least four times and can appear up to five times in the text being matched.
For example, the regular expression "a{4,5}" would match both "aaaa" and "aaaaa" but would not match "aaa" or "aaaaaa."

"{4}" is a specific quantifier indicating an exact number of occurrences, while "{4,5}" is a range quantifier indicating a minimum and maximum number of occurrences for the preceding element.

# 12. What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?

In regular expressions, shorthand character classes like \d, \w, and \s are used to represent commonly used character sets. These shorthand notations are helpful for simplifying regular expressions and making them more concise. Here's what each of these shorthand character classes signifies:

\d:

\d represents the digit character class.
It matches any single digit from 0 to 9.
For example, \d would match any of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9

\w:

\w represents the word character class.
It matches any word character, which includes uppercase and lowercase letters, digits, and underscores.
Essentially, \w matches alphanumeric characters and underscores.
For example, \w would match characters like "a," "Z," "5," and "_."

\s:

\s represents the whitespace character class.
It matches any whitespace character, such as spaces, tabs, and newline characters.
It's useful for identifying and working with whitespace in text.
For example, \s would match spaces and tabs in a string.

Here are some examples of how you might use these shorthand character classes in regular expressions:

\d{3} would match any three consecutive digits in a string.
\w+ would match one or more word characters in a row.
\s* would match zero or more whitespace characters.

# 13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?

In regular expressions, the shorthand character classes \D, \W, and \S are the negations or opposites of the \d, \w, and \s shorthand character classes. They match characters that are not in the respective character classes they represent. Here's what each of these shorthand character classes signifies:

\D:

\D represents the non-digit character class.
It matches any character that is not a digit (0-9).
For example, \D would match any character that is not a digit, such as letters, symbols, or whitespace.

\W:

\W represents the non-word character class.
It matches any character that is not a word character, which includes characters other than letters, digits, and underscores.
For example, \W would match symbols, punctuation marks, and whitespace.

\S:

\S represents the non-whitespace character class.
It matches any character that is not whitespace, including letters, digits, symbols, and punctuation marks.
For example, \S would match any character that is not a space, tab, or newline.

Here are some examples of how you might use these negated shorthand character classes in regular expressions:

\D+ would match one or more consecutive characters that are not digits.
\W* would match zero or more consecutive characters that are not word characters.
\S{2,} would match two or more consecutive characters that are not whitespace.

# 14. What is the difference between .*? and .*?

 the difference between "." and ".*?" in regular expressions.

.* (Greedy Match):
.* is a greedy match, which means it will try to match as much as possible while still allowing the rest of the pattern to match successfully.
It matches any character (except for newline characters) zero or more times.
In practical terms, it will match the longest possible sequence of characters that allows the overall pattern to match.

.*? (Lazy or Non-Greedy Match):

.*? is a lazy or non-greedy match, which means it will match as little as possible while still allowing the rest of the pattern to match successfully.
It also matches any character (except for newline characters) zero or more times, but it does so in the shortest possible sequence.
In practical terms, it will match the shortest possible sequence of characters that allows the overall pattern to match.

# 15. What is the syntax for matching both numbers and lowercase letters with a character class?

In regular expressions, you can use a character class to match both numbers and lowercase letters. To create a character class that matches numbers (0-9) and lowercase letters (a-z), you can use square brackets to enclose the range of characters you want to match

In [None]:
[0-9a-z]


In this character class:

[0-9] matches any digit from 0 to 9.                                                                         
[a-z] matches any lowercase letter from a to z.                                                      
So, [0-9a-z] will match any character that is either a digit or a lowercase letter.

# 16. What is the procedure for making a normal expression in regax case insensitive?


In Python, you can make a regular expression case-insensitive by using the re.IGNORECASE or re.I flag when compiling your regular expression pattern. 

1)Import the re module: You need to import Python's re module, which provides regular expression functionality.

2)Compile the regex pattern with the re.IGNORECASE flag: When you compile your regular expression pattern using re.compile(), include the re.IGNORECASE or re.I flag as the second argument. This flag makes the pattern match in a case-insensitive manner.

3)Use the compiled pattern to perform matches or searches: You can then use the compiled pattern object to search for or match text in a case-insensitive manner.

In [11]:

#example
import re

text = "Hello World"
pattern = re.compile(r'hello', re.IGNORECASE)  # Case-insensitive pattern

result = pattern.search(text)

if result:
    print("Pattern found:", result.group())
else:
    print("Pattern not found.")


Pattern found: Hello


# 17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?

In Python's regular expressions, the "." (dot) character normally matches any character except for a newline character ("\n"). However, if you pass re.DOTALL (or re.S) as the second argument when compiling the regular expression pattern using re.compile(), it changes the behavior of the dot to match any character, including newline characters.

Without re.DOTALL:
By default, the dot (".") matches any character except for a newline character ("\n").
So, it will match letters, digits, symbols, whitespace, etc., but not newline characters.

In [None]:
example

In [12]:
import re

text = "Hello\nWorld"
pattern = re.compile(r'H.llo')  # Dot matches any character except newline

result = pattern.search(text)

if result:
    print("Pattern found:", result.group())  # Output: "Hello"
else:
    print("Pattern not found.")


Pattern found: Hello


With re.DOTALL:
When you pass re.DOTALL (or re.S) as the second argument to re.compile(), it changes the behavior of the dot to match any character, including newline characters.
This is useful when you want the dot to match across multiple lines, making it more permissive in matching text.

In [None]:
example

In [13]:
import re

text = "Hello\nWorld"
pattern = re.compile(r'H.llo', re.DOTALL)  # Dot matches any character, including newline

result = pattern.search(text)

if result:
    print("Pattern found:", result.group())  # Output: "Hello\nWorld"
else:
    print("Pattern not found.")


Pattern found: Hello


# 18. If numReg = re.compile(r'\d+'), what will numRegex.sub('X','11 drummers, 10 pipers, five rings, 4 hen') return?

. In Python, if you use the numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen') with the given regular expression r'\d+', it will replace all sequences of one or more digits with the letter 'X'. 

In [14]:
import re

numRegex = re.compile(r'\d+')
result = numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen')

print(result)


X drummers, X pipers, five rings, X hen


# 19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?

Passing re.VERBOSE as the second argument to re.compile() in Python allows you to write regular expressions with more readability and whitespace, making them easier to understand. It also allows you to add comments within the regular expression pattern without affecting its functionality

In [None]:
 what it allows you to do:

1)Ignore Whitespace: When you use re.VERBOSE, whitespace (spaces and line breaks) within the regular expression pattern is ignored. This means you can format the regular expression more clearly by adding spaces and line breaks to separate different parts of the pattern.

2)Add Comments: You can add comments to the regular expression using the "#" symbol. These comments help you and others understand the purpose of different parts of the pattern. Comments are ignored by the regex engine.

In [15]:
import re

# Without re.VERBOSE
pattern1 = r'\d{3}-\d{2}-\d{4}'

# With re.VERBOSE
pattern2 = r'''
    \d{3}    # Match three digits
    -        # Match a hyphen
    \d{2}    # Match two digits
    -        # Match another hyphen
    \d{4}    # Match four digits
'''

text = '123-45-6789'

result1 = re.search(pattern1, text)
result2 = re.search(pattern2, text, re.VERBOSE)

print(result1.group())  
print(result2.group())  


123-45-6789
123-45-6789


# 20. How would you write a regex that match a number with comma for every three digits? It must match the given following:

In [None]:

'42'
'1,234'
'6,368,745'

but not the following:
'12,34,567'// (which has only two digits between the commas)
'1234' // (which lacks commas)

You can write a regular expression to match numbers with commas for every three digits as follows:

In [None]:
^(?!.*\d{4},)(?:\d{1,3}(?:,\d{3})*|\d+)$


^ and $ indicate the start and end of the string, respectively, ensuring that the entire string matches the pattern.
(?!.*,\d{1,2},) is a negative lookahead assertion that prevents matching if there are one or two digits between commas. This ensures that there are at least three digits between commas.
\d{1,3} matches between 1 and 3 digits at the beginning of the string.
(,\d{3})* matches zero or more groups of a comma followed by exactly three digits.

In [None]:
Here's how you can use this regular expression in Python:

In [16]:
import re

pattern = r'^(?!.*,\d{1,2},)\d{1,3}(,\d{3})*$'

strings = ['42', '1,234', '6,368,745', '12,34,567', '1234']

for s in strings:
    if re.match(pattern, s):
        print(f"Matched: {s}")
    else:
        print(f"Not Matched: {s}")


Matched: 42
Matched: 1,234
Matched: 6,368,745
Not Matched: 12,34,567
Not Matched: 1234


# 21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:

In [None]:

'Haruto Watanabe'
'Alice Watanabe'
'RoboCop Watanabe'
but not the following:
'haruto Watanabe' (where the first name is not capitalized)
'Mr. Watanabe' (where the preceding word has a nonletter character)
'Watanabe'(which has no first name)
'Haruto watanabe' (where Watanabe is not capitalized)

To write a regular expression that matches the full name of someone whose last name is "Watanabe" with the assumption that the first name always begins with a capital letter and is a single word, you can use the following pattern:

In [None]:
^[A-Z][a-zA-Z]*\sWatanabe$


In [None]:
^ and $ indicate the start and end of the string, respectively, ensuring that the entire string matches the pattern.
[A-Z] matches an uppercase letter (the first letter of the first name).
[a-zA-Z]* matches zero or more lowercase and uppercase letters for the rest of the first name.
\s matches a single space.
Watanabe matches the last name "Watanabe."

In [17]:
import re

pattern = r'^[A-Z][a-zA-Z]*\sWatanabe$'

names = ['Haruto Watanabe', 'Alice Watanabe', 'RoboCop Watanabe', 'haruto Watanabe', 'Mr. Watanabe', 'Watanabe', 'Haruto watanabe']

for name in names:
    if re.match(pattern, name):
        print(f"Matched: {name}")
    else:
        print(f"Not Matched: {name}")


Matched: Haruto Watanabe
Matched: Alice Watanabe
Matched: RoboCop Watanabe
Not Matched: haruto Watanabe
Not Matched: Mr. Watanabe
Not Matched: Watanabe
Not Matched: Haruto watanabe


# 22. How would you write a regex that matches a sentence where the first word is either Alice, Bob,or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs;and the sentence ends with a period? This regex should be case-insensitive. It must match the following:


In [None]:
'Alice eats apples.'
'Bob pets cats.'
'Carol throws baseballs.'
'Alice throws Apples.'
'BOB EATS CATS.'
but not the following:
'RoboCop eats apples.'
'ALICE THROWS FOOTBALLS.'
'Carol eats 7 cats.'

We can write a case-insensitive regular expression that matches sentences meeting the specified criteria as follows

In [None]:
^(?i)(Alice|Bob|Carol)\s+(eats|pets|throws)\s+(apples|cats|baseballs)\.$


In [None]:
^ and $ indicate the start and end of the string, respectively, ensuring that the entire string matches the pattern.
(?i) is a case-insensitive flag.
(Alice|Bob|Carol) matches one of the specified first names (Alice, Bob, or Carol).
\s+ matches one or more whitespace characters (spaces or tabs).
(eats|pets|throws) matches one of the specified action words (eats, pets, or throws).
\s+ again matches one or more whitespace characters.
(apples|cats|baseballs) matches one of the specified objects (apples, cats, or baseballs).
\. matches a period at the end of the sentence.

In [18]:
import re

pattern = r'^(?i)(Alice|Bob|Carol)\s+(eats|pets|throws)\s+(apples|cats|baseballs)\.$'

sentences = [
    'Alice eats apples.',
    'Bob pets cats.',
    'Carol throws baseballs.',
    'Alice throws Apples.',
    'BOB EATS CATS.',
    'RoboCop eats apples.',
    'ALICE THROWS FOOTBALLS.',
    'Carol eats 7 cats.',
]

for sentence in sentences:
    if re.match(pattern, sentence):
        print(f"Matched: {sentence}")
    else:
        print(f"Not Matched: {sentence}")


Matched: Alice eats apples.
Matched: Bob pets cats.
Matched: Carol throws baseballs.
Matched: Alice throws Apples.
Matched: BOB EATS CATS.
Not Matched: RoboCop eats apples.
Not Matched: ALICE THROWS FOOTBALLS.
Not Matched: Carol eats 7 cats.


  if re.match(pattern, sentence):
