<a href="https://colab.research.google.com/github/nitishainita/Python_assignments/blob/main/python_assignment_07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. What is the name of the feature responsible for generating Regex objects?

In Python, the feature responsible for generating regular expression (regex) objects is the 're' module. This module provides several functions and methods for working with regular expressions, including the ability to create regex objects.

To create a regex object, we typically use the compile() function provided by the 're' module. Here's an example:

In [None]:
import re

pattern = re.compile(r'\d{3}-\d{3}-\d{4}')

In this example, re.compile() creates a regex object that represents the pattern \d{3}-\d{3}-\d{4}. This pattern matches a phone number in the format of three digits, followed by a hyphen, followed by three digits, and finally another hyphen and four digits.

Once we have a regex object, we can use its various methods, such as search(), match(), findall(), etc., to perform operations on strings and find matches based on the specified pattern.

### 2- Why do raw strings often appear in Regex objects?

Raw strings are often used in regex objects because they treat backslashes (\) as literal characters rather than escape characters. In regular expressions, backslashes have special meanings and are used to escape metacharacters or introduce special sequences.

By using raw strings (denoted by the r prefix), you can write regular expressions more conveniently and avoid excessive use of escape characters. This is particularly useful in regex patterns that contain a lot of backslashes or special sequences.

For example, let's say you want to match a literal backslash followed by the letter "n" in a string. In a non-raw string, you would need to escape the backslash with another backslash like this: "\\\\n". However, in a raw string, you can simply write r"\\n", and it will be treated as a literal backslash followed by the letter "n".

Here's an example of using a raw string in a regex pattern:

import re

pattern = r'\d{3}-\d{3}-\d{4}'

In this example, the r prefix before the string indicates that it is a raw string. The pattern \d{3}-\d{3}-\d{4} matches a phone number in the format of three digits, followed by a hyphen, followed by three digits, and finally another hyphen and four digits.

### 3- What is the return value of the search() method?

The search() method in Python's regular expression module (re) searches for a pattern within a string and returns a match object if a match is found. The return value of the search() method depends on whether a match is found or not. Here are the possible return values:

Match object: If a match is found, the search() method returns a match object. You can use this match object to obtain information about the match, such as the matched string, the position of the match, and any captured groups within the pattern.

None: If no match is found, the search() method returns None, which indicates that the pattern was not found in the string.

Here's an example demonstrating the use of the search() method:

In [None]:
import re

pattern = r'apple'
text = 'I have an apple and a banana.'

match = re.search(pattern, text)
if match:
    print('Match found:', match.group())
else:
    print('No match found.')

Match found: apple


### 4. From a Match item, how do you get the actual strings that match the pattern?

To get the actual strings that match the pattern from a match object, we can use the group() method. The group() method returns the substring of the input string that matched the pattern.

By default, group() without any arguments returns the entire matched string. However, we can also pass an argument to group() to specify a particular capturing group within the pattern if regular expression contains capturing parentheses.

Here's an example that demonstrates how to retrieve the matched strings from a match object:

In [None]:
import re

pattern = r'\d+'
text = 'I have 3 apples and 5 bananas.'

match = re.search(pattern, text)
if match:
    print('Match found:', match.group())  # Retrieve the entire matched string
else:
    print('No match found.')

Match found: 3


If our pattern contains capturing groups, you can pass an argument to group() to specify which group to retrieve. For example:

In [None]:
import re

pattern = r'(\d+)\s+(\w+)'
text = 'I have 3 apples.'

match = re.search(pattern, text)
if match:
    print('Number:', match.group(1))  # Retrieve the first capturing group (digits)
    print('Fruit:', match.group(2))   # Retrieve the second capturing group (word)
else:
    print('No match found.')

Number: 3
Fruit: apples


In this example, the pattern (\d+)\s+(\w+) matches one or more digits followed by one or more whitespace characters, followed by one or more word characters. The search() method is used to find a match within the string text. The group(1) retrieves the digits, and group(2) retrieves the word. 

### 5-In the regex which created from the r&#39;(\d\d\d)-(\d\d\d-\d\d\d\d)&#39;, what does group zero cover?
Group 2? Group 1?

In the given regular expression r'(\d\d\d)-(\d\d\d-\d\d\d\d)', the groups are defined by the parentheses. Let's break down the groups:

Group 0: The entire match. It covers the entire pattern, including both groups 1 and 2.

Group 1: The first group defined by (\d\d\d). It matches three consecutive digits.

Group 2: The second group defined by (\d\d\d-\d\d\d\d). It matches three digits followed by a hyphen and then four digits.

## 6-In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?

In regular expressions, certain characters like parentheses and periods have special meanings. If you want to match these characters literally instead of their special meanings, you can use a backslash \ to escape them. This is called "escaping" the character.

In [6]:
import re

pattern = r'\.'
string = 'This is a test.'

matches = re.findall(pattern, string)
print(matches)  # Output: ['.']

['.']


## The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?

The findall() method in Python's regular expression module (re) returns different data structures based on the structure of the regular expression pattern used.

If the pattern contains no capturing groups (no parentheses), findall() will return a list of strings. Each element of the list represents a complete match of the pattern in the input string

In [7]:
import re

pattern = r'\d+'
string = 'I have 3 apples and 5 oranges'

matches = re.findall(pattern, string)
print(matches)

['3', '5']


In this case, the pattern \d+ matches one or more digits. Since there are no capturing groups, findall() returns a list of strings, where each string is a complete match.

If the pattern contains one or more capturing groups (parentheses), findall() will return a list of string tuples. Each tuple represents a match of the entire pattern, and each element of the tuple corresponds to a capturing group within the pattern.

In [8]:
import re

pattern = r'(\d+)-(\w+)'
string = 'ID: 123-ABC, ID: 456-DEF'

matches = re.findall(pattern, string)
print(matches)  

[('123', 'ABC'), ('456', 'DEF')]


In this case, the pattern (\d+)-(\w+) matches a sequence of one or more digits followed by a hyphen, followed by a sequence of one or more word characters. Since there are capturing groups defined by parentheses, findall() returns a list of tuples, where each tuple contains the matched groups.

## 8. In standard expressions, what does the | character mean?

In regular expressions, the | character is known as the pipe or alternation operator. It is used to specify multiple alternatives within a pattern. It matches either the expression preceding it or the expression following it.

Here's how the | operator works in regular expressions:

expression1 | expression2: Matches either expression1 or expression2.
For example, let's say we want to match either the word "cat" or "dog" in a string. We can use the | operator to create a pattern that matches either of the two words

In [9]:
import re

pattern = r'cat|dog'
string = 'I have a cat and a dog'

matches = re.findall(pattern, string)
print(matches)

['cat', 'dog']


In this case, the pattern cat|dog matches either the word "cat" or the word "dog". The findall() method returns a list of all matches found in the input string.

## 9. In regular expressions, what does the character stand for?

The character . (period) in regular expressions is known as a metacharacter and has a special meaning. It is used to match any single character except for a newline.

Here are some key points regarding the . metacharacter:

' . ' matches any character except a newline character (\n).
It matches a single occurrence of any character, including letters, digits, symbols, whitespace, etc.
If you want to match a literal period character, you need to escape it with a backslash `\.`.

### 10.In regular expressions, what is the difference between the + and * characters?

In regular expressions, the + and * characters are quantifiers used to specify the repetition of the preceding element. Here's the difference between the two:

+ (Plus Quantifier):
The + quantifier matches one or more occurrences of the preceding element. It requires at least one occurrence of the preceding element for a match to be found.
For example, consider the pattern a+:

a+ matches one or more consecutive occurrences of the letter 'a'.
Examples:

aaa matches a+

a matches a+

b does not match a+
* (Asterisk Quantifier):
The * quantifier matches zero or more occurrences of the preceding element. It allows for zero occurrences of the preceding element to be considered a match.
For example, consider the pattern a*:

a* matches zero or more consecutive occurrences of the letter 'a'.
Examples:

aaa matches a*

a matches a*

b matches a* (zero occurrences)
+
In summary:

+ matches one or more occurrences of the preceding element.
* matches zero or more occurrences of the preceding element.
It's important to note that these quantifiers are "greedy" by default, meaning they match as many occurrences as possible. If you want a "non-greedy" match, you can add a ? after the + or *, like +? or *?, which will match as few occurrences as possible.

## 11. What is the difference between {4} and {4,5} in regular expression?

In regular expressions, the curly braces {} are used as quantifiers to specify the exact or a range of repetition for the preceding element. Here's the difference between {4} and {4,5}:

{4}: Specifies an exact repetition count
The {4} quantifier specifies that the preceding element must occur exactly four times.
For example, consider the pattern a{4}:

a{4} matches exactly four consecutive occurrences of the letter 'a'.

Examples:

aaaa matches a{4}

aaa does not match a{4}

aaaaa does not match a{4}

{4,5}: Specifies a range of repetition counts
The {4,5} quantifier specifies that the preceding element must occur between four and five times, inclusive.

For example, consider the pattern a{4,5}:

a{4,5} matches between four and five consecutive occurrences of the letter 'a'.
Examples:

aaaa matches a{4,5}

aaaaa matches a{4,5}

aaa does not match a{4,5}

aaaaaaaa does not match a{4,5}

## 12. What do you mean by the `\d`, `\w`, and `\s` shorthand character classes signify in regular expressions?

\d: Matches any digit,
The \d shorthand character class represents any digit character (0-9). It is equivalent to the character range [0-9].
For example, the pattern \d would match a single digit.

Examples:

0 matches \d

5 matches \d

a does not match \d

10 does not match \d (matches only one digit)

\w: Matches any word character,
The \w shorthand character class represents any word character, which includes alphanumeric characters (letters, digits) and underscores (_). It is equivalent to the character range [a-zA-Z0-9_].
For example, the pattern \w would match a single word character.

Examples:

a matches \w

Z matches \w

9 matches \w

@ does not match \w (non-word character)

\s: Matches any whitespace character,
The \s shorthand character class represents any whitespace character, including spaces, tabs, and newline characters.
For example, the pattern \s would match a single whitespace character.

Examples:

' ' matches \s (space character)

\t matches \s (tab character)

\n matches \s (newline character)

a does not match \s (non-whitespace character)

## 13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?

In regular expressions, the shorthand character classes \D, \W, and \S are negated versions of their counterparts (\d, \w, and \s). They represent character classes that match characters that are not part of certain predefined types. Here's what each of these negated shorthand character classes signifies:

\D: Matches any non-digit character
The \D shorthand character class matches any character that is not a digit. It is the negation of the \d shorthand character class.
For example, the pattern \D would match a single non-digit character.

Examples:

a matches \D

@ matches \D

5 does not match \D

10 does not match \D (contains digits)

## 14. What is the difference between .*? and .*?

In regular expressions, the .*? and .* are both quantifiers used for matching patterns. However, they have different behaviors due to the presence or absence of the ? modifier.

.* (Greedy quantifier):
The .* expression is a greedy quantifier that matches zero or more occurrences of any character, except for a newline (\n). The greedy behavior means that it will match as many characters as possible while still allowing the overall pattern to match.
For example, consider the pattern a.*b:

a.*b matches an 'a', followed by any number of characters, and ends with a 'b'. It will match the longest possible string that satisfies this condition.
Examples:

In the string "abcb", a.*b matches "abcb" (greedily matches the longest substring between 'a' and 'b').
.*? (Non-greedy/Lazy quantifier):
The .*? expression is a non-greedy or lazy quantifier that matches zero or more occurrences of any character, except for a newline (\n). The non-greedy behavior means that it will match as few characters as possible while still allowing the overall pattern to match.
For example, consider the pattern a.*?b:

a.*?b matches an 'a', followed by any number of characters (minimal match), and ends with a 'b'. It will match the shortest possible string that satisfies this condition.

## 15. What is the syntax for matching both numbers and lowercase letters with a character class?

In [10]:
import re

pattern = r'[0-9a-z]'
string = 'a1b2c3'

matches = re.findall(pattern, string)
print(matches)

['a', '1', 'b', '2', 'c', '3']


## 16. What is the procedure for making a normal expression in regax case insensitive?

To make a regular expression case-insensitive in Python, you can use the re.IGNORECASE flag or the re.I flag as an argument to the regex functions. Here's the procedure:

Compile the regular expression pattern with the case-insensitive flag:
When compiling your regular expression pattern, you can include the re.IGNORECASE flag or the re.I flag as an argument to the re.compile() function. This flag makes the regular expression case-insensitive.

pattern = re.compile(r'pattern_here', re.IGNORECASE)

Use the compiled pattern for matching:
You can then use the compiled pattern object to perform matching using methods like search(), match(), or findall(). The case-insensitive flag will be applied to the pattern during matching.

result = pattern.search(your_text_here)

In [11]:
import re

pattern = re.compile(r'hello', re.IGNORECASE)
text = 'Hello, World!'

result = pattern.search(text)
if result:
    print('Match found.')
else:
    print('Match not found.')

Match found.


## 17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?

In regular expressions, the . (dot) character normally matches any character except for a newline (\n). However, if the re.DOTALL flag is passed as the second argument to the re.compile() function or used inline with the regular expression pattern, the . character will match any character including a newline.

In [12]:
import re

pattern = re.compile(r'a.b')
text = 'a\nb'

result = pattern.search(text)
if result:
    print('Match found.')
else:
    print('Match not found.')

Match not found.


In [13]:
import re

pattern = re.compile(r'a.b', re.DOTALL)
text = 'a\nb'

result = pattern.search(text)
if result:
    print('Match found.')
else:
    print('Match not found.')

Match found.


In this example, by using the re.DOTALL flag, the a.b pattern is able to match the 'a\nb' text because the . now matches the newline character as well.

### 19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?

Passing re.VERBOSE as the second argument to re.compile() in Python allows you to write regular expressions in a more readable and organized manner by ignoring whitespace and adding comments.

Here's what re.VERBOSE enables you to do:

Ignore whitespace: When re.VERBOSE is used, whitespace within the regular expression pattern is ignored. This allows you to add spaces, tabs, and newlines for formatting and readability purposes without affecting the pattern itself.

Add comments: You can add comments to the regular expression pattern by using the # symbol. The comments are ignored by the regular expression engine and serve as explanatory notes for understanding the pattern.

In [14]:
import re

pattern = re.compile(r"""
    \d{3}  # Match three digits
    -      # Match a hyphen
    \d{3}  # Match three digits
    -      # Match a hyphen
    \d{4}  # Match four digits
""", re.VERBOSE)

text = 'Phone number: 123-456-7890'

result = pattern.search(text)
if result:
    print('Match found.')
else:
    print('Match not found.')

Match found.


In this example, the regular expression pattern for matching a phone number is written in a multi-line string with re.VERBOSE as the second argument to re.compile(). The pattern is broken down into multiple lines and includes comments for better readability.

## 20. How would you write a regex that match a number with comma for every three digits? It must match the given following:

'42'

'1,234'

'6,368,745'

but not the following:

'12,34,567' (which has only two digits between the commas)

'1234' (which lacks commas)

In [15]:
import re

pattern = re.compile(r'^\d{1,3}(,\d{3})*$')

numbers = ['42', '1,234', '6,368,745', '12,34,567', '1234']

for number in numbers:
    match = pattern.match(number)
    if match:
        print(f"Match found: {number}")
    else:
        print(f"No match found: {number}")

Match found: 42
Match found: 1,234
Match found: 6,368,745
No match found: 12,34,567
No match found: 1234


^ asserts the start of the string.

\d{1,3} matches one to three digits.

(,\d{3})* matches zero or more occurrences of a comma followed by exactly three digits.

$ asserts the end of the string.

## 21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:

Haruto Watanabe

Alice Watanabe

RoboCop Watanabe

but not the following:

haruto Watanabe (where the first name is not capitalized)

Mr. Watanabe (where the preceding word has a nonletter character)

Watanabe (which has no first name)

Haruto watanabe (where Watanabe is not capitalized)

In [16]:
import re

pattern = re.compile(r'^[A-Z][a-zA-Z]* Watanabe$')

names = ['Haruto Watanabe', 'Alice Watanabe', 'RoboCop Watanabe', 'haruto Watanabe', 'Mr. Watanabe', 'Watanabe', 'Haruto watanabe']

for name in names:
    match = pattern.match(name)
    if match:
        print(f"Match found: {name}")
    else:
        print(f"No match found: {name}")

Match found: Haruto Watanabe
Match found: Alice Watanabe
Match found: RoboCop Watanabe
No match found: haruto Watanabe
No match found: Mr. Watanabe
No match found: Watanabe
No match found: Haruto watanabe


^ asserts the start of the string.

[A-Z] matches an uppercase letter (first letter of the first name).

[a-zA-Z]* matches zero or more lowercase or uppercase letters (remaining letters of the first name).
matches a space character.

Watanabe matches the last name "Watanabe" exactly.

$ asserts the end of the string.

## 22. How would you write a regex that matches a sentence where the first word is either Alice, Bob,or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs;and the sentence ends with a period? This regex should be case-insensitive. It must match the following:

Alice eats apples.

Bob pets cats.

Carol throws baseballs.

Alice throws Apples.

BOB EATS CATS.

but not the following:

RoboCop eats apples.

ALICE THROWS FOOTBALLS.

Carol eats 7 cats.

In [17]:
import re

pattern = re.compile(r'^(Alice|Bob|Carol) (eats|pets|throws) (apples|cats|baseballs)\.$', re.IGNORECASE)

sentences = ['Alice eats apples.', 'Bob pets cats.', 'Carol throws baseballs.', 'Alice throws Apples.', 'BOB EATS CATS.', 
             'RoboCop eats apples.', 'ALICE THROWS FOOTBALLS.', 'Carol eats 7 cats.']

for sentence in sentences:
    match = pattern.match(sentence)
    if match:
        print(f"Match found: {sentence}")
    else:
        print(f"No match found: {sentence}")

Match found: Alice eats apples.
Match found: Bob pets cats.
Match found: Carol throws baseballs.
Match found: Alice throws Apples.
Match found: BOB EATS CATS.
No match found: RoboCop eats apples.
No match found: ALICE THROWS FOOTBALLS.
No match found: Carol eats 7 cats.


^ asserts the start of the string.

(Alice|Bob|Carol) matches one of the specified names: "Alice", "Bob", or "Carol".

(eats|pets|throws) matches one of the specified actions: "eats", "pets", or "throws".

(apples|cats|baseballs) matches one of the specified objects: "apples", "cats", or "baseballs".

\. matches a period (escaped with a backslash to match the actual period character).

$ asserts the end of the string.