# Regular Expression

Python provides the re module for working with regular expressions. You can use it to find, match, and manipulate string patterns.

In [3]:
import re

### Basic Metacharacters and Their Meanings

#### 2. . (Dot):
The dot . is a special character that matches any character except a newline (\n).

In [11]:
string = 'hello\nworld'
pattern = r'.'
res = re.findall(pattern,string)
res

['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']

In [13]:
type(res)

list

#### 3. ^ (Caret):
The ^ symbol is used to match a string that starts with a specific pattern. It matches the __beginning__ of a string

In [19]:
pattern = r"^Hello"
string1 = "Hello, World!"
string2 = "World, Hello!"

res1 = re.match(pattern,string1)
res1

<re.Match object; span=(0, 5), match='Hello'>

In [25]:
res2 = re.match(pattern,string2)
res2#no ouput as no hello at start of string2

#### 4. \\$ (Dollar Sign):
The \\$ symbol matches the end of a string. It ensures the string ends with the pattern specified before $

In [40]:
pattern = r'World$'
string1 = "Hello, World"
string2 = "Hello World!"
re.search(pattern,string1)

<re.Match object; span=(7, 12), match='World'>

In [42]:
re.search(pattern,string2)

#### 5. * (Asterisk):
The * symbol matches 0 or more occurrences of the preceding element.

In [45]:
pattern = r"a*"
string = "aaabbb"

# Match 0 or more 'a's
result = re.findall(pattern, string)
print(result)  # Output: ['aaa', '', '', '', '']


['aaa', '', '', '', '']


Here, a* matches 0 or more occurrences of "a". It matches "aaa" at the start and then matches empty strings wherever "a" is absent.

In [52]:
pattern = r"a*"
string = "aaabbbaaca"

# Match 0 or more 'a's
result = re.findall(pattern, string)
print(result)  

['aaa', '', '', '', 'aa', '', 'a', '']


#### 6. + (Plus Sign):
The + symbol matches 1 or more occurrences of the preceding element.

In [57]:
pattern = r"a+"
string = "aaabbbaaca,''"

# Match 1 or more 'a's
result = re.findall(pattern, string)
print(result)  # Output: ['aaa']

['aaa', 'aa', 'a']


#### 7. ? (Question Mark):
The ? symbol matches 0 or 1 occurrence of the preceding element.

In [62]:
pattern = r"ca?t"
string = "cat and ct"

# Match 'c' followed by 'a' 0 or 1 time, followed by 't'
result = re.findall(pattern, string)
print(result)  # Output: ['cat', 'ct']

['cat', 'ct']


Here, ca?t matches both "cat" and "ct" because the ? allows "a" to appear either 0 or 1 time.

#### 8. [] (Square Brackets):
Square brackets are used to match a set of characters. Inside square brackets, you can specify a range or a list of characters.

In [66]:
pattern = r"[abc]"
string = "a test b for c matching"

# Match 'a', 'b', or 'c'
result = re.findall(pattern, string)
print(result)  # Output: ['a', 'b', 'c']

['a', 'b', 'c', 'a', 'c']


In this case, [abc] matches either "a", "b", or "c" in the string. You can also specify ranges, like [a-z] to match lowercase letters.

In [71]:
pattern = r"[a-z]"
string = "ABCDapda"

# Match 'a', 'b', or 'c'
result = re.findall(pattern, string)
print(result)  # Output: ['a', 'b', 'c']

['a', 'p', 'd', 'a']


In [106]:
pattern = r"[A-Z]"
string = "ABCDapda"

# Match 'a', 'b', or 'c'
result = re.findall(pattern, string)
print(result)  # Output: ['a', 'b', 'c']

['A', 'B', 'C', 'D']


#### 9. {} (Braces):
Curly braces {} are used to specify exactly how many times a character or group should occur.

 - {n}: Exactly n times.
 - {n,}: At least n times.
 - {n,m}: Between n and m times.

In [76]:
pattern = r"a{2}"
string = "aaaabc"

# Match exactly 2 'a's in a row
result = re.findall(pattern, string)
print(result)  # Output: ['aa']

['aa', 'aa']


In [78]:
pattern = r"a{2,}"
string = "aaaabc"

# Match exactly 2 'a's in a row
result = re.findall(pattern, string)
print(result)  # Output: ['aa']

['aaaa']


In [80]:
pattern = r"a{2,4}"
string = "aaaabc"

# Match exactly 2 'a's in a row
result = re.findall(pattern, string)
print(result)  # Output: ['aa']

['aaaa']


#### 10. \ (Backslash):
The backslash \ is used to escape special characters in regex or to define special sequences like \d, \w, \s, etc.

Special Sequences:
 - \d: Matches any digit (equivalent to [0-9]).
 - \w: Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]).
 - \s: Matches any whitespace character.

In [86]:
pattern = r"\d+"
string = "There are 123 apples and 45 bananas."

# Match all digits
result = re.findall(pattern, string)
print(result)  # Output: ['123', '45']

['123', '45']


Here, \d+ matches 1 or more digits in the string. It finds "123" and "45".

## Basic Regex Functions in Python:

#### 11. re.match():
This function checks if the pattern matches at the start of the string. It only looks for a match at the beginning.

In [93]:
pattern = r"Hello"
string = "Hello, World!"

result = re.match(pattern, string)
if result:
    print("Matched at the start!")

Matched at the start!


#### 12. re.search():
This function searches the entire string for the first occurrence of the pattern.

In [98]:
pattern = r"World"
string = "Hello, World!"

result = re.search(pattern, string)
if result:
    print("Found 'World' in the string!")

Found 'World' in the string!


#### 13. re.findall():
This function finds all occurrences of a pattern in a string and returns them as a list.

Example:

In [101]:
pattern = r"\d+"
string = "My phone number is 12345 and my ID is 67890."

matches = re.findall(pattern, string)
print(matches)  # Output: ['12345', '67890']


['12345', '67890']


#### 14. re.sub():
This function is used to replace matches of a pattern with a given string.

In [104]:
pattern = r"cat"
string = "I have a cat and another cat."

# Replace 'cat' with 'dog'
new_string = re.sub(pattern, "dog", string)
print(new_string)  # Output: "I have a dog and another dog."

I have a dog and another dog.
