# Regular Expression in Python

Regular expressions (also known as regex or regexp) are a powerful tool for searching and manipulating text. They allow you to define a pattern or set of rules that describe a particular string of characters, and then search for or manipulate any text that matches that pattern.

Regular expressions are commonly used in programming, particularly for tasks like data validation, searching and replacing text, and parsing strings. They are also useful in text editors, command-line tools, and other applications that involve working with text.

Some of the benefits of using regular expressions include:

- **Flexibility:** Regular expressions are incredibly flexible and can match a wide range of patterns, from simple strings to complex sequences of characters.
- **Efficiency:** Regular expressions are often faster than alternative methods for text processing, particularly for large amounts of data.
- **Accuracy:** Regular expressions are very precise and can be used to match specific patterns, ensuring that you only work with the data that you need.
- **Standardization:** Regular expressions are a widely accepted standard for working with text, making it easier to share and collaborate on code that involves text processing.

```
https://regexr.com/
```

## Python RegEx Methods

Python provides a powerful module called `re` for working with regular expressions. This module provides various methods for working with regular expressions in Python, including:

### 1. re.search(pattern, string, flags=0) 

The `re.search()` function is used to search for a pattern in a string and return the first occurrence of the pattern. It returns `None` if the match is not found. This is equivalent to `in` operator used with python string. Since the result is either some value or None, depending on whether a match was found or not, the result can be used with conditional expressions as well.

- `pattern`: The regular expression pattern to search for
- `string`: The string to search in
- `flags (optional)`: A set of flags that modify the behavior of the search

It's a good idea to use raw strings (represented as `r'...'`) to define regular expression patterns. This will make more sense later on.

The match object contains information about the match. Some of the useful methods and attributes of the match object are:

- `group()`: Returns the matched string
- `start()`: Returns the starting index of the match
- `end()`: Returns the ending index of the match
- `span()`: Returns a tuple containing the starting and ending indices of the match

In [1]:
import re

In [2]:
string = "The quice brown fox jumps over the lazy dog."
pattern = r"he"

match = re.search(pattern, string)

In [3]:
# Using re.search() with conditional expressions
if match:
    print("Match Object:", match)
    print("Match Group:", match.group())
    print("Match Start:", match.start())
    print("Match End:", match.end())
    print("Match Span:", match.span())
else:
    print("No match found.")

Match Object: <re.Match object; span=(1, 3), match='he'>
Match Group: he
Match Start: 1
Match End: 3
Match Span: (1, 3)


In the expression `<re.Match object; span=(1, 3), match='he'>` above, `re.Match` is the data type of the object, `match='he'` is the string that has been matched and `span=(1, 3)` is the index of start and end of the matched pattern `pattern` within the entire text `string`, where indexing starts from `0` as in regular python.


### 2. re.findall(pattern, string, flags=0)

The `re.findall()` function is used to find all occurrences of a regular expression pattern in a string. All the parameters are same as that used with `re.search()`. The result of `re.findall()` is a list of all the matches found. The result in the example below is quite simple, we will discuss the pattern design later to draw more insights on the upcoming topic. 

In [4]:
string = "The quice brown fox jumps over the lazy dog."
pattern = r"[A-Z]he"

match = re.findall(pattern, string)

In [5]:
print(match)


['The']


### 3. re.match(pattern, string, flags=0)

`re.match()` is a method that searches for a pattern in the beginning of a string. It returns a match object if it finds a match, and None if it does not. All the parameters are same as that used with `re.search()` and `re.findall()`. Similar to `re.search()` object, `re.match()` object also has methods like `group()`, `start()`, `end()`, `span()`.

In [6]:
string = "The quice brown fox jumps over the lazy dog."
pattern = r"[A-Z]he"

match = re.match(pattern, string)

In [7]:
# Using re.match() with conditional expressions
if match:
    print("Match Object:", match)
    print("Match Group:", match.group())
    print("Match Start:", match.start())
    print("Match End:", match.end())
    print("Match Span:", match.span())
else:
    print("No match found.")

Match Object: <re.Match object; span=(0, 3), match='The'>
Match Group: The
Match Start: 0
Match End: 3
Match Span: (0, 3)


### 4. re.sub(pattern, repl, string, count=0, flags=0)

`re.sub()` is a method that is used to replace occurrences of a pattern in a string with a replacement string. It returns a new string with the replacements made. Here are the parameters used:

- `pattern`: The regular expression pattern to search for
- `repl`: replacement string that you want to use in place of matched pattern
- `string`: The string to search in
- `count`: Maximum number of replacements to make
- `flags (optional)`: A set of flags that modify the behavior of the search

In [8]:
string = "The quice brown fox jumps over the lazy dog."
pattern = r"[a-z]he"
repl = "The"

match = re.sub(pattern, repl, string, count=1)


In [9]:
match

'The quice brown fox jumps over The lazy dog.'

### 5. re.split(pattern, string, maxsplit=0, flags=0)

`re.split()` is a method that is used to split a string into a list of substrings based on a regular expression pattern. It returns a list of the substrings. It is similar to Python's `split()` method use with Python `str` objects. Let's see how each parameter works:

- `pattern`: The regular expression pattern to search for
- `string`: The string to search in
- `maxsplit`: Maximum number of splits to make
- `flags (optional)`: A set of flags that modify the behavior of the search

In [10]:
string = "The quice brown fox 1 <div> over the 2 lazy dog."
pattern = r"<[a-z]\w+>"

segements = re.split(pattern, string)

In [11]:
segements

['The quice brown fox 1 ', ' over the 2 lazy dog.']

### 6. re.compile(pattern, flags=0)

`re.compile()` is a method that is used to compile a regular expression pattern into a regular expression object. This regular expression object can then be used for matching, searching, or replacing patterns in strings.

**Is it worth compiling regular expression patterns?**

Here are some reasons why defining a regular expression pattern with re.compile might be worth it:

- `Improved performance`: When a pattern is compiled using `re.compile`, the regular expression engine performs some optimizations to the pattern that can improve the performance of matching, searching, or replacing operations. The compiled pattern can be reused multiple times, which can be faster than recompiling the pattern each time it is used.
- `Easier debugging`: If you have a regular expression pattern that is not working as expected, defining the pattern using `re.compile` can make it easier to debug your code. You can print the compiled pattern to see what it looks like, and you can also inspect the regular expression object to see its attributes and methods.

If you get more curious, you might like to see this: [Is it worth using Python's re.compile?](https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile)

In [12]:
pattern = re.compile('[A-Z]he')

In [13]:
pattern.findall("The quick fox")


['The']

In [14]:
pattern.sub("the", "The quick fox")

'the quick fox'

## Metacharacters

Metacharacters are special characters in regular expressions that have a special meaning and are used to match specific patterns in a string. Here are some of the most commonly used metacharacters in Python's `re` module:

- `. (dot)`: Matches any single character except newline.
- `^ (caret)`: Matches the beginning of a string.
- `$ (dollar)`: Matches the end of a string.
- `* (asterisk)`: Matches zero or more occurrences of the preceding character.
- `+ (plus)`: Matches one or more occurrences of the preceding character.
- `? (question mark)`: Matches zero or one occurrence of the preceding character.
- `{m} (curly braces)`: Matches exactly m occurrences of the preceding character.
- `{m,n} (curly braces)`: Matches between m and n occurrences of the preceding character.
- `[] (square brackets)`: Matches any one of the characters enclosed in the brackets.
- `| (pipe)`: Matches either the expression before or after the pipe.
- `\ (backslash)`: Used to escape metacharacters and match literal characters. For example, \. matches a period and \\ matches a backslash.

Here are some examples of how to use metacharacters in regular expressions:

In [16]:
import re

# Matches any string that starts with 'hello'
pattern = r'^hello'
string = 'hello world'
match = re.match(pattern, string)
print(match.group())

# Matches any string that ends with 'world'
pattern = 'world$'
string = 'hello world'
match = re.search(pattern, string)
print(match.group())

# Matches any string that contains 'python'
pattern = r'python'
string = 'I love python'
match = re.search(pattern, string)
print(match.group())

# Matches any string that starts with 'a' followed by zero or more 'b's
pattern = r'^a*b*'
string = 'aabbb'
match = re.match(pattern, string, flags = re.IGNORECASE)
print(match.group())

# Matches any string that contains 'cat' or 'dog'
pattern = r'cat|dog'
string = 'I have a cat and a dog'
match = re.search(pattern, string)
print(match.group())

# Matches any string that starts with 'http://' or 'https://'
pattern = r'^https?://'
string = 'http://www.google.com'
match = re.match(pattern, string)
print(match.group())

hello
world
python
aabbb
cat
http://


hese are just a few examples of how to use metacharacters in regular expressions. By combining metacharacters and literal characters, you can create complex regular expressions that can match specific patterns in strings.

## Special Sequences

Special sequences in Python's re module are sequences of characters that represent a special type of pattern. Here are some of the most commonly used special sequences:

- `\d`: Matches any digit (0-9).
- `\D`: Matches any non-digit character.
- `\s`: Matches any whitespace character (space, tab, newline, etc.).
- `\S`: Matches any non-whitespace character.
- `\w`: Matches any alphanumeric character (a-z, A-Z, 0-9, _).
- `\W`: Matches any non-alphanumeric character.
- `\b`: Matches the boundary between a word character and a non-word character.
- `\B`: Matches the boundary between two word characters or two non-word characters.
- `\A`: Matches the beginning of the string. It works the same as the caret (^) metacharacter.
- `\Z`: Matches the end of the string. It works the same as the dollar ($) metacharacter.

Here are some examples of how to use these special sequences:

In [18]:
import re

# Match any digit
pattern = re.compile(r'\d+')
string = 'The price is $123.45'
matches = pattern.findall(string)
print(matches)

# Match any non-word character
pattern = re.compile(r'\W+')
string = 'Hello, World!'
matches = pattern.findall(string)
print(matches)

# Match the boundary between a word character and a non-word character
pattern = re.compile(r'\bworld\b')
string = 'Hello, world!'
match = pattern.search(string)
print(match.group())

['123', '45']
[', ', '!']
world


## Python RegEx Flags

Flags in Python's `re` module are optional arguments that modify the behavior of regular expression functions. These flags allow you to control how regular expressions are matched and interpreted. Here are some of the most commonly used flags:

- `re.IGNORECASE`: Ignores case when matching characters. For example, the pattern r'cat' would match the string "cat", "Cat", and "CAT" with this flag.
- `re.MULTILINE`: Allows the pattern to match multiple lines in a string. This flag changes the meaning of the `^` and `$` metacharacters to match the beginning and end of each line, rather than just the beginning and end of the entire string.
- `re.DOTALL`: Allows the dot (`.`) metacharacter to match any character, including newline characters.
- `re.VERBOSE`: Allows you to add comments and whitespace to your regular expression pattern, making it easier to read and understand.
- `re.ASCII`: Makes `\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s` and `\S` perform ASCII-only matching instead of Unicode matching.
- `re.UNICODE`: Makes `\w`, `\W`, `\b`, `\B`, `\d` and `\D` perform Unicode matching instead of ASCII-only matching.
- `re.LOCALE`: Makes `\w`, `\W`, `\b`, `\B`, `\d` and `\D` perform matching according to the current locale.
- `re.DEBUG`: Enable debug output for the regular expression engine.
- `re.A`: This is an alias for `re.ASCII`.
- `re.I`: This is an alias for `re.IGNORECASE`.
- `re.M`: This is an alias for `re.MULTILINE`.
- `re.S`: This is an alias for `re.DOTALL`.
- `re.X`: This is an alias for `re.VERBOSE`.
- `re.U`: This is an alias for `re.UNICODE`.
- `re.L`: This is an alias for `re.LOCALE`.