# Regular expressions


## Meta-characters

Meta-characters are special characters in regular expressions that have a specific meaning or functionality. They are used to define patterns that match specific types of characters or sequences of characters in a given text. Here are some commonly used meta-characters in regular expressions:

- `.` (Dot): Matches any single character except a newline character.
- `*` (Asterisk): Matches zero or more occurrences of the preceding character or group.
- `+` (Plus): Matches one or more occurrences of the preceding character or group.
- `?` (Question Mark): Matches zero or one occurrence of the preceding character or group.
- `|` (Pipe): Acts like a logical OR, allowing you to specify alternatives.
- `()` (Parentheses): Creates a capturing group for capturing and extracting matched text.
- `[]` (Square Brackets): Defines a character class, allowing you to match any one of the characters specified within the brackets.
- `[^]` (Caret Inside Square Brackets): Defines a negated character class, matching any character that is not in the specified set.
- `\` (Backslash): Escapes the character that follows it, allowing you to match literal instances of meta-characters.
- `^` (Caret): Matches the beginning of a line (or the beginning of the string).
- `$` (Dollar Sign): Matches the end of a line (or the end of the string).
- `{}` (Curly Braces): Specifies a range of occurrences for the preceding character or group.
- `\b` (Word Boundary): Matches a position at a word boundary.
- `\d`, `\D`: Matches a digit or a non-digit character.
- `\w`, `\W`: Matches a word character (letter, digit, or underscore) or a non-word character.
- `\s`, `\S`: Matches a whitespace character or a non-whitespace character.


In [1]:
import re

In [21]:
text = "The quick brown fox jumps over the lazy dog."

In [13]:
# Match any single character followed by "ox"
pattern_dot = r".ox"
result_dot = re.search(pattern_dot, text)
print("Using . (Dot):", result_dot.group())

Using . (Dot): fox


In [14]:
# Match any word character followed by "ox"
pattern_word = r"\wox"
result_word = re.search(pattern_word, text)
print("Using \\w (Word Character):", result_word.group())

Using \w (Word Character): fox


In [15]:
# Match "fox" or "dog"
pattern_pipe = r"fox|dog"
result_pipe = re.search(pattern_pipe, text)
print("Using | (Pipe):", result_pipe.group())

Using | (Pipe): fox


In [16]:
# Match one or more "o" followed by "x"
pattern_plus = r"o+x"
result_plus = re.search(pattern_plus, text)
print("Using + (Plus):", result_plus.group())

Using + (Plus): ox


In [17]:
# Match "q" followed by zero or one "u"
pattern_question = r"qu?"
result_question = re.search(pattern_question, text)
print("Using ? (Question Mark):", result_question.group())

Using ? (Question Mark): qu


In [18]:
# Match any digit followed by "ox"
pattern_digit = r"\dox"
result_digit = re.search(pattern_digit, text)
print("Using \\d (Digit):", result_digit.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [19]:
# Match word boundaries before and after "ox"
pattern_boundary = r"\box\b"
result_boundary = re.search(pattern_boundary, text)
print("Using \\b (Word Boundary):", result_boundary.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [20]:
# Match any whitespace character
pattern_whitespace = r"\s"
result_whitespace = re.search(pattern_whitespace, text)
print("Using \\s (Whitespace):", result_whitespace.group())


Using \s (Whitespace):  


## Group

Regular expression groups allow you to capture and extract specific parts of the matched text. You can define groups using parentheses () in your regular expressions.

In [22]:
import re

In [23]:
text = "John's email is john@example.com. Mary can be reached at mary123@gmail.com."

In [24]:
# Using groups to extract email addresses and names
pattern = r"([A-Za-z]+)'s email is ([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,})"
matches = re.findall(pattern, text)

for match in matches:
    name, email = match
    print("Name:", name)
    print("Email:", email)
    print()

Name: John
Email: john@example.com



## Lookarounds

Lookarounds in regular expressions are special constructs that allow you to define conditions that must be met before or after a match, without including the matched text in the final result. Lookarounds are non-capturing groups, meaning they don't capture and include the matched text in the result. They are useful for specifying contextual constraints for your matches.

There are four types of lookarounds:

- Positive Lookahead (?=...): Asserts that a certain pattern must appear after the main pattern. It doesn't consume any characters.
- Negative Lookahead (?!...): Asserts that a certain pattern must not appear after the main pattern.
- Positive Lookbehind (?<=...): Asserts that a certain pattern must appear before the main pattern.
- Negative Lookbehind (?<!...): Asserts that a certain pattern must not appear before the main pattern.

In [25]:
import re

In [26]:
text = "I like Python and Java programming languages."

In [27]:
# Positive lookahead to find "Python" followed by "Java"
pattern_positive_lookahead = r"Python(?=.*Java)"
result_positive_lookahead = re.search(pattern_positive_lookahead, text)
print("Positive Lookahead:", result_positive_lookahead.group())

Positive Lookahead: Python


In [28]:
# Negative lookahead to find "Python" not followed by "C++"
pattern_negative_lookahead = r"Python(?!.*C\+\+)"
result_negative_lookahead = re.search(pattern_negative_lookahead, text)
print("Negative Lookahead:", result_negative_lookahead.group())

Negative Lookahead: Python


In [29]:
# Positive lookbehind to find "Java" preceded by "Python"
pattern_positive_lookbehind = r"(?<=Python )Java"
result_positive_lookbehind = re.search(pattern_positive_lookbehind, text)
print("Positive Lookbehind:", result_positive_lookbehind.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [30]:
# Negative lookbehind to find "Java" not preceded by "Python"
pattern_negative_lookbehind = r"(?<!Python )Java"
result_negative_lookbehind = re.search(pattern_negative_lookbehind, text)
print("Negative Lookbehind:", result_negative_lookbehind.group())

Negative Lookbehind: Java


## Greedy and non-greedy

Greedy and non-greedy (also known as lazy or reluctant) matching are concepts in regular expressions that determine how the engine matches patterns with multiple occurrences.

### Greedy Matching:
In greedy matching, the regular expression engine tries to match as much as possible while still satisfying the pattern. It matches the longest possible substring that fits the pattern.

In [31]:
import re

text = "This is a <div>sample</div> text with <div>multiple</div> div tags."
pattern_greedy = r"<div>.*</div>"
result_greedy = re.search(pattern_greedy, text)
print("Greedy Matching:", result_greedy.group())


Greedy Matching: <div>sample</div> text with <div>multiple</div>


### Non-Greedy (Lazy) Matching:
In non-greedy matching, the regular expression engine tries to match as little as possible while still satisfying the pattern. It matches the shortest possible substring that fits the pattern.

In [33]:
import re

text = "This is a <div>sample</div> text with <div>multiple</div> div tags."
pattern_non_greedy = r"<div>.*?</div>"
result_non_greedy = re.search(pattern_non_greedy, text)
print("Non-Greedy (Lazy) Matching:", result_non_greedy.group())


Non-Greedy (Lazy) Matching: <div>sample</div>
