# Regular Expression(Regex)
The core idea of regular expressions is to define a pattern that the re module uses to search through text (a string).

Regex is the standard for checking if user input conforms to a required format.

regex patterns are used to specify search rules within strings

In [1]:
# Importing the module
import re

# Predefined character classes:

| Syntax               | Meaning / Description                                                                                           |                                                         |   |
| -------------------- | --------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | - |
| `.`                  | Any character except newline (use DOTALL to include newline)                                                    |                                                         |   |
| `^`                  | Start of string (or start of line in MULTILINE mode)                                                            |                                                         |   |
| `$`                  | End of string (or end of line in MULTILINE mode)                                                                |                                                         |   |
| `[abc]`              | Character class: match **one** of the characters `a`, `b`, or `c`                                               |                                                         |   |
| `[a-zA-Z]`           | Character range: any letter from a–z or A–Z                                                                     |                                                         |   |
| `[^xyz]`             | Negated class: any character *except* `x`, `y`, or `z`                                                          |                                                         |   |
| `\d`                 | Digit (0–9)                                                                                                     |                                                         |   |
| `\D`                 | Non-digit (opposite of `\d`)                                                                                    |                                                         |   |
| `\w`                 | “Word” character (letters, digits, underscore)                                                                  |                                                         |   |
| `\W`                 | Non-word character (opposite of `\w`)                                                                           |                                                         |   |
| `\s`                 | Whitespace (spaces, tabs, newlines)                                                                             |                                                         |   |
| `\S`                 | Non-whitespace (opposite of `\s`)                                                                               |                                                         |   |
| `\b`                 | Word boundary (between `\w` and `\W`)                                                                           |                                                         |   |
| `\B`                 | Non-word-boundary (opposite of `\b`)                                                                            |                                                         |   |
| `\A`                 | Start of string (absolute anchor)                                                                               |                                                         |   |
| `\Z`                 | End of string (absolute anchor)                                                                                 |                                                         |   |
| `( ... )`            | Grouping parentheses (capturing)                                                                                |                                                         |   |
| `(?: ... )`          | Non-capturing group                                                                                             |                                                         |   |
| `(?P<name> ... )`    | Named capturing group (name the group)                                                                          |                                                         |   |
| `                    | `                                                                                                               | Alternation (OR): matches pattern on left or right of ` | ` |
| `*`                  | Quantifier: 0 or more of previous element (greedy)                                                              |                                                         |   |
| `+`                  | 1 or more (greedy)                                                                                              |                                                         |   |
| `?`                  | 0 or 1 (greedy) – optional element                                                                              |                                                         |   |
| `{m,n}`              | Between m and n repetitions (inclusive)                                                                         |                                                         |   |
| `+?`, `*?`, `{m,n}?` | Non-greedy versions of above (match as little as possible)                                                      |                                                         |   |
| `\`                  | Escape: treat next metachar as literal or give special sequence (e.g. `\.` for literal dot, `\\` for backslash) |                                                         |   |


In [3]:
#Basic Syntax
security_log='user123 logged in on 2021-10-05.'
m1=re.search(r'\d+',security_log)
print(m1)

<re.Match object; span=(4, 7), match='123'>


In [12]:
log2='user123456789 session has timed out'
m2=re.search(r'\d+',log2)
print(m2.group())

123456789


In [6]:
print(bool(re.match(r'user',security_log)))

True


In [8]:
print(bool(re.search(r'session',log2)))

True


In [11]:
pattern = r'[\w.-]+@[\w.-]+\.\w+'
text2 = "Please contact us at support@example.com."
m2 = re.search(pattern, text2)
print(m2.group())

support@example.com


# re.compile()

re.compile() is used to compile a regular expression pattern into a regular expression object (or Pattern object).

In [16]:
year_pattern = re.compile(r'\d{4}')

text1 = "The first version was released in 1991."
text2 = "Development began around 1990, but the official launch was 1993."


match1 = year_pattern.search(text1)
matches2 = year_pattern.findall(text2)

print(f"Text 1 match: {match1.group()}" if match1 else "No match in Text 1")
print(f"Text 2 matches: {matches2}")

Text 1 match: 1991
Text 2 matches: ['1990', '1993']


# re.match()
The re.match() function in Python's re module attempts to match a regular expression pattern only at the very beginning of the string.

In [17]:
comment1='food is delicious'
comment2='Maa mor toppings to my pizza'

m3=re.match(r'foo',comment1)
m4=re.match(r'foo',comment2)

print(m3.group())
print(f'{m4.group()}' if m4 else 'No Match in th string')

foo
No Match in th string


In [18]:

print(re.match(r'\d+', '123abc'))
print(re.match(r'\d+', 'abc123'))


<re.Match object; span=(0, 3), match='123'>
None


# re.search()

The re.search() function in Python's re module scans an entire string to find the first location where the regular expression pattern produces a match.

In [19]:
import re

text1 = "food is good"
text2 = "good food is here"

# Match at the start
match_result = re.match(r'foo', text1)
print(f"Match 1 (start): {match_result.group()}" if match_result else "Match 1: None")

# No match at the start
match_result = re.match(r'foo', text2)
print(f"Match 2 (start): {match_result}" if match_result else "Match 2: None")

# Search finds it anywhere
search_result = re.search(r'foo', text2)
print(f"Search 2 (anywhere): {search_result.group()}" if search_result else "Search 2: None")

Match 1 (start): foo
Match 2: None
Search 2 (anywhere): foo


In [21]:
text = "Call us for support at 555-123-4567 or email us."

phone_pattern = r'\d{3}-\d{3}-\d{4}'

m5=re.search(phone_pattern,text)
print(m5.group())

555-123-4567


In [22]:
contact='To get in touch with Hamza dial 0712345678'

s_pattern=r'\d{10}'
m6=re.search(s_pattern,contact)
print(m6.group())

0712345678


# re.fullmatch()
The re.fullmatch() function in Python's re module is used to determine if a regular expression pattern matches the entirety of a given string.

In [24]:

# Pattern: r'[A-Z]{3}\d{4}'
# [A-Z]{3} = exactly 3 uppercase letters
# \d{4}    = exactly 4 digits
id_pattern = r'[A-Z]{3}\d{4}'

valid_id = "XYZ9876"
too_long = "XYZ9876A"
too_short = "AB12345"


match1 = re.fullmatch(id_pattern, valid_id)
print(f"'{valid_id}': {'Match' if match1 else 'No Match'}")

match2 = re.fullmatch(id_pattern, too_long)
print(f"'{too_long}': {'Match' if match2 else 'No Match'}")

match3 = re.fullmatch(id_pattern, too_short)
print(f"'{too_short}': {'Match' if match3 else 'No Match'}")

'XYZ9876': Match
'XYZ9876A': No Match
'AB12345': No Match


In [26]:
id1=14503224
id2=1150198

d_pattern=r'\d{8}'

print(re.fullmatch(d_pattern,str(id1)))
print(re.fullmatch(d_pattern,str(id2)))

<re.Match object; span=(0, 8), match='14503224'>
None


# re.findall()
The re.findall() function in Python's re module is used to find all non-overlapping matches of a regular expression pattern within a string and return them as a list.

In [27]:
import re
text = "12 drummers drumming, 11 pipers piping, 10 lords a-leaping"
print(re.findall(r'\d+', text))


['12', '11', '10']


In [28]:
import re

text = "Project start date was 01/15/24. Final review is set for 11/20/25."

# Pattern to match two digits, slash, two digits, slash, two digits
date_pattern = r'\d{2}/\d{2}/\d{2}'

# Use re.findall()
dates = re.findall(date_pattern, text)

print(f"Extracted Dates: {dates}")

Extracted Dates: ['01/15/24', '11/20/25']


In [29]:
import re

name_list = "Jane Doe, John Smith, Alice Brown"

# Pattern with two capturing groups: (First Name) (Last Name)
name_pattern = r'(\w+)\s(\w+)'

# Use re.findall()
names_data = re.findall(name_pattern, name_list)

print(f"Extracted Data (List of Tuples): {names_data}")

for first, last in names_data:
    print(f"First: {first}, Last: {last}")


Extracted Data (List of Tuples): [('Jane', 'Doe'), ('John', 'Smith'), ('Alice', 'Brown')]
First: Jane, Last: Doe
First: John, Last: Smith
First: Alice, Last: Brown


# re.finditer()

The re.finditer() function in Python's re module is similar to re.findall(), but instead of returning a list of strings or tuples, it returns an iterator yielding Match Objects for every non-overlapping match found in the string.

In [33]:
text = "Price A: $10.99, Price B: $5.00, Price C: $100.45"

# Pattern to find a dollar sign followed by one or more digits,
# then a decimal point, then two digits.
price_pattern = r'\$\d+\.\d{2}'

# 1. Use re.finditer() to get an iterator of match objects
match_iterator = re.finditer(price_pattern, text)

print(f"Scanning text for prices...\n")

# 2. Iterate through the match objects
for match in match_iterator:
    # Use Match Object methods for details
    matched_text = match.group()
    start, end = match.span()

    print(f"✅ Found: {matched_text}")
    print(f"   Span: ({start}, {end})")
    print("-" * 20)

Scanning text for prices...

✅ Found: $10.99
   Span: (9, 15)
--------------------
✅ Found: $5.00
   Span: (26, 31)
--------------------
✅ Found: $100.45
   Span: (42, 49)
--------------------


In [34]:
kes_price="Price A: KES109 ,PriceB: KES200 ,PriceC: KES300"

p_pattern=r'\w+\d+'

m_iterator=re.finditer(p_pattern,kes_price)

for match in m_iterator:
    matched_text=match.group()
    start,end=match.span()

    print(f'Price Found: {matched_text}')
    print(f'Span: ({start},{end})')

Price Found: KES109
Span: (9,15)
Price Found: KES200
Span: (25,31)
Price Found: KES300
Span: (41,47)


# re.sub()

The re.sub() function in Python's re module is used for substitution—it finds all occurrences of a pattern in a string and replaces them with a specified replacement string or the result of a replacement function.

In [35]:
date = "2023-03-15"
new_date = re.sub(r'(\d{4})-(\d{2})-(\d{2})', r'\3/\2/\1', date)
print(new_date)

15/03/2023


In [37]:
text = "Baked Beans and Spam"
text2 = re.sub(r'\sand\s', ' & ', text)
print(text2)

Baked Beans & Spam


In [40]:
email="abubakarhamzamaina  .com"
new_email=re.sub(r'\s\s','@gmail',email)
print(new_email)

abubakarhamzamaina@gmail.com


#re.subn()

The re.subn() function in Python's re module performs the same substitution operation as re.sub(), but instead of returning only the modified string, it returns a tuple containing the new string and the number of substitutions made.

In [41]:
import re
text = "a b c d e"
new_text, num = re.subn(r'\s', '-', text)
print(new_text, num)

a-b-c-d-e 4


In [43]:
text = "Hello, world! How are you today? I'm fine."

# Pattern to match punctuation
punctuation_pattern = r'[.!,?;]'
replacement = " "

# Use re.subn()
result_tuple = re.subn(punctuation_pattern, replacement, text)

# Unpack the results
new_text, substitution_count = result_tuple

print(f"Original Text: '{text}'")
print(f"Substituted Text: '{new_text}'")
print(f"Total Substitutions Made: {substitution_count}")

Original Text: 'Hello, world! How are you today? I'm fine.'
Substituted Text: 'Hello  world  How are you today  I'm fine '
Total Substitutions Made: 4


# Regex Flags

In [58]:
text = "First line\nSecond line"
print(re.search(r"^Second", text))
print(re.search(r"^Second", text, flags=re.MULTILINE))


None
<re.Match object; span=(11, 17), match='Second'>


In [57]:
text = "Baked Beans And Spam"
text2 = re.sub(r'\sAND\s', ' & ', text, flags=re.IGNORECASE)
print(text2)

Baked Beans & Spam
