# Regular Expression
- Meta Characters & Special Sequence Characters: [Notes](https://marslearnings.github.io/marslearnings/blog/2025/04/13/introduction-to-regular-expressions/)
- Youtube Video: [Video](https://youtu.be/qQnFIzwXwp0?si=QyXblqL-t50yRbD4)

In [1]:
import re
# re: regular expression

### Match
- `re.match()`
identify the pattern only at the **beginning** of the string

In [2]:
string = "Hello world"
results = re.match("Hello", string) 
# this will identify the string only at the start of the string

print(results)

<re.Match object; span=(0, 5), match='Hello'>


In [3]:
string = "Hello world"
results = re.match("world", string) 
# this will identify the string only at the start of the string

print(results)

None


In [4]:
string = "Hello world"
pattern = r'\w+' 
#  r stands for raw string
results = re.match(pattern, string) 
# this will identify the string only at the start of the string

print(results)

<re.Match object; span=(0, 5), match='Hello'>


In [5]:
print(results.group())

Hello


In [6]:
results.span()

(0, 5)

In [7]:
results.start(), results.end()

(0, 5)

- Practise Example

In [8]:
text_list = ['Python is fun',
             'I love Python',
             'Python is general purpose language',
             'python 3.12 is my current python version'
            ]

# Extract the elements where start of the string is "Python" | P is uppercase
# case-sensitive method

In [9]:
pattern = 'Python'
filter_text_list = []
for text in text_list: 
    result = re.match(pattern, text)
    if result:
        filter_text_list.append(text)

print(filter_text_list)

['Python is fun', 'Python is general purpose language']


In [10]:
# Extract the elements where start of the string is "Python" | P is uppercase or lowercase
# case-insensitive method
pattern = 'Python'
filter_text_list = []
for text in text_list: 
    result = re.match(pattern, text, flags=re.IGNORECASE)
    if result:
        filter_text_list.append(text)

print(filter_text_list)

['Python is fun', 'Python is general purpose language', 'python 3.12 is my current python version']


### Search
- `re.search()`
  
Searches for first occurrence of pattern anywhere in string


In [11]:
text = 'Hello world'
pattern = 'world'

result = re.search(pattern, text)

print(result)

<re.Match object; span=(6, 11), match='world'>


In [12]:
text = 'I having 42 apples and 31 bannas'
# extract the digits in the text or string
result = re.search(r'\d+', text)
# search will identify only first occurences

print(result)

<re.Match object; span=(9, 11), match='42'>


In [13]:
text_new = text[result.end():]
print(text_new)
result_new = re.search(r'\d+', text_new)

print(result_new)

 apples and 31 bannas
<re.Match object; span=(12, 14), match='31'>


### Find all
- `re.findall()`

In [14]:
text = 'I having 42 apples and 31 bannas'
pattern = r'\d+'

results = re.findall(pattern, text)

print(results)

['42', '31']


### Practise Examples

In [18]:
# Example 1: Extract words of different lengths
text = "The quick brown fox jumps over the lazy dog"

# Words with exactly 3 letters
pattern = r"\b\w{3}\b"
three_letter_word = re.findall(pattern, text)
print("The three letter words are =", three_letter_word)

# Words with 4 or more letters
pattern = r"\b\w{4,}\b"
four_or_more_words = re.findall(pattern, text)
print("The four or more letter words are =", four_or_more_words)


The three letter words are = ['The', 'fox', 'the', 'dog']
The four or more letter words are = ['quick', 'brown', 'jumps', 'over', 'lazy']


In [19]:
# Example 2: Extract all capitalized words
text = "Python was created by Guido van Rossum. Java was created by James Gosling at Sun Microsystems."

pattern = r"\b[A-Z]\w+"

capitalized_words = re.findall(pattern, text)

print("captilized words", capitalized_words)

captilized words ['Python', 'Guido', 'Rossum', 'Java', 'James', 'Gosling', 'Sun', 'Microsystems']


### re.compile()
`re.compile(pattern, flags=0)`

- Converts regex pattern into a reusable regex object
- More efficient when using the same pattern multiple times

In [20]:
text = "Python was first developed in Feb 1991"
# extract the digit or year out it.
pattern = re.compile(r"\d{4}") # compile will convert the pattern in to bytecode

results = pattern.search(text)
print(results.group())
print(results)

1991
<re.Match object; span=(34, 38), match='1991'>


#### END
---