Regular Expressions

Regular expressions (regex) are patterns used to search, match, and extract text. They help validate input, find specific formats, and work with complex string patterns.

In [1]:
text = "the agents phone number is 234-234-2355. Call. Call soon! "


In [2]:
'phone' in text

True

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern,text)

<re.Match object; span=(11, 16), match='phone'>

In [6]:
p = 'not in text'

In [7]:
re.search(p,text)

In [8]:
pattern = 'phone'

In [9]:
match = re.search(pattern,text)


In [10]:
match.span()

(11, 16)

In [11]:
match.start()

11

In [12]:
match.end()

16

In [13]:
test = 'my phone 1 my phone 2'

In [14]:
match = re.search('phone',test)

In [15]:
print(match)

<re.Match object; span=(3, 8), match='phone'>


In [16]:
matches = re.findall('phone',test)

In [17]:
matches

['phone', 'phone']

In [18]:
for match in re.finditer('phone',test):
    print(match.span())

(3, 8)
(14, 19)


In [None]:
for match in re.finditer('phone',test):
    print(match.group())

phone
phone


Regular Expression Character Identifiers (re)

Character identifiers in regex define what kind of characters a pattern can match.

Common ones:

\d → any digit (0–9)

\D → any non-digit

\w → any word character (letters, digits, underscore)

\W → any non-word character

\s → any whitespace (space, tab, newline)

\S → any non-whitespace

In [20]:
text = "the agents phone number is 234-234-2355. Call. Call soon! "

In [21]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)   

In [22]:
phone

<re.Match object; span=(27, 39), match='234-234-2355'>

In [23]:
phone.group()

'234-234-2355'

| Quantifier | Meaning           | Example   | What it matches        |
| ---------- | ----------------- | --------- | ---------------------- |
| `*`        | 0 or more         | `a*`      | `""`, `a`, `aa`, `aaa` |
| `+`        | 1 or more         | `a+`      | `a`, `aa`, `aaa`       |
| `?`        | 0 or 1 (optional) | `colou?r` | `color`, `colour`      |
| `{n}`      | exactly n         | `\d{3}`   | `123`                  |
| `{n,}`     | n or more         | `\d{2,}`  | `12`, `1234`, …        |
| `{n,m}`    | between n and m   | `\d{2,4}` | `12`, `123`, `1234`    |


In [24]:
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phone = re.search(phone_pattern,text)   
phone.group()

'234-234-2355'

In [25]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
phone = re.search(phone_pattern,text)   
phone.group()

'234-234-2355'

In [26]:
phone.group(1)

'234'

In [27]:
phone.group(4)
#dont exist

IndexError: no such group

In [28]:
re.search(r'agent','the agent has a phone')

<re.Match object; span=(4, 9), match='agent'>

In [30]:
re.search(r'officer','the agent has a phone') #want to search for an or 

In [31]:
re.search(r'agent|officer','the agent has a phone')

<re.Match object; span=(4, 9), match='agent'>

In [None]:
re.findall(r'.at' , 'the cat in the hat sat there')
# '.' will act as wild card

['cat', 'hat', 'sat']

In [35]:
re.findall(r'^\d',"1 is a number") 
#starts with a number

['1']

In [36]:
re.findall(r'\d$',"is a number 2") 
#ends with for $

['2']

# exclusion with []

In [37]:
phrase = 'there are 3 mumbers 34 inside 5 this sentence'

In [40]:
pattern = r'[^\d]+'

In [41]:
re.findall(pattern,phrase)

['there are ', ' mumbers ', ' inside ', ' this sentence']

In [46]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [48]:
clean = re.findall(r'[^!.? ]+',test_phrase)

In [50]:
' '.join(clean)

'This is a string But it has punctuation How can we remove it'

# inclusion with []

In [51]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [55]:
pattern = r'[\w]+-[\w]+'

In [56]:
re.findall(pattern,text)

['hypen-words', 'long-ish']

In [60]:
text ='Hello, would you like some catfish?'
texttwo = 'Hello, would you like to take a catnap?'
textthree = 'Hello, have you seen this caterpiller?'

In [58]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [61]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [63]:
re.search(r'cat(fish|nap|erpiller)',textthree)

<re.Match object; span=(26, 37), match='caterpiller'>