<a href="https://colab.research.google.com/github/ngzhiwei517/NLP/blob/main/Lecture_02_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions
A regular expression (regex) is a special way to describe patterns in text.

A regular expression is like a smart pattern you use to search, extract, or replace text.

In [13]:
import re
match = re.search(r"\d+", "Price is 100 dollars")
print(match.group())  # Output: 100

match = re.match(r"\d+", "200 100 dollars")
print(match.group())  # Output: 100

#📃 findall()
#Finds all matching patterns and returns them as a lists

matches = re.findall(r"\d+", "Price is 100 and discount is 20")
print(matches)  # Output: ['100', '20']

#🔁 finditer()
#Returns an iterator of match objects (more detailed info).
for match in re.finditer(r"\d+", "100 apples and 20 bananas"):
    print(match.group())  # Output: 100, then 20

#✏️ sub()
#Substitute (replace) text that matches a pattern.
result = re.sub(r"\d+", "XX", "Price is 100 dollars")
print(result)  # Output: Price is XX dollars
#📌 Replace numbers with “XX” 🔁

100
200
['100', '20']
100
20
Price is XX dollars






*   🎯 match():Checks only the start of the string.
*   🔍 search()
Looks for the first match anywhere in the string.

text = "I love Python!"

search: looks for "Python" anywhere in the text

result_search = re.search("Python", text)

✅ Match object found

*   match: looks for "Python" only at the beginning.

result_match = re.match("Python", text)

❌ None





The full of functions including the respective input parameters can be found here:
- https://docs.python.org/3.4/library/re.html#re.search


The method `show_matches()`

It's a helper function to make it easier to see which parts of a string match a pattern using regular expressions (re module in Python).

In [14]:
def show_matches(pattern, string, flags=0):
    matches = re.findall(pattern, string, flags)
    if len(matches) == 0:
        print("No match found.", "\n")
    else:
        try:
            print(', '.join(matches), '\n')
        except:
            print(matches, '\n')

In [17]:
show_matches(r'\d+', "Today is 2025-05-06")

2025, 05, 06 



In [18]:
show_matches(r'cat', "There's a dog here.")

No match found. 



Let's use the next controversial statement for a series of examples.


## Basic patterns

🔤 Ordinary Characters
a, X, 9, < – match exactly themselves



---
Meta-characters (have special meanings)

. ^ $ * + ? { [ ] \ | ( )


 Dot (.) — Matches any 1 character except newline

In [21]:
re.findall(r'c.t', 'cat, cut, cot, c#t')
# 💡 ['cat', 'cut', 'cot', 'c#t']


['cat', 'cut', 'cot', 'c#t']

\w — Matches 1 word character: letter, digit, or _

In [22]:
re.findall(r'\w', 'Hi_123!')
# 💡 ['H', 'i', '_', '1', '2', '3']


['H', 'i', '_', '1', '2', '3']

 \W — Matches 1 non-word character
python
Copy code


In [23]:
re.findall(r'\W', 'Hi_123!')
# 💡 ['!']


['!']

\d — Matches 1 digit (0-9)

In [25]:
re.findall(r'\d+', 'My number is 12 345')
# 💡 ['12345']

['12', '345']

\D — Matches non-digits

In [26]:
re.findall(r'\D+', '123ABC!456')
# 💡 ['ABC!', '']

['ABC!']

\s — Matches 1 whitespace (space, tab, newline)

In [27]:
re.findall(r'\s', 'Hello world\t!')
# 💡 [' ', '\t']


[' ', '\t']

\S — Matches non-whitespace characters

In [29]:
re.findall(r'\S+', 'Hi there!')
# 💡 ['Hi', 'there!']

['Hi', 'there!']

\b — Matches word boundary

In [28]:
re.findall(r'\bcat\b', 'A cat in the catalog.')
# 💡 ['cat']  ✅ only whole word "cat", not "catalog"
#👉 Return only the exact word cat — nothing more, nothing less.

['cat']

 ^ and $ — Start and End of string


In [35]:
re.findall(r'^Hi', 'Hi there!')
# 💡 ['Hi']



['Hi']

In [38]:
re.findall(r'!$', 'Hi there！')
# 💡 ['!']

[]

 Escaping special characters (*\*)

In [47]:
re.findall(r'\.', 'File.txt')
# 💡 ['.'] ✅ matches the dot literally

re.findall(r'\\', 'C:\\Users')
# 💡 ['\\'] ✅ matches backslash


['\\']

In [43]:
sentence = "Having 1 cat is nice, but having 20 cats is crazy!"
sent1="cat haha cats"

In [44]:
show_matches(r'cat', sentence)
show_matches(r'cats', sentence)
show_matches(r'cat\s', sentence)
show_matches(r'\bcat\b', sentence)
show_matches(r'^cat', sentence)
show_matches(r'^cat', sent1)
show_matches(r'cat$', sentence)
show_matches(r'cat$', sent1)

cat, cat 

cats 

cat  

cat 

No match found. 

cat 

No match found. 

No match found. 



In [45]:
show_matches(r'Cat', sentence)
show_matches(r'Cat', sentence, flags=re.IGNORECASE)

No match found. 

cat, cat 



By default, every part of a pattern has to match for the whole pattern to match. The pipe operator `|` allows to combine two patterns by means of a logical OR:

In [46]:
show_matches(r'cats|cat', sentence)
show_matches(r'cat|cats', sentence)

cat, cats 

cat, cat 



Notice that the two expressions do noy yield the same result. In the second expression, the successful match happens already with the first three letters of "cats". If the words would share a common substring, the order wouldn't matter. for, example `cat|dog` and `dog|cat` would indeed yield the same results.

In [None]:
show_matches(r'\d', sentence)        # Same as: show_matches(r'[0-9]', sentence)
show_matches(r'\d\d', sentence)      # Same as: show_matches(r'[0-9][0-9]', sentence)
show_matches(r'\d\d\d', sentence)    # Same as: show_matches(r'[0-9][0-9][0-9]', sentence)

In [None]:
show_matches(r'\w', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]', sentence)
show_matches(r'\w\w', sentence)      # Same as: show_matches(r'[a-zA-Z0-9_][a-zA-Z0-9_]', sentence)
show_matches(r'\w\w\w', sentence)    # Same as: show_matches(r'[a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_]', sentence)

Notice that matches never overlap. For example, in the last example, one might assume something like: `Hav`, `avi`, `vin`, `ing`, `cat`, `nic`, `ice`, etc. Basic RegEx analyzer do not support overlapping matches, but there are RegEx packages for Python available that do, e.g., `import regex`.


In [None]:
show_matches(r'Having', sentence)
show_matches(r'^Having', sentence)
show_matches(r'Having$', sentence)

In [None]:
print(re.sub(r'cat', 'dog', sentence))

**Important:** Always be careful what you're doing when you replace substrings! :)

In [None]:
print(re.sub(r'cat', 'dog', 'The scatter plot shows all the categories'))

## Repetition patterns

- `+`: 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
- `*`: 0 or more occurrences of the pattern to its left
- `?`: match 0 or 1 occurrences of the pattern to its left
- `{n}`: specification of number of occurrences
- `{l,u}`: specification of a lower bound `l` and an upper bound `u`. The interval can be unbounded: `{l,}` or `{,u}`

In [None]:
show_matches(r'\d+', sentence)        # Same as: show_matches(r'[0-9]+', sentence)
show_matches(r'\d*', sentence)        # Same as: show_matches(r'[0-9]*', sentence)
show_matches(r'\d?', sentence)        # Same as: show_matches(r'[0-9]?', sentence)

In [None]:
show_matches(r'\d{,3}', sentence)
show_matches(r'\d{1}', sentence)
show_matches(r'\d{1,1}', sentence)
show_matches(r'\d{2,3}', sentence)
show_matches(r'\d{2,2}', sentence)
show_matches(r'\d{1,}', sentence)

In [None]:
show_matches(r'\w+', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]+', sentence)
show_matches(r'\w*', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]*', sentence)
show_matches(r'\w?', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]?', sentence)
show_matches(r'[a-zA-Z]+', sentence)

In [None]:
show_matches(r'cats+', sentence)
show_matches(r'cats*', sentence)
show_matches(r'cats?', sentence)

## Group extraction

Groups are very useful to "organize" a regular expression into different parts. While for a match all parts/groups must match, the individual groups are captured individually and can be accessed as such. As a result, the output is no longer a list of strings, one for each match, but a list of tuples. The number of elements in each tuple reflects the number of groups in the regular expression.


The following looks for all number-word pairs (e.g., "20 cats", but also "300km" since the whitespace between number and word is optional).

In [None]:
show_matches(r'(\d+)\s?(\w+)', sentence)

The following examples splits a sentence into different clauses by looking for commas, colons, and semicolons.

In [None]:
show_matches(r'([\w\s]+)[,;:]([\w\s]+)', sentence)

The following examples look for email addresses in a string.

In [None]:
email_sentence = "You can contact me via test@example.org or demo.user.@example.org."

matches = re.findall(r'[\w.-]+@[\w.-]+\w', email_sentence)
matches = re.findall(r'((\w+[\w.])*@(\w+[.])*\w+)', email_sentence)

for m in matches:
    print(m)

We can use groups to make it easy to individually get the account name (string for @) and the server name (string after @).

In [None]:
matches = re.findall(r'([\w.-]+)@([\w.-]+\w)', email_sentence)

for m in matches:
    print(m)
    print("-- account name: {}".format(m[0]))
    print("-- server name:  {}".format(m[1]))

Sometimes, we not only want the matching substrings but also the position of each match in the input string. For example, it can be interesting how far apart two matches are in the string. To accomplish this, we can use the method `finditer()` which returns an iterator of match objects which in turn provide functions to get the location of a match in terms of the positions of the first character and the last character.

In [None]:
pattern = re.compile(r'([\w.-]+)@([\w.-]+)')

for m in pattern.finditer(email_sentence):
    print(m)
    print(m.group(), m.span())      # Same as print(m.group(0), m.span(0))
    print(m.group(1), m.span(1))
    print(m.group(2), m.span(2))
    print()

Note that `m` is not a string but a match object, since we are using `finditer()` instead of `findall()`.

By default, groups are getting numbered with respect to the order of appearance. Thus to access the first group, we have to write `group(1)` and `span(1)`. `span(0)` or simply `span()` returns the complete match covering all groups and the "rest".

Groups can also be nested -- see the example later below. In case of nesting, the numbering derives from the order in which groups are opened by `(`.


### Named groups

Instead of using the implicit ordering of groups to access the (partial) matches, we can also give each group its own name with `(?P<account>...)`. This later allows access to the groups using the name, which avoids keeping track of the ordering -- but makes the regular expression more scary looking.

The example below again extracts all email addresses and differentiates between the account and server part of the address; see above. However, here we use named groups to make printing the matches more intuitive.


In [None]:
pattern = re.compile(r'(?P<account>[\w.-]+)@(?P<server>[\w.-]+\w)')

for m in pattern.finditer(email_sentence):
    print(m)
    print(m.group(), m.span())
    print(m.group('account'), m.span('account'))
    print(m.group('server'), m.span('server'))
    print()

### Grouping and Capturing

Parentheses serve two different purposes: grouping expressions and capturing the text that matches an expression. That is, by default, each matching group is captured and returned. However, sometimes groups are only used for matching by not capturing. Consider the following example:

In [None]:
gc_sentence = "Aircraft and airplane are synonyms, but a jet is a special kind of airplane."

From this sentence we want to extract all mentions of "aircraft", "airplane", and "jet". Making use of the syntactic similarities between "aircraft" and "airplane", we can accomplish this by using:

In [None]:
pattern = re.compile(r'(air(craft|plane)|jet)', flags=re.IGNORECASE)

for m in pattern.finditer(gc_sentence):
    print(m.group(1), m.span(1))
    print(m.group(2), m.span(2))

Since we have two groups, we not only capture "aircraft" and "airplane" but also "craft" and "plane". This is not really what we want. To get rid of it, we can switch off the capturing of the 2nd group with `?:` -- now the 2n group is only used for matching:

In [None]:
pattern = re.compile(r'(air(?:craft|plane)|jet)', flags=re.IGNORECASE)

for m in pattern.finditer(gc_sentence):
    print(m.group(1), m.span(1))
    # print(m.group(2), m.span(2)) # Now throws an error since the 2nd group has been "disabled"

Side note: In this easy example, `(aircraft|airplane|jet)` would also work just fine.

## Backreferences

Backreferences are a concept related to groups. They allow referring to a group in the regular expressions. The common use case is to find repeated patterns in a string.

In [None]:
br_sentence = "School is the coolest time of the day hahahahahaha."

In the following example, we are looking for words that contain repeated substrings. According to the sentence above, that can be "school" where the repeated substring is "o", or "hahahahahaha" where the repeated substring is "ha".

In [None]:
show_matches(r'\w*(\w)\1\w*', br_sentence)
show_matches(r'\w*(\w+)\1\w*', br_sentence)

This didn't quite work since `findall()` captures only groups, if groups are used. To capture the full match one can use `finditer()` or keep `findall()` but change the expression a little bit.

In [None]:
pattern = re.compile(r'\w*(\w+)\1\w*')

for m in pattern.finditer(br_sentence):
    print(m.group(), m.span())

The alternative is to use `findall()` and pack the whole expression into its own group. The only required change is that we need to replace `\1` by `\2` since we now have two groups -- actually nested groups, and the group that checks for doubled characters is the inner group.

In [None]:
br_sentence = "This is a test."

show_matches(r'(\w*(\w)\2\w*)', br_sentence)
show_matches(r'(\w*(\w{2,})\2\w*)', br_sentence)
show_matches(r'(\b(\w)\w*\2\b)', br_sentence, flags=re.IGNORECASE)

for m in re.findall(r'(\w*(\w+)\2\w*)', br_sentence):
    print(m[0]) # We know that we are interested in the first group ==> first element in tuple

Just as an additional example, let's use named groups to make the access to the matches more intuitive.

In [None]:
pattern = re.compile(r'(?P<doubledword>\w*(\w+)\2\w*)')

for m in pattern.finditer(br_sentence):
    print(m.group('doubledword'), m.span('doubledword'))
    print()

### Replacing substrings using backreferences

The replacement string can also return backreferences. However, the syntax is slightly different to avoid potential ambiguities: to refer to the first group, instead of `\1` use `\g<1>`.

In [None]:
print(re.sub(r'(is)', 'was and \g<1>', br_sentence))

The following substring replacement using `sub()` first finds double words and replaces them with just the first occurrence.

In [None]:
print(re.sub(r'\b(\w+)\s+\1\b', '\g<1>', 'The the length of of a text is often defined by the the number of words.', flags=re.IGNORECASE))

We can also replace doubled words with the second occurrence, but this is likely to cause more problems when the doubled words appear at the beginning of sentence with the first occurrence being capitalized and the second one being lowercase. Notice that we first have to change `\1` to `(\1)` to put the second occurrence into its own group, otherwise we cannot refer to it with `\g<2>`.


In [None]:
print(re.sub(r'\b(\w+)\s+(\1)\b', '\g<2>', 'The the length of of a text is often defined by the the number of words.', flags=re.IGNORECASE))

## Lookarounds: Lookaheads & Lookbehinds

Lookarounds are assertions that look ahead or behind to ensure that a subpattern does or does not occur. Lookarounds match characters just like any other pattern, but then gives up the match, returning only the result: match or no match (hence an assertion).

With the 2 directions (ahead, behind) and the 2 types of assertions (match, no match), there a for types of lookarounds:

- `(?=)`: positive lookahead -- `A(?=B)` finds expression A but only when followed by expression B

- `(?!)`: negative lookahead -- `A(?!B)` finds expression A but only when *not* followed by expression B

- `(?<=)`: positive lookbehind -- `(?<=B)A` finds expression A but only when preceded by expression B

- `(?<!)`: negative lookbehind -- `(?<!B)A` finds expression A but only when *not* preceded by expression B


Lookaround expressions are very useful when there are several conditions.

**Important:** The pattern of the lookbehinds need be of fixed length, that is, these patterns cannot have any repetition specifier that allows for various numbers of characters. For example, `(?<!B+)A`, for some expression `B` would throw an error.


### Lookaheads

The following examples shows a simple positive lookahead: We are searching for all first names of people with the last names "Simpson". That means, we are looking for all words that are followed be "Simpsons" (but we do not care about "Simpson" itself)

In [None]:
lap_sentence = "The team consists of Homer Simpson, Barney Gumble, Monty Burns, Marge Simpson, Ned Flanders, and Lenny Lennard."

pattern = re.compile(r'\w+(?= Simpson)')

for m in pattern.finditer(lap_sentence):
    print(m.group())

### Lookbehinds

The following examples shows a simple positive lookbehind: We are searching for all amounts of money in SGD.

In [None]:
lbp_sentence = "For 5 years, the funding is $1,000,000.00 which converts to S$1,307,600.00."

# pattern = re.compile(r'[0-9,.]*\d')     # try the pattern without the lookbehind
pattern = re.compile(r'(?<=S\$)[0-9,.]*\d')

for m in pattern.finditer(lbp_sentence):
    print(m.group())

Note that we have to escape `$` to `\$` since `$` is a reserved symbol within regular expressions.

## More Examples

### Find all dates in a text

In [None]:
date_sentence = "the office will be closed from 20/03/2017 to 25.03.2017."

pattern = re.compile(r'\d+[/.]\d+[/.]\d+')

for m in pattern.finditer(date_sentence):
    print(m.group())

Don't forget that regular expressions only look for mattern. The expression above cannot truly find only real dates. For example, the expression would also match `123.456.789` which might be a phone number. It is usually always possible to make regular expressions more robust against wrong matches, but its complexity can quickly shoot up. In this example, we can use `r'\d{1,2}[/.]\d{1,2}[/.]\d{2,4}'` to ensure that `123.456.789` wouldn't match.

### Find all IATA flight numbers in a text

According to Wikipedia, IATA flight numbers consist of a 2-letter identifier of the airline and an assigned number -- fun fact: low numbers represent more prestigious flights. Officially, there's no whitespace between the airline code and the number, but many sources place one.


In [None]:
iata_sentence = "Was thinking about taking the SQ 326 but now I'm taking LH9765."

pattern = re.compile(r'\b[a-zA-Z]{2}\s?[\d]{1,4}\b')

for m in pattern.finditer(iata_sentence):
    print(m.group())

### Change text from American English to British English (a bit)

In [None]:
ae_sentence = "Let's organize a meeting to list all commercialized products for the citizens."

In [None]:
# be_sentence = re.sub(r'ize', 'ise', ae_sentence, flags=re.IGNORECASE) # Check why this fails

be_sentence = re.sub(r'\b([\w]+)iz(ing|e[sd]?)?\b', '\g<1>is\g<2>', ae_sentence, flags=re.IGNORECASE) # Check why this fails

print(be_sentence)

### Password strength checker

Regular expression do not necessarily need to return a match. In this example, we use a regular expression to check the strength of passwords, i.e., a password needs to fulfil some requirements to be valid (see list below). A check for several conditions is often a good indicator that lookarounds are they best option.

- `(?=.*[a-z])`: at least 1 lowercase alphabetical character

- `(?=.*[A-Z])`: at least 1 uppercase alphabetical character

- `(?=.*[0-9])`: at least 1 numeric character

- `(?=.*[!@#\$%\^&\*])`: at least one special character (note the escaped reserved RegEx characters)

- `(?=.{8,})`: eight characters or longer

In [None]:
good_password = 'testTEST123#'
bad_password = 'testtest123#'   # no uppercase character

In [None]:
pattern = re.compile(r'^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!@#\$%\^&\*])(?=.{8,})')

print(pattern.match(good_password))
print(pattern.match(bad_password))

Not that even valid passwords don't return only an empty string. This is because the whole pattern is composed of lookaheads, which are assertions (i.e., patterns that give up the match; see above). If you want to get the password as a matched string you can change the last lookahead `(?=.{8,})` to a normal pattern `.{8,}`.

## Summary

Regular expressions are a powerful and flexible tool to analyze text. This tutorial covered only a small subset of the capabilities of regular expressions.

- In general, the same task can be solved with different regular expressions, from simple to (overly) complex

- Regular expressions are "easy to learn but hard to master"

- Practically all modern programming languages support regular expressions

- RegEx engines (e.g., Python's `re`) are not all created equal. Not all support all (advanced) features.
