# PETI8123 Lab 9: Regular Expression

<!--
Q1:
regex = re.compile(r'\b[tT]he\b')

for match in regex.finditer(text):
    print(f"start={match.start()}, end={match.end()}, text: {match.group()}")


Q2:
for match in re.finditer(r"\w+ly\b", text):
    print(f"start={match.start()}, end={match.end()}, text: {match.group()}")



Q3.
total = 0
for row in my_costs:
    match = re.search(r"\d+", row)
    if match:
        total += int(match.group())

print("Total: ", total)
print("Discounted: ", total * 0.95)
-->

Regular expressions (regex or regexp) are concise and flexible tools for finding, validating, and manipulating patterns within text. They are widely used in programming for tasks like pattern matching, data extraction, and text manipulation. In this lab, we learn some basic concepts and examine simple examples of the applications of regex.

For Python, regular expression functions are provided in the ``re`` module. You can find its full documentation [here](https://docs.python.org/3/library/re.html).

Now, we import the ``re`` module:

In [1]:
import re

Regular expressions are just like normal strings, but some characters are associated with special meanings. Below lists some basic rules:

1. **Characters**:
  - Ordinary characters such as letters, digits, or any other specific characters, which will match the corresponding text.
  - Meta-characters such as ``. , * + ? ^ $`` have special meanings.
  
  
2. **Character Sets**:

  - Square brackets (``[]``) are used to represent any one of the characters inside. For example, ``[abc]`` equals the letter 'a', 'b', or 'c'.


3. **Repetition**:

	- ``*``: represents the preceding element zero or more times.
	- ``+``: represents the preceding element one or more times.
	- ``?``: represents the preceding element zero or one time.
	- ``{n}``: represents the preceding element exactly n times.
	- ``{n,}``: represents the preceding element at least n times.
	- ``{n,m}``: represents the preceding element at least n times but not more than m times.


4. **Positions**:

  - ``^``: matches the beginning of a string.
  - ``$``: matches the end of a string.
  - ``\b``: matches a word boundary.
  - ``\B``: matches a non-word boundary.



## 1. Matching at the Beginning of Strings

*Matching at the beginning of strings* refers to a process in computer programming where a program or algorithm attempts to identify patterns or sequences of characters that occur at the start of a string or text.

This typically involves comparing a given pattern or substring with the initial characters of a string to determine if there is a match. If a match is found, it indicates that the string starts with the specified pattern.

This matching process is commonly used in various applications, including text processing, search algorithms, and regular expressions, to efficiently locate and extract specific information or perform specific operations based on the starting content of strings.

In [2]:
# Create a regular expression object using re.compile to match the pattern 'key'
regex = re.compile("key")

# Use the match method to attempt to match the pattern against the string "I have a key"
match = regex.match("I have a key")

# Print the match result (match will be a Match object if a match is found, or None if no match)
print(match)

None


In [3]:
# Use the match method to attempt to match the pattern against the string "keys open doors"
match = regex.match("keys open doors")

# Print the match result (match will be a Match object if a match is found, or None if no match)
print(match)

<re.Match object; span=(0, 3), match='key'>


⚠️ Think about why the first example doesn't work, but the second one works.

## 2. Matching within Strings

*Matching within strings* refers to the process of identifying specific patterns or sequences of characters that occur anywhere within a larger string or text.

Instead of focusing solely on the beginning of the string, this approach involves searching for patterns that may be located at any position within the text.

In [4]:
# Create a regular expression object using re.compile to match the pattern '11*'
regex = re.compile('11*')

# Use the match method to attempt to match the pattern against the string 'I have $111 dollars'
match = regex.match('I have $111 dollars')

# Print the match result (match will be a Match object if a match is found, or None if no match)
print(match)

None


In [5]:
# Use the search method to search for the pattern '11*' in the string 'I have $111 dollars'
match = regex.search('I have $111 dollars')

# Print the search result (match will be a Match object if a match is found, or None if no match)
print(match)

<re.Match object; span=(8, 11), match='111'>


A slightly different example:

In [6]:
# Use the search method to search for the pattern '11*' in the string 'I have $25111 dollars'
match = regex.search('I have $25111 dollars')

# Print the search result (match will be a Match object if a match is found, or None if no match)
print(match)

<re.Match object; span=(10, 13), match='111'>


### Another Example with a Different Regular Expression:

In [7]:
# \b: matches a word boundary
regex = re.compile(r"\bkey\b")  # Note the use of r"..." here

print(regex)

# Text to search
text = "I have a key, and keys open doors."

re.compile('\\bkey\\b')


In [8]:
# Use re.search to search for the pattern in the text
match = regex.search(text)

# Check if a match is found
if match:
  print("Match found!")
else:
  print("No match found.")

Match found!


In [9]:
# Check if a match is found

if match:
  # Print the matched text
  print("Matched text:", match.group())

  # Print the starting index of the match
  print("Start position:", match.start())

  # Print the ending index of the match
  print("End position:", match.end())

Matched text: key
Start position: 9
End position: 12


## 3. Extracting Matching Strings

*Extracting matching strings* involves isolating and retrieving specific parts of a text or data that match a particular pattern or criteria, by using ``findall(...)`` or ``finditer(...)`` methods.

In [10]:
# Create a regular expression object using re.compile to match one or more digits [0-9]+
regex = re.compile('[0-9]+') # +: represents the preceding element one or more times.

# Use the findall method to find all occurrences of one or more digits in the string
matches = regex.findall('My 2 favorite numbers are 19 and 42')

# Print the list of matches found
print(matches)

['2', '19', '42']


In [11]:
text = "My 2 favorite numbers are 19 and 42"

for match in regex.finditer(text):
    # Print information about each match
    start = match.start()
    end = match.end()
    print(f"start={start}, end={end}, text: {match.group()}")

start=3, end=4, text: 2
start=26, end=28, text: 19
start=33, end=35, text: 42


## 4. Grouping

*Grouping*, in the context of computer programming and regular expressions, refers to the process of creating subexpressions within a larger expression. By using parentheses, programmers can create groups within a regular expression to apply operators or quantifiers to multiple characters at once. These groups can be referenced easily after using regexp.

In [12]:
# Create a regular expression object using re.compile to match the pattern '(ab*)(c+)(def)'
# Note how we define 3 groups here
regex = re.compile('(ab*)(c+)(def)')
print(regex)

# Use the match method to attempt to match the pattern against the string 'abbbdefhhgg'
match = regex.match('abbbdefhhgg')

# Print the match result (match will be a Match object if a match is found, or None if no match)
print(match)

re.compile('(ab*)(c+)(def)')
None


⚠️ Why doesn't it work?

In [13]:
# Use the match method to attempt to match the pattern against the string 'abbbcdefhhgg'
match = regex.match('abbbcdefhhgg')

# Print the match result (match will be a Match object if a match is found, or None if no match)
print(match)

<re.Match object; span=(0, 8), match='abbbcdef'>


In [14]:
# .group(): returns the part of the string where there was a match
match.groups()

('abbb', 'c', 'def')

In [15]:
# Let's examine the groups:

print(match.group(1))
print(match.group(2))
print(match.group(3))

abbb
c
def


### Another Example:

We use ``\d+`` to match one or more digits that act as whole words.

In [16]:
# Define the regular expression pattern to match one or more digits as whole words

# \b: matches a word boundary
# \d: Returns a match where the string contains digits (numbers from 0-9)
regex = re.compile(r"\b\d+\b")

# The text to search for a match
text = "The price is $1000 and $200 for shipping at I30."

# Use re.search to search for the pattern in the text
match = regex.search(text)

# Check if a match is found
if match:
    # Print the matched text
    print("Matched text:", match.group())
else:
    # Print a message if no match is found
    print("No match found.")

Matched text: 1000


In [17]:
regex.findall(text)

['1000', '200']

## 5. Shortcut Syntax (No ``compile(...)``)

The ``re`` module also provides the common functions without the need of using ``compile(...)``. These functions take the first parameter as the regexp, and the second parameter as the string to search.

Example: ``[aeiou]`` matches any single vowel (a, e, i, o, or u) with the shortcut syntax:

In [18]:
# Define the regular expression pattern to match any single vowel (a, e, i, o, or u)
pattern = r"[aeiou]"

# The text in which we want to find matches
text = "This is a sample text."

# Use re.findall to find all matches of the pattern in the text
matches = re.findall(pattern, text)

print("Matches found:", matches)


Matches found: ['i', 'i', 'a', 'a', 'e', 'e']


The example below uses repetitions and groups to identify the day, month and year from a date string. Details:

- ``(\d{2})``: Matches two digits and set them as a group
- ``/``: Matches a forward slash
- ``(\d{4})``: Matches four digits and set them as a group

In [19]:
# Define the regular expression pattern to match a date in the format "dd/mm/yyyy"
pattern = r"(\d{2})/(\d{2})/(\d{4})"

# The text in which we want to find a match
text = "Date: 25/12/2023"

# Use re.search to search for the pattern in the text
match = re.search(pattern, text)

# Check if a match is found
if match:
    print("Full match group:", match.group())  # Note the call of group() without a number

    day = match.group(1)
    month = match.group(2)
    year = match.group(3)
    print(f"{year} - {month} - {day}")
else:
    # Print a message if no match is found
    print("No match found.")

Full match group: 25/12/2023
2023 - 12 - 25


## ⚠️ Exercises

**Q1.** Write a regular expression to find all instances of the determiner "the" (consider both "the" and "The") in the ``text`` variable. Print out their start and end positions.


In [29]:
text = 'The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.'

regex = re.compile(r'\b[Tt]he\b') # need to find out 'the' or 'The

for m in regex.finditer(text):
  start_pos = m.start()
  end_pos = m.end()
  print(f'the word "{m.group()}" start at {start_pos}, end at {end_pos}')

the word "The" start at 0, end at 3
the word "the" start at 22, end at 25
the word "the" start at 103, end at 106


**Q2.** Find out all adverbs (words ending with ly) in the following sentence:

*Hint: ``\w`` means a letter that can be used in words (English alphabet).*

In [61]:
# method 1: use regex to find the adverbs
text = "He was carefully disguised but captured quickly by lying police."

all_adverbs = re.findall(r'\w*ly\b', text)
print(f'all_adverbs: {all_adverbs}')

all_adverbs: ['carefully', 'quickly']


In [62]:
# method 2: use nltk to split words and use simple regex to find the adverbs
import nltk
nltk.download('punkt')
from nltk import word_tokenize

text = "He was carefully disguised but captured quickly by lying police."

words_nltk = word_tokenize(text) # Tokenize the words into individual words

all_adverbs = []

for i in words_nltk:
  check_end_with = re.findall('ly$', i)
  if check_end_with:
    all_adverbs.append(i)

print(f'all_adverbs: {all_adverbs}')

all_adverbs: ['carefully', 'quickly']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Q3.** Given the following statements, calculate the total cost needed, then apply a discount of 5% and calculate the final cost:

In [63]:
my_costs = [
    "An apple costs $9",
    "A computer is MOP 7999 worth",
    "We went to Taipa and spent 255dollars for drinks.",
    "Where should I have lunch if I am planning to spend $ 20?",
    "MPU charges $12$ for parking"
]

In [64]:
regex = re.compile('[0-9]+')

cal_cost = 0

for s in my_costs:
  matches = regex.findall(s) # find number in each sentense
  num = int(matches[0])
  cal_cost = cal_cost + num

print(f'cal_cost: {cal_cost}')
print(f'cal_cost with 5% discount: {cal_cost * (1-0.05)}')

cal_cost: 8295
cal_cost with 5% discount: 7880.25
