<a href="https://colab.research.google.com/github/kedrick07/NLP-Semester-5/blob/main/Lab_1_Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 1: Regular Expression**
**Course:** SKM3206 – Natural Language Processing  
**Lecturer:** Dr Nurul Amelina Nasharuddin  
**Due date:** 20 October 2025 (Monday)

> ⚠️ **Instructions:**  
> 1. Go to *File → Save a copy in Drive* before editing.  
> 2. Rename your notebook as `Lab_LabNo_StudentID_Name.ipynb`.
> 3. Read the examples carefully before attempting the exercises.
> 4. Complete all the tasks in the code cells provided.  
> 5. Submit your Colab link (e.g. http://colab.research.google.com/drive/....) as a submission in the PutraBlast.  





## **REGEX Python Documentation**

In Python, the `re` module provides functions such as:
- `re.search()` – finds the first match of a pattern.
- `re.findall()` – returns all matches of a pattern.
- `re.sub()` – replaces matched text with another string.
- `re.split()` – splits text by the occurrences of a pattern.

You will certainly want to consult the [Python regular expression documentation](https://docs.python.org/3/library/re.html) as you work on this project.

## **Refresher: Regex Symbols and Examples**

### **1. Basic Metacharacters**
- **`.` (dot)** : Matches any character except a newline.
  - **Regex**: `c.t`
  - **Sentence**: "The cat sat on the mat."
  - **Match**: `cat`, `mat` (matches any three-letter word where the first and last letters are "c" and "t," and the middle character can be anything)

- **`^`** : Matches the start of a string.
  - **Regex**: `^Hello`
  - **Sentence**: "Hello world! Hello again."
  - **Match**: `Hello` (only the first "Hello" at the start of the string)

- **`$`** : Matches the end of a string.
  - **Regex**: `world$`
  - **Sentence**: "Hello world"
  - **Match**: `world` (only if "world" is at the end)

- **`*`** : Matches 0 or more repetitions of the preceding element.
  - **Regex**: `ab*c`
  - **Sentence**: "ac, abc, abbc"
  - **Match**: `ac`, `abc`, `abbc` (matches "a" followed by any number of "b"s and then "c")

- **`+`** : Matches 1 or more repetitions of the preceding element.
  - **Regex**: `ab+c`
  - **Sentence**: "ac, abc, abbc"
  - **Match**: `abc`, `abbc` (requires at least one "b" after "a" before "c")

- **`?`** : Makes the preceding element optional (0 or 1 occurrence).
  - **Regex**: `colou?r`
  - **Sentence**: "color, colour"
  - **Match**: `color`, `colour` (matches both versions with or without "u")

- **`{n}`** : Matches exactly `n` occurrences of the preceding element.
  - **Regex**: `a{3}`
  - **Sentence**: "aa, aaa, aaaa"
  - **Match**: `aaa` (matches exactly three "a"s)

- **`{n,}`** : Matches `n` or more occurrences of the preceding element.
  - **Regex**: `a{2,}`
  - **Sentence**: "a, aa, aaa"
  - **Match**: `aa`, `aaa` (matches two or more "a"s)

- **`{n,m}`** : Matches between `n` and `m` occurrences of the preceding element.
  - **Regex**: `a{2,3}`
  - **Sentence**: "a, aa, aaa, aaaa"
  - **Match**: `aa`, `aaa` (matches two or three "a"s)

- **`\`** : Escapes a metacharacter to match it literally.
  - **Regex**: `3\.14`
  - **Sentence**: "Pi is approximately 3.14"
  - **Match**: `3.14` (the dot is treated as a literal dot, not any character)

### **2. Character Classes**
- **`[abc]`** : Matches any one character inside the square brackets.
  - **Regex**: `[aeiou]`
  - **Sentence**: "The cat sat on the mat."
  - **Match**: `a`, `o`, `a`, `a`, `a` (matches each vowel)

- **`[^abc]`** : Matches any character NOT inside the square brackets.
  - **Regex**: `[^aeiou]`
  - **Sentence**: "Hello world"
  - **Match**: `H`, `l`, `l`, ` `, `w`, `r`, `l`, `d` (matches non-vowels)

### **3. Predefined Character Classes**
- **`\d`** : Matches any digit (0-9).
  - **Regex**: `\d+`
  - **Sentence**: "My number is 12345."
  - **Match**: `12345` (matches the whole sequence of digits)

- **`\D`** : Matches any non-digit character.
  - **Regex**: `\D+`
  - **Sentence**: "Room 101"
  - **Match**: `Room` (matches the non-digit part)

- **`\w`** : Matches any word character (alphanumeric + underscore).
  - **Regex**: `\w+`
  - **Sentence**: "Hello_123!"
  - **Match**: `Hello_123` (matches word characters)

- **`\W`** : Matches any non-word character.
  - **Regex**: `\W+`
  - **Sentence**: "Hello, world!"
  - **Match**: `, `, `!` (matches punctuation and spaces)

- **`\s`** : Matches any whitespace character (spaces, tabs, line breaks).
  - **Regex**: `\s+`
  - **Sentence**: "Hello world"
  - **Match**: ` ` (matches the space between "Hello" and "world")

- **`\S`** : Matches any non-whitespace character.
  - **Regex**: `\S+`
  - **Sentence**: "Hello world"
  - **Match**: `Hello`, `world` (matches words)

### **4. Anchors**
- **`\b`** : Matches a word boundary.
  - **Regex**: `\bcat\b`
  - **Sentence**: "The cat sat on the catalogue."
  - **Match**: `cat` (matches "cat" as a whole word)

- **`\B`** : Matches a non-word boundary.
  - **Regex**: `\Bcat\B`
  - **Sentence**: "The catalogue contains the word category."
  - **Match**: `cat` (matches "cat" as part of "catalogue" or "category")


## **1. Basic Metacharacter**

In [None]:
import re

# Example 1: Dot (.) - Match any character except a newline
text = "The cat sat on the mat."
pattern = r"c.t"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")


Pattern: c.t, Matches: ['cat']


In [None]:
# Example 2: Star (*) - Match zero or more repetitions
text = "ab, abc, abbc, abbbc"
pattern = r"ab*c"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")

Pattern: ab*c, Matches: ['abc', 'abbc', 'abbbc']


In [None]:
# Example 3: Curly braces ({n,m}) - Match between n and m occurrences
text = "ha, haha, hahahaha"
pattern = r"ha{2,4}"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")

Pattern: ha{2,4}, Matches: []


## **2. Character Classes**

In [None]:
# Example 1: Square brackets [] - Match any character inside the brackets
text = "apple, orange, grape"
pattern = r"[aeiou]"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")


Pattern: [aeiou], Matches: ['a', 'e', 'o', 'a', 'e', 'a', 'e']


In [None]:
# Example 2: Negated character class [^] - Match characters NOT in the brackets
text = "123 ABC abc"
pattern= r"[^0-9]"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")


Pattern: [^0-9], Matches: [' ', 'A', 'B', 'C', ' ', 'a', 'b', 'c']


In [None]:
# Example 3: Range in character class - Match a range of characters
text10 = "The quick brown fox jumps over the lazy dog."
pattern = r"[a-m]"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")

Pattern: [a-m], Matches: ['a', 'b', 'c']


## **3. Predefined Character Classes**

In [None]:
# Example 1: \d - Match any digit
text = "Phone number: 123-456-7890"
pattern = r"\d+"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")

Pattern: \d+, Matches: ['123', '456', '7890']


In [None]:
# Example 2: \w - Match any word character (alphanumeric + underscore)
text = "Welcome to Regex_101!"
pattern = r"\w+"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")


Pattern: \w+, Matches: ['Welcome', 'to', 'Regex_101']


In [None]:
# Example 3: \s - Match any whitespace character
text = "Hello world, it's a beautiful day."
pattern = r"\s"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")

Pattern: \s, Matches: [' ', ' ', ' ', ' ', ' ']


## **4. Anchors**

In [None]:
# Example 1: ^ - Match the start of a string
text = "Hello there, Hello world!"
pattern = r"^Hello"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")


Pattern: ^Hello, Matches: ['Hello']


In [None]:
# Example 2: $ - Match the end of a string
text = "This is the end."
pattern = r"end\.$"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")


Pattern: end\.$, Matches: ['end.']


In [None]:
# Example 3: \b - Match a word boundary
text = "The cat in the catalog."
pattern = r"\bcat\b"
matches = re.findall(pattern, text)
print(f"Pattern: {pattern}, Matches: {matches}")

Pattern: \bcat\b, Matches: ['cat']


## **More examples**

### 1. Find all 3-letter words.

In [None]:
import re

text = "The cat ran fast but the dog was slow."
pattern = r"\b\w{3}\b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['The', 'cat', 'ran', 'the', 'dog']

# Regex pattern: \b\w{3}\b
# Explanation:
# \b is a word boundary to ensure the match starts and ends on a word.
# \w{3} matches exactly three word characters (letters, digits, or underscores).
# \b marks the end of the word boundary.

['The', 'cat', 'ran', 'but', 'the', 'dog', 'was']


### 2. Find words that start with a capital letter.

In [None]:
text = "Alice went to Paris to see the Eiffel Tower."
pattern = r"\b[A-Z]\w*"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Alice', 'Paris', 'Eiffel', 'Tower']

#Regex: \b[A-Z]\w*
#Explanation:
#\b ensures the match starts at a word boundary.
#[A-Z] matches any uppercase letter.
#\w* matches zero or more word characters following the uppercase letter.

['Alice', 'Paris', 'Eiffel', 'Tower']


### 3. Extract all numbers from a text.

In [None]:
text = "There are 12 apples, 5 oranges, and 100 grapes."
pattern = r"\b\d+\b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['12', '5', '100']

# Regex: \b\d+\b
# Explanation:
# \b ensures the match starts at a word boundary.
# \d+ matches one or more digits.
# \b marks the end of the word boundary, ensuring we capture whole numbers.

['12', '5', '100']


### 4. Find all words that end with "ing".

In [None]:
text = "She was running, jumping, and singing."
pattern = r"\b\w+ing\b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['running', 'jumping', 'singing']

# Regex: \b\w+ing\b
# Explanation:
# \b ensures the match starts at a word boundary.
# \w+ matches one or more word characters.
# ing matches the literal string "ing".
# \b marks the end of the word boundary to ensure the word ends with "ing".

['running', 'jumping', 'singing']


### 5. Find all instances of a single character followed by exactly two digits.

In [None]:
text = "A12 B34 C56 D789"
pattern = r"\b\w\d{2}\b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['A12', 'B34', 'C56']

# Regex: \b\w\d{2}\b
# Explanation:
# \b ensures the match starts at a word boundary.
# \w matches a single word character.
# \d{2} matches exactly two digits.
# \b marks the end of the word boundary.

['A12', 'B34', 'C56']


# 🧮 **Lab Assignment**
Complete the following task and ensure your answers run without error.


## 📘 **Question Brief**

Go thorugh the code below. This is a functioning program, but currently the only things it will match are integers and real numbers. You will need to **add regular expressions** so that it correctly matches all of the items described later in these instructions.



In [None]:
"""
match_strings.py

This notebook cell simulates reading lines from standard input.
Students can enter multiple text lines, and the program will print
the name of the first matching pattern or "unknown" if no match.
"""

import re

# ------------------------------------------------------------
# STEP 1: Define regex patterns
# ------------------------------------------------------------
patterns = [
    (r'^\d+$', 'integer'),
    (r'^\d+\.\d+$', 'real number'),
]

# ------------------------------------------------------------
# STEP 2: Provide input lines
# ------------------------------------------------------------
user_input = """
1234
5.6
"""

# Convert to list of non-empty lines
lines = [line.strip() for line in user_input.strip().split("\n") if line.strip()]

# ------------------------------------------------------------
# STEP 3: Match each line to patterns
# ------------------------------------------------------------
def print_match(string, patterns):
    for pattern, name in patterns:
        if re.fullmatch(pattern, string):
            print(f"{string} → {name}")
            return
    print(f"{string} → unknown")

print("🔍 Matching Results:\n")
for line in lines:
    print_match(line, patterns)


🔍 Matching Results:

1234 → integer
5.6 → real number


In the `main()` function of the program there is the following list of tuples. Each tuple in the list contains a regular expression and a name of the pattern that it recognizes.

    patterns = [
        (r'^\d+$', 'integer'),
        (r'^\d+\.\d+$', 'real number'),
    ]

The only changes you need to make to the program are additional entries in this list.


### **🧩 Pattern 1 – Match UPM Course Code**

A valid UPM course code:
- Begins with **three uppercase letters (A–Z)**.  
- Followed by **four digits**.  
- There may or may not be **one space** between letters and digits.

✅ Valid examples: `SKM3206`, `CSC 3100`, `QKB1234`  
❌ Invalid examples: `sse123`, `PKP567`, `S km 3001`

### **💰 Pattern 2 – Match Malaysian Fee Format**

Rules:
- Must begin with `RM`.  
- Can include **commas** as thousand separators.  
- May have **optional cents (two digits)** after a decimal point.  

✅ Valid: `RM1`, `RM1.99`, `RM2,000.99`, `RM1,234,567.89`  
❌ Invalid: `RM1.9`, `RM10,23.4`

### **☎️ Pattern 3 – Match Malaysian Phone Numbers**

A valid phone number:
- Area code in parentheses `(03)` or with a hyphen `03-`.  
- Followed by **8 digits**.  
- May include **one space** after the closing parenthesis.

✅ Valid: `(03)23456789`, `(05) 23456789`, `04-23456789`  
❌ Invalid: `(123)4567890`, `456-7890`

---
## **Task:**
1. Copy the `match_strings.py` into the code section below.
2. For each of the pattern above, add their own regex patterns to the patterns list.
3. Paste or type their test strings inside the `user_input` variable.
4. Run the cell to see which pattern each line matches.
---

---
## ✅ **Submission Reminder:**  
- Upload your Colab link to Moodle (as *Viewer link*).  
- Ensure the file name is `Lab1_StudentID_Name.ipynb`.  
---
