<a href="https://colab.research.google.com/github/karolinakuligowska/TMSMM_codes/blob/main/TMSMM_class3_regexp2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions for Text Mining applications

In [1]:
import re

# Sample: customer transaction record
sentence = (
    "On 03/12/2025, John Smith purchased 3 MacBook Pro laptops for $6,400 "
    "from our New York branch at 125 Market St. "
    "He rated his experience 9/10. "
)

print(sentence)

On 03/12/2025, John Smith purchased 3 MacBook Pro laptops for $6,400 from our New York branch at 125 Market St. He rated his experience 9/10. 


**1. SEARCH: Does a pattern appear in the text?**

In [2]:
# Does the text contain the company branch location "New York"?
print("Contains 'New York'?")
print(bool(re.search(r"New York", sentence)))  # True

Contains 'New York'?
True


In [3]:
# Does the text mention any monetary value (e.g. $xxx or $x,xxx)?
print("Contains a monetary amount?")
print(bool(re.search(r"\$\d[\d,]*", sentence)))  # True

Contains a monetary amount?
True


**2. EXTRACT: Find and retrieve matching patterns**

In [4]:
# Extract the date of the transaction
date = re.search(r"\b\d{2}/\d{2}/\d{4}\b", sentence)
print("Date of transaction:", date.group())  # '03/12/2025'

Date of transaction: 03/12/2025


In [5]:
# Extract the first name and last name (capitalized words in sequence)
name = re.search(r"\b[A-Z][a-z]+ [A-Z][a-z]+\b", sentence)
print("Customer name:", name.group())  # 'John Smith'


Customer name: John Smith


In [6]:
# Extract all numeric values (quantities, ratings, etc.)
numbers = re.findall(r"\b\d+(?:/\d+)?\b", sentence)
print("Numeric values:", numbers)  # ['03', '12', '2025', '3', '6', '400', '125', '9', '10']

Numeric values: ['03/12', '2025', '3', '6', '400', '125', '9/10']


In [7]:
# Extract all monetary amounts
amounts = re.findall(r"\$\d[\d,]*", sentence)
print("Monetary amounts:", amounts)  # ['$6,400']

Monetary amounts: ['$6,400']


In [8]:
# Extract all capitalized words (potential entities or brands)
caps = re.findall(r"\b[A-Z][a-zA-Z]+\b", sentence)
print("Capitalized words:", caps)
# ['On', 'John', 'Smith', 'MacBook', 'Pro', 'New', 'York', 'He']

Capitalized words: ['On', 'John', 'Smith', 'MacBook', 'Pro', 'New', 'York', 'Market', 'St', 'He']


**3. COUNT: How many times do certain patterns appear?**


In [9]:
# Count how many capitalized words (proper nouns) appear
print("Count of capitalized words:", len(caps))

Count of capitalized words: 10


In [10]:
# Count how many numeric patterns (numbers, dates, ratings) appear
print("Count of numeric patterns:", len(re.findall(r"\d+", sentence)))

Count of numeric patterns: 9


In [11]:
# Count how many product names contain 'Pro'
print("Count of 'Pro' occurrences:", len(re.findall(r"\bPro\b", sentence)))

Count of 'Pro' occurrences: 1


**4. REPLACE: Clean or modify text**

In [12]:
# Replace the customer's name with a placeholder for anonymization
anon = re.sub(r"\bJohn Smith\b", "[CUSTOMER]", sentence)
print("Anonymized name:", anon)

Anonymized name: On 03/12/2025, [CUSTOMER] purchased 3 MacBook Pro laptops for $6,400 from our New York branch at 125 Market St. He rated his experience 9/10. 


In [13]:
# Replace all monetary amounts with a tag <AMOUNT>
masked = re.sub(r"\$\d[\d,]*", "<AMOUNT>", anon)
print("Masked monetary amounts:", masked)

Masked monetary amounts: On 03/12/2025, [CUSTOMER] purchased 3 MacBook Pro laptops for <AMOUNT> from our New York branch at 125 Market St. He rated his experience 9/10. 


In [14]:
# Replace multiple spaces (if any) with a single space
cleaned_spaces = re.sub(r"\s{2,}", " ", masked)
print("Normalized spacing:", cleaned_spaces)

Normalized spacing: On 03/12/2025, [CUSTOMER] purchased 3 MacBook Pro laptops for <AMOUNT> from our New York branch at 125 Market St. He rated his experience 9/10. 


In [15]:
# Remove numbers, punctuation, and symbols for further NLP analysis
clean_text = re.sub(r"[^A-Za-z\s]", "", sentence)
clean_text = re.sub(r"\s{2,}", " ", clean_text).strip()
print("Cleaned text:", clean_text)

Cleaned text: On John Smith purchased MacBook Pro laptops for from our New York branch at Market St He rated his experience
