**Note: For this assignment, you may only use standard Python and the `re` (Regular Expression) module. Advanced libraries such as NumPy, Pandas are not permitted**

## Exercises 1

Use `re.search` to find whether a string contains a phone number. The pattern that you write should detect a phone number in the following strings.  
```
"Call me at 382-384-3840."  
"my number is (510) 849-3519. Call me!"
```  
And not find a match in the following strings. 
```
"my number is 510-849-35192"  
"here’s my number: 510-849.3519"
``` 
Consider making your own tests as well  

In [1]:
import re
from datetime import datetime, timedelta

In [2]:
# YOUR CODE HERE
texts = [
    'Call me at 382-384-3840.',
    'my number is (510) 849-3519. Call me!',
    'my number is 510-849-35192',
    'here’s my number: 510-849.3519'
]

phone = r'\(?\d{3}\)?[\s-]\d{3}-\d{4}(?!\d)'
for t in texts:
    match = re.search(phone, t)
    if match:
        print(match.group(), ': Match')
    else:
        print(t, ': No match')

382-384-3840 : Match
(510) 849-3519 : Match
my number is 510-849-35192 : No match
here’s my number: 510-849.3519 : No match


## Exercise 2

Use `re.sub` to alter the string below so that the dates have a common format that uses a dash for the day, month, and year separator.  
```
03/12/2018, 03.13.18, 03/14/2018, 03:15:2018
```

In [3]:
# YOUR CODE HERE
text = '03/12/2018, 03.13.18, 03/14/2018, 03:15:2018'

re.sub(r'[/.:]', '-', text)

'03-12-2018, 03-13-18, 03-14-2018, 03-15-2018'

## Exercise 3

Consider the first five sentences of the novel “Little Women” below. Extract the spoken dialog from each sentence.

In [4]:
text = '''
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
'''
print(text)


"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."



In [5]:
# YOUR CODE HERE
dialog = re.findall(r'"(.*?)"', text)
print(dialog)

["Christmas won't be Christmas without any presents,", "It's so dreadful to be poor!", "I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all,", "We've got Father and Mother, and each other,", "We haven't got Father, and shall not have him for a long time."]


## Exercise 4

In this exercise, you you working with ```email_test.txt``` file (attached), using Regular Expression.\
`Original Dataset: https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus`

In [6]:
# YOUR CODE HERE: open file
file_path = 'email_test.txt'
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
    content = f.read()

print(content[:500])

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; char


#### Simple Fraudulent email detection

1. Count how many emails contain urgency-related words (URGENT, IMMEDIATELY, QUICK, ASSISTANCE, CONFIDENTIAL). 
Calculate what percentage of emails use these tactics.

In [7]:
emails = re.split(r"(?m)^From r", content)
emails = [e.strip() for e in emails if e.strip()]

urgent_pattern = re.compile(r"\b(?:URGENT|IMMEDIATELY|QUICK|ASSISTANCE|CONFIDENTIAL)\b", re.IGNORECASE)

urgent_count = sum(1 for mail in emails if urgent_pattern.search(mail))

total_emails = len(emails)

percentage = (urgent_count / total_emails) * 100

print("Tổng số email:", total_emails)
print("Email chứa các cụm từ khẩn cấp:", urgent_count)
print("Tỉ lệ chiếm:", percentage, "%")

Tổng số email: 1330
Email chứa các cụm từ khẩn cấp: 1143
Tỉ lệ chiếm: 85.93984962406014 %


2. Find all mentions of money amounts in the email bodies (e.g., `US$25M`, `$100,000.00`, `USD$31,000,000.00`). Calculate:
- Total number of money mentions across all emails
- The largest amount mentioned
- The smallest amount mentioned
- Average amount per email

In [8]:
money_pattern = re.compile(r'(?:US\$|USD\$|\$)\s?\d[\d,]*(?:\.\d+)?\s?(?:M|m|million|Million)?(?![A-Za-z])', re.IGNORECASE)

money_mentions = money_pattern.findall(content)

def parse_money(val):
    v = val.upper().replace("USD", "").replace("US$", "").replace("$", "").strip()
    multiplier = 1
    if v.endswith("M"):
        multiplier = 1_000_000
        v = v[:-1]
    elif v.endswith("MILLION"):
        multiplier = 1_000_000
        v = v.replace("MILLION", "")
    v = v.replace(",", "").strip()
    try:
        return float(v) * multiplier
    except:
        return None

money_values = [
    parse_money(m) for m in money_mentions
    if parse_money(m) not in (None, 0.0)
]

print("Tổng số lần nhắc đến tiền:", len(money_values))
print(f"Số tiền lớn nhất: ${max(money_values):,.2f}")
print(f"Số tiền nhỏ nhất: ${min(money_values):,.2f}")
print(f"Trung bình số tiền được nhắc/email: ${sum(money_values)/len(emails):,.2f}")

Tổng số lần nhắc đến tiền: 1967
Số tiền lớn nhất: $85,050,000,000,000.00
Số tiền nhỏ nhất: $0.28
Trung bình số tiền được nhắc/email: $409,842,566,355.90


3. Extract all mentions of deaths or deceased persons (e.g., "late father", "died", "deceased", "death of").\
   What percentage of emails use death as part of their story?

In [9]:
# YOUR CODE HERE
death_pattern = re.compile(r"\b(?:LATE FATHER|DIED|DECEASED|DEATH OF)\b", re.IGNORECASE)
death_count = sum(1 for mail in emails if death_pattern.search(mail))

percentage = (death_count/total_emails) * 100

print("Tỉ lệ chiếm của các email đề cập đến cái chết hoặc người đã chết:", percentage)

Tỉ lệ chiếm của các email đề cập đến cái chết hoặc người đã chết: 51.72932330827068


4. Many emails mention percentage splits of money (e.g., "70% for us", "20% for you", "10% for expenses"). \
   Extract all percentage distributions and identify the most common split pattern offered to recipients.

   Example: most common split patterns:\
   70% - 20% - 10%: appears 15 times\
   75% - 20% - 5%: appears 12 times\
    60% - 30% - 10%: appears 8 times\
    80% - 15% - 5%: appears 6 times\
    55% - 30% - 10% - 5%: appears 4 times

In [10]:
percent_pattern = re.compile(r'(\d{1,3})\s*%')
splits = []
pattern_counts = {}

for email in emails:
    nums = re.findall(percent_pattern, email)
    if len(nums) >= 2:
        nums = list(map(int, nums))
        if abs(sum(nums) - 100) == 0:
            norm = " - ".join(f"{n}%" for n in sorted(nums, reverse=True))
            splits.append(norm)
            if norm in pattern_counts:
                pattern_counts[norm] += 1
            else:
                pattern_counts[norm] = 1

sorted_patterns = sorted(pattern_counts.items(), key=lambda x: x[1], reverse=True)

for pat, cnt in sorted_patterns[:5]:
    print(f"{pat}: xuất hiện {cnt} lần")

60% - 30% - 10%: xuất hiện 46 lần
70% - 25% - 5%: xuất hiện 43 lần
70% - 20% - 10%: xuất hiện 34 lần
60% - 35% - 5%: xuất hiện 33 lần
75% - 20% - 5%: xuất hiện 27 lần


5. Create a "scam score" for each email based on:\
    Urgency keywords (1 point each)\
    Money mentions (2 points each)\
    Percentages offered (1 point each)\
    Death mentions (1 point)\
    ALL CAPS usage (1 point if >20% of text)

    ***Rank the top 10 highest-scoring emails*** 

In [11]:
# YOUR CODE HERE

scam_scores = []

for idx, mail in enumerate(emails, 1):
    score = 0
    # Urgency
    score += len(urgent_pattern.findall(mail)) * 1
    # Money
    score += len(money_pattern.findall(mail)) * 2
    # Percentages
    score += len(percent_pattern.findall(mail)) * 1
    # Death mentions
    if death_pattern.search(mail):
        score += 1
    # ALL CAPS
    letters = [c for c in mail if c.isalpha()]
    if letters:
        caps_ratio = sum(1 for c in letters if c.isupper()) / len(letters)
        if caps_ratio > 0.2:
            score += 1
    
    scam_scores.append((idx, score))

scam_scores_sorted = sorted(scam_scores, key=lambda x: x[1], reverse=True)

for email_id, score in scam_scores_sorted[:10]:
    print(f"Email #{email_id}: Scam score = {score}")

Email #134: Scam score = 38
Email #591: Scam score = 32
Email #633: Scam score = 31
Email #634: Scam score = 31
Email #244: Scam score = 29
Email #394: Scam score = 29
Email #945: Scam score = 28
Email #416: Scam score = 27
Email #461: Scam score = 27
Email #462: Scam score = 27


6. Identify emails that appear to be duplicates or near-duplicates (same sender, similar subject, sent within 24 hours). How many duplicate emails exist?

In [12]:
# YOUR CODE HERE
from_re = re.compile(r"From:\s*(.*)", re.IGNORECASE)
subject_re = re.compile(r"Subject:\s*(.*)", re.IGNORECASE)
date_re = re.compile(r"Date:\s*(.*)", re.IGNORECASE)

def parse_date(date_str):
    try:
        return datetime.strptime(date_str.strip(), "%a, %d %b %Y %H:%M:%S %z")
    except:
        return None

records = []
for mail in emails:
    sender = None
    subject = None
    date = None
    
    m_from = from_re.search(mail)
    if m_from:
        sender = m_from.group(1).strip()
    
    m_subject = subject_re.search(mail)
    if m_subject:
        subject = m_subject.group(1).strip().lower()
    
    m_date = date_re.search(mail)
    if m_date:
        date = parse_date(m_date.group(1))
    
    if sender and subject and date:
        records.append((sender, subject, date))

duplicates = []
for i in range(len(records)):
    for j in range(i+1, len(records)):
        sender1, subj1, date1 = records[i]
        sender2, subj2, date2 = records[j]
        if sender1 == sender2 and subj1 == subj2:
            if date1 and date2 and abs(date1 - date2) <= timedelta(hours=24):
                duplicates.append((i, j))

print("Số lượng email bị gửi (gần như) trùng:", len(duplicates))

Số lượng email bị gửi (gần như) trùng: 252
