<a href="https://colab.research.google.com/github/jeff10-04/jeff_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [None]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [3]:
%%writefile q1_crm_cleanup.py
from pathlib import Path
import re
import csv

EMAIL_REGEX = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")


def validate_email(email: str) -> bool:
    """Return True if the provided email matches the regex exactly."""
    email = email.strip()
    return bool(EMAIL_REGEX.fullmatch(email))


def normalize_phone(raw_number: str) -> str:
    """Normalize a phone number by removing non-digit characters and returning the last 10 digits if available."""
    digits = re.sub(r"\D", "", raw_number)
    if len(digits) < 10:
        return ""
    return digits[-10:]


def parse_contacts(lines):
    """Parse a list of lines into cleaned contact dictionaries.

    Each line is expected to have at least three comma-separated values: name, email, phone.
    Invalid emails are skipped, phones are normalized, and duplicate emails (case-insensitive) are removed.
    """
    contacts = []
    for line in lines:
        parts = line.strip().split(",")
        if len(parts) < 3:
            continue
        name = parts[0].strip()
        email = parts[1].strip()
        phone = parts[2].strip()
        if not validate_email(email):
            continue
        phone_norm = normalize_phone(phone)
        contacts.append({"name": name, "email": email, "phone": phone_norm})
    # Deduplicate contacts by email case-insensitively, keeping first occurrence
    seen = set()
    deduped = []
    for contact in contacts:
        key = contact["email"].casefold()
        if key not in seen:
            seen.add(key)
            deduped.append(contact)
    return deduped


def main():
    """Main function to read raw contacts, clean them, and write to a CSV file."""
    input_path = Path("contacts_raw.txt")
    try:
        lines = input_path.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print("File not found. Please make sure contacts_raw.txt exists.")
        return
    contacts = parse_contacts(lines)
    output_path = Path("contacts_clean.csv")
    with output_path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["name", "email", "phone"])
        for c in contacts:
            writer.writerow([c["name"], c["email"], c["phone"]])
    print("contacts_clean.csv created successfully!")


if __name__ == "__main__":
    main()

Writing q1_crm_cleanup.py


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [2]:
%%writefile test_crm_cleanup.py
import unittest
from q1_crm_cleanup import validate_email, normalize_phone, parse_contacts


class TestCrmCleanup(unittest.TestCase):
    def test_email_validation(self):
        """Test valid and invalid email formats."""
        valid_emails = [
            "user@example.com",
            "john.doe+test@sub.domain.co",
            "USER123@EXAMPLE.ORG"
        ]
        for email in valid_emails:
            self.assertTrue(validate_email(email), f"{email} should be valid")
        invalid_emails = [
            "userexample.com",
            "john@.com",
            "invalid@domain",
            "bad@domain,com",
            ""
        ]
        for email in invalid_emails:
            self.assertFalse(validate_email(email), f"{email} should be invalid")

    def test_phone_normalization(self):
        """Test normalization of various phone number formats."""
        cases = {
            "(469)555-9999": "4695559999",
            "972-555-777": "",  # too short after stripping non-digits
            "+1-972-777-2121": "9727772121",
            "5555555555": "5555555555",
            "abc": ""
        }
        for raw, expected in cases.items():
            self.assertEqual(normalize_phone(raw), expected)

    def test_parse_contacts(self):
        """Test parsing from a list of strings and deduplication by email."""
        sample_lines = [
            "Alice Johnson, alice@example.com, (469) 555-1234",
            "Bob Smith, bob@example.com, 972-555-7777",
            "Invalid User, invalid-email, 1234567890",
            "Case Duplicate, Alice@Example.com, +1-234-567-8901"
        ]
        contacts = parse_contacts(sample_lines)
        # The invalid email should be removed and the duplicate email (case-insensitive) dropped
        self.assertEqual(len(contacts), 2)
        # Check first contact details
        self.assertEqual(contacts[0]["name"], "Alice Johnson")
        self.assertEqual(contacts[0]["email"], "alice@example.com")
        self.assertEqual(contacts[0]["phone"], "4695551234")
        # Check second contact details
        self.assertEqual(contacts[1]["name"], "Bob Smith")
        self.assertEqual(contacts[1]["email"], "bob@example.com")
        self.assertEqual(contacts[1]["phone"], "9725557777")


if __name__ == "__main__":
    unittest.main()


Overwriting test_crm_cleanup.py


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
