<a href="https://colab.research.google.com/github/lt33tx/Landon_Tinch_DTSC3020_Fall2025-/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [2]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [8]:
import re
from pathlib import Path

# Configuration Constants #
EMAIL_REGEX = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
INPUT_FILE = "contacts_raw.txt"
OUTPUT_FILE = "contacts_clean.csv"

# Regex to parse the two different raw contact line formats #
LINE_PARSER_REGEX = re.compile(
    r'("?.+?"?)\s*<([a-zA-Z0-9._%+-]+[\[\]@a-zA-Z0-9.-]+)> \s*,\s*(.+)|'
    r'(.+?) \s*,\s*([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}) \s*,\s*(.+)'
)



def phone_cleaner(raw_number):
  # Function to normalize and extract the last 10 digits of a phone number #
    """
    Normalizes the phone number.
    Uses re.sub to strip all non-digits, then keeps the last 10 if long enough.
    """
    digits = re.sub(r"\D", "", raw_number)

    if len(digits) >= 10:
        return digits[-10:]
    else:
        return ""

def is_valid_email(email_str):
  # Function to validate an email using a full regex match #
    """
    Checks if an email is valid using the required full-match regex.
    Must strip whitespace first, per instructions!
    """
    return re.fullmatch(EMAIL_REGEX, email_str.strip()) is not None



def run_cleanup():
  # Main function to handle file I/O, cleaning, filtering, and de-duping #
    """
    The main function that handles file I/O, cleaning, filtering, and de-duping.
    """
    input_path = Path(INPUT_FILE)

    final_contacts = []
    seen_emails = set() # Set for tracking emails for de-duplication #

    try:
        with input_path.open(encoding="utf-8") as f:
            raw_data = f.readlines()

    except FileNotFoundError:
        print(f"The file '{INPUT_FILE}' wasn't found.")
        return # Exit if the input file is missing #

    print(f"Processing {len(raw_data)} lines from raw data...")

    for line in raw_data:
        line = line.strip()
        if not line:
            continue # Skip lines that don't match the regex #

        match = LINE_PARSER_REGEX.match(line)

        if not match:
            continue

        name, email, raw_phone = None, None, None

# Logic to extract fields based on which regex alternative matched #
        if match.group(1):
            name = match.group(1).strip().strip('"')
            email = match.group(2).strip()
            raw_phone = match.group(3).strip()
        elif match.group(4):
            name = match.group(4).strip()
            email = match.group(5).strip()
            raw_phone = match.group(6).strip()


        if not all([name, email, raw_phone]):
            continue

        if is_valid_email(email): # Process only if the email is valid #

            email_key = email.casefold()

            if email_key not in seen_emails: # Check for duplicate entry #

                clean_phone = phone_cleaner(raw_phone)

                final_contacts.append({ # Add the clean, unique contact #
                    'name': name,
                    'email': email,
                    'phone': clean_phone
                })

                seen_emails.add(email_key)

    print(f"Finished processing. Found {len(final_contacts)} unique and valid contacts.")

# CSV Output Section #
    output_path = Path(OUTPUT_FILE)


    csv_content = ["name,email,phone"] # Add CSV header #

    for contact in final_contacts:
        row = f'"{contact["name"]}",{contact["email"]},{contact["phone"]}'
        csv_content.append(row)

    try:
        with output_path.open("w", encoding="utf-8") as f:
            f.write('\n'.join(csv_content))
        print(f"Success Clean data written to '{OUTPUT_FILE}'.")
    except Exception as e:
        print(f" Error writing CSV: {e}")


if __name__ == "__main__":
    run_cleanup() # Execute the main function #

Processing 7 lines from raw data...
Finished processing. Found 4 unique and valid contacts.
Success Clean data written to 'contacts_clean.csv'.


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [12]:
import unittest
import re

# Configuration Constants #
EMAIL_REGEX = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"

# Regex compiled to parse the two different raw contact line formats #
LINE_PARSER_REGEX = re.compile(
    r'("?.+?"?)\s*<([a-zA-Z0-9._%+-]+[\[\]@a-zA-Z0-9.-]+)> \s*,\s*(.+)|'
    r'(.+?) \s*,\s*([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}) \s*,\s*(.+)'
)

def phone_cleaner(raw_number):
  # Function to normalize and extract the last 10 digits of a phone number #
    """
    Normalizes the phone number. (Copied from Q1)
    """
    digits = re.sub(r"\D", "", raw_number)

    if len(digits) >= 10:
        return digits[-10:]
    else:
        return ""

def is_valid_email(email_str):
  # Function to validate an email using a full regex match #
    """
    Checks if an email is valid using the required full-match regex. (Copied from Q1)
    """
    return re.fullmatch(EMAIL_REGEX, email_str.strip()) is not None


class TestCRMCleanup(unittest.TestCase):
  # Class containing unit tests for the cleaning functions #

    def test_email_validation(self):
      # Test cases for valid and invalid email addresses #
        """
        Test cases for valid and invalid email addresses using is_valid_email.
        """
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("mehdi.ay@example.org"))
        self.assertTrue(is_valid_email("NIMA@example.io"))

        self.assertFalse(is_valid_email("bob[at]example.com"), "Email with [at] should be invalid")
        self.assertFalse(is_valid_email("missingatsign.com"))
        self.assertFalse(is_valid_email("user@.com"))
        self.assertFalse(is_valid_email("user@domain."))


    def test_phone_normalization(self):
      # Test cases for phone_cleaner with various formats and lengths #
        """
        Test cases for phone_cleaner with different formats and lengths.
        """
        self.assertEqual(phone_cleaner("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(phone_cleaner("972-777-2121"), "9727772121")
        self.assertEqual(phone_cleaner("(469)555-9999"), "4695559999")

        self.assertEqual(phone_cleaner("+1 972 777 2121"), "9727772121")

        self.assertEqual(phone_cleaner("123-4567"), "")
        self.assertEqual(phone_cleaner(""), "")


    def test_parsing_and_deduplication(self):
      # Tests regex parsing and case-insensitive de-duplication logic #
        """
        Tests parsing of a small input string and ensures case-insensitive de-duplication works.
        """
        raw_input_data = (
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234\n'
            'Sara M. , sara@mail.co , 214 555 8888\n'
            'Mehdi <MEHDI@example.org> , 469.555.9999\n'
            'duplicate <Alice@Example.com> , 469 555 1234'
        )

        final_rows = []
        seen_emails_casefold = set() # Set for tracking emails case-insensitively #

        for line in raw_input_data.strip().split('\n'): # Iterate over each test data line #
            line = line.strip()
            match = LINE_PARSER_REGEX.match(line)

            name, email, raw_phone = None, None, None
            if match:
              # Logic to extract fields based on which regex alternative matched #
                if match.group(1):
                    name, email, raw_phone = match.group(1).strip().strip('"'), match.group(2).strip(), match.group(3).strip()
                elif match.group(4):
                    name, email, raw_phone = match.group(4).strip(), match.group(5).strip(), match.group(6).strip()

            if name and email and raw_phone and is_valid_email(email): # Ensure data is complete and email is valid #
                email_key = email.casefold() # Normalize email to check for duplicates #

                if email_key not in seen_emails_casefold: # Check for duplicate entry #
                    clean_phone = phone_cleaner(raw_phone)
                    final_rows.append({'name': name, 'email': email, 'phone': clean_phone})
                    seen_emails_casefold.add(email_key) # Add normalized email to seen set #

        expected_rows = [
            {'name': 'Alice Johnson', 'email': 'alice@example.com', 'phone': '4695551234'},
            {'name': 'Sara M.', 'email': 'sara@mail.co', 'phone': '2145558888'},
            {'name': 'Mehdi', 'email': 'MEHDI@example.org', 'phone': '4695559999'}
        ]

        self.assertEqual(len(final_rows), 3, "Only 3 unique, valid contacts should remain after filtering/deduplication.") # Assert final count #
        self.assertEqual(final_rows, expected_rows, "Parsed rows and deduplication failed to match expected output!") # Assert final count #


if __name__ == '__main__':
  # Run the unit tests when the script is executed #
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

...
----------------------------------------------------------------------
Ran 3 tests in 0.004s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
