# Question 2: Extract Digits and Phone Numbers using Regular Expressions

This notebook demonstrates how to extract:
- All digits from a string
- 10-digit phone numbers in various formats

Using Python regular expressions with input from a file.

## 1. Import Required Libraries

Import the `re` module for regular expressions and other necessary libraries for file handling.

In [1]:
import re
import os
from collections import Counter

print("Required libraries imported successfully!")

Required libraries imported successfully!


## 2. Read Input Text from File

Read the input paragraph from a text file and display the content for processing.

In [2]:
# Read input text from file
file_path = 'input_text.txt'

try:
    with open(file_path, 'r', encoding='utf-8') as file:
        input_text = file.read()
    
    print("Input Text from File:")
    print("-" * 50)
    print(input_text)
    print("-" * 50)
    print(f"Total characters: {len(input_text)}")
    
except FileNotFoundError:
    print(f"Error: {file_path} not found!")
    # Create sample text for demonstration
    input_text = """Contact us at 9876543210 or call (555) 123-4567. 
    Our office number is 8765432109. Emergency: 911. 
    Room 123, Code 456, Phone: 7890123456."""
    print("Using sample text instead:")
    print(input_text)

Input Text from File:
--------------------------------------------------
Contact information for our office:
Phone: 9876543210 or call us at (555) 123-4567
Alternative numbers: 8765432109, 7654321098
Our office address is 123 Main Street, Suite 456
Zip code: 98765
Emergency contact: 9123456780
International format: +1-800-555-0199
Some random digits: 42, 100, 2023, 7890
More phone numbers: 5551234567, 4567891230
Mixed content: Room 789, Phone 6789012345, Code 2024

--------------------------------------------------
Total characters: 396


## 3. Extract All Digits from Text

Use regular expressions to find and extract all individual digits from the input text.

In [3]:
# Extract all individual digits
def extract_all_digits(text):
    """
    Extract all individual digits from text
    Pattern: \d matches any single digit (0-9)
    """
    digit_pattern = r'\d'
    digits = re.findall(digit_pattern, text)
    return digits

# Extract digits
all_digits = extract_all_digits(input_text)

print("3.1 All Individual Digits:")
print(f"Digits found: {all_digits}")
print(f"Total digits: {len(all_digits)}")

# Count frequency of each digit
digit_frequency = Counter(all_digits)
print("\nDigit frequency:")
for digit, count in sorted(digit_frequency.items()):
    print(f"  Digit '{digit}': {count} times")

3.1 All Individual Digits:
Digits found: ['9', '8', '7', '6', '5', '4', '3', '2', '1', '0', '5', '5', '5', '1', '2', '3', '4', '5', '6', '7', '8', '7', '6', '5', '4', '3', '2', '1', '0', '9', '7', '6', '5', '4', '3', '2', '1', '0', '9', '8', '1', '2', '3', '4', '5', '6', '9', '8', '7', '6', '5', '9', '1', '2', '3', '4', '5', '6', '7', '8', '0', '1', '8', '0', '0', '5', '5', '5', '0', '1', '9', '9', '4', '2', '1', '0', '0', '2', '0', '2', '3', '7', '8', '9', '0', '5', '5', '5', '1', '2', '3', '4', '5', '6', '7', '4', '5', '6', '7', '8', '9', '1', '2', '3', '0', '7', '8', '9', '6', '7', '8', '9', '0', '1', '2', '3', '4', '5', '2', '0', '2', '4']
Total digits: 122

Digit frequency:
  Digit '0': 14 times
  Digit '1': 12 times
  Digit '2': 14 times
  Digit '3': 10 times
  Digit '4': 11 times
  Digit '5': 19 times
  Digit '6': 10 times
  Digit '7': 11 times
  Digit '8': 10 times
  Digit '9': 11 times


  Pattern: \d matches any single digit (0-9)


## 4. Extract Specific Digit Patterns

Use regular expressions to extract specific digit sequences or patterns like numbers with specific lengths.

In [4]:
# Extract all number sequences (consecutive digits)
def extract_number_sequences(text):
    """
    Extract sequences of consecutive digits
    Pattern: \d+ matches one or more consecutive digits
    """
    number_pattern = r'\d+'
    numbers = re.findall(number_pattern, text)
    return numbers

# Extract 3-digit numbers
def extract_3_digit_numbers(text):
    """
    Extract exactly 3-digit numbers
    Pattern: \b\d{3}\b matches exactly 3 digits with word boundaries
    """
    three_digit_pattern = r'\b\d{3}\b'
    three_digit_numbers = re.findall(three_digit_pattern, text)
    return three_digit_numbers

# Extract 4-digit numbers
def extract_4_digit_numbers(text):
    """
    Extract exactly 4-digit numbers
    Pattern: \b\d{4}\b matches exactly 4 digits with word boundaries
    """
    four_digit_pattern = r'\b\d{4}\b'
    four_digit_numbers = re.findall(four_digit_pattern, text)
    return four_digit_numbers

# Extract all patterns
number_sequences = extract_number_sequences(input_text)
three_digit_nums = extract_3_digit_numbers(input_text)
four_digit_nums = extract_4_digit_numbers(input_text)

print("4.1 All Number Sequences:")
print(f"Number sequences found: {number_sequences}")
print(f"Total sequences: {len(number_sequences)}")

print("\n4.2 3-Digit Numbers:")
print(f"3-digit numbers: {three_digit_nums}")
print(f"Count: {len(three_digit_nums)}")

print("\n4.3 4-Digit Numbers:")
print(f"4-digit numbers: {four_digit_nums}")
print(f"Count: {len(four_digit_nums)}")

# Categorize numbers by length
print("\n4.4 Numbers categorized by length:")
length_categories = {}
for num in number_sequences:
    length = len(num)
    if length not in length_categories:
        length_categories[length] = []
    length_categories[length].append(num)

for length, nums in sorted(length_categories.items()):
    print(f"  {length}-digit numbers: {nums} (Count: {len(nums)})")

4.1 All Number Sequences:
Number sequences found: ['9876543210', '555', '123', '4567', '8765432109', '7654321098', '123', '456', '98765', '9123456780', '1', '800', '555', '0199', '42', '100', '2023', '7890', '5551234567', '4567891230', '789', '6789012345', '2024']
Total sequences: 23

4.2 3-Digit Numbers:
3-digit numbers: ['555', '123', '123', '456', '800', '555', '100', '789']
Count: 8

4.3 4-Digit Numbers:
4-digit numbers: ['4567', '0199', '2023', '7890', '2024']
Count: 5

4.4 Numbers categorized by length:
  1-digit numbers: ['1'] (Count: 1)
  2-digit numbers: ['42'] (Count: 1)
  3-digit numbers: ['555', '123', '123', '456', '800', '555', '100', '789'] (Count: 8)
  4-digit numbers: ['4567', '0199', '2023', '7890', '2024'] (Count: 5)
  5-digit numbers: ['98765'] (Count: 1)
  10-digit numbers: ['9876543210', '8765432109', '7654321098', '9123456780', '5551234567', '4567891230', '6789012345'] (Count: 7)


  Pattern: \d+ matches one or more consecutive digits
  Pattern: \b\d{3}\b matches exactly 3 digits with word boundaries
  Pattern: \b\d{4}\b matches exactly 4 digits with word boundaries


## 5. Extract 10-Digit Phone Numbers

Use regular expressions to identify and extract 10-digit phone numbers in basic format (e.g., 1234567890).

In [5]:
# Extract 10-digit phone numbers (basic format)
def extract_10_digit_phones(text):
    """
    Extract exactly 10-digit phone numbers
    Pattern: \b\d{10}\b matches exactly 10 consecutive digits with word boundaries
    """
    phone_pattern = r'\b\d{10}\b'
    phones = re.findall(phone_pattern, text)
    return phones

# Extract 10-digit phone numbers
basic_phones = extract_10_digit_phones(input_text)

print("5.1 Basic 10-Digit Phone Numbers:")
print(f"Phone numbers found: {basic_phones}")
print(f"Total 10-digit phones: {len(basic_phones)}")

# Validate and format phone numbers
print("\n5.2 Phone Number Validation:")
for i, phone in enumerate(basic_phones, 1):
    if len(phone) == 10:
        formatted = f"({phone[:3]}) {phone[3:6]}-{phone[6:]}"
        print(f"  Phone {i}: {phone} → {formatted} ✓ Valid")
    else:
        print(f"  Phone {i}: {phone} ✗ Invalid (length: {len(phone)})")

# Check for potential phone numbers (including those with less than 10 digits)
potential_phones = [num for num in number_sequences if len(num) >= 7]
print("\n5.3 Potential Phone Numbers (7+ digits):")
for phone in potential_phones:
    if len(phone) == 10:
        print(f"  {phone} → Likely phone number ✓")
    elif len(phone) == 7:
        print(f"  {phone} → Local number (7 digits)")
    else:
        print(f"  {phone} → {len(phone)} digits")

5.1 Basic 10-Digit Phone Numbers:
Phone numbers found: ['9876543210', '8765432109', '7654321098', '9123456780', '5551234567', '4567891230', '6789012345']
Total 10-digit phones: 7

5.2 Phone Number Validation:
  Phone 1: 9876543210 → (987) 654-3210 ✓ Valid
  Phone 2: 8765432109 → (876) 543-2109 ✓ Valid
  Phone 3: 7654321098 → (765) 432-1098 ✓ Valid
  Phone 4: 9123456780 → (912) 345-6780 ✓ Valid
  Phone 5: 5551234567 → (555) 123-4567 ✓ Valid
  Phone 6: 4567891230 → (456) 789-1230 ✓ Valid
  Phone 7: 6789012345 → (678) 901-2345 ✓ Valid

5.3 Potential Phone Numbers (7+ digits):
  9876543210 → Likely phone number ✓
  8765432109 → Likely phone number ✓
  7654321098 → Likely phone number ✓
  9123456780 → Likely phone number ✓
  5551234567 → Likely phone number ✓
  4567891230 → Likely phone number ✓
  6789012345 → Likely phone number ✓


  Pattern: \b\d{10}\b matches exactly 10 consecutive digits with word boundaries


## 6. Extract Phone Numbers with Different Formats

Use regular expressions to extract phone numbers in various formats including dashes, parentheses, and spaces (e.g., (123) 456-7890, 123-456-7890).

In [6]:
# Extract formatted phone numbers
def extract_formatted_phones(text):
    """
    Extract phone numbers in various formats:
    - (123) 456-7890
    - 123-456-7890
    - 123.456.7890
    - 123 456 7890
    - 1234567890
    """
    # Comprehensive pattern for various phone formats
    phone_patterns = [
        r'\(\d{3}\)\s?\d{3}[-.]?\d{4}',  # (123) 456-7890 or (123)456-7890
        r'\d{3}[-.]\d{3}[-.]\d{4}',      # 123-456-7890 or 123.456.7890
        r'\d{3}\s\d{3}\s\d{4}',         # 123 456 7890
        r'\b\d{10}\b'                   # 1234567890
    ]
    
    all_formatted_phones = []
    for pattern in phone_patterns:
        phones = re.findall(pattern, text)
        all_formatted_phones.extend(phones)
    
    # Remove duplicates while preserving order
    unique_phones = []
    for phone in all_formatted_phones:
        if phone not in unique_phones:
            unique_phones.append(phone)
    
    return unique_phones

# Extract US/International format phones
def extract_international_phones(text):
    """
    Extract international format phone numbers:
    - +1-800-555-0199
    - +1 (800) 555-0199
    """
    international_pattern = r'\+\d{1,3}[-\s]?\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}'
    international_phones = re.findall(international_pattern, text)
    return international_phones

# Extract all formatted phone numbers
formatted_phones = extract_formatted_phones(input_text)
international_phones = extract_international_phones(input_text)

print("6.1 All Formatted Phone Numbers:")
for i, phone in enumerate(formatted_phones, 1):
    # Extract just digits to verify it's 10 digits
    digits_only = re.sub(r'\D', '', phone)
    if len(digits_only) == 10:
        print(f"  {i}. {phone} → {digits_only} ✓ Valid format")
    else:
        print(f"  {i}. {phone} → {digits_only} ({len(digits_only)} digits)")

print(f"\nTotal formatted phones found: {len(formatted_phones)}")

print("\n6.2 International Format Phone Numbers:")
if international_phones:
    for i, phone in enumerate(international_phones, 1):
        digits_only = re.sub(r'\D', '', phone)
        print(f"  {i}. {phone} → {digits_only}")
else:
    print("  No international format phone numbers found.")

# Advanced phone number extraction with validation
def clean_and_validate_phone(phone_str):
    """
    Clean phone number string and validate
    """
    # Remove all non-digit characters
    digits = re.sub(r'\D', '', phone_str)
    
    # Validate length
    if len(digits) == 10:
        # Format as (XXX) XXX-XXXX
        formatted = f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
        return {'original': phone_str, 'cleaned': digits, 'formatted': formatted, 'valid': True}
    elif len(digits) == 11 and digits[0] == '1':
        # US number with country code
        clean_digits = digits[1:]
        formatted = f"+1 ({clean_digits[:3]}) {clean_digits[3:6]}-{clean_digits[6:]}"
        return {'original': phone_str, 'cleaned': clean_digits, 'formatted': formatted, 'valid': True}
    else:
        return {'original': phone_str, 'cleaned': digits, 'formatted': None, 'valid': False}

print("\n6.3 Cleaned and Validated Phone Numbers:")
all_phone_candidates = list(set(formatted_phones + international_phones))

for phone in all_phone_candidates:
    result = clean_and_validate_phone(phone)
    if result['valid']:
        print(f"  ✓ {result['original']} → {result['formatted']}")
    else:
        print(f"  ✗ {result['original']} → Invalid ({len(result['cleaned'])} digits)")

6.1 All Formatted Phone Numbers:
  1. (555) 123-4567 → 5551234567 ✓ Valid format
  2. 800-555-0199 → 8005550199 ✓ Valid format
  3. 9876543210 → 9876543210 ✓ Valid format
  4. 8765432109 → 8765432109 ✓ Valid format
  5. 7654321098 → 7654321098 ✓ Valid format
  6. 9123456780 → 9123456780 ✓ Valid format
  7. 5551234567 → 5551234567 ✓ Valid format
  8. 4567891230 → 4567891230 ✓ Valid format
  9. 6789012345 → 6789012345 ✓ Valid format

Total formatted phones found: 9

6.2 International Format Phone Numbers:
  1. +1-800-555-0199 → 18005550199

6.3 Cleaned and Validated Phone Numbers:
  ✓ 9123456780 → (912) 345-6780
  ✓ 6789012345 → (678) 901-2345
  ✓ (555) 123-4567 → (555) 123-4567
  ✓ 4567891230 → (456) 789-1230
  ✓ 8765432109 → (876) 543-2109
  ✓ 9876543210 → (987) 654-3210
  ✓ 5551234567 → (555) 123-4567
  ✓ 7654321098 → (765) 432-1098
  ✓ 800-555-0199 → (800) 555-0199
  ✓ +1-800-555-0199 → +1 (800) 555-0199


## Summary and Results

Let's summarize all our findings from the text analysis.

In [7]:
# Final summary
print("=" * 60)
print("           FINAL SUMMARY AND RESULTS")
print("=" * 60)

print("\n📊 DIGIT EXTRACTION RESULTS:")
print(f"  • Total individual digits found: {len(all_digits)}")
print(f"  • Total number sequences found: {len(number_sequences)}")
print(f"  • Unique digits used: {sorted(set(all_digits))}")

print("\n📞 PHONE NUMBER EXTRACTION RESULTS:")
print(f"  • Basic 10-digit phones: {len(basic_phones)}")
print(f"  • Formatted phones (all types): {len(formatted_phones)}")
print(f"  • International format phones: {len(international_phones)}")

print("\n🔍 REGULAR EXPRESSION PATTERNS USED:")
patterns = {
    'Individual digits': r'\d',
    'Number sequences': r'\d+',
    '10-digit phones': r'\b\d{10}\b',
    'Formatted phones': r'\(\d{3}\)\s?\d{3}[-.]?\d{4}',
    'International phones': r'\+\d{1,3}[-\s]?\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}'
}

for description, pattern in patterns.items():
    print(f"  • {description}: {pattern}")

print("\n✅ VALIDATED PHONE NUMBERS:")
valid_phones = [phone for phone in formatted_phones if len(re.sub(r'\D', '', phone)) == 10]
if valid_phones:
    for phone in valid_phones:
        digits = re.sub(r'\D', '', phone)
        formatted = f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
        print(f"  • {phone} → {formatted}")
else:
    print("  • No valid 10-digit phone numbers found")

print("\n" + "=" * 60)
print("Analysis completed successfully!")
print("=" * 60)

           FINAL SUMMARY AND RESULTS

📊 DIGIT EXTRACTION RESULTS:
  • Total individual digits found: 122
  • Total number sequences found: 23
  • Unique digits used: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

📞 PHONE NUMBER EXTRACTION RESULTS:
  • Basic 10-digit phones: 7
  • Formatted phones (all types): 9
  • International format phones: 1

🔍 REGULAR EXPRESSION PATTERNS USED:
  • Individual digits: \d
  • Number sequences: \d+
  • 10-digit phones: \b\d{10}\b
  • Formatted phones: \(\d{3}\)\s?\d{3}[-.]?\d{4}
  • International phones: \+\d{1,3}[-\s]?\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}

✅ VALIDATED PHONE NUMBERS:
  • (555) 123-4567 → (555) 123-4567
  • 800-555-0199 → (800) 555-0199
  • 9876543210 → (987) 654-3210
  • 8765432109 → (876) 543-2109
  • 7654321098 → (765) 432-1098
  • 9123456780 → (912) 345-6780
  • 5551234567 → (555) 123-4567
  • 4567891230 → (456) 789-1230
  • 6789012345 → (678) 901-2345

Analysis completed successfully!
