## Setup and Imports

In [None]:
from explodingham.models.baseline_models.regex import RegexPartialMatchClassifier, RegexFullMatchClassifier
import re

## Example 1: URL Validation with RegexPartialMatchClassifier

The `RegexPartialMatchClassifier` uses `re.match()` to check if a pattern matches at the **beginning** of a string. Let's use it to validate URLs that start with http:// or https://.

In [10]:
# Create classifier to match URLs starting with http:// or https://
url_clf = RegexPartialMatchClassifier(pattern=r'https?://')

urls = [
    "https://example.com",
    "http://site.org",
    "ftp://files.com",
    "example.com",
    "https://secure.bank.com/login"
]

predictions = url_clf.predict(urls)

print("URL Validation Results (must start with http:// or https://):")
for url, is_valid in zip(urls, predictions):
    status = "‚úì Valid" if is_valid else "‚úó Invalid"
    print(f"  {status}: {url}")

URL Validation Results (must start with http:// or https://):
  ‚úì Valid: https://example.com
  ‚úì Valid: http://site.org
  ‚úó Invalid: ftp://files.com
  ‚úó Invalid: example.com
  ‚úì Valid: https://secure.bank.com/login


## Example 2: Email Detection with RegexFullMatchClassifier

The `RegexFullMatchClassifier` uses `re.search()` to find patterns **anywhere** in the string. This is perfect for detecting emails within larger text.

In [11]:
# Create classifier to detect emails anywhere in text
email_clf = RegexFullMatchClassifier(
    pattern=r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
)

texts = [
    "Contact us at support@example.com for help",
    "No contact information in this message",
    "Email me at user@domain.org or call",
    "Visit our website",
    "Send questions to help@company.co.uk"
]

predictions = email_clf.predict(texts)

print("Email Detection Results:")
for text, has_email in zip(texts, predictions):
    status = "üìß Contains email" if has_email else "üì≠ No email"
    print(f"  {status}: '{text}'")

Email Detection Results:
  üìß Contains email: 'Contact us at support@example.com for help'
  üì≠ No email: 'No contact information in this message'
  üìß Contains email: 'Email me at user@domain.org or call'
  üì≠ No email: 'Visit our website'
  üìß Contains email: 'Send questions to help@company.co.uk'


## Example 3: Phone Number Format Validation

Using `RegexPartialMatchClassifier` to validate phone numbers that start with a specific format.

In [None]:
# Validate US phone numbers in format +1-XXX-XXX-XXXX
phone_clf = RegexPartialMatchClassifier(pattern=r'\+1-\d{3}-\d{3}-\d{4}')

phone_numbers = [
    "+1-555-123-4567",
    "555-123-4567",
    "+1-800-555-0199",
    "1-555-123-4567",
    "+1-555-1234"
]

predictions = phone_clf.predict(phone_numbers)

print("Phone Number Validation (format: +1-XXX-XXX-XXXX):")
for phone, is_valid in zip(phone_numbers, predictions):
    status = "‚úì Valid" if is_valid else "‚úó Invalid"
    print(f"  {status}: {phone}")

## Example 4: Case-Insensitive Keyword Detection

Using the `ignore_case` parameter to detect keywords regardless of capitalization.

In [12]:
# Detect "python" keyword anywhere in text (case-insensitive)
keyword_clf = RegexFullMatchClassifier(pattern=r'python', ignore_case=True)

texts = [
    "I love Python programming",
    "Java is also great",
    "PYTHON is powerful",
    "Learning python was fun",
    "JavaScript and Ruby"
]

predictions = keyword_clf.predict(texts)

print("Keyword Detection (case-insensitive 'python'):")
for text, has_keyword in zip(texts, predictions):
    status = "üêç Found" if has_keyword else "‚ùå Not found"
    print(f"  {status}: '{text}'")

Keyword Detection (case-insensitive 'python'):
  üêç Found: 'I love Python programming'
  ‚ùå Not found: 'Java is also great'
  üêç Found: 'PYTHON is powerful'
  üêç Found: 'Learning python was fun'
  ‚ùå Not found: 'JavaScript and Ruby'


## Example 5: Spam Detection with Multiple Patterns

Using regex to detect common spam keywords in messages.

In [None]:
# Detect spam keywords (free, win, click, offer, etc.)
spam_clf = RegexFullMatchClassifier(
    pattern=r'\b(free|win|winner|click|offer|prize|urgent|limited)\b',
    ignore_case=True
)

messages = [
    "You are the WINNER of $1000!",
    "Meeting scheduled for tomorrow at 2pm",
    "Click here for a FREE offer",
    "Can we discuss the project?",
    "URGENT: Limited time to claim your prize!"
]

predictions = spam_clf.predict(messages)

print("Spam Detection Results:")
for message, is_spam in zip(messages, predictions):
    status = "üö´ Likely spam" if is_spam else "‚úì Legitimate"
    print(f"  {status}: '{message}'")

## Example 6: File Extension Validation

Check if filenames start with valid names and have specific extensions.

In [None]:
# Validate Python files (must start with letters/numbers and end with .py)
python_file_clf = RegexPartialMatchClassifier(pattern=r'[a-zA-Z0-9_]+\.py$')

filenames = [
    "main.py",
    "test_utils.py",
    "script.txt",
    "data.csv",
    "helper_functions.py"
]

predictions = python_file_clf.predict(filenames)

print("Python File Validation:")
for filename, is_valid in zip(filenames, predictions):
    status = "‚úì Valid .py" if is_valid else "‚úó Not .py"
    print(f"  {status}: {filename}")

## Example 7: Social Media Handle Detection

Detect Twitter/X handles (starting with @) anywhere in text.

In [None]:
# Detect social media handles (@username)
handle_clf = RegexFullMatchClassifier(pattern=r'@[A-Za-z0-9_]+')

tweets = [
    "Check out @user123 for great content",
    "This is a regular tweet",
    "Shoutout to @python_dev and @data_science",
    "Email me at user@example.com",
    "Follow @ai_news for updates"
]

predictions = handle_clf.predict(tweets)

print("Social Media Handle Detection:")
for tweet, has_handle in zip(tweets, predictions):
    status = "@ Found handle" if has_handle else "‚óã No handle"
    print(f"  {status}: '{tweet}'")

## Example 8: Code Comment Detection

Identify lines that start with Python comment syntax.

In [None]:
# Detect lines starting with Python comments
comment_clf = RegexPartialMatchClassifier(pattern=r'\s*#')

code_lines = [
    "# This is a comment",
    "x = 42  # Inline comment",
    "    # Indented comment",
    "print('Hello')",
    "#TODO: Fix this"
]

predictions = comment_clf.predict(code_lines)

print("Python Comment Detection (lines starting with #):")
for line, is_comment in zip(code_lines, predictions):
    status = "üí¨ Comment" if is_comment else "‚öôÔ∏è  Code"
    print(f"  {status}: '{line}'")

## Example 9: Credit Card Pattern Detection (Partial)

Detect if text contains patterns that look like credit card numbers.

In [None]:
# Detect credit card-like patterns (simplified)
cc_clf = RegexFullMatchClassifier(pattern=r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b')

texts = [
    "My card is 1234 5678 9012 3456",
    "Invoice #12345",
    "Card: 4532-1234-5678-9010",
    "No sensitive information here",
    "Number: 1234567890123456"
]

predictions = cc_clf.predict(texts)

print("Credit Card Pattern Detection:")
for text, has_cc_pattern in zip(texts, predictions):
    status = "‚ö†Ô∏è  Contains CC pattern" if has_cc_pattern else "‚úì Safe"
    print(f"  {status}: '{text}'")

## Example 10: Hashtag Extraction

Detect posts containing hashtags.

In [None]:
# Detect hashtags
hashtag_clf = RegexFullMatchClassifier(pattern=r'#[A-Za-z0-9_]+')

posts = [
    "Loving this weather! #sunny #happy",
    "Just a regular post",
    "Check out #MachineLearning and #AI",
    "No tags here",
    "#Python is awesome!"
]

predictions = hashtag_clf.predict(posts)

print("Hashtag Detection:")
for post, has_hashtag in zip(posts, predictions):
    status = "# Has hashtag" if has_hashtag else "‚óã No hashtag"
    print(f"  {status}: '{post}'")

## Example 11: Date Format Validation

Validate dates in YYYY-MM-DD format at the start of strings.

In [None]:
# Validate ISO date format (YYYY-MM-DD) at start
date_clf = RegexPartialMatchClassifier(pattern=r'\d{4}-\d{2}-\d{2}')

log_entries = [
    "2025-12-18 System started",
    "Error occurred yesterday",
    "2024-01-01 New year log",
    "Status: OK",
    "2025-06-15 Maintenance completed"
]

predictions = date_clf.predict(log_entries)

print("Date Format Validation (starts with YYYY-MM-DD):")
for entry, has_date in zip(log_entries, predictions):
    status = "üìÖ Dated" if has_date else "‚óã Undated"
    print(f"  {status}: '{entry}'")

## Example 12: Using Regex Flags

Demonstrate using explicit regex flags for multiline and verbose patterns.

In [None]:
# Match lines starting with digits (using MULTILINE flag)
multiline_clf = RegexFullMatchClassifier(
    pattern=r'^\d+',
    flags=[re.MULTILINE]
)

texts = [
    "First line\n123 Second line",
    "No numbers at start\nOf any line",
    "Line 1\n42 is the answer",
    "All text\nNo digits"
]

predictions = multiline_clf.predict(texts)

print("Multiline Pattern Detection (lines starting with digits):")
for text, has_match in zip(texts, predictions):
    status = "‚úì Found" if has_match else "‚úó Not found"
    display_text = text.replace('\n', '‚Üµ')
    print(f"  {status}: '{display_text}'")

## Example 13: Comparing Partial vs Full Match

See the difference between `RegexPartialMatchClassifier` (start only) and `RegexFullMatchClassifier` (anywhere).

In [None]:
pattern = r'test'

# Partial match (start only)
partial_clf = RegexPartialMatchClassifier(pattern=pattern)

# Full match (anywhere)
full_clf = RegexFullMatchClassifier(pattern=pattern)

test_strings = [
    "test case",
    "this is a test",
    "testing",
    "no match",
    "retest"
]

partial_predictions = partial_clf.predict(test_strings)
full_predictions = full_clf.predict(test_strings)

print("Comparing Partial vs Full Match for pattern 'test':")
print("\nString                  | Partial (start) | Full (anywhere)")
print("-" * 65)
for text, partial, full in zip(test_strings, partial_predictions, full_predictions):
    partial_str = "‚úì" if partial else "‚úó"
    full_str = "‚úì" if full else "‚úó"
    print(f"{text:23} | {partial_str:15} | {full_str}")

## Example 14: Encoding Support with Bytes

Both classifiers support automatic decoding of byte strings using the specified encoding.

In [None]:
# Create classifier with UTF-8 encoding
unicode_clf = RegexFullMatchClassifier(pattern=r'caf√©|na√Øve', encoding='utf-8')

# Mix of strings and bytes
texts = [
    "I love caf√©",
    b"That's naive",
    "Visit the caf√©".encode('utf-8'),
    "Regular text",
    b"A na\xc3\xafve approach"  # "na√Øve" in UTF-8 bytes
]

predictions = unicode_clf.predict(texts)

print("Unicode Pattern Detection (caf√© or na√Øve):")
for text, has_match in zip(texts, predictions):
    status = "‚úì Found" if has_match else "‚úó Not found"
    display = text if isinstance(text, str) else f"[bytes: {text[:30]}...]"
    print(f"  {status}: {display}")

## Summary

This notebook demonstrated:

1. **RegexPartialMatchClassifier** - Matches patterns at the **beginning** of strings (uses `re.match()`)
   - URL validation
   - Phone number format checking
   - File extension validation
   - Comment line detection
   - Date format validation

2. **RegexFullMatchClassifier** - Matches patterns **anywhere** in strings (uses `re.search()`)
   - Email detection
   - Keyword detection
   - Spam filtering
   - Social media handle detection
   - Credit card pattern detection
   - Hashtag extraction

3. **Key Features**
   - `ignore_case` parameter for case-insensitive matching
   - `flags` parameter for custom regex flags (MULTILINE, VERBOSE, etc.)
   - `encoding` parameter for automatic byte string decoding
   - Support for both string and bytes inputs

4. **Use Cases**
   - Input validation
   - Text classification
   - Pattern detection
   - Format checking
   - Content filtering

### When to Use Which?

- **Use RegexPartialMatchClassifier** when:
  - You need to validate formats (URLs, phone numbers, dates)
  - The pattern should be at the start
  - You're checking prefixes or starting patterns

- **Use RegexFullMatchClassifier** when:
  - You need to detect patterns anywhere
  - Searching for keywords or phrases
  - Finding embedded patterns (emails, mentions, hashtags)
  - Content filtering and classification