# Text Cleaning & Manipulation

**Module 4: Data Cleaning & Transformation**

## Learning Objectives
- Master the pandas `.str` accessor for text operations
- Apply regular expressions for pattern matching
- Clean and standardize text data
- Extract useful information from text fields

## Business Context
> "Text data is messy by nature. Names have typos, emails have different formats, and free-text fields are chaos!"

As a Data Analyst, you'll frequently encounter text data that needs cleaning before analysis. This notebook covers the essential techniques.

In [None]:
import pandas as pd
import numpy as np
import re

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)

print("✓ Libraries loaded successfully")

---
## 1. The `.str` Accessor

Pandas provides the `.str` accessor to apply string methods to entire columns.

### Basic String Operations

In [None]:
# Sample data with messy text
customers = pd.DataFrame({
    'name': ['  alice SMITH  ', 'BOB jones', 'Charlie Brown', '  diana PRINCE', 'EVE   '],
    'email': ['Alice.Smith@GMAIL.com', 'bob@yahoo.COM', 'CHARLIE@outlook.com', 
              'diana.prince@Company.CO.UK', 'eve@HOTMAIL.com'],
    'phone': ['555-123-4567', '(555) 234-5678', '555.345.6789', '5554567890', '+1 555 567 8901'],
    'address': ['123 Main St, New York, NY 10001', 
                '456 Oak Ave, Los Angeles, CA 90001',
                '789 Pine Rd, Chicago, IL 60601',
                '321 Elm Blvd, Houston, TX 77001',
                '654 Maple Dr, Phoenix, AZ 85001']
})

print("Original customer data:")
print(customers)

In [None]:
# Case conversion
print("=== Case Conversion ===")
print(f"\nOriginal names: {customers['name'].tolist()}")
print(f"lowercase: {customers['name'].str.lower().tolist()}")
print(f"UPPERCASE: {customers['name'].str.upper().tolist()}")
print(f"Title Case: {customers['name'].str.title().tolist()}")
print(f"Capitalize: {customers['name'].str.capitalize().tolist()}")

In [None]:
# Whitespace handling
print("=== Whitespace Handling ===")
print(f"\nOriginal: {customers['name'].tolist()}")
print(f"strip(): {customers['name'].str.strip().tolist()}")
print(f"lstrip(): {customers['name'].str.lstrip().tolist()}")
print(f"rstrip(): {customers['name'].str.rstrip().tolist()}")

# Clean and standardize names
customers['name_clean'] = customers['name'].str.strip().str.title()
print(f"\nCleaned: {customers['name_clean'].tolist()}")

### String Length and Content Checks

In [None]:
# Length operations
print("=== Length Operations ===")
customers['name_length'] = customers['name_clean'].str.len()
print(customers[['name_clean', 'name_length']])

# Word count
customers['word_count'] = customers['name_clean'].str.split().str.len()
print("\nWith word count:")
print(customers[['name_clean', 'name_length', 'word_count']])

In [None]:
# Content checks
print("=== Content Checks ===")

# Check if email contains gmail
print("\nEmails containing 'gmail':")
print(customers['email'].str.lower().str.contains('gmail'))

# Check if string starts/ends with specific text
print("\nAddresses starting with digit:")
print(customers['address'].str[0].str.isdigit())

# Check if phone is numeric only
print("\nPhones (digits only):")
print(customers['phone'].str.replace(r'\D', '', regex=True))

---
## 2. Splitting and Extracting

Extract parts of strings or split into multiple columns.

In [None]:
# Split names into first and last
customers[['first_name', 'last_name']] = customers['name_clean'].str.split(' ', n=1, expand=True)

print("=== Split Names ===")
print(customers[['name_clean', 'first_name', 'last_name']])

In [None]:
# Extract email parts
customers['email_clean'] = customers['email'].str.lower()
customers['email_username'] = customers['email_clean'].str.split('@').str[0]
customers['email_domain'] = customers['email_clean'].str.split('@').str[1]

print("=== Email Parts ===")
print(customers[['email_clean', 'email_username', 'email_domain']])

In [None]:
# Extract address components
# Pattern: Street, City, State ZIP
address_parts = customers['address'].str.split(', ', expand=True)
address_parts.columns = ['street', 'city', 'state_zip']

# Further split state and zip
address_parts[['state', 'zip']] = address_parts['state_zip'].str.split(' ', expand=True)

print("=== Address Components ===")
print(address_parts)

### Slicing Strings

In [None]:
# String slicing with .str[]
print("=== String Slicing ===")
print(f"\nOriginal emails: {customers['email_clean'].tolist()}")
print(f"First 5 chars: {customers['email_clean'].str[:5].tolist()}")
print(f"Last 4 chars: {customers['email_clean'].str[-4:].tolist()}")

# Get initials
customers['initials'] = customers['first_name'].str[0] + customers['last_name'].str[0]
print(f"\nInitials: {customers['initials'].tolist()}")

---
## 3. String Replacement

Replace text patterns within strings.

In [None]:
# Basic replacement
print("=== Basic Replacement ===")
print(f"\nOriginal phones: {customers['phone'].tolist()}")

# Remove all non-digit characters
customers['phone_clean'] = customers['phone'].str.replace(r'\D', '', regex=True)
print(f"Digits only: {customers['phone_clean'].tolist()}")

In [None]:
# Format phone numbers consistently
def format_phone(phone):
    """
    Format phone number as (XXX) XXX-XXXX
    """
    if pd.isna(phone):
        return np.nan
    
    # Remove all non-digits
    digits = re.sub(r'\D', '', str(phone))
    
    # Handle international prefix
    if len(digits) == 11 and digits[0] == '1':
        digits = digits[1:]
    
    # Format if we have 10 digits
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    else:
        return phone  # Return original if can't format

customers['phone_formatted'] = customers['phone'].apply(format_phone)
print("=== Formatted Phone Numbers ===")
print(customers[['phone', 'phone_formatted']])

In [None]:
# Multiple replacements with a dictionary
abbreviations = {
    'St': 'Street',
    'Ave': 'Avenue',
    'Rd': 'Road',
    'Blvd': 'Boulevard',
    'Dr': 'Drive'
}

# Create pattern from dictionary
def expand_abbreviations(text, mapping):
    for abbrev, full in mapping.items():
        # Use word boundaries to avoid partial replacements
        text = re.sub(rf'\b{abbrev}\b', full, text)
    return text

customers['address_expanded'] = customers['address'].apply(
    lambda x: expand_abbreviations(x, abbreviations)
)

print("=== Expanded Abbreviations ===")
print(customers[['address', 'address_expanded']])

---
## 4. Regular Expressions (Regex) for Data Analysts

Regex is powerful for pattern matching. Here are the most useful patterns for data cleaning.

### Essential Regex Patterns

| Pattern | Description | Example Match |
|---------|-------------|---------------|
| `\d` | Any digit | 0, 1, 2, ... |
| `\D` | Any non-digit | a, @, ! |
| `\w` | Word character (letter, digit, _) | a, 1, _ |
| `\W` | Non-word character | @, !, space |
| `\s` | Whitespace | space, tab, newline |
| `+` | One or more | `\d+` matches "123" |
| `*` | Zero or more | `\d*` matches "" or "123" |
| `?` | Zero or one | `colou?r` matches "color" or "colour" |
| `{n}` | Exactly n times | `\d{4}` matches "2024" |
| `{n,m}` | Between n and m times | `\d{2,4}` matches "12" to "1234" |
| `^` | Start of string | `^Hello` |
| `$` | End of string | `world$` |
| `\b` | Word boundary | `\bcat\b` matches "cat" not "category" |
| `( )` | Capture group | `(\d{3})` captures 3 digits |

In [None]:
# Sample data with various patterns
data = pd.DataFrame({
    'text': [
        'Order #12345 placed on 2024-01-15',
        'Invoice INV-2024-0001 total $1,234.56',
        'Customer ID: CUST-789, Email: john@example.com',
        'SKU: ABC-123-XYZ, Qty: 5',
        'Phone: (555) 123-4567, Fax: 555-987-6543'
    ]
})

print("Sample text data:")
for i, text in enumerate(data['text']):
    print(f"{i}: {text}")

In [None]:
# Extract patterns using .str.extract()
print("=== Extract Patterns ===")

# Extract order numbers (# followed by digits)
data['order_num'] = data['text'].str.extract(r'#(\d+)')

# Extract dates (YYYY-MM-DD format)
data['date'] = data['text'].str.extract(r'(\d{4}-\d{2}-\d{2})')

# Extract dollar amounts
data['amount'] = data['text'].str.extract(r'\$([\d,]+\.?\d*)')

# Extract email addresses
data['email'] = data['text'].str.extract(r'([\w.]+@[\w.]+)')

print(data)

In [None]:
# Extract multiple groups
print("=== Extract Multiple Groups ===")

# Extract all phone numbers
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
data['phones'] = data['text'].str.findall(phone_pattern)
print(data[['text', 'phones']])

In [None]:
# Check if pattern exists using .str.contains()
print("=== Pattern Matching ===")

# Contains email address?
data['has_email'] = data['text'].str.contains(r'[\w.]+@[\w.]+', regex=True)

# Contains money amount?
data['has_amount'] = data['text'].str.contains(r'\$[\d,]+', regex=True)

print(data[['text', 'has_email', 'has_amount']])

---
## 5. Common Text Cleaning Tasks

### 5.1 Standardizing Names

In [None]:
# Messy names from different sources
names = pd.DataFrame({
    'raw_name': [
        '  SMITH, JOHN  ',
        'jane doe',
        'Dr. Robert Johnson III',
        'O\'Connor, Mary',
        'jean-pierre DUPONT',
        'van der Berg, Peter',
        'WILLIAMS,   SARAH'
    ]
})

print("Original names:")
print(names)

In [None]:
def clean_name(name):
    """
    Clean and standardize a name.
    """
    if pd.isna(name):
        return np.nan
    
    # Remove extra whitespace
    name = ' '.join(name.split())
    
    # Handle "LAST, FIRST" format
    if ',' in name:
        parts = name.split(',')
        if len(parts) == 2:
            name = f"{parts[1].strip()} {parts[0].strip()}"
    
    # Title case (but preserve lowercase particles like 'van', 'de', 'der')
    particles = ['van', 'de', 'der', 'von', 'la', 'le']
    words = name.lower().split()
    result = []
    for word in words:
        if word in particles and result:  # Only keep lowercase if not first word
            result.append(word)
        else:
            result.append(word.capitalize())
    
    return ' '.join(result)

names['clean_name'] = names['raw_name'].apply(clean_name)
print("\nCleaned names:")
print(names)

### 5.2 Validating Email Addresses

In [None]:
# Email validation
emails = pd.DataFrame({
    'email': [
        'john@example.com',
        'Jane.Doe@company.co.uk',
        'invalid-email',
        'test@test',
        'user.name+tag@domain.org',
        '@nodomain.com',
        'noat.com'
    ]
})

# Basic email pattern
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

emails['is_valid'] = emails['email'].str.match(email_pattern)
emails['email_lower'] = emails['email'].str.lower()

print("Email Validation:")
print(emails)

### 5.3 Cleaning Product Descriptions

In [None]:
# Messy product data
products = pd.DataFrame({
    'description': [
        'iPhone 13 Pro  - 128GB   Blue',
        'SAMSUNG Galaxy S21, 256gb, BLACK',
        'google pixel 6 -- 128 gb (white)',
        'OnePlus 9 Pro\n256GB\nGreen',
        'Sony   Xperia   5   III    128GB'
    ]
})

print("Original descriptions:")
print(products)

In [None]:
def clean_product_description(desc):
    """
    Standardize product descriptions.
    """
    if pd.isna(desc):
        return np.nan
    
    # Replace newlines and tabs with spaces
    desc = re.sub(r'[\n\t\r]', ' ', desc)
    
    # Remove multiple dashes
    desc = re.sub(r'-+', '-', desc)
    
    # Remove multiple spaces
    desc = re.sub(r'\s+', ' ', desc)
    
    # Standardize storage format (128gb -> 128GB)
    desc = re.sub(r'(\d+)\s*gb', r'\1GB', desc, flags=re.IGNORECASE)
    
    # Remove parentheses
    desc = re.sub(r'[\(\)]', '', desc)
    
    # Remove leading/trailing whitespace and punctuation
    desc = desc.strip(' -,')
    
    return desc

products['clean_description'] = products['description'].apply(clean_product_description)
print("\nCleaned descriptions:")
print(products)

In [None]:
# Extract product attributes
products['brand'] = products['clean_description'].str.extract(r'^([\w]+)')
products['storage'] = products['clean_description'].str.extract(r'(\d+GB)')
products['color'] = products['clean_description'].str.extract(r'(Blue|Black|White|Green|Red)', 
                                                               flags=re.IGNORECASE)
products['color'] = products['color'].str.title()

print("\nExtracted attributes:")
print(products[['clean_description', 'brand', 'storage', 'color']])

---
## 6. Handling Special Characters and Encoding

International data often has special characters that need attention.

In [None]:
# International names and text
international = pd.DataFrame({
    'name': ['José García', 'François Müller', 'Søren Østergaard', 
             'Владимир', 'محمد', '田中太郎'],
    'city': ['São Paulo', 'Zürich', 'København', 'Москва', 'الرياض', '東京']
})

print("International data:")
print(international)

In [None]:
# Remove accents for ASCII-only systems
import unicodedata

def remove_accents(text):
    """
    Remove accents from text while preserving base letters.
    Only works for Latin-based scripts.
    """
    if pd.isna(text):
        return np.nan
    
    # Normalize unicode and remove accents
    normalized = unicodedata.normalize('NFKD', text)
    ascii_text = normalized.encode('ASCII', 'ignore').decode('ASCII')
    
    return ascii_text if ascii_text else text  # Return original if result is empty

international['name_ascii'] = international['name'].apply(remove_accents)
international['city_ascii'] = international['city'].apply(remove_accents)

print("\nWith ASCII versions:")
print(international)

In [None]:
# Detect script type
def detect_script(text):
    """
    Detect the primary script used in text.
    """
    if pd.isna(text) or len(text) == 0:
        return 'Unknown'
    
    # Count characters by script
    scripts = {'Latin': 0, 'Cyrillic': 0, 'Arabic': 0, 'CJK': 0}
    
    for char in text:
        if '\u0000' <= char <= '\u007F' or '\u00C0' <= char <= '\u024F':
            scripts['Latin'] += 1
        elif '\u0400' <= char <= '\u04FF':
            scripts['Cyrillic'] += 1
        elif '\u0600' <= char <= '\u06FF':
            scripts['Arabic'] += 1
        elif '\u4E00' <= char <= '\u9FFF' or '\u3040' <= char <= '\u30FF':
            scripts['CJK'] += 1
    
    # Return the most common script
    return max(scripts, key=scripts.get) if max(scripts.values()) > 0 else 'Unknown'

international['name_script'] = international['name'].apply(detect_script)
print("\nScript detection:")
print(international[['name', 'name_script']])

---
## 7. Practical Exercises

### Exercise 1: Clean Customer Data

In [None]:
# Messy customer data
messy_customers = pd.DataFrame({
    'full_name': ['  JOHNSON, MARY  ', 'Robert Smith Jr.', '  sarah   o\'brien  ',
                  'Dr. James WILSON', 'emily-rose davis'],
    'email': ['MARY.J@Gmail.COM', 'r.smith@@company.com', 'sarah.o.brien@outlook',
              'jwilson@hospital.org', 'Emily_Davis@yahoo.com'],
    'phone': ['555.123.4567', '(555) 234-5678', '555-345-6789',
              '15554567890', '+1 (555) 567-8901']
})

print("Messy customer data:")
print(messy_customers)

In [None]:
# TODO: Clean the customer data
# 1. Standardize names (Title Case, handle "LAST, FIRST" format)
# 2. Split into first_name and last_name
# 3. Clean and validate email addresses (lowercase, mark invalid ones)
# 4. Standardize phone numbers to (XXX) XXX-XXXX format


### Exercise 2: Extract Information from Log Data

In [None]:
# Log data
logs = pd.DataFrame({
    'log_entry': [
        '[2024-01-15 10:23:45] ERROR: Failed to connect to 192.168.1.100:8080',
        '[2024-01-15 10:24:12] INFO: User admin logged in from IP 10.0.0.5',
        '[2024-01-15 10:25:33] WARNING: Disk usage at 85% on server-01',
        '[2024-01-15 10:26:01] ERROR: Database timeout after 30s',
        '[2024-01-15 10:27:45] INFO: Backup completed successfully, size: 2.5GB'
    ]
})

print("Log entries:")
for log in logs['log_entry']:
    print(log)

In [None]:
# TODO: Extract the following from each log entry:
# 1. timestamp (convert to datetime)
# 2. log_level (ERROR, INFO, WARNING)
# 3. ip_address (if present)
# 4. message (everything after the log level)


### Exercise 3: Clean Survey Responses

In [None]:
# Survey responses with free text
survey = pd.DataFrame({
    'response': [
        'I REALLY loved the product!!!',
        '  meh... it was ok i guess   ',
        'TERRIBLE customer service!!!! Never again!!!',
        'Great product, fast shipping :)',
        'wtf is this garbage???',
        'Pretty good. Would recommend.',
        'N/A',
        ''
    ]
})

print("Survey responses:")
print(survey)

In [None]:
# TODO: Clean the survey data
# 1. Handle empty/N/A responses
# 2. Normalize case (Title Case or lowercase)
# 3. Remove excessive punctuation (multiple !, ?, etc.)
# 4. Calculate word count
# 5. Add a 'has_profanity' flag (for words like 'wtf', 'garbage')


---
## 8. Key Takeaways

### ✅ Essential `.str` Methods

```python
# Case conversion
df['col'].str.lower() / .str.upper() / .str.title()

# Whitespace
df['col'].str.strip() / .str.lstrip() / .str.rstrip()

# Split and join
df['col'].str.split('delimiter', expand=True)
df['col'].str.cat(df['col2'], sep=' ')

# Search and replace
df['col'].str.contains('pattern', regex=True)
df['col'].str.replace('old', 'new', regex=True)
df['col'].str.extract(r'(pattern)')

# Length and slicing
df['col'].str.len()
df['col'].str[0:5]
```

### ✅ Common Regex Patterns

```python
# Email: r'^[\w.+-]+@[\w.-]+\.[a-zA-Z]{2,}$'
# Phone: r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
# Date: r'\d{4}-\d{2}-\d{2}'
# Money: r'\$[\d,]+\.?\d*'
# Digits only: r'\d+'
# Non-digits: r'\D'
```

### ⚠️ Common Mistakes

1. Forgetting `regex=True` when using patterns
2. Not handling NaN values before string operations
3. Case sensitivity issues (always `.str.lower()` before comparing)
4. Not escaping special regex characters (`$`, `.`, `?`, etc.)