# Lesson 7: String Manipulation and Date/Time Data

**Topic:** Working with Text and Temporal Data in R

**Learning Objectives:**
- Use `stringr` functions to clean and manipulate text data
- Apply `lubridate` functions to parse and work with dates
- Extract components from dates (year, month, day, weekday)
- Perform date calculations for business analytics
- Combine string and date operations for real-world data cleaning

**Time:** 60 minutes

---

## Background: Why These Skills Matter in Business

### The Reality of Data in the Wild

According to industry surveys and data science professionals:
- **80% of analysis time** is spent on data cleaning and preparation
- **Text data** contains valuable insights but arrives messy: inconsistent capitalization, extra spaces, typos, abbreviations
- **Date/time data** comes in countless formats: "2024-01-15", "01/15/2024", "Jan 15, 2024", "15-Jan-24"
- **Data integration** from multiple sources requires standardization before analysis

### Business Impact

Mastering string and date manipulation enables you to:

**Revenue Growth:**
- Analyze customer feedback to identify product improvements
- Segment customers by purchase recency for targeted marketing
- Identify seasonal patterns to optimize inventory and pricing

**Operational Efficiency:**
- Automate data cleaning processes (saving hours of manual work)
- Standardize product names and categories across systems
- Track response times and identify bottlenecks

**Strategic Insights:**
- Understand temporal patterns (daily, weekly, seasonal trends)
- Identify at-risk customers based on engagement recency
- Measure campaign effectiveness over time

### Real-World Applications

**E-commerce Example:**
```
Problem: Product names from suppliers are inconsistent
- "  laptop PRO 15-inch  " vs "Laptop Pro 15\"" vs "LAPTOP-PRO-15"

Solution: Use stringr to standardize
- Remove extra spaces: str_trim()
- Standardize case: str_to_title()
- Extract key info: str_extract() for screen size
```

**Customer Service Example:**
```
Problem: Need to identify customers who haven't purchased recently

Solution: Use lubridate to calculate recency
- Parse dates: ymd()
- Calculate days since: today() - last_purchase_date
- Segment: case_when() based on recency
```

---

## Key Concepts You'll Master

### String Manipulation (`stringr` package)

**Core Functions:**
- `str_trim()` - Remove whitespace from start/end
- `str_squish()` - Remove extra whitespace everywhere
- `str_to_lower()`, `str_to_upper()`, `str_to_title()` - Change case
- `str_detect()` - Find if pattern exists (returns TRUE/FALSE)
- `str_count()` - Count pattern occurrences
- `str_extract()` - Pull out first match
- `str_extract_all()` - Pull out all matches
- `str_replace()` - Replace first match
- `str_replace_all()` - Replace all matches
- `str_sub()` - Extract substring by position

**Regular Expressions (Regex) Basics:**
- `\\d` - Any digit (0-9)
- `\\w` - Any word character (letter, digit, underscore)
- `\\s` - Any whitespace
- `+` - One or more
- `*` - Zero or more
- `^` - Start of string
- `$` - End of string
- `|` - OR operator

### Date/Time Operations (`lubridate` package)

**Parsing Functions:**
- `ymd()` - Parse "2024-01-15" (Year-Month-Day)
- `mdy()` - Parse "01/15/2024" (Month-Day-Year)
- `dmy()` - Parse "15-01-2024" (Day-Month-Year)
- `ymd_hms()` - Parse date with time "2024-01-15 14:30:00"

**Extraction Functions:**
- `year()` - Extract year (2024)
- `month()` - Extract month number (1-12) or name
- `day()` - Extract day of month (1-31)
- `wday()` - Extract day of week (1-7 or name)
- `quarter()` - Extract quarter (1-4)
- `week()` - Extract week of year
- `hour()`, `minute()`, `second()` - Extract time components

**Calculation Functions:**
- `today()` - Current date
- `now()` - Current date and time
- Date arithmetic: `date1 - date2` gives difference
- `floor_date()`, `ceiling_date()` - Round dates
- `interval()`, `duration()`, `period()` - Time spans

---

## Part 1: Setup and Sample Data

### Understanding the Packages

**`stringr` (part of tidyverse):**
- Provides consistent, intuitive functions for string manipulation
- All functions start with `str_` making them easy to discover
- Handles regular expressions (regex) for powerful pattern matching
- Vectorized operations work efficiently on entire columns
- Part of the tidyverse, so integrates seamlessly with dplyr pipes

**`lubridate`:**
- Dramatically simplifies date/time parsing and manipulation
- Intuitive function names match date format (ymd, mdy, dmy)
- Handles time zones, periods, durations, and intervals
- Makes date arithmetic as simple as regular math
- Essential for any time-based business analysis

### Why We Need Sample Data

We'll create realistic messy business data that demonstrates common data quality issues:
- **Extra whitespace** (leading/trailing spaces)
- **Inconsistent capitalization** (UPPERCASE, lowercase, Title Case)
- **Mixed formats** in the same column
- **Embedded information** that needs extraction (sizes, capacities)
- **Date variations** across different systems

This mirrors what you'll encounter in real business environments where data comes from multiple sources.

In [5]:
# Load necessary packages
library(tidyverse)  # includes stringr for text manipulation
library(lubridate)  # for date/time operations

# Confirm successful loading
cat("‚úÖ Packages loaded successfully!\n\n")

# Display key functions for reference
cat("STRING MANIPULATION (stringr):\n")
cat("  Cleaning: str_trim(), str_squish()\n")
cat("  Case: str_to_lower(), str_to_upper(), str_to_title()\n")
cat("  Detection: str_detect(), str_count()\n")
cat("  Extraction: str_extract(), str_extract_all()\n")
cat("  Replacement: str_replace(), str_replace_all()\n\n")

cat("DATE/TIME OPERATIONS (lubridate):\n")
cat("  Parsing: ymd(), mdy(), dmy(), ymd_hms()\n")
cat("  Extraction: year(), month(), day(), wday(), quarter()\n")
cat("  Current: today(), now()\n")
cat("  Arithmetic: date1 - date2, floor_date(), ceiling_date()\n")

‚úÖ Packages loaded successfully!

STRING MANIPULATION (stringr):
  Cleaning: str_trim(), str_squish()
  Case: str_to_lower(), str_to_upper(), str_to_title()
  Detection: str_detect(), str_count()
  Extraction: str_extract(), str_extract_all()
  Replacement: str_replace(), str_replace_all()

DATE/TIME OPERATIONS (lubridate):
  Parsing: ymd(), mdy(), dmy(), ymd_hms()
  Extraction: year(), month(), day(), wday(), quarter()
  Current: today(), now()
  Arithmetic: date1 - date2, floor_date(), ceiling_date()


In [6]:
# Create sample product data with intentionally messy text
# This demonstrates common data quality issues in real business data

products <- data.frame(
  ProductID = 1:8,
  
  # Notice the data quality issues:
  # - Extra spaces at start/end
  # - Inconsistent capitalization
  # - Mixed formats
  ProductName = c(
    "  laptop PRO 15-inch  ",      # Leading/trailing spaces, mixed case
    "wireless MOUSE",                # Mixed case
    "USB-C Hub with HDMI",          # Proper format
    "27-inch Monitor 4K",           # Contains size and resolution
    "mechanical keyboard RGB",      # All lowercase
    "  Webcam HD 1080p  ",          # Spaces and resolution
    "Gaming Headset Pro",           # Title case
    "portable SSD 1TB"              # Contains capacity
  ),
  
  # Categories also have inconsistent capitalization
  Category = c(
    "computers",      # lowercase
    "Peripherals",    # Title case
    "ACCESSORIES",    # UPPERCASE
    "monitors",       # lowercase
    "Peripherals",    # Title case
    "accessories",    # lowercase (same as row 3 but different case!)
    "PERIPHERALS",    # UPPERCASE (same as rows 2,5 but different case!)
    "Storage"         # Title case
  ),
  
  Price = c(1299.99, 29.99, 49.99, 599.99, 149.99, 89.99, 199.99, 179.99)
)

cat("üìä Original Product Data (notice the messy formatting):\n\n")
cat("Data Quality Issues to Fix:\n")
cat("  ‚ùå Extra whitespace in product names\n")
cat("  ‚ùå Inconsistent capitalization in categories\n")
cat("  ‚ùå Mixed case in product names\n")
cat("  ‚ùå Embedded information (sizes, capacities) not extracted\n\n")

print(products)

üìä Original Product Data (notice the messy formatting):

Data Quality Issues to Fix:
  ‚ùå Extra whitespace in product names
  ‚ùå Inconsistent capitalization in categories
  ‚ùå Mixed case in product names
  ‚ùå Embedded information (sizes, capacities) not extracted

  ProductID             ProductName    Category   Price
1         1    laptop PRO 15-inch     computers 1299.99
2         2          wireless MOUSE Peripherals   29.99
3         3     USB-C Hub with HDMI ACCESSORIES   49.99
4         4      27-inch Monitor 4K    monitors  599.99
5         5 mechanical keyboard RGB Peripherals  149.99
6         6       Webcam HD 1080p   accessories   89.99
7         7      Gaming Headset Pro PERIPHERALS  199.99
8         8        portable SSD 1TB     Storage  179.99


In [7]:
# Create sample transaction data with various date formats
# In real business scenarios, dates come from different systems with different formats

transactions <- data.frame(
  TransactionID = 1:10,
  
  CustomerName = c(
    "John Smith", "Jane Doe", "Bob Johnson", "Alice Williams", "Charlie Brown",
    "Diana Prince", "Eve Adams", "Frank Miller", "Grace Lee", "Henry Ford"
  ),
  
  # All dates are in YYYY-MM-DD format here (ISO 8601 standard)
  # In real data, you might see: "01/15/2024", "15-Jan-2024", "2024-01-15", etc.
  OrderDate = c(
    "2024-01-15", "2024-02-20", "2024-01-10", "2024-03-05", "2024-02-14",
    "2024-03-20", "2024-01-25", "2024-02-28", "2024-03-10", "2024-01-30"
  ),
  
  Amount = c(1299.99, 29.99, 599.99, 149.99, 49.99, 89.99, 199.99, 179.99, 1299.99, 599.99)
)

cat("üìÖ Transaction Data:\n\n")
cat("What we'll learn to do with this data:\n")
cat("  ‚úì Parse date strings into Date objects\n")
cat("  ‚úì Extract year, month, day, weekday\n")
cat("  ‚úì Calculate days since transaction\n")
cat("  ‚úì Identify weekend vs weekday transactions\n")
cat("  ‚úì Segment customers by purchase recency\n")
cat("  ‚úì Extract first names for personalization\n\n")

print(transactions)

üìÖ Transaction Data:

What we'll learn to do with this data:
  ‚úì Parse date strings into Date objects
  ‚úì Extract year, month, day, weekday
  ‚úì Calculate days since transaction
  ‚úì Identify weekend vs weekday transactions
  ‚úì Segment customers by purchase recency
  ‚úì Extract first names for personalization

   TransactionID   CustomerName  OrderDate  Amount
1              1     John Smith 2024-01-15 1299.99
2              2       Jane Doe 2024-02-20   29.99
3              3    Bob Johnson 2024-01-10  599.99
4              4 Alice Williams 2024-03-05  149.99
5              5  Charlie Brown 2024-02-14   49.99
6              6   Diana Prince 2024-03-20   89.99
7              7      Eve Adams 2024-01-25  199.99
8              8   Frank Miller 2024-02-28  179.99
9              9      Grace Lee 2024-03-10 1299.99
10            10     Henry Ford 2024-01-30  599.99


## Part 2: String Manipulation with stringr

### Why String Cleaning Matters

**Business Problem:** Inconsistent text data leads to:
- Duplicate categories that should be the same ("Peripherals" vs "PERIPHERALS" vs "peripherals")
- Failed joins between datasets ("Laptop Pro" doesn't match "  laptop PRO  ")
- Inaccurate counts and summaries
- Poor user experience in reports and dashboards

**Solution:** Standardize text data using stringr functions

### Core stringr Functions Explained

**Cleaning Functions:**
- `str_trim(string)` - Removes whitespace from start and end
  - Example: `str_trim("  hello  ")` ‚Üí `"hello"`
- `str_squish(string)` - Removes leading/trailing whitespace AND reduces internal whitespace to single spaces
  - Example: `str_squish("  hello    world  ")` ‚Üí `"hello world"`

**Case Conversion:**
- `str_to_lower(string)` - Convert to lowercase
  - Example: `str_to_lower("HELLO World")` ‚Üí `"hello world"`
- `str_to_upper(string)` - Convert to UPPERCASE
  - Example: `str_to_upper("hello world")` ‚Üí `"HELLO WORLD"`
- `str_to_title(string)` - Convert to Title Case
  - Example: `str_to_title("hello world")` ‚Üí `"Hello World"`

**Pattern Detection:**
- `str_detect(string, pattern)` - Returns TRUE if pattern found, FALSE otherwise
  - Example: `str_detect("wireless mouse", "wireless")` ‚Üí `TRUE`
  - Use case: Filter products, flag records, create indicators

**Pattern Extraction:**
- `str_extract(string, pattern)` - Extracts first match of pattern
  - Example: `str_extract("15-inch screen", "\\d+")` ‚Üí `"15"`
  - Use case: Pull out numbers, codes, specific words

**Pattern Replacement:**
- `str_replace(string, pattern, replacement)` - Replace first match
- `str_replace_all(string, pattern, replacement)` - Replace all matches
  - Example: `str_replace_all("comp", "comp", "computer")` ‚Üí `"computer"`

### Regular Expression Quick Reference

- `\\d` - Any digit (0-9)
- `\\d+` - One or more digits
- `\\w` - Any word character (letter, digit, underscore)
- `\\s` - Any whitespace
- `^` - Start of string
- `$` - End of string
- `|` - OR (e.g., "pro|premium|deluxe")
- `.` - Any character
- `*` - Zero or more
- `+` - One or more

**Note:** In R, you need double backslashes `\\` because backslash is an escape character.

In [8]:
# BUSINESS USE CASE: Standardize product names and categories for consistent reporting
#
# Problem: Product names have extra spaces and inconsistent capitalization
# Impact: Can't accurately count products, join with other data, or create clean reports
# Solution: Use str_trim() and str_to_title() to standardize

products_clean <- products %>%
  mutate(
    # Step 1: Remove leading/trailing whitespace
    # str_trim() removes spaces from start and end
    # Before: "  laptop PRO 15-inch  "
    # After:  "laptop PRO 15-inch"
    ProductName_Clean = str_trim(ProductName),
    
    # Step 2: Convert to title case for consistency
    # str_to_title() capitalizes first letter of each word
    # Before: "laptop PRO 15-inch"
    # After:  "Laptop Pro 15-Inch"
    ProductName_Clean = str_to_title(ProductName_Clean),
    
    # Step 3: Standardize category names
    # This ensures "Peripherals", "PERIPHERALS", and "peripherals" all become "Peripherals"
    Category_Clean = str_to_title(Category)
  )

cat("üßπ BEFORE vs AFTER Cleaning:\n\n")

# Show the transformation
products_clean %>% 
  select(ProductID, ProductName, ProductName_Clean, Category, Category_Clean) %>%
  print()

cat("\n‚úÖ Benefits of cleaning:\n")
cat("  ‚Ä¢ Consistent formatting across all products\n")
cat("  ‚Ä¢ No extra spaces causing join failures\n")
cat("  ‚Ä¢ Categories now group correctly\n")
cat("  ‚Ä¢ Professional appearance in reports\n")

üßπ BEFORE vs AFTER Cleaning:

  ProductID             ProductName       ProductName_Clean    Category
1         1    laptop PRO 15-inch        Laptop Pro 15-Inch   computers
2         2          wireless MOUSE          Wireless Mouse Peripherals
3         3     USB-C Hub with HDMI     Usb-C Hub With Hdmi ACCESSORIES
4         4      27-inch Monitor 4K      27-Inch Monitor 4k    monitors
5         5 mechanical keyboard RGB Mechanical Keyboard Rgb Peripherals
6         6       Webcam HD 1080p           Webcam Hd 1080p accessories
7         7      Gaming Headset Pro      Gaming Headset Pro PERIPHERALS
8         8        portable SSD 1TB        Portable Ssd 1tb     Storage
  Category_Clean
1      Computers
2    Peripherals
3    Accessories
4       Monitors
5    Peripherals
6    Accessories
7    Peripherals
8        Storage

‚úÖ Benefits of cleaning:
  ‚Ä¢ Consistent formatting across all products
  ‚Ä¢ No extra spaces causing join failures
  ‚Ä¢ Categories now group correctly
  ‚Ä¢ Profe

In [9]:
# BUSINESS USE CASE: Create product feature flags for filtering and analysis
#
# Problem: Need to identify products with specific features (wireless, gaming, premium)
# Impact: Can't easily filter products or create targeted marketing campaigns
# Solution: Use str_detect() to find patterns in product names

products_clean <- products_clean %>%
  mutate(
    # Check if product is wireless
    # str_detect() returns TRUE if pattern found, FALSE otherwise
    # str_to_lower() makes search case-insensitive
    Is_Wireless = str_detect(str_to_lower(ProductName), "wireless"),
    
    # Check if product is gaming-related
    # Useful for: Gaming category reports, targeted ads, inventory planning
    Is_Gaming = str_detect(str_to_lower(ProductName), "gaming"),
    
    # Check if product is premium (contains Pro, HD, or 4K)
    # The | symbol means OR in regex
    # Matches: "Pro", "HD", "4K" (case-insensitive)
    Is_Premium = str_detect(str_to_lower(ProductName), "pro|hd|4k")
  )

cat("üè∑Ô∏è  PRODUCT FEATURE FLAGS:\n\n")

products_clean %>%
  select(ProductName_Clean, Is_Wireless, Is_Gaming, Is_Premium) %>%
  print()

cat("\nüí° Business Applications:\n")
cat("  ‚Ä¢ Filter products for targeted marketing campaigns\n")
cat("  ‚Ä¢ Analyze premium vs standard product performance\n")
cat("  ‚Ä¢ Create product category reports\n")
cat("  ‚Ä¢ Identify cross-sell opportunities\n")

# Show summary statistics
cat("\nüìä Summary:\n")
cat("  Wireless products:", sum(products_clean$Is_Wireless), "\n")
cat("  Gaming products:", sum(products_clean$Is_Gaming), "\n")
cat("  Premium products:", sum(products_clean$Is_Premium), "\n")

üè∑Ô∏è  PRODUCT FEATURE FLAGS:

        ProductName_Clean Is_Wireless Is_Gaming Is_Premium
1      Laptop Pro 15-Inch       FALSE     FALSE       TRUE
2          Wireless Mouse        TRUE     FALSE      FALSE
3     Usb-C Hub With Hdmi       FALSE     FALSE       TRUE
4      27-Inch Monitor 4k       FALSE     FALSE       TRUE
5 Mechanical Keyboard Rgb       FALSE     FALSE      FALSE
6         Webcam Hd 1080p       FALSE     FALSE       TRUE
7      Gaming Headset Pro       FALSE      TRUE       TRUE
8        Portable Ssd 1tb       FALSE     FALSE      FALSE

üí° Business Applications:
  ‚Ä¢ Filter products for targeted marketing campaigns
  ‚Ä¢ Analyze premium vs standard product performance
  ‚Ä¢ Create product category reports
  ‚Ä¢ Identify cross-sell opportunities

üìä Summary:
  Wireless products: 1 
  Gaming products: 1 
  Premium products: 5 


In [10]:
# BUSINESS USE CASE: Extract specifications from product descriptions
#
# Problem: Product specs (sizes, capacities) are embedded in text
# Impact: Can't filter by size, sort by capacity, or analyze spec trends
# Solution: Use str_extract() with regex to pull out numbers

products_clean <- products_clean %>%
  mutate(
    # Extract first number found in product name
    # \\d+ means "one or more digits"
    # Examples:
    #   "27-inch Monitor" ‚Üí "27"
    #   "1080p Webcam" ‚Üí "1080"
    #   "1TB SSD" ‚Üí "1"
    Size_Number = str_extract(ProductName, "\\d+"),
    
    # Check if product has capacity indicator (TB or GB)
    # Useful for storage products
    Has_Capacity = str_detect(ProductName, "TB|GB")
  )

cat("üîç EXTRACTED PRODUCT SPECIFICATIONS:\n\n")

products_clean %>%
  select(ProductName_Clean, Size_Number, Has_Capacity) %>%
  print()

cat("\nüí° Business Applications:\n")
cat("  ‚Ä¢ Filter monitors by screen size\n")
cat("  ‚Ä¢ Sort storage products by capacity\n")
cat("  ‚Ä¢ Analyze price vs. specifications\n")
cat("  ‚Ä¢ Identify missing product information\n")
cat("  ‚Ä¢ Create size-based product recommendations\n")

cat("\nüìù Note: For more complex extraction (like '1TB'), you'd use:\n")
cat("  str_extract(ProductName, \"\\d+\\s*(TB|GB)\")\n")

üîç EXTRACTED PRODUCT SPECIFICATIONS:

        ProductName_Clean Size_Number Has_Capacity
1      Laptop Pro 15-Inch          15        FALSE
2          Wireless Mouse        <NA>        FALSE
3     Usb-C Hub With Hdmi        <NA>        FALSE
4      27-Inch Monitor 4k          27        FALSE
5 Mechanical Keyboard Rgb        <NA>         TRUE
6         Webcam Hd 1080p        1080        FALSE
7      Gaming Headset Pro        <NA>        FALSE
8        Portable Ssd 1tb           1         TRUE

üí° Business Applications:
  ‚Ä¢ Filter monitors by screen size
  ‚Ä¢ Sort storage products by capacity
  ‚Ä¢ Analyze price vs. specifications
  ‚Ä¢ Identify missing product information
  ‚Ä¢ Create size-based product recommendations

üìù Note: For more complex extraction (like '1TB'), you'd use:
  str_extract(ProductName, "\d+\s*(TB|GB)")


## Part 3: Date/Time Manipulation with lubridate

### Why Date/Time Skills Are Critical for Business

**Business Problems:**
- Dates come in different formats from different systems
- Need to calculate customer recency for segmentation
- Must identify seasonal patterns and trends
- Time-based analysis drives inventory, staffing, marketing decisions
- Incorrect date handling leads to wrong business decisions

**Real-World Example:**
```
Scenario: E-commerce company wants to re-engage inactive customers

Questions to answer:
- Who hasn't purchased in 90+ days?
- What day of week do customers shop most?
- Are there seasonal patterns in purchases?
- How long between repeat purchases?

All require date manipulation!
```

### Core lubridate Functions Explained

**Parsing Functions (String ‚Üí Date):**
- `ymd("2024-01-15")` - Parse Year-Month-Day format
- `mdy("01/15/2024")` - Parse Month-Day-Year format (common in US)
- `dmy("15-01-2024")` - Parse Day-Month-Year format (common in Europe)
- `ymd_hms("2024-01-15 14:30:00")` - Parse date with time

**Why this matters:** Different systems use different formats. lubridate handles them all!

**Extraction Functions (Date ‚Üí Components):**
- `year(date)` - Extract year as number (2024)
- `month(date)` - Extract month as number (1-12)
- `month(date, label=TRUE)` - Extract month name ("January")
- `day(date)` - Extract day of month (1-31)
- `wday(date)` - Extract day of week as number (1=Sunday, 7=Saturday)
- `wday(date, label=TRUE)` - Extract weekday name ("Monday")
- `quarter(date)` - Extract quarter (1-4)
- `week(date)` - Extract week of year (1-53)

**Current Date/Time:**
- `today()` - Current date (no time)
- `now()` - Current date and time

**Date Arithmetic:**
- `today() - date` - Days between dates
- `date + days(7)` - Add 7 days
- `date + months(1)` - Add 1 month
- `floor_date(date, "month")` - Round down to start of month
- `ceiling_date(date, "month")` - Round up to start of next month

### Business Applications

**Customer Segmentation:**
- Calculate days since last purchase
- Segment: New (< 30 days), Active (< 90 days), At-Risk (90-180 days), Churned (> 180 days)

**Trend Analysis:**
- Group sales by month, quarter, or year
- Identify seasonal patterns
- Calculate year-over-year growth

**Operational Planning:**
- Identify peak days/hours for staffing
- Analyze weekend vs weekday patterns
- Track response times and SLAs

**Marketing Timing:**
- Determine best days to send campaigns
- Identify seasonal promotion opportunities
- Calculate optimal re-engagement timing

In [11]:
# BUSINESS USE CASE: Parse dates and extract components for analysis
#
# Problem: Dates are stored as text strings, can't do calculations or extract patterns
# Impact: Can't segment by month, identify weekday patterns, or calculate recency
# Solution: Use ymd() to parse, then extract components

transactions_clean <- transactions %>%
  mutate(
    # Step 1: Parse date string to Date object
    # ymd() handles "2024-01-15" format (Year-Month-Day)
    # For "01/15/2024" use mdy()
    # For "15-01-2024" use dmy()
    OrderDate_Parsed = ymd(OrderDate),
    
    # Step 2: Extract year (useful for year-over-year analysis)
    Year = year(OrderDate_Parsed),
    
    # Step 3: Extract month as number (1-12)
    Month = month(OrderDate_Parsed),
    
    # Step 4: Extract month name (more readable in reports)
    # label=TRUE gives name instead of number
    # abbr=FALSE gives full name ("January" not "Jan")
    Month_Name = month(OrderDate_Parsed, label = TRUE, abbr = FALSE),
    
    # Step 5: Extract day of month (1-31)
    Day = day(OrderDate_Parsed),
    
    # Step 6: Get day of week (critical for pattern analysis)
    # label=TRUE gives name ("Monday", "Tuesday", etc.)
    # 1=Sunday, 2=Monday, ..., 7=Saturday
    Weekday = wday(OrderDate_Parsed, label = TRUE, abbr = FALSE)
  )

cat("üìÖ TRANSACTIONS WITH DATE COMPONENTS:\n\n")

transactions_clean %>%
  select(TransactionID, OrderDate, Year, Month_Name, Weekday) %>%
  print()

cat("\nüí° Business Applications:\n")
cat("  ‚Ä¢ Group sales by month for trend analysis\n")
cat("  ‚Ä¢ Identify which weekdays have most transactions\n")
cat("  ‚Ä¢ Create monthly/quarterly reports\n")
cat("  ‚Ä¢ Analyze seasonal patterns\n")
cat("  ‚Ä¢ Plan staffing based on busy days\n")

üìÖ TRANSACTIONS WITH DATE COMPONENTS:

   TransactionID  OrderDate Year Month_Name   Weekday
1              1 2024-01-15 2024    January    Monday
2              2 2024-02-20 2024   February   Tuesday
3              3 2024-01-10 2024    January Wednesday
4              4 2024-03-05 2024      March   Tuesday
5              5 2024-02-14 2024   February Wednesday
6              6 2024-03-20 2024      March Wednesday
7              7 2024-01-25 2024    January  Thursday
8              8 2024-02-28 2024   February Wednesday
9              9 2024-03-10 2024      March    Sunday
10            10 2024-01-30 2024    January   Tuesday

üí° Business Applications:
  ‚Ä¢ Group sales by month for trend analysis
  ‚Ä¢ Identify which weekdays have most transactions
  ‚Ä¢ Create monthly/quarterly reports
  ‚Ä¢ Analyze seasonal patterns
  ‚Ä¢ Plan staffing based on busy days


In [12]:
# BUSINESS USE CASE: Customer recency analysis for segmentation
#
# Problem: Need to identify customers who haven't purchased recently for re-engagement
# Impact: Can't target at-risk customers, missing revenue opportunities
# Solution: Calculate days since purchase and categorize

transactions_clean <- transactions_clean %>%
  mutate(
    # Calculate days since transaction (from today)
    # today() gives current date
    # Subtracting dates gives a 'difftime' object
    # as.numeric() converts to number of days
    Days_Since = as.numeric(today() - OrderDate_Parsed),
    
    # Categorize customers by recency (RFM analysis component)
    # case_when() is like if-else but cleaner for multiple conditions
    Recency_Category = case_when(
      Days_Since <= 30 ~ "Recent (< 30 days)",      # Active, engaged customers
      Days_Since <= 60 ~ "Moderate (30-60 days)",   # Still engaged
      TRUE ~ "Old (> 60 days)"                       # At-risk, need re-engagement
    ),
    
    # Check if transaction was on weekend
    # wday() returns 1 for Sunday, 7 for Saturday
    # %in% checks if value is in the vector
    # Useful for: Staffing decisions, campaign timing
    Is_Weekend = wday(OrderDate_Parsed) %in% c(1, 7)
  )

cat("üéØ CUSTOMER RECENCY ANALYSIS:\n\n")

transactions_clean %>%
  select(CustomerName, OrderDate, Days_Since, Recency_Category, Is_Weekend) %>%
  print()

cat("\nÔøΩÔøΩ Business Applications:\n")
cat("  ‚Ä¢ Identify at-risk customers for re-engagement campaigns\n")
cat("  ‚Ä¢ Prioritize follow-up based on recency\n")
cat("  ‚Ä¢ Calculate customer lifetime value (CLV)\n")
cat("  ‚Ä¢ Analyze weekend vs weekday shopping patterns\n")
cat("  ‚Ä¢ Optimize campaign timing\n")

# Show distribution
cat("\nüìä Recency Distribution:\n")
table(transactions_clean$Recency_Category) %>% print()

cat("\nüìä Weekend vs Weekday:\n")
table(transactions_clean$Is_Weekend) %>% print()

üéØ CUSTOMER RECENCY ANALYSIS:

     CustomerName  OrderDate Days_Since Recency_Category Is_Weekend
1      John Smith 2024-01-15        630  Old (> 60 days)      FALSE
2        Jane Doe 2024-02-20        594  Old (> 60 days)      FALSE
3     Bob Johnson 2024-01-10        635  Old (> 60 days)      FALSE
4  Alice Williams 2024-03-05        580  Old (> 60 days)      FALSE
5   Charlie Brown 2024-02-14        600  Old (> 60 days)      FALSE
6    Diana Prince 2024-03-20        565  Old (> 60 days)      FALSE
7       Eve Adams 2024-01-25        620  Old (> 60 days)      FALSE
8    Frank Miller 2024-02-28        586  Old (> 60 days)      FALSE
9       Grace Lee 2024-03-10        575  Old (> 60 days)       TRUE
10     Henry Ford 2024-01-30        615  Old (> 60 days)      FALSE

ÔøΩÔøΩ Business Applications:
  ‚Ä¢ Identify at-risk customers for re-engagement campaigns
  ‚Ä¢ Prioritize follow-up based on recency
  ‚Ä¢ Calculate customer lifetime value (CLV)
  ‚Ä¢ Analyze weekend vs weekday shop

In [13]:
# BUSINESS USE CASE: Identify transaction patterns by day of week
#
# Problem: Need to understand when customers shop to optimize staffing and campaigns
# Impact: Inefficient staffing, missed revenue opportunities
# Solution: Aggregate transactions by weekday

weekday_analysis <- transactions_clean %>%
  # Group by day of week
  group_by(Weekday) %>%
  # Calculate key metrics for each day
  summarize(
    # Count of transactions
    Transaction_Count = n(),
    
    # Total revenue for the day
    Total_Revenue = sum(Amount),
    
    # Average transaction size
    Avg_Transaction = mean(Amount),
    
    .groups = 'drop'  # Remove grouping after summarize
  ) %>%
  # Sort by transaction count (highest first)
  arrange(desc(Transaction_Count))

cat("ÔøΩÔøΩ TRANSACTION PATTERNS BY DAY OF WEEK:\n\n")
print(weekday_analysis)

cat("\nüí° Business Insights:\n")
cat("  ‚Ä¢ Identify peak days for staffing optimization\n")
cat("  ‚Ä¢ Determine best days to launch campaigns\n")
cat("  ‚Ä¢ Understand customer shopping behavior\n")
cat("  ‚Ä¢ Plan inventory based on demand patterns\n")
cat("  ‚Ä¢ Optimize customer service availability\n")

# Find the busiest day
busiest_day <- weekday_analysis$Weekday[1]
cat("\nüî• Busiest Day:", as.character(busiest_day), "\n")

ÔøΩÔøΩ TRANSACTION PATTERNS BY DAY OF WEEK:

[90m# A tibble: 5 √ó 4[39m
  Weekday   Transaction_Count Total_Revenue Avg_Transaction
  [3m[90m<ord>[39m[23m                 [3m[90m<int>[39m[23m         [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m Wednesday                 4          920.            230.
[90m2[39m Tuesday                   3          780.            260.
[90m3[39m Sunday                    1         [4m1[24m300.           [4m1[24m300.
[90m4[39m Monday                    1         [4m1[24m300.           [4m1[24m300.
[90m5[39m Thursday                  1          200.            200.

üí° Business Insights:
  ‚Ä¢ Identify peak days for staffing optimization
  ‚Ä¢ Determine best days to launch campaigns
  ‚Ä¢ Understand customer shopping behavior
  ‚Ä¢ Plan inventory based on demand patterns
  ‚Ä¢ Optimize customer service availability

üî• Busiest Day: Wednesday 


## Part 4: Combined String and Date Operations

Real-world example: Extract customer first names and create personalized messages based on purchase recency.

In [14]:
# BUSINESS USE CASE: Personalized customer outreach based on purchase recency
#
# Problem: Need to send personalized re-engagement messages to customers
# Impact: Generic messages have low engagement, missing revenue opportunities
# Solution: Combine string extraction (first name) with date calculations (recency)

customer_outreach <- transactions_clean %>%
  mutate(
    # Extract first name from full name
    # ^\\w+ means: from start (^) match one or more word characters (\\w+)
    # This gets everything before the first space
    # "John Smith" ‚Üí "John"
    # "Alice Williams" ‚Üí "Alice"
    FirstName = str_extract(CustomerName, "^\\w+"),
    
    # Create personalized message based on recency
    # case_when() allows complex conditional logic
    # paste() combines strings
    Message = case_when(
      # Recent customers (< 30 days): Thank them
      Days_Since <= 30 ~ paste("Hi", FirstName, "! Thanks for your recent purchase!"),
      
      # Moderate recency (30-60 days): Gentle reminder
      Days_Since <= 60 ~ paste("Hi", FirstName, ", we miss you! Check out our new products."),
      
      # Old customers (> 60 days): Special offer to re-engage
      TRUE ~ paste("Hi", FirstName, ", it's been a while! Here's a special offer for you.")
    )
  )

cat("üíå PERSONALIZED CUSTOMER MESSAGES:\n\n")

customer_outreach %>%
  select(CustomerName, FirstName, Days_Since, Message) %>%
  print()

cat("\nüí° Business Applications:\n")
cat("  ‚Ä¢ Automated email campaigns with personalization\n")
cat("  ‚Ä¢ SMS marketing with customer names\n")
cat("  ‚Ä¢ Targeted re-engagement based on behavior\n")
cat("  ‚Ä¢ Improved customer experience through personalization\n")
cat("  ‚Ä¢ Higher engagement rates vs generic messages\n")

cat("\nüìà Expected Impact:\n")
cat("  ‚Ä¢ Personalized messages: 2-3x higher open rates\n")
cat("  ‚Ä¢ Recency-based targeting: 5-10x higher conversion\n")
cat("  ‚Ä¢ Automated workflow: Saves hours of manual work\n")

üíå PERSONALIZED CUSTOMER MESSAGES:

     CustomerName FirstName Days_Since
1      John Smith      John        630
2        Jane Doe      Jane        594
3     Bob Johnson       Bob        635
4  Alice Williams     Alice        580
5   Charlie Brown   Charlie        600
6    Diana Prince     Diana        565
7       Eve Adams       Eve        620
8    Frank Miller     Frank        586
9       Grace Lee     Grace        575
10     Henry Ford     Henry        615
                                                           Message
1     Hi John , it's been a while! Here's a special offer for you.
2     Hi Jane , it's been a while! Here's a special offer for you.
3      Hi Bob , it's been a while! Here's a special offer for you.
4    Hi Alice , it's been a while! Here's a special offer for you.
5  Hi Charlie , it's been a while! Here's a special offer for you.
6    Hi Diana , it's been a while! Here's a special offer for you.
7      Hi Eve , it's been a while! Here's a special offer for yo

## Part 5: Business Analytics Application

Putting it all together: Customer segmentation and product analysis.

In [15]:
# Monthly revenue analysis
monthly_revenue <- transactions_clean %>%
  group_by(Year, Month_Name) %>%
  summarize(
    Total_Revenue = sum(Amount),
    Transaction_Count = n(),
    Avg_Transaction = mean(Amount),
    .groups = 'drop'
  ) %>%
  arrange(Year, Month)

print("Monthly Revenue Summary:")
print(monthly_revenue)

ERROR: [1m[33mError[39m in `arrange()`:[22m
[1m[22m[36m‚Ñπ[39m In argument: `..2 = Month`.
[1mCaused by error:[22m
[33m![39m object 'Month' not found


In [None]:
# Product category analysis with clean names
category_summary <- products_clean %>%
  group_by(Category_Clean) %>%
  summarize(
    Product_Count = n(),
    Avg_Price = mean(Price),
    Total_Value = sum(Price),
    Premium_Count = sum(Is_Premium),
    .groups = 'drop'
  ) %>%
  arrange(desc(Total_Value))

print("Product Category Analysis:")
print(category_summary)

## Summary: Key Takeaways

### String Manipulation (stringr):
- `str_trim()` - Clean whitespace
- `str_to_lower/upper/title()` - Standardize case
- `str_detect()` - Find patterns
- `str_extract()` - Pull out specific parts
- `str_replace()` - Fix inconsistencies

### Date/Time Operations (lubridate):
- `ymd()`, `mdy()`, `dmy()` - Parse dates
- `year()`, `month()`, `day()` - Extract components
- `wday()` - Get weekday
- Date arithmetic - Calculate differences
- `today()`, `now()` - Current date/time

### Business Applications:
- Clean messy product names and categories
- Analyze transaction patterns by time
- Segment customers by recency
- Create personalized communications
- Identify trends and seasonality

### Next Steps:
Practice these skills with your own data! Text and date cleaning are essential for real-world analytics.

---

**End of Lesson 7**