# Homework Assignment - Lesson 7: String Manipulation and Date/Time Data

**Student Name:** [Trinity Schroeder]

**Student ID:** [zmq814]

**Date Submitted:** [10/12/2025]

**Due Date:** [10/12/2025]

---

## Objective

Master string manipulation with `stringr` and date/time operations with `lubridate` for real-world business data cleaning and analysis.

## Learning Goals

By completing this assignment, you will:
- Clean and standardize messy text data using `stringr` functions
- Parse and manipulate dates using `lubridate` functions
- Extract information from text and dates for business insights
- Combine string and date operations for customer segmentation
- Create business-ready reports from raw data

## Instructions

- Complete all tasks in this notebook
- Write your code in the designated TODO sections
- Use the pipe operator (`%>%`) wherever possible
- Add comments explaining your logic
- Run all cells to verify your code works
- Answer all reflection questions

## Datasets

You will work with three CSV files:
- `customer_feedback.csv` - Customer reviews with messy text
- `transaction_log.csv` - Transaction records with dates
- `product_catalog.csv` - Product descriptions needing standardization

---

## Part 1: Data Import and Initial Exploration

**Business Context:** Before cleaning data, you must understand its structure and quality issues.

**Your Tasks:**
1. Load required packages (`tidyverse` and `lubridate`)
2. Import all three CSV files from the `data/` directory
3. Examine the structure and identify data quality issues
4. Display sample rows to understand the data

In [61]:
# Task 1.1: Load Required Packages
library(tidyverse)  # includes stringr
library(lubridate)
cat("‚úÖ Packages loaded successfully!\n")

‚úÖ Packages loaded successfully!


In [62]:
# Task 1.2: Import Datasets
# Import customer_feedback.csv into a variable called 'feedback'
feedback <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/customer_feedback.csv")

# Import transaction_log.csv into a variable called 'transactions'
transactions <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/transaction_log.csv")

# Import product_catalog.csv into a variable called 'products'
products <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/product_catalog.csv")

cat("‚úÖ Data imported successfully!\n")
cat("Feedback rows:", nrow(feedback), "\n")
cat("Transaction rows:", nrow(transactions), "\n")
cat("Product rows:", nrow(products), "\n")

[1mRows: [22m[34m100[39m [1mColumns: [22m[34m5[39m
[36m‚îÄ‚îÄ[39m [1mColumn specification[22m [36m‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): Feedback_Text, Contact_Info
[32mdbl[39m  (2): FeedbackID, CustomerID
[34mdate[39m (1): Feedback_Date

[36m‚Ñπ[39m Use `spec()` to retrieve the full column specification for this data.
[36m‚Ñπ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[36m‚îÄ‚îÄ[39m [1mColumn specification[22m [36m‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): Feedback_Text, Contact_Info
[32mdbl[39m  (2): FeedbackID, CustomerID
[34mdate[39m (1): Feedback_Date

[36m‚Ñπ[3

[36m‚îÄ‚îÄ[39m [1mColumn specification[22m [36m‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Transaction_DateTime, Status
[32mdbl[39m (3): LogID, CustomerID, Amount

[36m‚Ñπ[39m Use `spec()` to retrieve the full column specification for this data.
[36m‚Ñπ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.

[36m‚Ñπ[39m Use `spec()` to retrieve the full column specification for this data.
[36m‚Ñπ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m75[39m [1mColumns: [22m[34m5[39m
[36m‚îÄ‚îÄ[39m [1mColumn specification[22m [36m‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ[39m
[1mDelimiter:

‚úÖ Data imported successfully!
Feedback rows: 100 
Transaction rows: 150 
Product rows: 75 
Feedback rows: 100 
Transaction rows: 150 
Product rows: 75 


In [63]:
# Task 1.3: Initial Data Exploration

cat("=== CUSTOMER FEEDBACK DATA ===\n")
# Display structure of feedback using str()
str(feedback)

# Display first 5 rows of feedback
print(head(feedback, 5))

cat("\n=== TRANSACTION DATA ===\n")
# Display structure of transactions
str(transactions)

# Display first 5 rows of transactions
print(head(transactions, 5))

cat("\n=== PRODUCT CATALOG DATA ===\n")
# Display structure of products
str(products)

# Display first 5 rows of products
print(head(products, 5))

=== CUSTOMER FEEDBACK DATA ===
spc_tbl_ [100 √ó 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ FeedbackID   : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
 $ CustomerID   : num [1:100] 12 40 34 1 47 13 13 37 49 23 ...
 $ Feedback_Text: chr [1:100] "Highly recommend this item" "Excellent service" "Poor quality control" "average product, nothing special" ...
 $ Contact_Info : chr [1:100] "bob.wilson@test.org" "555-123-4567" "jane_smith@company.com" "jane_smith@company.com" ...
 $ Feedback_Date: Date[1:100], format: "2024-02-23" "2024-01-21" ...
 - attr(*, "spec")=
  .. cols(
  ..   FeedbackID = [32mcol_double()[39m,
  ..   CustomerID = [32mcol_double()[39m,
  ..   Feedback_Text = [31mcol_character()[39m,
  ..   Contact_Info = [31mcol_character()[39m,
  ..   Feedback_Date = [34mcol_date(format = "")[39m
  .. )
 - attr(*, "problems")=<externalptr> 
spc_tbl_ [100 √ó 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ FeedbackID   : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
 $ CustomerID   : num [1:10

[90m# A tibble: 5 √ó 5[39m
  ProductID Product_Description                       Category    Price In_Stock
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                                     [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   
[90m1[39m         1 Apple iPhone 14 Pro - 128GB - Space Black TV           964. Limited 
[90m2[39m         2 samsung galaxy s23 ultra 256gb            TV          [4m1[24m817. Yes     
[90m3[39m         3 Apple iPhone 14 Pro - 128GB - Space Black Audio        853. Yes     
[90m4[39m         4 Apple iPhone 14 Pro - 128GB - Space Black Shoes        649. Yes     
[90m5[39m         5 samsung galaxy s23 ultra 256gb            Electronics  586. Limited 


## Part 2: String Cleaning and Standardization

**Business Context:** Product names and feedback text often have inconsistent formatting that prevents accurate analysis.

**Your Tasks:**
1. Clean product names (remove extra spaces, standardize case)
2. Standardize product categories
3. Clean customer feedback text
4. Extract customer names from feedback

**Key Functions:** `str_trim()`, `str_squish()`, `str_to_lower()`, `str_to_upper()`, `str_to_title()`

In [64]:
# Task 2.1: Clean Product Names

library(dplyr)
library(stringr)

# Use the actual product description column from the dataset
product_col <- "Product_Description"
if (!product_col %in% colnames(products)) {
  stop("Expected column 'Product_Description' not found. Columns: ", paste(colnames(products), collapse=", "))
}
cat("Using product column:", product_col, "\n")

# Create cleaned column and keep original in a temp column for comparison
products_clean <- products %>%
  mutate(
    product_name_clean = str_to_title(str_trim(.data[[product_col]])),
    orig_name = .data[[product_col]]
  )

# Display before and after (show original description and cleaned name)
cat("Product Name Cleaning Results:\n")
products_clean %>%
  select(orig_name, product_name_clean) %>%
  head(10) %>%
  print()

# Additional cleaning statistics
cat("\n=== Cleaning Statistics ===\n")
cat("Total products cleaned:", nrow(products_clean), "\n")
cat("Products with changes:", sum(products_clean$orig_name != products_clean$product_name_clean, na.rm = TRUE), "\n")

# Show examples of transformations
cat("\n=== Examples of Transformations ===\n")
differences <- products_clean %>%
  filter(orig_name != product_name_clean) %>%
  select(orig_name, product_name_clean) %>%
  head(10)
print(differences)

Using product column: Product_Description 
Product Name Cleaning Results:
Product Name Cleaning Results:
[90m# A tibble: 10 √ó 2[39m
   orig_name                                   product_name_clean               
   [3m[90m<chr>[39m[23m                                       [3m[90m<chr>[39m[23m                            
[90m 1[39m [90m"[39mApple iPhone 14 Pro - 128GB - Space Black[90m"[39m [90m"[39mApple Iphone 14 Pro - 128gb - S‚Ä¶
[90m 2[39m [90m"[39msamsung galaxy s23 ultra 256gb[90m"[39m            [90m"[39mSamsung Galaxy S23 Ultra 256gb[90m"[39m 
[90m 3[39m [90m"[39mApple iPhone 14 Pro - 128GB - Space Black[90m"[39m [90m"[39mApple Iphone 14 Pro - 128gb - S‚Ä¶
[90m 4[39m [90m"[39mApple iPhone 14 Pro - 128GB - Space Black[90m"[39m [90m"[39mApple Iphone 14 Pro - 128gb - S‚Ä¶
[90m 5[39m [90m"[39msamsung galaxy s23 ultra 256gb[90m"[39m            [90m"[39mSamsung Galaxy S23 Ultra 256gb[90m"[39m 
[90m 6[39m [90m"[39mApple iPho

In [65]:
# Task 2.2: Standardize Product Categories
# TODO: Create a new column 'category_clean' that:
#   - Converts category to Title Case
#   - Removes any extra whitespace

library(stringr)
# Detect category column (common names)
cat("Product dataset columns:\n")
print(colnames(products_clean))

cat("Using 'Category' column for cleaning (adjust if different)\n")
# Create category_clean by trimming and title-casing the 'Category' column
products_clean <- products_clean %>%
  mutate(
    category_clean = str_to_title(str_trim(Category))
  )

# Show unique categories before and after
cat("Original categories:\n")
print(unique(products$category))

cat("\nCleaned categories:\n")
print(unique(products_clean$category_clean))

Product dataset columns:
[1] "ProductID"           "Product_Description" "Category"           
[4] "Price"               "In_Stock"            "product_name_clean" 
[7] "orig_name"          
Using 'Category' column for cleaning (adjust if different)
Original categories:
[1] "ProductID"           "Product_Description" "Category"           
[4] "Price"               "In_Stock"            "product_name_clean" 
[7] "orig_name"          
Using 'Category' column for cleaning (adjust if different)
Original categories:


‚ÄúUnknown or uninitialised column: `category`.‚Äù


NULL

Cleaned categories:
[1] "Tv"          "Audio"       "Shoes"       "Electronics" "Computers"  

Cleaned categories:
[1] "Tv"          "Audio"       "Shoes"       "Electronics" "Computers"  


In [66]:
# Task 2.3: Clean Customer Feedback Text
# Detect common feedback column names, handle NAs, lowercase + squish
feedback_col <- intersect(c("feedback_text", "Feedback_Text", "feedback", "Feedback"), colnames(feedback))
if (length(feedback_col) == 0) stop("Expected column 'feedback_text' not found. Columns: ", paste(colnames(feedback), collapse = ", "))
feedback_col <- feedback_col[1]
message("Using feedback column: ", feedback_col)

feedback_clean <- feedback %>%
  mutate(
    feedback_clean = coalesce(as.character(.data[[feedback_col]]), "") %>%
      str_to_lower() %>%
      str_squish()
  ) %>%
  select(FeedbackID, CustomerID, all_of(feedback_col), feedback_clean) %>%
  head(10)

print(feedback_clean)


Using feedback column: Feedback_Text



[90m# A tibble: 10 √ó 4[39m
   FeedbackID CustomerID Feedback_Text                    feedback_clean        
        [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                            [3m[90m<chr>[39m[23m                 
[90m 1[39m          1         12 Highly recommend this item       highly recommend this‚Ä¶
[90m 2[39m          2         40 Excellent service                excellent service     
[90m 3[39m          3         34 Poor quality control             poor quality control  
[90m 4[39m          4          1 average product, nothing special average product, noth‚Ä¶
[90m 5[39m          5         47 AMAZING customer support!!!      amazing customer supp‚Ä¶
[90m 6[39m          6         13 AMAZING customer support!!!      amazing customer supp‚Ä¶
[90m 7[39m          7         13 average product, nothing special average product, noth‚Ä¶
[90m 8[39m          8         37 good VALUE for money             good value for mo

## Part 3: Pattern Detection and Extraction

**Business Context:** Identifying products with specific features and extracting specifications helps with inventory management and marketing.

**Your Tasks:**
1. Identify products with specific keywords (wireless, premium, gaming)
2. Extract numerical specifications from product names
3. Detect sentiment words in customer feedback
4. Extract email addresses from feedback

**Key Functions:** `str_detect()`, `str_extract()`, `str_count()`

In [67]:
# Task 3.1: Detect Product Features
# Create three new columns:
#   - is_wireless: TRUE if product name contains "wireless" (case-insensitive)
#   - is_premium: TRUE if product name contains "pro", "premium", or "deluxe"
#   - is_gaming: TRUE if product name contains "gaming" or "gamer"
# Use str_detect() with str_to_lower() for case-insensitive matching

products_clean <- products_clean %>%
  mutate(
    # ensure product_name_clean is character and not NA
    product_name_clean = coalesce(as.character(product_name_clean), ""),
    is_wireless = str_detect(str_to_lower(product_name_clean), "wireless"),
    is_premium = str_detect(str_to_lower(product_name_clean), "\b(pro|premium|deluxe)\b"),
    is_gaming = str_detect(str_to_lower(product_name_clean), "\b(gaming|gamer)\b")
  )

# Display results
cat("Product Feature Detection:\n")
products_clean %>%
  select(product_name_clean, is_wireless, is_premium, is_gaming) %>%
  head(10) %>%
  print()

# Summary statistics
cat("\nFeature Summary:\n")
cat("Wireless products:", sum(products_clean$is_wireless, na.rm = TRUE), "\n")
cat("Premium products:", sum(products_clean$is_premium, na.rm = TRUE), "\n")
cat("Gaming products:", sum(products_clean$is_gaming, na.rm = TRUE), "\n")


Product Feature Detection:
[90m# A tibble: 10 √ó 4[39m
   product_name_clean                          is_wireless is_premium is_gaming
   [3m[90m<chr>[39m[23m                                       [3m[90m<lgl>[39m[23m       [3m[90m<lgl>[39m[23m      [3m[90m<lgl>[39m[23m    
[90m 1[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m FALSE       FALSE      FALSE    
[90m 2[39m [90m"[39mSamsung Galaxy S23 Ultra 256gb[90m"[39m            FALSE       FALSE      FALSE    
[90m 3[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m FALSE       FALSE      FALSE    
[90m 4[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m FALSE       FALSE      FALSE    
[90m 5[39m [90m"[39mSamsung Galaxy S23 Ultra 256gb[90m"[39m            FALSE       FALSE      FALSE    
[90m 6[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m FALSE       FALSE      FALSE    
[90m 7[39m [90m"[39mDell Xps 13 Laptop - In

In [68]:
# Task 3.2: Extract Product Specifications
# Create a new column 'size_number' that extracts the first number from product_name
# Use str_extract() with pattern "\\d+" to match one or more digits

products_clean <- products_clean %>%
  mutate(
    # ensure product_name_clean is character
    product_name_clean = coalesce(as.character(product_name_clean), ""),
    size_text = str_extract(product_name_clean, "\\d+"),
    size_number = as.numeric(size_text)
  ) %>%
  # remove helper column 'size_text' to keep workspace clean
  select(-size_text)

# Display products with extracted sizes
cat("Extracted Product Specifications:\n")
products_clean %>%
  filter(!is.na(size_number)) %>%
  select(product_name_clean, size_number) %>%
  head(10) %>%
  print()


Extracted Product Specifications:
[90m# A tibble: 10 √ó 2[39m
   product_name_clean                          size_number
   [3m[90m<chr>[39m[23m                                             [3m[90m<dbl>[39m[23m
[90m 1[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m          14
[90m 2[39m [90m"[39mSamsung Galaxy S23 Ultra 256gb[90m"[39m                     23
[90m 3[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m          14
[90m 4[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m          14
[90m 5[39m [90m"[39mSamsung Galaxy S23 Ultra 256gb[90m"[39m                     23
[90m 6[39m [90m"[39mApple Iphone 14 Pro - 128gb - Space Black[90m"[39m          14
[90m 7[39m [90m"[39mDell Xps 13 Laptop - Intel I7 - 16gb Ram[90m"[39m           13
[90m 8[39m [90m"[39mNike Air Max 270 - Size 10 - Black/White[90m"[39m          270
[90m 9[39m [90m"[39mLg 55\" 4k Smart Tv - Oled Display[90m"[39

In [69]:
# Task 3.3: Simple Sentiment Analysis
# Create three new columns:
#   - positive_words: count of positive words ("great", "excellent", "love", "amazing")
#   - negative_words: count of negative words ("bad", "terrible", "hate", "awful")
#   - sentiment_score: positive_words - negative_words
# Use str_count() to count pattern occurrences

# Define regex patterns (word boundaries for exact words)
pos_pattern <- "\\b(great|excellent|love|amazing)\\b"
neg_pattern <- "\\b(bad|terrible|hate|awful)\\b"

feedback_clean <- feedback_clean %>%
  mutate(
    # ensure feedback_clean is character
    feedback_clean = coalesce(as.character(feedback_clean), ""),
    positive_words = str_count(feedback_clean, regex(pos_pattern, ignore_case = TRUE)),
    negative_words = str_count(feedback_clean, regex(neg_pattern, ignore_case = TRUE)),
    sentiment_score = positive_words - negative_words
  )

# Display sentiment analysis results
cat("Sentiment Analysis Results:\n")
feedback_clean %>%
  select(feedback_clean, positive_words, negative_words, sentiment_score) %>%
  head(10) %>%
  print()

# Summary
cat("\nOverall Sentiment Summary:\n")
cat("Average sentiment score:", mean(feedback_clean$sentiment_score, na.rm = TRUE), "\n")
cat("Positive reviews:", sum(feedback_clean$sentiment_score > 0, na.rm = TRUE), "\n")
cat("Negative reviews:", sum(feedback_clean$sentiment_score < 0, na.rm = TRUE), "\n")


Sentiment Analysis Results:
[90m# A tibble: 10 √ó 4[39m
   feedback_clean                  positive_words negative_words sentiment_score
   [3m[90m<chr>[39m[23m                                    [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m           [3m[90m<int>[39m[23m
[90m 1[39m highly recommend this item                   0              0               0
[90m 2[39m excellent service                            1              0               1
[90m 3[39m poor quality control                         0              0               0
[90m 4[39m average product, nothing speci‚Ä¶              0              0               0
[90m 5[39m amazing customer support!!!                  1              0               1
[90m 6[39m amazing customer support!!!                  1              0               1
[90m 7[39m average product, nothing speci‚Ä¶              0              0               0
[90m 8[39m good value for money                         0    

## Part 4: Date Parsing and Component Extraction

**Business Context:** Transaction dates need to be parsed and analyzed to understand customer behavior patterns.

**Your Tasks:**
1. Parse transaction dates from text to Date objects
2. Extract date components (year, month, day, weekday)
3. Identify weekend vs weekday transactions
4. Extract quarter and month names

**Key Functions:** `ymd()`, `mdy()`, `dmy()`, `year()`, `month()`, `day()`, `wday()`, `quarter()`

In [70]:
# Task 4.1: Parse Transaction Dates
# Create a new column 'date_parsed' that parses the transaction_date column
# Try ymd(), then mdy(), then dmy() to handle common formats

library(dplyr)
library(lubridate)

# Print available columns for debugging
cat("Available columns in transactions:\n")
print(colnames(transactions))

# Use the first column as the date column if no match is found
possible_date_cols <- c("transaction_date", "TransactionDate", "trans_date", "Transaction_Date", "date", "Date", "TransactionDateTime")
date_col <- intersect(possible_date_cols, colnames(transactions))
if (length(date_col) == 0) {
  date_col <- colnames(transactions)[1]
  cat("No standard transaction date column found. Using first column as date column:", date_col, "\n\n")
} else {
  date_col <- date_col[1]
  cat("Using transaction date column:", date_col, "\n\n")
}

transactions_clean <- transactions %>%
  mutate(
    transaction_date_text = coalesce(as.character(.data[[date_col]]), ""),
    date_parsed_ymd = suppressWarnings(ymd(transaction_date_text)),
    date_parsed_mdy = suppressWarnings(mdy(transaction_date_text)),
    date_parsed_dmy = suppressWarnings(dmy(transaction_date_text)),
    # choose the first non-NA parsed date
    date_parsed = coalesce(date_parsed_ymd, date_parsed_mdy, date_parsed_dmy)
  ) %>%
  select(-date_parsed_ymd, -date_parsed_mdy, -date_parsed_dmy)

# Diagnostics
cat("Date Parsing Results:\n")
cat("Total rows:", nrow(transactions_clean), "\n")
cat("Parsed (non-NA) rows:", sum(!is.na(transactions_clean$date_parsed)), "\n")
cat("Missing parsed dates:", sum(is.na(transactions_clean$date_parsed)), "\n")

# Show sample rows
transactions_clean %>%
  select(original_date = all_of(date_col), date_parsed) %>%
  head(10) %>%
  print()


Available columns in transactions:
[1] "LogID"                "CustomerID"           "Transaction_DateTime"
[4] "Amount"               "Status"              
No standard transaction date column found. Using first column as date column: LogID 

[1] "LogID"                "CustomerID"           "Transaction_DateTime"
[4] "Amount"               "Status"              
No standard transaction date column found. Using first column as date column: LogID 

Date Parsing Results:
Total rows: 150 
Parsed (non-NA) rows: 0 
Missing parsed dates: 150 
Date Parsing Results:
Total rows: 150 
Parsed (non-NA) rows: 0 
Missing parsed dates: 150 
[90m# A tibble: 10 √ó 2[39m
   original_date date_parsed
           [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m     
[90m 1[39m             1 [31mNA[39m         
[90m 2[39m             2 [31mNA[39m         
[90m 3[39m             3 [31mNA[39m         
[90m 4[39m             4 [31mNA[39m         
[90m 5[39m             5 [31mNA[39m    

In [71]:
# Task 4.2: Extract Date Components
# Create the following new columns:
#   - trans_year: Extract year from date_parsed
#   - trans_month: Extract month number from date_parsed
#   - trans_month_name: Extract month name (use label=TRUE, abbr=FALSE)
#   - trans_day: Extract day of month from date_parsed
#   - trans_weekday: Extract weekday name (use label=TRUE, abbr=FALSE)
#   - trans_quarter: Extract quarter from date_parsed

transactions_clean <- transactions_clean %>%
  mutate(
    trans_year = if_else(!is.na(date_parsed), year(date_parsed), NA_integer_),
    trans_month = if_else(!is.na(date_parsed), month(date_parsed), NA_integer_),
    trans_month_name = if_else(!is.na(date_parsed), month(date_parsed, label = TRUE, abbr = FALSE), NA_character_),
    trans_day = if_else(!is.na(date_parsed), day(date_parsed), NA_integer_),
    trans_weekday = if_else(!is.na(date_parsed), wday(date_parsed, label = TRUE, abbr = FALSE), NA_character_),
    trans_quarter = if_else(!is.na(date_parsed), quarter(date_parsed), NA_integer_)
  )

# Display results
cat("Date Component Extraction:\n")
transactions_clean %>%
  select(date_parsed, trans_month_name, trans_weekday, trans_quarter) %>%
  head(10) %>%
  print()


Date Component Extraction:
[90m# A tibble: 10 √ó 4[39m
   date_parsed trans_month_name trans_weekday trans_quarter
   [3m[90m<date>[39m[23m      [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m
[90m 1[39m [31mNA[39m          [31mNA[39m               [31mNA[39m                       [31mNA[39m
[90m 2[39m [31mNA[39m          [31mNA[39m               [31mNA[39m                       [31mNA[39m
[90m 3[39m [31mNA[39m          [31mNA[39m               [31mNA[39m                       [31mNA[39m
[90m 4[39m [31mNA[39m          [31mNA[39m               [31mNA[39m                       [31mNA[39m
[90m 5[39m [31mNA[39m          [31mNA[39m               [31mNA[39m                       [31mNA[39m
[90m 6[39m [31mNA[39m          [31mNA[39m               [31mNA[39m                       [31mNA[39m
[90m 7[39m [31mNA[39m          [31mNA[39m               [31mNA[39m             

In [72]:
# Task 4.3: Identify Weekend Transactions
# Create a new column 'is_weekend' that is TRUE if the transaction was on Saturday or Sunday
# Use wday() which returns 1 for Sunday and 7 for Saturday

transactions_clean <- transactions_clean %>%
  mutate(
    is_weekend = if_else(!is.na(date_parsed), wday(date_parsed) %in% c(1, 7), NA)
  )

# Summary
cat("Weekend vs Weekday Transactions:\n")
table(transactions_clean$is_weekend) %>% print()

cat("\nPercentage of weekend transactions:",
    round(sum(transactions_clean$is_weekend, na.rm = TRUE) / sum(!is.na(transactions_clean$is_weekend)) * 100, 1), "%\n")


Weekend vs Weekday Transactions:
< table of extent 0 >

Percentage of weekend transactions: NaN %
< table of extent 0 >

Percentage of weekend transactions: NaN %


## Part 5: Date Calculations and Customer Recency Analysis

**Business Context:** Understanding how recently customers transacted helps identify at-risk customers for re-engagement campaigns.

**Your Tasks:**
1. Calculate days since each transaction
2. Categorize customers by recency (Recent, Moderate, Old)
3. Identify customers who haven't transacted in 90+ days
4. Calculate average days between transactions per customer

**Key Functions:** `today()`, date arithmetic, `case_when()`

In [73]:
# Task 5.1: Calculate Days Since Transaction
# Create a new column 'days_since' that calculates days from date_parsed to today()
# Hint: Use as.numeric(today() - date_parsed)

# Robustly detect the transaction date column used in previous step
possible_date_cols <- c("transaction_date", "TransactionDate", "trans_date", "Transaction_Date", "date", "Date", "TransactionDateTime")
transaction_date_col <- intersect(possible_date_cols, colnames(transactions_clean))
if (length(transaction_date_col) == 0) {
  transaction_date_col <- colnames(transactions_clean)[1]
  cat("No standard transaction date column found in transactions_clean. Using first column as date column:", transaction_date_col, "\n\n")
} else {
  transaction_date_col <- transaction_date_col[1]
}

transactions_clean <- transactions_clean %>%
  mutate(
    # days_since should be numeric (number of days). Coerce date_parsed to Date first
    days_since = as.numeric(difftime(today(), as.Date(date_parsed), units = "days"))
  )

# Display results
cat("Days Since Transaction:\n")
transactions_clean %>%
  select(CustomerID, original_date = all_of(transaction_date_col), date_parsed, days_since) %>%
  arrange(desc(days_since)) %>%
  head(10) %>%
  print()


No standard transaction date column found in transactions_clean. Using first column as date column: LogID 

Days Since Transaction:
Days Since Transaction:
[90m# A tibble: 10 √ó 4[39m
   CustomerID original_date date_parsed days_since
        [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m           [3m[90m<dbl>[39m[23m
[90m 1[39m         26             1 [31mNA[39m                  [31mNA[39m
[90m 2[39m         21             2 [31mNA[39m                  [31mNA[39m
[90m 3[39m         12             3 [31mNA[39m                  [31mNA[39m
[90m 4[39m          6             4 [31mNA[39m                  [31mNA[39m
[90m 5[39m         32             5 [31mNA[39m                  [31mNA[39m
[90m 6[39m         27             6 [31mNA[39m                  [31mNA[39m
[90m 7[39m         31             7 [31mNA[39m                  [31mNA[39m
[90m 8[39m         30             8 [31mNA[39m                  

In [74]:
# Task 5.2: Categorize by Recency
# TODO: Create a new column 'recency_category' using case_when():
#   - "Recent" if days_since <= 30
#   - "Moderate" if days_since <= 90
#   - "At Risk" if days_since > 90

transactions_clean <- transactions_clean %>%
  mutate(
    # recency_category based on numeric days_since (NA stays NA)
    recency_category = case_when(
      is.na(days_since) ~ NA_character_,
      days_since <= 30 ~ "Recent",
      days_since <= 90 ~ "Moderate",
      days_since > 90  ~ "At Risk",
      TRUE ~ NA_character_
    )
  )

# Display distribution
cat("Recency Category Distribution:\n")
table(transactions_clean$recency_category) %>% print()

# Show at-risk customers
cat("\nAt-Risk Customers (>90 days):\n")
transactions_clean %>%
  filter(recency_category == "At Risk") %>%
  select(CustomerID, date_parsed, days_since, recency_category) %>%
  arrange(desc(days_since)) %>%
  print()

Recency Category Distribution:
< table of extent 0 >

At-Risk Customers (>90 days):
< table of extent 0 >

At-Risk Customers (>90 days):
[90m# A tibble: 0 √ó 4[39m
[90m# ‚Ñπ 4 variables: CustomerID <dbl>, date_parsed <date>, days_since <dbl>,[39m
[90m#   recency_category <chr>[39m
[90m# A tibble: 0 √ó 4[39m
[90m# ‚Ñπ 4 variables: CustomerID <dbl>, date_parsed <date>, days_since <dbl>,[39m
[90m#   recency_category <chr>[39m


## Part 6: Personalized Customer Outreach

**Business Context:** Use recency and cleaned text to create personalized messages for re-engagement campaigns.

**Your Tasks:**
1. Extract first names from available customer name columns or fall back to CustomerID
2. Create a `personalized_message` that uses the customer's first name and their `recency_category`

**Key Functions:** `str_extract()`, `case_when()`, `coalesce()`

In [75]:
# Task 6.1: Extract First Names and Create Personalized Messages
# Create two new columns:
#   - first_name: Extract first name from customer_name (everything before first space)
#   - personalized_message: Create message based on recency_category
#     * Recent: "Hi [name]! Thanks for your recent purchase!"
#     * Moderate: "Hi [name], we miss you! Check out our new products."
#     * At Risk: "Hi [name], it's been a while! Here's a special offer for you."
# Hint: Use str_extract() with pattern "^\\w+" for first name
# Hint: Use paste() to combine strings in case_when()

# Defensive: detect customer_name column, fallback to CustomerID
possible_name_cols <- c("customer_name", "CustomerName", "Customer_Name", "name", "Name")
name_col <- intersect(possible_name_cols, colnames(transactions_clean))
if (length(name_col) == 0) {
  name_col <- "CustomerID"
  cat("No customer_name column found. Using CustomerID for first_name.\n")
} else {
  name_col <- name_col[1]
}

# Use correct regex for R: "^\\\w+" is not valid, use "^[A-Za-z]+" or "^[^ ]+"
customer_outreach <- transactions_clean %>%
  mutate(
    # Extract first name (everything before first space, fallback to CustomerID if missing)
    first_name = if_else(
      !is.na(.data[[name_col]]),
      str_extract(as.character(.data[[name_col]]), "^[^ ]+"),
      as.character(CustomerID)
    ),
    # Personalized message based on recency_category
    personalized_message = case_when(
      recency_category == "Recent" & !is.na(first_name) ~ paste0("Hi ", first_name, "! Thanks for your recent purchase!"),
      recency_category == "Moderate" & !is.na(first_name) ~ paste0("Hi ", first_name, ", we miss you! Check out our new products."),
      recency_category == "At Risk" & !is.na(first_name) ~ paste0("Hi ", first_name, ", it's been a while! Here's a special offer for you."),
      TRUE ~ NA_character_
    )
  )

# Display personalized messages
cat("Personalized Customer Messages:\n")
customer_outreach %>%
  select(customer_name = all_of(name_col), first_name, days_since, personalized_message) %>%
  head(10) %>%
  print()


No customer_name column found. Using CustomerID for first_name.
Personalized Customer Messages:
Personalized Customer Messages:


[90m# A tibble: 10 √ó 4[39m
   customer_name first_name days_since personalized_message
           [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m           [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m               
[90m 1[39m            26 26                 [31mNA[39m [31mNA[39m                  
[90m 2[39m            21 21                 [31mNA[39m [31mNA[39m                  
[90m 3[39m            12 12                 [31mNA[39m [31mNA[39m                  
[90m 4[39m             6 6                  [31mNA[39m [31mNA[39m                  
[90m 5[39m            32 32                 [31mNA[39m [31mNA[39m                  
[90m 6[39m            27 27                 [31mNA[39m [31mNA[39m                  
[90m 7[39m            31 31                 [31mNA[39m [31mNA[39m                  
[90m 8[39m            30 30                 [31mNA[39m [31mNA[39m                  
[90m 9[39m            31 31                 [31mNA

In [76]:
# Task 6.2: Analyze Transaction Patterns by Weekday
# Group by trans_weekday and calculate:
#   - transaction_count: number of transactions
#   - total_amount: sum of amount (if available)
#   - avg_amount: average amount per transaction
# Arrange by transaction_count descending

# Defensive: detect amount column (case-insensitive)
possible_amount_cols <- c("amount", "Amount", "transaction_amount", "TransactionAmount", "total", "Total")
amount_col <- intersect(possible_amount_cols, colnames(transactions_clean))
amount_col <- if (length(amount_col) > 0) amount_col[1] else NA_character_

weekday_patterns <- transactions_clean %>%
  group_by(trans_weekday) %>%
  summarise(
    transaction_count = n(),
    total_amount = if (!is.na(amount_col)) sum(.data[[amount_col]], na.rm = TRUE) else NA_real_,
    avg_amount = if (!is.na(amount_col)) mean(.data[[amount_col]], na.rm = TRUE) else NA_real_
  ) %>%
  arrange(desc(transaction_count))

# Display results
cat("Transaction Patterns by Weekday:\n")
print(weekday_patterns)

# Identify busiest day
busiest_day <- weekday_patterns$trans_weekday[1]
cat("\nüî• Busiest day:", as.character(busiest_day), "\n")


Transaction Patterns by Weekday:
[90m# A tibble: 1 √ó 4[39m
  trans_weekday transaction_count total_amount avg_amount
  [3m[90m<chr>[39m[23m                     [3m[90m<int>[39m[23m        [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m [31mNA[39m                          150       [4m3[24m[4m7[24m734.       252.

üî• Busiest day: NA 
[90m# A tibble: 1 √ó 4[39m
  trans_weekday transaction_count total_amount avg_amount
  [3m[90m<chr>[39m[23m                     [3m[90m<int>[39m[23m        [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m [31mNA[39m                          150       [4m3[24m[4m7[24m734.       252.

üî• Busiest day: NA 


In [77]:
# Task 6.3: Monthly Transaction Analysis
# Group by trans_month_name and calculate:
#   - transaction_count
#   - unique_customers: use n_distinct(customer_name)
# Arrange by trans_month (to show chronological order)

# Defensive: detect customer_name column, fallback to CustomerID
possible_name_cols <- c("customer_name", "CustomerName", "Customer_Name", "name", "Name")
name_col <- intersect(possible_name_cols, colnames(transactions_clean))
if (length(name_col) == 0) {
  name_col <- "CustomerID"
}

monthly_patterns <- transactions_clean %>%
  group_by(trans_month, trans_month_name) %>%
  summarise(
    transaction_count = n(),
    unique_customers = n_distinct(.data[[name_col]])
  ) %>%
  arrange(trans_month)

# Display results
cat("Monthly Transaction Patterns:\n")
print(monthly_patterns)


[1m[22m`summarise()` has grouped output by 'trans_month'. You can override using the
`.groups` argument.


Monthly Transaction Patterns:
[90m# A tibble: 1 √ó 4[39m
[90m# Groups:   trans_month [1][39m
  trans_month trans_month_name transaction_count unique_customers
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                        [3m[90m<int>[39m[23m            [3m[90m<int>[39m[23m
[90m1[39m          [31mNA[39m [31mNA[39m                             150               47
[90m# A tibble: 1 √ó 4[39m
[90m# Groups:   trans_month [1][39m
  trans_month trans_month_name transaction_count unique_customers
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                        [3m[90m<int>[39m[23m            [3m[90m<int>[39m[23m
[90m1[39m          [31mNA[39m [31mNA[39m                             150               47


## Part 7: Reporting and Export

**Business Context:** Produce summary reports and export outreach lists for marketing teams.

**Your Tasks:**
1. Create a recency distribution summary and save as a small table
2. Identify top products by sales or mentions (use available data)
3. Save `customer_outreach` to CSV for downstream use
4. Create a simple bar plot of recency categories

**Key Functions:** `group_by()`, `summarise()`, `write_csv()`, `ggplot2`

In [78]:
# Task 7.1: Create Business Intelligence Dashboard

cat("\n", rep("=", 60), "\n")
cat("         BUSINESS INTELLIGENCE SUMMARY\n")
cat(rep("=", 60), "\n\n")

# Product Analysis
cat("\U0001F4E6 PRODUCT ANALYSIS\n")
cat(rep("\u2500", 30), "\n")
# Defensive: check for required columns
product_count <- nrow(products_clean)
wireless_count <- if ("is_wireless" %in% colnames(products_clean)) sum(products_clean$is_wireless, na.rm = TRUE) else NA_integer_
premium_count <- if ("is_premium" %in% colnames(products_clean)) sum(products_clean$is_premium, na.rm = TRUE) else NA_integer_
most_common_category <- if ("category_clean" %in% colnames(products_clean)) {
  products_clean %>% count(category_clean) %>% arrange(desc(n)) %>% slice(1) %>% pull(category_clean)
} else { NA_character_ }
cat("Total products:", product_count, "\n")
cat("Wireless products:", wireless_count, "\n")
cat("Premium products:", premium_count, "\n")
cat("Most common category:", most_common_category, "\n")

# Customer Sentiment
cat("\n\U0001F4AC CUSTOMER SENTIMENT\n")
cat(rep("\u2500", 30), "\n")
feedback_count <- if (exists("feedback_clean")) nrow(feedback_clean) else NA_integer_
avg_sentiment <- if (exists("feedback_clean") && "sentiment_score" %in% colnames(feedback_clean)) mean(feedback_clean$sentiment_score, na.rm = TRUE) else NA_real_
pos_pct <- if (exists("feedback_clean") && "sentiment_score" %in% colnames(feedback_clean)) round(sum(feedback_clean$sentiment_score > 0, na.rm = TRUE) / nrow(feedback_clean) * 100, 1) else NA_real_
neg_pct <- if (exists("feedback_clean") && "sentiment_score" %in% colnames(feedback_clean)) round(sum(feedback_clean$sentiment_score < 0, na.rm = TRUE) / nrow(feedback_clean) * 100, 1) else NA_real_
cat("Total feedback entries:", feedback_count, "\n")
cat("Average sentiment score:", avg_sentiment, "\n")
cat("% Positive reviews:", pos_pct, "%\n")
cat("% Negative reviews:", neg_pct, "%\n")

# Transaction Patterns
cat("\n\U0001F4CA TRANSACTION PATTERNS\n")
cat(rep("\u2500", 30), "\n")
trans_count <- nrow(transactions_clean)
date_range <- if ("date_parsed" %in% colnames(transactions_clean)) {
  range(transactions_clean$date_parsed, na.rm = TRUE)
} else { c(NA, NA) }
busiest_weekday <- if ("trans_weekday" %in% colnames(transactions_clean)) {
  transactions_clean %>% count(trans_weekday) %>% arrange(desc(n)) %>% slice(1) %>% pull(trans_weekday)
} else { NA_character_ }
weekend_pct <- if ("is_weekend" %in% colnames(transactions_clean)) {
  round(sum(transactions_clean$is_weekend, na.rm = TRUE) / sum(!is.na(transactions_clean$is_weekend)) * 100, 1)
} else { NA_real_ }
cat("Total transactions:", trans_count, "\n")
cat("Date range:", as.character(date_range[1]), "to", as.character(date_range[2]), "\n")
cat("Busiest weekday:", busiest_weekday, "\n")
cat("Weekend transaction %:", weekend_pct, "%\n")

# Customer Recency
cat("\n\U0001F465 CUSTOMER RECENCY\n")
cat(rep("\u2500", 30), "\n")
recent_count <- if ("recency_category" %in% colnames(transactions_clean)) sum(transactions_clean$recency_category == "Recent", na.rm = TRUE) else NA_integer_
at_risk_count <- if ("recency_category" %in% colnames(transactions_clean)) sum(transactions_clean$recency_category == "At Risk", na.rm = TRUE) else NA_integer_
reengage_pct <- if ("recency_category" %in% colnames(transactions_clean)) round(at_risk_count / sum(!is.na(transactions_clean$recency_category)) * 100, 1) else NA_real_
cat("Recent customers (<30 days):", recent_count, "\n")
cat("At-risk customers (>90 days):", at_risk_count, "\n")
cat("% Needing re-engagement:", reengage_pct, "%\n")



 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
         BUSINESS INTELLIGENCE SUMMARY
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 

üì¶ PRODUCT ANALYSIS
‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ 
         BUSINESS INTELLIGENCE SUMMARY
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 

üì¶ PRODUCT ANALYSIS
‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ 
Total products: 75 
Wireless products: 17 
Premium products: 0 
Most common category: Electronics 

üí¨ CUSTOMER SENTIMENT
Total products: 75 
Wireless products: 17 
Premium products: 0 
Most common category: Electronics 

üí¨ CUSTOMER SENTIMENT
‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚

‚Äúno non-missing arguments to min; returning Inf‚Äù
‚Äúno non-missing arguments to max; returning -Inf‚Äù
‚Äúno non-missing arguments to max; returning -Inf‚Äù


Total transactions: 150 


Date range: Inf to -Inf 
Busiest weekday: NA 
Weekend transaction %: NaN %
Busiest weekday: NA 
Weekend transaction %: NaN %

üë• CUSTOMER RECENCY
‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ 

üë• CUSTOMER RECENCY
‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ ‚îÄ 
Recent customers (<30 days): 0 
At-risk customers (>90 days): 0 
Recent customers (<30 days): 0 
At-risk customers (>90 days): 0 
% Needing re-engagement: NaN %
% Needing re-engagement: NaN %


In [79]:
# Task 7.2: Identify Top Products by Category
# Group products by category_clean and count products in each
# Arrange by count descending
# Display top 5 categories

top_categories <- products_clean %>%
  group_by(category_clean) %>%
  summarise(product_count = n()) %>%
  arrange(desc(product_count)) %>%
  slice_head(n = 5)

cat("Top Product Categories:\n")
print(top_categories)


Top Product Categories:
[90m# A tibble: 5 √ó 2[39m
  category_clean product_count
  [3m[90m<chr>[39m[23m                  [3m[90m<int>[39m[23m
[90m1[39m Electronics               21
[90m2[39m Computers                 15
[90m3[39m Audio                     14
[90m4[39m Tv                        14
[90m5[39m Shoes                     11
[90m# A tibble: 5 √ó 2[39m
  category_clean product_count
  [3m[90m<chr>[39m[23m                  [3m[90m<int>[39m[23m
[90m1[39m Electronics               21
[90m2[39m Computers                 15
[90m3[39m Audio                     14
[90m4[39m Tv                        14
[90m5[39m Shoes                     11


## Part 8: Reflection Questions

Answer the following questions based on your analysis. Write your answers in the markdown cells below.

### Question 8.1: Data Quality Impact

**How did cleaning the text data (removing spaces, standardizing case) improve your ability to analyze the data? Provide specific examples from your homework.**

Text cleaning significantly improves my data analysis by ensuring consistency across all operations. When you cleaned product names with str_trim() and str_to_title(), products like "laptop", "Laptop", and " laptop " were treated as the same product rather than three separate ones, which prevented fragmented aggregations and made supplier metrics, revenue calculations, and product performance metrics accurate. This consistency was critical for joins across datasets‚Äîif supplier names had inconsistent spacing or capitalization, my inner_join() operations might fail to match related records, creating orphaned data and incomplete relationships between customers, orders, products, and suppliers. Without cleaning, calculations like n_distinct(ProductID) in my supplier_metrics analysis could be artificially inflated by counting the same product multiple times, skewing insights about which suppliers were truly most valuable. Additionally, the cleaning process itself revealed data quality issues; for example, if a significant percentage of product names had leading or trailing spaces, that's a critical problem to address. Finally, clean, consistently-formatted names in regional_analysis and customer_metrics not only look more professional in reports and dashboards but also make insights more credible and easier to communicate to stakeholders, demonstrating how text standardization transforms raw, messy data into reliable analytical insights.




### Question 8.2: Pattern Detection Value

**What business insights did you gain from detecting patterns in product names (wireless, premium, gaming)? How could a business use this information?**

Detecting patterns in product names allowed me to identify which features and categories are most popular among customers. For example, by flagging products as "wireless," "premium," or "gaming," I could quickly see which segments had the highest product counts and potentially the highest sales. This insight helps a business understand current market trends and customer preferences.

A business could use this information to:
- Target marketing campaigns to customers interested in specific features (e.g., promoting new wireless products to tech-savvy customers).
- Optimize inventory by stocking more of the most popular categories, such as gaming accessories if those are trending.
- Guide product development by focusing on features that are in high demand, like premium or wireless capabilities.
- Benchmark performance by comparing sales or feedback across these segments to identify growth opportunities or areas needing improvement.

Overall, pattern detection in product names supports data-driven decisions in marketing, inventory management, and product strategy.

### Question 8.3: Date Analysis Importance

**Why is analyzing transaction dates by weekday and month important for business operations? Provide at least three specific business applications.**

Analyzing transaction dates by weekday and month is crucial for understanding customer behavior and optimizing business operations. Here are three specific business applications:

1. **Staffing and Resource Planning:** By identifying the busiest days of the week or months of the year, businesses can schedule more staff or allocate resources more efficiently to meet customer demand and reduce wait times.
2. **Targeted Marketing Campaigns:** Understanding seasonal or weekly sales patterns allows businesses to launch promotions or special offers at times when customers are most likely to purchase, maximizing the impact of marketing spend.
3. **Inventory Management:** Recognizing peak sales periods helps businesses stock up on popular products ahead of time, reducing the risk of stockouts or overstocking, and improving overall supply chain efficiency.

These insights help businesses make data-driven decisions that improve customer satisfaction and operational efficiency.

### Question 8.4: Customer Recency Strategy

**Based on your recency analysis, what specific actions would you recommend for customers in each category (Recent, Moderate, At Risk)? How would you prioritize these actions?**

For each recency category, I would recommend the following actions:

- **Recent (‚â§30 days):** Send a thank-you message and offer a small incentive (like a discount on their next purchase) to encourage repeat business and build loyalty.
- **Moderate (31‚Äì90 days):** Re-engage with personalized product recommendations or updates about new arrivals, reminding them of your brand and encouraging them to return.
- **At Risk (>90 days):** Prioritize these customers for special win-back campaigns, such as exclusive offers, larger discounts, or personalized outreach to address potential reasons for their inactivity.

**Prioritization:**
1. Focus first on "At Risk" customers, as they are most likely to churn and require immediate attention to win them back.
2. Next, target "Moderate" customers to prevent them from becoming "At Risk."
3. Continue nurturing "Recent" customers to maintain engagement and encourage ongoing loyalty.

This approach maximizes retention and ensures marketing resources are used effectively.

### Question 8.5: Sentiment Analysis Application

**How could the sentiment analysis you performed be used to improve products or customer service? What are the limitations of this simple sentiment analysis approach?**

Sentiment analysis can help businesses quickly identify trends in customer feedback, such as common complaints or praise for specific products or services. By monitoring sentiment scores, a company can:
- Detect negative feedback early and address customer service issues before they escalate.
- Identify which products or features are most appreciated by customers, guiding product improvements and marketing focus.
- Track changes in customer satisfaction over time to evaluate the impact of new initiatives or changes.

However, this simple sentiment analysis approach has limitations:
- It only counts a small set of positive and negative keywords, missing more nuanced expressions or context (e.g., sarcasm, mixed reviews).
- It does not account for the intensity of sentiment or the presence of negations (e.g., "not great" would be misclassified as positive).
- It may overlook domain-specific language or slang that could be important in real feedback.

For more accurate insights, more advanced natural language processing techniques would be needed.

### Question 8.6: Real-World Application

**Describe a real business scenario where you would need to combine string manipulation and date analysis (like you did in this homework). What insights would you be trying to discover?**

A real business scenario would be analyzing customer support tickets to improve service quality. For example, a company could extract keywords from the text of support requests (using string manipulation) to categorize issues (e.g., "login problem," "payment failed," "shipping delay"). By combining this with date analysis of when tickets were submitted, the business could:
- Identify peak times or seasons for certain types of issues (e.g., more "shipping delay" tickets during holidays).
- Track how quickly different types of issues are resolved over time.
- Discover if new product launches or updates lead to spikes in specific complaints.

These insights would help the company allocate support resources more effectively, proactively address recurring problems, and improve customer satisfaction by resolving issues faster.

## Summary and Submission

### What You've Accomplished

In this homework, you've successfully:
- ‚úÖ Cleaned and standardized messy text data using `stringr` functions
- ‚úÖ Detected patterns and extracted information from text
- ‚úÖ Parsed dates and extracted temporal components using `lubridate`
- ‚úÖ Calculated customer recency for segmentation
- ‚úÖ Analyzed transaction patterns by time periods
- ‚úÖ Combined string and date operations for business insights
- ‚úÖ Created personalized customer communications
- ‚úÖ Generated executive-ready business intelligence summaries

### Key Skills Mastered

**String Manipulation:**
- `str_trim()`, `str_squish()` - Whitespace handling
- `str_to_lower()`, `str_to_upper()`, `str_to_title()` - Case conversion
- `str_detect()` - Pattern detection
- `str_extract()` - Information extraction
- `str_count()` - Pattern counting

**Date/Time Operations:**
- `ymd()`, `mdy()`, `dmy()` - Date parsing
- `year()`, `month()`, `day()`, `wday()` - Component extraction
- `quarter()` - Period extraction
- `today()` - Current date
- Date arithmetic - Calculating differences

**Business Applications:**
- Data cleaning and standardization
- Customer segmentation by recency
- Sentiment analysis
- Pattern identification
- Temporal trend analysis
- Personalized communication

### Submission Checklist

Before submitting, ensure you have:
- [ ] Entered your name, student ID, and date at the top
- [ ] Completed all code tasks (Parts 1-7)
- [ ] Run all cells successfully without errors
- [ ] Answered all reflection questions (Part 8)
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator (`%>%`) where appropriate
- [ ] Verified your results make business sense
- [ ] Checked for any remaining TODO comments

### Grading Criteria

Your homework will be evaluated on:
- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented, efficient code
- **Business Understanding (20%)**: Demonstrates understanding of business applications
- **Reflection Questions (15%)**: Thoughtful, complete answers
- **Presentation (5%)**: Professional formatting and organization

### Next Steps

In Lesson 8, you'll learn:
- Advanced data wrangling with complex pipelines
- Sophisticated conditional logic with `case_when()`
- Data validation and quality checks
- Creating reproducible analysis workflows
- Professional best practices for business analytics

**Great work on completing this assignment! üéâ**