# Homework Assignment - Lesson 7: String Manipulation and Date/Time Data

**Student Name:** [Deon Schoeman]

**Student ID:** [MCH616]

**Date Submitted:** [10/8/2025]

**Due Date:** [10/12/2025]

---

## Objective

Master string manipulation with `stringr` and date/time operations with `lubridate` for real-world business data cleaning and analysis.

## Learning Goals

By completing this assignment, you will:
- Clean and standardize messy text data using `stringr` functions
- Parse and manipulate dates using `lubridate` functions
- Extract information from text and dates for business insights
- Combine string and date operations for customer segmentation
- Create business-ready reports from raw data

## Instructions

- Complete all tasks in this notebook
- Write your code in the designated TODO sections
- Use the pipe operator (`%>%`) wherever possible
- Add comments explaining your logic
- Run all cells to verify your code works
- Answer all reflection questions

## Datasets

You will work with three CSV files:
- `customer_feedback.csv` - Customer reviews with messy text
- `transaction_log.csv` - Transaction records with dates
- `product_catalog.csv` - Product descriptions needing standardization

---

## Part 1: Data Import and Initial Exploration

**Business Context:** Before cleaning data, you must understand its structure and quality issues.

**Your Tasks:**
1. Load required packages (`tidyverse` and `lubridate`)
2. Import all three CSV files from the `data/` directory
3. Examine the structure and identify data quality issues
4. Display sample rows to understand the data

In [5]:
# Task 1.1: Load Required Packages
# TODO: Load tidyverse (includes stringr)
library(tidyverse)
library(lubridate)

setwd("/workspaces/Assignment-3-Data-Transformation-with-dplyr---Part-1/data/")


# TODO: Load lubridate


cat("âœ… Packages loaded successfully!\n")

âœ… Packages loaded successfully!


In [6]:
# Task 1.2: Import Datasets
# TODO: Import customer_feedback.csv into a variable called 'feedback'
feedback <- read.csv("customer_feedback.csv")

# TODO: Import transaction_log.csv into a variable called 'transactions'
transactions <- read.csv("transaction_log.csv")

# TODO: Import product_catalog.csv into a variable called 'products'
products <- read.csv("product_catalog.csv")

cat("âœ… Data imported successfully!\n")
cat("Feedback rows:", nrow(feedback), "\n")
cat("Transaction rows:", nrow(transactions), "\n")
cat("Product rows:", nrow(products), "\n")

âœ… Data imported successfully!
Feedback rows: 100 
Transaction rows: 150 
Product rows: 75 


In [7]:
# Task 1.3: Initial Data Exploration

cat("=== CUSTOMER FEEDBACK DATA ===\n")
# TODO: Display structure of feedback using str()
str(feedback)


# TODO: Display first 5 rows of feedback
head(feedback, 5)

cat("\n=== TRANSACTION DATA ===\n")
# TODO: Display structure of transactions
str(transactions)

# TODO: Display first 5 rows of transactions
head(transactions, 5)

cat("\n=== PRODUCT CATALOG DATA ===\n")
# TODO: Display structure of products
str(products)

# TODO: Display first 5 rows of products
head(products, 5)


=== CUSTOMER FEEDBACK DATA ===
'data.frame':	100 obs. of  5 variables:
 $ FeedbackID   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CustomerID   : int  12 40 34 1 47 13 13 37 49 23 ...
 $ Feedback_Text: chr  "Highly recommend this item" "   Excellent service   " "Poor quality control" "average product, nothing special" ...
 $ Contact_Info : chr  "bob.wilson@test.org" "555-123-4567" "jane_smith@company.com" "jane_smith@company.com" ...
 $ Feedback_Date: chr  "2024-02-23" "2024-01-21" "2023-09-02" "2023-08-21" ...


Unnamed: 0_level_0,FeedbackID,CustomerID,Feedback_Text,Contact_Info,Feedback_Date
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>
1,1,12,Highly recommend this item,bob.wilson@test.org,2024-02-23
2,2,40,Excellent service,555-123-4567,2024-01-21
3,3,34,Poor quality control,jane_smith@company.com,2023-09-02
4,4,1,"average product, nothing special",jane_smith@company.com,2023-08-21
5,5,47,AMAZING customer support!!!,555-123-4567,2023-04-24



=== TRANSACTION DATA ===
'data.frame':	150 obs. of  5 variables:
 $ LogID               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CustomerID          : int  26 21 12 6 32 27 31 30 31 13 ...
 $ Transaction_DateTime: chr  "4/5/24 14:30" "3/15/24 14:30" "3/15/24 14:30" "3/20/24 9:15" ...
 $ Amount              : num  277 175 252 215 269 ...
 $ Status              : chr  "Pending" "Pending" "Pending" "Pending" ...


Unnamed: 0_level_0,LogID,CustomerID,Transaction_DateTime,Amount,Status
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<chr>
1,1,26,4/5/24 14:30,277.22,Pending
2,2,21,3/15/24 14:30,175.16,Pending
3,3,12,3/15/24 14:30,251.71,Pending
4,4,6,3/20/24 9:15,214.98,Pending
5,5,32,3/20/24 9:15,268.91,Completed



=== PRODUCT CATALOG DATA ===
'data.frame':	75 obs. of  5 variables:
 $ ProductID          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Product_Description: chr  "Apple iPhone 14 Pro - 128GB - Space Black" "samsung galaxy s23 ultra 256gb" "Apple iPhone 14 Pro - 128GB - Space Black" "Apple iPhone 14 Pro - 128GB - Space Black" ...
 $ Category           : chr  "TV" "TV" "Audio" "Shoes" ...
 $ Price              : num  964 1817 853 649 586 ...
 $ In_Stock           : chr  "Limited" "Yes" "Yes" "Yes" ...


Unnamed: 0_level_0,ProductID,Product_Description,Category,Price,In_Stock
Unnamed: 0_level_1,<int>,<chr>,<chr>,<dbl>,<chr>
1,1,Apple iPhone 14 Pro - 128GB - Space Black,TV,963.53,Limited
2,2,samsung galaxy s23 ultra 256gb,TV,1817.44,Yes
3,3,Apple iPhone 14 Pro - 128GB - Space Black,Audio,852.79,Yes
4,4,Apple iPhone 14 Pro - 128GB - Space Black,Shoes,648.58,Yes
5,5,samsung galaxy s23 ultra 256gb,Electronics,586.35,Limited


## Part 2: String Cleaning and Standardization

**Business Context:** Product names and feedback text often have inconsistent formatting that prevents accurate analysis.

**Your Tasks:**
1. Clean product names (remove extra spaces, standardize case)
2. Standardize product categories
3. Clean customer feedback text
4. Extract customer names from feedback

**Key Functions:** `str_trim()`, `str_squish()`, `str_to_lower()`, `str_to_upper()`, `str_to_title()`

In [15]:
# Task 2.1: Clean Product Names
# TODO: Create a new column 'product_name_clean' that:
#   - Removes leading/trailing whitespace using str_trim()
#   - Converts to Title Case using str_to_title()

products_clean <- products %>%
  mutate(
    product_name = (Product_Description),
    product_name_clean = str_trim(Product_Description) %>%
    str_to_title()
  )

# Display before and after
cat("Product Name Cleaning Results:\n")
products_clean %>%
  select(product_name, product_name_clean) %>%
  head(10) %>%
  print()

Product Name Cleaning Results:
                                product_name
1  Apple iPhone 14 Pro - 128GB - Space Black
2             samsung galaxy s23 ultra 256gb
3  Apple iPhone 14 Pro - 128GB - Space Black
4  Apple iPhone 14 Pro - 128GB - Space Black
5             samsung galaxy s23 ultra 256gb
6  Apple iPhone 14 Pro - 128GB - Space Black
7   DELL XPS 13 Laptop - Intel i7 - 16GB RAM
8         hp envy printer - wireless - color
9   Nike Air Max 270 - Size 10 - Black/White
10         LG 55" 4K Smart TV - OLED Display
                          product_name_clean
1  Apple Iphone 14 Pro - 128gb - Space Black
2             Samsung Galaxy S23 Ultra 256gb
3  Apple Iphone 14 Pro - 128gb - Space Black
4  Apple Iphone 14 Pro - 128gb - Space Black
5             Samsung Galaxy S23 Ultra 256gb
6  Apple Iphone 14 Pro - 128gb - Space Black
7   Dell Xps 13 Laptop - Intel I7 - 16gb Ram
8         Hp Envy Printer - Wireless - Color
9   Nike Air Max 270 - Size 10 - Black/White
10         Lg 55" 4k Sma

In [20]:
# Task 2.2: Standardize Product Categories
# TODO: Create a new column 'category_clean' that:
#   - Converts category to Title Case
#   - Removes any extra whitespace

products_clean <- products_clean %>%
  mutate(
    category_clean = str_trim(Category) %>%
    str_to_title()
  )

# Show unique categories before and after
cat("Original categories:\n")
print(unique(products$Category))

cat("\nCleaned categories:\n")
print(unique(products_clean$category_clean))

Original categories:
[1] "TV"          "Audio"       "Shoes"       "Electronics" "Computers"  

Cleaned categories:


[1] "Tv"          "Audio"       "Shoes"       "Electronics" "Computers"  


In [22]:
# Task 2.3: Clean Customer Feedback Text
# TODO: Create a new column 'feedback_clean' that:
#   - Converts text to lowercase using str_to_lower()
#   - Removes extra whitespace using str_squish()

feedback_clean <- feedback %>%
  mutate(
    feedback_text = (Feedback_Text),
    feedback_clean = str_to_lower(feedback_text) %>%
    str_squish()
    
  )

# Display sample
cat("Feedback Cleaning Sample:\n")
feedback_clean %>%
  select(feedback_text, feedback_clean) %>%
  head(5) %>%
  print()

Feedback Cleaning Sample:
                     feedback_text                   feedback_clean
1       Highly recommend this item       highly recommend this item
2             Excellent service                   excellent service
3             Poor quality control             poor quality control
4 average product, nothing special average product, nothing special
5      AMAZING customer support!!!      amazing customer support!!!


## Part 3: Pattern Detection and Extraction

**Business Context:** Identifying products with specific features and extracting specifications helps with inventory management and marketing.

**Your Tasks:**
1. Identify products with specific keywords (wireless, premium, gaming)
2. Extract numerical specifications from product names
3. Detect sentiment words in customer feedback
4. Extract email addresses from feedback

**Key Functions:** `str_detect()`, `str_extract()`, `str_count()`

In [23]:
# Task 3.1: Detect Product Features
# TODO: Create three new columns:
#   - is_wireless: TRUE if product name contains "wireless" (case-insensitive)
#   - is_premium: TRUE if product name contains "pro", "premium", or "deluxe"
#   - is_gaming: TRUE if product name contains "gaming" or "gamer"
# Hint: Use str_detect() with str_to_lower() for case-insensitive matching
# Hint: Use | (pipe) in regex for OR conditions

products_clean <- products_clean %>%
  mutate(
    is_wireless = str_detect(str_to_lower(product_name_clean), "wireless"),
    is_premium = str_detect(str_to_lower(product_name_clean), "pro|premium|deluxe"),
    is_gaming = str_detect(str_to_lower(product_name_clean), "gaming|gamer")
    
  )

# Display results
cat("Product Feature Detection:\n")
products_clean %>%
  select(product_name_clean, is_wireless, is_premium, is_gaming) %>%
  head(10) %>%
  print()

# Summary statistics
cat("\nFeature Summary:\n")
cat("Wireless products:", sum(products_clean$is_wireless), "\n")
cat("Premium products:", sum(products_clean$is_premium), "\n")
cat("Gaming products:", sum(products_clean$is_gaming), "\n")

Product Feature Detection:
                          product_name_clean is_wireless is_premium is_gaming
1  Apple Iphone 14 Pro - 128gb - Space Black       FALSE       TRUE     FALSE
2             Samsung Galaxy S23 Ultra 256gb       FALSE      FALSE     FALSE
3  Apple Iphone 14 Pro - 128gb - Space Black       FALSE       TRUE     FALSE
4  Apple Iphone 14 Pro - 128gb - Space Black       FALSE       TRUE     FALSE
5             Samsung Galaxy S23 Ultra 256gb       FALSE      FALSE     FALSE
6  Apple Iphone 14 Pro - 128gb - Space Black       FALSE       TRUE     FALSE
7   Dell Xps 13 Laptop - Intel I7 - 16gb Ram       FALSE      FALSE     FALSE
8         Hp Envy Printer - Wireless - Color        TRUE      FALSE     FALSE
9   Nike Air Max 270 - Size 10 - Black/White       FALSE      FALSE     FALSE
10         Lg 55" 4k Smart Tv - Oled Display       FALSE      FALSE     FALSE

Feature Summary:
Wireless products: 17 
Premium products: 13 
Gaming products: 0 


In [30]:
# Task 3.2: Extract Product Specifications
# TODO: Create a new column 'size_number' that extracts the first number from product_name
# Hint: Use str_extract() with pattern "\\\\d+" to match one or more digits

products_clean <- products_clean %>%
  mutate(
    size_number = str_extract(product_name_clean, "\\d+")
    
  )

# Display products with extracted sizes
cat("Extracted Product Specifications:\n")
products_clean %>%
  filter(!is.na(size_number)) %>%
  select(product_name_clean, size_number) %>%
  head(10) %>%
  print()

Extracted Product Specifications:


                          product_name_clean size_number
1  Apple Iphone 14 Pro - 128gb - Space Black          14
2             Samsung Galaxy S23 Ultra 256gb          23
3  Apple Iphone 14 Pro - 128gb - Space Black          14
4  Apple Iphone 14 Pro - 128gb - Space Black          14
5             Samsung Galaxy S23 Ultra 256gb          23
6  Apple Iphone 14 Pro - 128gb - Space Black          14
7   Dell Xps 13 Laptop - Intel I7 - 16gb Ram          13
8   Nike Air Max 270 - Size 10 - Black/White         270
9          Lg 55" 4k Smart Tv - Oled Display          55
10       Sony Wh-1000xm4 Wireless Headphones        1000


In [32]:
# Task 3.3: Simple Sentiment Analysis
# TODO: Create three new columns:
#   - positive_words: count of positive words ("great", "excellent", "love", "amazing")
#   - negative_words: count of negative words ("bad", "terrible", "hate", "awful")
#   - sentiment_score: positive_words - negative_words
# Hint: Use str_count() to count pattern occurrences

feedback_clean <- feedback_clean %>%
  mutate(
    positive_words = str_count(feedback_clean, "great|excellent|love|amazing"),
    negative_words = str_count(feedback_clean, "bad|terrible|hate|awful"),
    sentiment_score = postive_words - negative_words
  )

# Display sentiment analysis results
cat("Sentiment Analysis Results:\n")
feedback_clean %>%
  select(feedback_clean, positive_words, negative_words, sentiment_score) %>%
  head(10) %>%
  print()

# Summary
cat("\nOverall Sentiment Summary:\n")
cat("Average sentiment score:", mean(feedback_clean$sentiment_score), "\n")
cat("Positive reviews:", sum(feedback_clean$sentiment_score > 0), "\n")
cat("Negative reviews:", sum(feedback_clean$sentiment_score < 0), "\n")

Sentiment Analysis Results:
                     feedback_clean positive_words negative_words
1        highly recommend this item              0              0
2                 excellent service              1              0
3              poor quality control              0              0
4  average product, nothing special              0              0
5       amazing customer support!!!              1              0
6       amazing customer support!!!              1              0
7  average product, nothing special              0              0
8              good value for money              0              0
9        highly recommend this item              0              0
10       highly recommend this item              0              0
   sentiment_score
1                0
2                1
3                0
4                0
5                1
6                1
7                0
8                0
9                0
10               0

Overall Sentiment Summary:


Average sentiment score: 0.18 
Positive reviews: 30 
Negative reviews: 20 


## Part 4: Date Parsing and Component Extraction

**Business Context:** Transaction dates need to be parsed and analyzed to understand customer behavior patterns.

**Your Tasks:**
1. Parse transaction dates from text to Date objects
2. Extract date components (year, month, day, weekday)
3. Identify weekend vs weekday transactions
4. Extract quarter and month names

**Key Functions:** `ymd()`, `mdy()`, `dmy()`, `year()`, `month()`, `day()`, `wday()`, `quarter()`

In [None]:
# Task 4.1: Parse Transaction Dates
# TODO: Create a new column 'date_parsed' that parses the transaction_date column
# Hint: Check the format of transaction_date first, then use ymd(), mdy(), or dmy()

transactions_clean <- transactions %>%
  mutate(
    transaction_date = (Transaction_DateTime),
    date_parsed = parse_date_time(transaction_date, order = c("mdy_HM", "dmy_HMS", "ymd_HMS"))
  )

# Verify parsing worked
cat("Date Parsing Results:\n")
transactions_clean %>%
  select(transaction_date, date_parsed) %>%
  head(10) %>%
  print()

Date Parsing Results:
      transaction_date         date_parsed
1         4/5/24 14:30 2024-04-05 14:30:00
2        3/15/24 14:30 2024-03-15 14:30:00
3        3/15/24 14:30 2024-03-15 14:30:00
4         3/20/24 9:15 2024-03-20 09:15:00
5         3/20/24 9:15 2024-03-20 09:15:00
6         3/20/24 9:15 2024-03-20 09:15:00
7         3/20/24 9:15 2024-03-20 09:15:00
8        3/15/24 14:30 2024-03-15 14:30:00
9  25-03-2024 16:45:30 2024-03-25 16:45:30
10        4/5/24 14:30 2024-04-05 14:30:00


In [72]:
# Task 4.2: Extract Date Components
# TODO: Create the following new columns:
#   - trans_year: Extract year from date_parsed
#   - trans_month: Extract month number from date_parsed
#   - trans_month_name: Extract month name (use label=TRUE, abbr=FALSE)
#   - trans_day: Extract day of month from date_parsed
#   - trans_weekday: Extract weekday name (use label=TRUE, abbr=FALSE)
#   - trans_quarter: Extract quarter from date_parsed

transactions_clean <- transactions_clean %>%
  mutate(
    trans_year = year(date_parsed),
    trans_month = month(date_parsed),
    trans_month_name = month(date_parsed, label=TRUE, abbr=FALSE),
    trans_day = day(date_parsed),
    trans_weekday = wday(date_parsed, label=TRUE, abbr=FALSE),
    trans_quarter = quarter(date_parsed)
  )

# Display results
cat("Date Component Extraction:\n")
transactions_clean %>%
  select(date_parsed, trans_month_name, trans_weekday, trans_quarter) %>%
  head(10) %>%
  print()

Date Component Extraction:
           date_parsed trans_month_name trans_weekday trans_quarter
1  2024-04-05 14:30:00            April        Friday             2
2  2024-04-05 14:30:00            April        Friday             2
3  2024-04-05 14:30:00            April        Friday             2
4  2024-04-05 14:30:00            April        Friday             2
5  2024-04-05 14:30:00            April        Friday             2
6  2024-04-05 14:30:00            April        Friday             2
7  2024-04-05 14:30:00            April        Friday             2
8  2024-04-05 14:30:00            April        Friday             2
9  2024-04-05 14:30:00            April        Friday             2
10 2024-04-05 14:30:00            April        Friday             2


In [56]:
# Task 4.3: Identify Weekend Transactions
# TODO: Create a new column 'is_weekend' that is TRUE if the transaction was on Saturday or Sunday
# Hint: Use wday() which returns 1 for Sunday and 7 for Saturday
# Hint: Use %in% c(1, 7) to check if day is weekend

transactions_clean <- transactions_clean %>%
  mutate(
    is_weekend = wday(date_parsed) %in% c(1, 7),
    is_weekday = wday(date_parsed) %in% c(2,3,4,5,6)
  )

# Summary
cat("Weekend vs Weekday Transactions:\n")
table(transactions_clean$is_weekend) %>% print()

cat("\nPercentage of weekend transactions:",
    round(sum(transactions_clean$is_weekend) / nrow(transactions_clean) * 100, 1), "%\n")

Weekend vs Weekday Transactions:

FALSE 
  150 

Percentage of weekend transactions: 0 %


## Part 5: Date Calculations and Customer Recency Analysis

**Business Context:** Understanding how recently customers transacted helps identify at-risk customers for re-engagement campaigns.

**Your Tasks:**
1. Calculate days since each transaction
2. Categorize customers by recency (Recent, Moderate, Old)
3. Identify customers who haven't transacted in 90+ days
4. Calculate average days between transactions per customer

**Key Functions:** `today()`, date arithmetic, `case_when()`

In [74]:
# Task 5.1: Calculate Days Since Transaction
# TODO: Create a new column 'days_since' that calculates days from date_parsed to today()
# Hint: Use as.numeric(today() - date_parsed)

transactions_clean <- transactions_clean %>%
  mutate(
    date_parsed_ymd = as.Date(date_parsed),
    days_since = as.numeric(today() - date_parsed_ymd)
  )

# Display results
cat("Days Since Transaction:\n")
transactions_clean %>%
  select(CustomerID, date_parsed_ymd, days_since) %>%
  arrange(desc(days_since)) %>%
  head(10) %>%
  print()

Days Since Transaction:
   CustomerID date_parsed_ymd days_since
1          21      2024-03-15        572
2          12      2024-03-15        572
3          30      2024-03-15        572
4          45      2024-03-15        572
5           2      2024-03-15        572
6          18      2024-03-15        572
7          34      2024-03-15        572
8          48      2024-03-15        572
9          28      2024-03-15        572
10         30      2024-03-15        572


In [76]:
# Task 5.2: Categorize by Recency
# TODO: Create a new column 'recency_category' using case_when():
#   - "Recent" if days_since <= 30
#   - "Moderate" if days_since <= 90
#   - "At Risk" if days_since > 90

transactions_clean <- transactions_clean %>%
  mutate(
    recency_category = case_when(
      days_since <= 30 ~ "Recent",
      days_since <= 90 ~ "Moderate",
      days_since > 90 ~ "At Risk"
    )
  )

# Display distribution
cat("Recency Category Distribution:\n")
table(transactions_clean$recency_category) %>% print()

# Show at-risk customers
cat("\nAt-Risk Customers (>90 days):\n")
transactions_clean %>%
  filter(recency_category == "At Risk") %>%
  select(CustomerID, date_parsed, days_since) %>%
  arrange(desc(days_since)) %>%
  print()

Recency Category Distribution:

At Risk 
    150 

At-Risk Customers (>90 days):


    CustomerID         date_parsed days_since
1           21 2024-03-15 14:30:00        572
2           12 2024-03-15 14:30:00        572
3           30 2024-03-15 14:30:00        572
4           45 2024-03-15 14:30:00        572
5            2 2024-03-15 14:30:00        572
6           18 2024-03-15 14:30:00        572
7           34 2024-03-15 14:30:00        572
8           48 2024-03-15 14:30:00        572
9           28 2024-03-15 14:30:00        572
10          30 2024-03-15 14:30:00        572
11          33 2024-03-15 14:30:00        572
12          11 2024-03-15 14:30:00        572
13          36 2024-03-15 14:30:00        572
14           6 2024-03-15 14:30:00        572
15           8 2024-03-15 14:30:00        572
16          38 2024-03-15 14:30:00        572
17          49 2024-03-15 14:30:00        572
18          28 2024-03-15 14:30:00        572
19          44 2024-03-15 14:30:00        572
20          33 2024-03-15 14:30:00        572
21          29 2024-03-15 14:30:00

## Part 6: Combined String and Date Operations

**Business Context:** Create personalized customer outreach messages based on purchase recency.

**Your Tasks:**
1. Extract first names from customer names
2. Create personalized messages based on recency
3. Analyze transaction patterns by weekday
4. Identify best customers (recent + high value)

**Key Functions:** Combine `str_extract()`, date calculations, `case_when()`, `group_by()`, `summarize()`

In [None]:
# Task 6.1: Extract First Names and Create Personalized Messages
# TODO: Create two new columns:
#   - first_name: Extract first name from customer_name (everything before first space)
#   - personalized_message: Create message based on recency_category
#     * Recent: "Hi [name]! Thanks for your recent purchase!"
#     * Moderate: "Hi [name], we miss you! Check out our new products."
#     * At Risk: "Hi [name], it's been a while! Here's a special offer for you."
# Hint: Use str_extract() with pattern "^\\\\w+" for first name
# Hint: Use paste() to combine strings in case_when()


#Student Note: I had to use CustomerID instead of customer_name as there is no customer_name column in the transactions data frame.
customer_outreach <- transactions_clean %>%
  mutate(
    first_name = str_extract(CustomerID, "^\\w+"),
    personalized_message = case_when(
      recency_category == "Recent" ~ paste("Hi", first_name, "! Thanks for your recent purchase!"),
      recency_category == "Moderate" ~ paste("Hi", first_name, ", we miss you! Check out our new products."),
      recency_category == "At Risk" ~ paste("Hi", first_name, ", it's been a while! Here's a special offer for you.")
    )
    
  )

# Display personalized messages
cat("Personalized Customer Messages:\n")
customer_outreach %>%
  select(CustomerID, first_name, days_since, personalized_message) %>%
  head(10) %>%
  print()

Personalized Customer Messages:
   CustomerID first_name days_since
1          26         26        551
2          13         13        551
3          44         44        551
4          22         22        551
5          13         13        551
6          35         35        551
7          30         30        551
8          48         48        551
9           4          4        551
10         44         44        551
                                         personalized_message
1  Hi 26 , it's been a while! Here's a special offer for you.
2  Hi 13 , it's been a while! Here's a special offer for you.
3  Hi 44 , it's been a while! Here's a special offer for you.
4  Hi 22 , it's been a while! Here's a special offer for you.
5  Hi 13 , it's been a while! Here's a special offer for you.
6  Hi 35 , it's been a while! Here's a special offer for you.
7  Hi 30 , it's been a while! Here's a special offer for you.
8  Hi 48 , it's been a while! Here's a special offer for you.
9   Hi 4 , it'

In [79]:
# Task 6.2: Analyze Transaction Patterns by Weekday
# TODO: Group by trans_weekday and calculate:
#   - transaction_count: number of transactions
#   - total_amount: sum of amount (if available)
#   - avg_amount: average amount per transaction
# TODO: Arrange by transaction_count descending

weekday_patterns <- transactions_clean %>%
  group_by(trans_weekday) %>%
  summarize(
    transaction_count = n(),
    total_amount = sum(Amount, na.rm=TRUE),
    avg_amount = mean(Amount, na.rm=TRUE)
  ) %>%
  arrange(desc(transaction_count))
  

# Display results
cat("Transaction Patterns by Weekday:\n")
print(weekday_patterns)

# Identify busiest day
busiest_day <- weekday_patterns$trans_weekday[1]
cat("\nðŸ”¥ Busiest day:", as.character(busiest_day), "\n")

Transaction Patterns by Weekday:
[90m# A tibble: 3 Ã— 4[39m
  trans_weekday transaction_count total_amount avg_amount
  [3m[90m<ord>[39m[23m                     [3m[90m<int>[39m[23m        [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m Monday                       61       [4m1[24m[4m5[24m367.       252.
[90m2[39m Friday                       55       [4m1[24m[4m4[24m789.       269.
[90m3[39m Wednesday                    34        [4m7[24m578.       223.

ðŸ”¥ Busiest day: Monday 


In [None]:
# Task 6.3: Monthly Transaction Analysis
# TODO: Group by trans_month_name and calculate:
#   - transaction_count
#   - unique_customers: use n_distinct(customer_name)
# TODO: Arrange by trans_month (to show chronological order)

#Student Note: Customer name does not exist in data sets so I am using CustomerID.
monthly_patterns <- transactions_clean %>%
  group_by(trans_month_name) %>%
  summarize(
    transaction_count = n(),
    unique_customers = n_distinct(CustomerID)
  ) %>%
  arrange(trans_month_name)

    # Display results
cat("Monthly Transaction Patterns:\n")
print(monthly_patterns)

Monthly Transaction Patterns:
[90m# A tibble: 2 Ã— 3[39m
  trans_month_name transaction_count unique_customers
  [3m[90m<ord>[39m[23m                        [3m[90m<int>[39m[23m            [3m[90m<int>[39m[23m
[90m1[39m March                           89               37
[90m2[39m April                           61               34


## Part 7: Business Intelligence Summary

**Business Context:** Create an executive summary that combines all your analyses into actionable insights.

**Your Tasks:**
1. Calculate key metrics across all datasets
2. Identify top products and categories
3. Summarize customer sentiment
4. Provide data-driven recommendations

In [None]:
# Task 7.1: Create Business Intelligence Dashboard

cat("\n", rep("=", 60), "\n")
cat("         BUSINESS INTELLIGENCE SUMMARY\n")
cat(rep("=", 60), "\n\n")

# Product Analysis
cat("ðŸ“¦ PRODUCT ANALYSIS\n")
cat(rep("â”€", 30), "\n")
# TODO: Calculate and display:
#   - Total number of products
#   - Number of wireless products
#   - Number of premium products
#   - Most common category
product_analysis <- products_clean %>%
    summarize(
    total_products = n(),
    wireless_products = sum(is_wireless),
    premium_products = sum(is_premium),
    most_common_category = names(sort(table(category_clean), decreasing=TRUE))[1]
    )
print(product_analysis)

# Customer Sentiment
cat("\nðŸ’¬ CUSTOMER SENTIMENT\n")
cat(rep("â”€", 30), "\n")
# TODO: Calculate and display:
#   - Total feedback entries
#   - Average sentiment score
#   - Percentage of positive reviews
#   - Percentage of negative reviews
sentiment_analysis <- feedback_clean %>%
    summarize(
    total_feedback = n(),
    avg_sentiment_score = mean(sentiment_score),
    positive_reviews = sum(sentiment_score > 0),
    negative_reviews = sum(sentiment_score < 0),
    pct_positive = round(positive_reviews / total_feedback * 100, 1),
    pct_negative = round(negative_reviews / total_feedback * 100, 1)
    )
print(sentiment_analysis)


# Transaction Patterns
cat("\nðŸ“Š TRANSACTION PATTERNS\n")
cat(rep("â”€", 30), "\n")
# TODO: Calculate and display:
#   - Total transactions
#   - Date range (earliest to latest)
#   - Busiest weekday
#   - Weekend transaction percentage
transaction_patterns <- transactions_clean %>%
    summarize(
    total_transactions = n(),
    date_range = paste(min(date_parsed_ymd, na.rm=TRUE), "to", max(date_parsed_ymd, na.rm=TRUE)),
    busiest_weekday = busiest_day,
    weekend_pct = round(sum(is_weekend) / total_transactions * 100, 1)
    )
print(transaction_patterns)


# Customer Recency
cat("\nðŸ‘¥ CUSTOMER RECENCY\n")
cat(rep("â”€", 30), "\n")
# TODO: Calculate and display:
#   - Number of recent customers (< 30 days)
#   - Number of at-risk customers (> 90 days)
#   - Percentage needing re-engagement
customer_recency <- transactions_clean %>%
    summarize(
    recent_customers = sum(recency_category == "Recent"),
    at_risk_customers = sum(recency_category == "At Risk"),
    pct_needing_reengagement = round(at_risk_customers / n() * 100, 1)
    )
print(customer_recency)



 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
         BUSINESS INTELLIGENCE SUMMARY
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 

ðŸ“¦ PRODUCT ANALYSIS
â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ 


  total_products wireless_products premium_products most_common_category
1             75                17               13          Electronics

ðŸ’¬ CUSTOMER SENTIMENT
â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ 
  total_feedback avg_sentiment_score positive_reviews negative_reviews
1            100                0.18               30               20
  pct_positive pct_negative
1           30           20

ðŸ“Š TRANSACTION PATTERNS
â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ 
  total_transactions               date_range busiest_weekday weekend_pct
1                150 2024-03-15 to 2024-04-05          Monday           0

ðŸ‘¥ CUSTOMER RECENCY
â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ â”€ 
  recent_customers at_risk_customers pct_needing_reengagement
1       

In [None]:
# Task 7.2: Identify Top Products by Category
# TODO: Group products by category_clean and count products in each
# TODO: Arrange by count descending
# TODO: Display top 5 categories

top_categories <- products_clean %>%
  group_by(category_clean) %>%
  summarize(total_products = n()) %>%
  arrange(desc(total_products)) %>%
  head(5)

cat("Top Product Categories:\n")
print(top_categories)

Top Product Categories:


[90m# A tibble: 5 Ã— 2[39m
  category_clean total_products
  [3m[90m<chr>[39m[23m                   [3m[90m<int>[39m[23m
[90m1[39m Electronics                21
[90m2[39m Computers                  15
[90m3[39m Audio                      14
[90m4[39m Tv                         14
[90m5[39m Shoes                      11


## Part 8: Reflection Questions

Answer the following questions based on your analysis. Write your answers in the markdown cells below.

### Question 8.1: Data Quality Impact

**How did cleaning the text data (removing spaces, standardizing case) improve your ability to analyze the data? Provide specific examples from your homework.**

Your answer here: Cleaning the text data help with analysis by standardizing the format of the text. It made counting and grouping much easier as I did not have to worry about different cases or extra spaces causing issues. Take for example categories it was easier to count how many products were in each category after cleaning the data. I would have had to write a lot more code to account for all the different variations of the same category. Also with product names it was easier to lower case everything to count if the description contained certain keywords like "wireless" or "premium".



### Question 8.2: Pattern Detection Value

**What business insights did you gain from detecting patterns in product names (wireless, premium, gaming)? How could a business use this information?**

Your answer here: Since there were no gaming products that was found. I would say that the business should consider adding gaming products to their products that they carry. It is a big category and missing out on a lot of potential sales. Being able to tell which products are wireless or premium can help with seeing what type of products are more popular. Then the company can stock more of those types of products or run promotions on them. Or see why the premium products are not as popular and maybe lower the price or run a sale on them.



### Question 8.3: Date Analysis Importance

**Why is analyzing transaction dates by weekday and month important for business operations? Provide at least three specific business applications.**

Your answer here:

1. It would be important to know which weekdays are busiest to make sure there is enough inventory to handle the demand. If there isn't enough inventory then the business could lose sales and customers.
2. Analyzing by month can help with seasonal trends. For example if sales are higher in December then the business can plan for that by stocking more inventory and running promotions.
3. Analyzing by month can also help with budgeting and forecasting. The business can see which months are typically slower and plan accordingly by cutting costs or running special promotions to boost sales during those times.


### Question 8.4: Customer Recency Strategy

**Based on your recency analysis, what specific actions would you recommend for customers in each category (Recent, Moderate, At Risk)? How would you prioritize these actions?**

Your answer here: Because the dates were from long ago every customer fell into the at risk category. I am going to come up with different answers than what was in the data. For customers in the moderate category I would set out a marketing campaign to let these customers know about new products or future sales to get them to come back. For at risk customers I would send out a special offer or discount to get them to come back. For recent customers I would send out a customer feedback survey to see how their experience was and how they are enjoying the products they purchased.



### Question 8.5: Sentiment Analysis Application

**How could the sentiment analysis you performed be used to improve products or customer service? What are the limitations of this simple sentiment analysis approach?**

Your answer here: It would help with finding common issues or problems with customer service or products. The limitations of this approach was that it did not capture all negative reviews or all the positive reviews. This was due to only looking for a few keywords. If I created a more comprehensive list of keywords it would have been better to identify more sentiments in the analysis.



### Question 8.6: Real-World Application

**Describe a real business scenario where you would need to combine string manipulation and date analysis (like you did in this homework). What insights would you be trying to discover?**

Your answer here: I would be checking for seasonal items and when they are the most popular. For example summer break, is there more gaming related items being sold or around the holidays are there more premium items being sold. Also during back to school season what items are being sold the most. Being able to plan ahead for inventory and marketing campaigns can help the business be more successful. Also depending on the inventory normally being sold, there might be discounts from suppliers for ordering in bulk which could lead to higher profit margins, being able to pass on savings to customers leading to high customer satisfaction, or being able to run a sale on those items to increase sales to outbid the local competition.



## Summary and Submission

### What You've Accomplished

In this homework, you've successfully:
- âœ… Cleaned and standardized messy text data using `stringr` functions
- âœ… Detected patterns and extracted information from text
- âœ… Parsed dates and extracted temporal components using `lubridate`
- âœ… Calculated customer recency for segmentation
- âœ… Analyzed transaction patterns by time periods
- âœ… Combined string and date operations for business insights
- âœ… Created personalized customer communications
- âœ… Generated executive-ready business intelligence summaries

### Key Skills Mastered

**String Manipulation:**
- `str_trim()`, `str_squish()` - Whitespace handling
- `str_to_lower()`, `str_to_upper()`, `str_to_title()` - Case conversion
- `str_detect()` - Pattern detection
- `str_extract()` - Information extraction
- `str_count()` - Pattern counting

**Date/Time Operations:**
- `ymd()`, `mdy()`, `dmy()` - Date parsing
- `year()`, `month()`, `day()`, `wday()` - Component extraction
- `quarter()` - Period extraction
- `today()` - Current date
- Date arithmetic - Calculating differences

**Business Applications:**
- Data cleaning and standardization
- Customer segmentation by recency
- Sentiment analysis
- Pattern identification
- Temporal trend analysis
- Personalized communication

### Submission Checklist

Before submitting, ensure you have:
- [ ] Entered your name, student ID, and date at the top
- [ ] Completed all code tasks (Parts 1-7)
- [ ] Run all cells successfully without errors
- [ ] Answered all reflection questions (Part 8)
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator (`%>%`) where appropriate
- [ ] Verified your results make business sense
- [ ] Checked for any remaining TODO comments

### Grading Criteria

Your homework will be evaluated on:
- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented, efficient code
- **Business Understanding (20%)**: Demonstrates understanding of business applications
- **Reflection Questions (15%)**: Thoughtful, complete answers
- **Presentation (5%)**: Professional formatting and organization

### Next Steps

In Lesson 8, you'll learn:
- Advanced data wrangling with complex pipelines
- Sophisticated conditional logic with `case_when()`
- Data validation and quality checks
- Creating reproducible analysis workflows
- Professional best practices for business analytics

**Great work on completing this assignment! ðŸŽ‰**