# MIDTERM EXAM: Comprehensive R Data Wrangling Assessment

**Student Name:** Hale Lacquement

**Student ID:** hio884

**Date:** 10/19/2025

**Time Limit:** 4 hours

---

## Exam Overview

This comprehensive midterm exam assesses your mastery of ALL R data wrangling skills covered in Lessons 1-8:

- **Lesson 1:** R Basics and Data Import
- **Lesson 2:** Data Cleaning (Missing Values & Outliers)
- **Lesson 3:** Data Transformation Part 1 (select, filter, arrange)
- **Lesson 4:** Data Transformation Part 2 (mutate, summarize, group_by)
- **Lesson 5:** Data Reshaping (pivot_longer, pivot_wider)
- **Lesson 6:** Combining Datasets (joins)
- **Lesson 7:** String Manipulation & Date/Time
- **Lesson 8:** Advanced Wrangling & Best Practices

## Business Scenario

You are a data analyst for a retail company. The executive team needs a comprehensive analysis of:
- Sales performance across products and regions
- Customer behavior and segmentation
- Data quality issues and recommendations
- Strategic insights for business growth

## Instructions

1. **Set your working directory** to where your data files are located
2. Complete ALL tasks in order
3. Write code in the TODO sections
4. Use the pipe operator (%>%) to chain operations
5. Add comments explaining your logic
6. Run all cells to verify your code works
7. Answer all reflection questions

## Grading

- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented code
- **Business Understanding (20%)**: Demonstrates understanding of context
- **Analysis & Insights (15%)**: Meaningful insights and recommendations
- **Reflection Questions (5%)**: Thoughtful answers

## Academic Integrity

This is an individual exam. You may use:
- Course notes and lesson materials
- R documentation and help files
- Your previous homework assignments

You may NOT:
- Collaborate with other students
- Use AI assistants or online forums
- Share code or solutions

---

**Good luck! ðŸŽ“**

## Part 1: R Basics and Data Import (Lesson 1)

**Skills Assessed:** Variables, data types, data import, working directory

**Your Tasks:**
1. Set working directory
2. Load required packages
3. Import multiple datasets
4. Examine data structures

In [99]:
# Task 1.1: Set Working Directory
# TODO: Set your working directory to where your data files are located
# IMPORTANT: Students must set their own path!
# Example: setwd("/Users/yourname/GitHub/ai-homework-grader-clean/data")

# Your code here:
setwd("/workspaces/Fall2025-MS3083-Base_Template/data")

# Verify working directory
cat("Current working directory:", getwd(), "\n")

Current working directory: /workspaces/Fall2025-MS3083-Base_Template/data 


In [100]:
# Task 1.2: Load Required Packages
# TODO: Load tidyverse (includes dplyr, tidyr, stringr, ggplot2)
library(tidyverse)

# TODO: Load lubridate for date operations
library(lubridate)

cat("âœ… Packages loaded successfully!\n")

âœ… Packages loaded successfully!


In [101]:
# Task 1.3: Import Datasets
# TODO: Import the following CSV files using read_csv():
#   - company_sales_data.csv -> sales_data
#   - customers.csv -> customers
#   - products.csv -> products
#   - orders.csv -> orders
#   - order_items.csv -> order_items

# Your code here:
sales_data <- read_csv("/workspaces/Fall2025-MS3083-Base_Template/data/sales_data.csv")

customers <- read_csv("/workspaces/Fall2025-MS3083-Base_Template/data/customers.csv")

products <- read_csv("/workspaces/Fall2025-MS3083-Base_Template/data/products.csv")

orders <- read_csv("/workspaces/Fall2025-MS3083-Base_Template/data/orders.csv")

order_items <- read_csv("/workspaces/Fall2025-MS3083-Base_Template/data/order_items.csv")


# Display import summary
cat("âœ… Data imported successfully!\n")
cat("Sales data:", nrow(sales_data), "rows\n")
cat("Customers:", nrow(customers), "rows\n")
cat("Products:", nrow(products), "rows\n")
cat("Orders:", nrow(orders), "rows\n")
cat("Order items:", nrow(order_items), "rows\n")

[1mRows: [22m[34m100[39m [1mColumns: [22m[34m6[39m
[36mâ”€â”€[39m [1mColumn specification[22m [36mâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): Product, Region
[32mdbl[39m  (3): TransactionID, Amount, Quantity
[34mdate[39m (1): Date

[36mâ„¹[39m Use `spec()` to retrieve the full column specification for this data.
[36mâ„¹[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m100[39m [1mColumns: [22m[34m5[39m
[36mâ”€â”€[39m [1mColumn specification[22m [36mâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Name, Email, City
[32mdbl[39m  (1): CustomerID
[34mdate[39m (1): R

âœ… Data imported successfully!
Sales data: 100 rows
Customers: 100 rows
Products: 50 rows
Orders: 250 rows
Order items: 400 rows


## Part 2: Data Cleaning - Missing Values & Outliers (Lesson 2)

**Skills Assessed:** Identifying NAs, handling missing data, detecting outliers

**Your Tasks:**
1. Check for missing values in sales_data
2. Handle missing values appropriately
3. Identify outliers in Revenue column
4. Create a cleaned dataset

In [102]:
# Task 2.1: Check for Missing Values
# TODO: Create 'missing_summary' that shows count of NAs in each column of sales_data


missing_summary <- colSums(is.na(sales_data))


cat("========== MISSING VALUES SUMMARY ==========\n")
print(missing_summary)
cat("\nTotal missing values:", sum(missing_summary), "\n")

TransactionID          Date       Product        Amount      Quantity 
            0             0             0             0             0 
       Region 
            0 

Total missing values: 0 


In [103]:
# Task 2.2: Handle Missing Values
# TODO: Create 'sales_clean' by removing rows with ANY missing values

sales_clean <- na.omit(sales_data)

cat("========== DATA CLEANING RESULTS ==========\n")
cat("Original rows:", nrow(sales_data), "\n")
cat("Cleaned rows:", nrow(sales_clean), "\n")
cat("Rows removed:", nrow(sales_data) - nrow(sales_clean), "\n")

Original rows: 100 
Cleaned rows: 100 
Rows removed: 0 


In [104]:
# Task 2.3: Detect Outliers in Revenue
# TODO: Calculate outlier thresholds using IQR method
#   - Calculate Q1 (25th percentile) and Q3 (75th percentile) of Revenue
#   - Calculate IQR = Q3 - Q1
#   - Lower bound = Q1 - 1.5 * IQR
#   - Upper bound = Q3 + 1.5 * IQR
# TODO: Create 'outlier_analysis' dataframe with these values
sales_clean$Revenue <- sales_clean$Amount * sales_clean$Quantity

Q1 <- quantile(sales_clean$Revenue, 0.25, na.rm = TRUE)
Q3 <- quantile(sales_clean$Revenue, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outlier_analysis <- data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Bound", "Upper Bound"),
  Value = c(Q1, Q3, IQR_value, lower_bound, upper_bound)
)

cat("========== OUTLIER ANALYSIS ==========\n")
print(outlier_analysis)

# Count outliers
outlier_count <- sum(sales_clean$Revenue < lower_bound | sales_clean$Revenue > upper_bound)
cat("\nNumber of outliers detected:", outlier_count, "\n")

       Metric     Value
1          Q1  1042.102
2          Q3  3575.080
3         IQR  2532.977
4 Lower Bound -2757.364
5 Upper Bound  7374.546



Number of outliers detected: 1 


## Part 3: Data Transformation Part 1 (Lesson 3)

**Skills Assessed:** select(), filter(), arrange(), pipe operator

**Your Tasks:**
1. Select specific columns
2. Filter data by conditions
3. Sort data
4. Chain operations with pipe

In [105]:
# Task 3.1: Select Specific Columns
# TODO: Create 'sales_summary' with only these columns from sales_clean:
#   Region, Product_Category, Revenue, Units_Sold, Sale_Date
sales_clean <- sales_clean %>%
rename(
  Sale_Date = Date,
  Product_Category = Product,
  Units_Sold = Quantity
)
sales_clean$Revenue <- sales_clean$Amount * sales_clean$Units_Sold

sales_summary <- sales_clean %>%
  # Your code here:
  select(Region, Product_Category, Revenue, Units_Sold, Sale_Date)

cat("========== SELECTED COLUMNS ==========\n")
cat("Columns:", names(sales_summary), "\n")
cat("Rows:", nrow(sales_summary), "\n")
head(sales_summary, 5)



Columns: Region Product_Category Revenue Units_Sold Sale_Date 
Rows: 100 


Region,Product_Category,Revenue,Units_Sold,Sale_Date
<chr>,<chr>,<dbl>,<dbl>,<date>
South,Keyboard,1039.05,1,2023-01-27
East,Tablet,2987.75,5,2023-05-18
West,Keyboard,1539.48,1,2023-02-14
South,Mouse,261.66,1,2023-07-28
South,Monitor,1047.73,1,2023-05-17


In [106]:
# Task 3.2: Filter High Revenue Sales
# TODO: Create 'high_revenue_sales' by filtering sales_clean for Revenue > 20000


high_revenue_sales <- sales_clean %>%
  # Your code here:
  filter(Revenue > 20000)

cat("========== HIGH REVENUE SALES ==========\n")
cat("Total high revenue transactions:", nrow(high_revenue_sales), "\n")
cat("Total revenue from these sales: $", sum(high_revenue_sales$Revenue), "\n")

Total high revenue transactions: 0 


Total revenue from these sales: $ 0 


In [107]:
# Task 3.3: Sort by Revenue
# TODO: Create 'top_sales' by arranging sales_clean by Revenue in descending order
#       and keeping only the top 10 rows

top_sales <- sales_clean %>%
  # Your code here:
  arrange(desc(Revenue), 10)

cat("========== TOP 10 SALES ==========\n")
print(top_sales %>% select(Region, Product_Category, Revenue, Units_Sold))



[90m# A tibble: 100 Ã— 4[39m
   Region Product_Category Revenue Units_Sold
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m West   Monitor            [4m7[24m687           5
[90m 2[39m East   Keyboard           [4m7[24m118.          4
[90m 3[39m South  Mouse              [4m6[24m869.          4
[90m 4[39m South  Monitor            [4m6[24m511.          4
[90m 5[39m West   Mouse              [4m6[24m260.          5
[90m 6[39m South  Monitor            [4m5[24m927.          3
[90m 7[39m West   Tablet             [4m5[24m534.          4
[90m 8[39m East   Monitor            [4m5[24m531.          3
[90m 9[39m West   Keyboard           [4m5[24m526.          5
[90m10[39m East   Keyboard           [4m5[24m312.          4
[90m# â„¹ 90 more rows[39m


In [108]:
# Task 3.4: Chain Multiple Operations
# TODO: Create 'regional_top_sales' by:
#   1. Filtering for Revenue > 15000
#   2. Selecting: Region, Product_Category, Revenue
#   3. Arranging by Region (ascending) then Revenue (descending)
#   4. Keeping top 15 rows
# Use the pipe operator to chain all operations

regional_top_sales <- sales_clean %>%
  # Your code here:
  group_by(Region) %>%
  filter(Revenue > 15000) %>%
  select(Region, Product_Category, Revenue) %>%
  arrange(Region, desc(Revenue)) %>%
  head(15)


cat("========== REGIONAL TOP SALES ==========\n")
print(regional_top_sales)



[90m# A tibble: 0 Ã— 3[39m
[90m# Groups:   Region [0][39m
[90m# â„¹ 3 variables: Region <chr>, Product_Category <chr>, Revenue <dbl>[39m


## Part 4: Data Transformation Part 2 (Lesson 4)

**Skills Assessed:** mutate(), summarize(), group_by()

**Your Tasks:**
1. Create calculated columns with mutate()
2. Calculate summary statistics
3. Perform grouped analysis
4. Generate business metrics

In [109]:
# Task 4.1: Create Calculated Columns
# TODO: Add these new columns to sales_clean using mutate():
#   - revenue_per_unit: Revenue / Units_Sold
#   - high_value: "Yes" if Revenue > 20000, else "No"
# Store result in 'sales_enhanced'

sales_enhanced <- sales_clean %>%
  mutate(
    revenue_per_unit = Revenue / Units_Sold,
    high_value = case_when(Revenue > 20000 ~ "Yes", Revenue < 20000 ~ "No")
    
  )

cat("========== ENHANCED SALES DATA ==========\n")
cat("New columns added: revenue_per_unit, high_value\n")
head(sales_enhanced %>% select(Revenue, Units_Sold, revenue_per_unit, high_value), 5)

New columns added: revenue_per_unit, high_value


Revenue,Units_Sold,revenue_per_unit,high_value
<dbl>,<dbl>,<dbl>,<chr>
1039.05,1,1039.05,No
2987.75,5,597.55,No
1539.48,1,1539.48,No
261.66,1,261.66,No
1047.73,1,1047.73,No


In [110]:
# Task 4.2: Calculate Overall Summary Statistics
# TODO: Create 'overall_summary' with these metrics from sales_enhanced:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - total_units: sum of Units_Sold
#   - transaction_count: count using n()


overall_summary <- sales_enhanced %>%

summarise(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    total_units = sum(Units_Sold),
    transaction_count = n()
)


cat("========== OVERALL SUMMARY ==========\n")
print(overall_summary)

[90m# A tibble: 1 Ã— 4[39m
  total_revenue avg_revenue total_units transaction_count
          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m       [4m2[24m[4m4[24m[4m6[24m378.       [4m2[24m464.         269               100


In [111]:
# Task 4.3: Regional Performance Analysis
# TODO: Create 'regional_summary' by grouping sales_enhanced by Region
#       and calculating:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - transaction_count: count using n()
# Then arrange by total_revenue descending
# Hint: Use group_by() %>% summarize() %>% arrange()

regional_summary <- sales_enhanced %>%
group_by(Region) %>%
summarise(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    total_units = sum(Units_Sold),
    transaction_count = n()
) %>%
arrange(desc(total_revenue))

cat("\n========== REGIONAL SUMMARY ==========\n")
print(regional_summary)


[90m# A tibble: 4 Ã— 5[39m
  Region total_revenue avg_revenue total_units transaction_count
  [3m[90m<chr>[39m[23m          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m West          [4m7[24m[4m3[24m632.       [4m2[24m454.          82                30
[90m2[39m South         [4m6[24m[4m8[24m778.       [4m2[24m751.          67                25
[90m3[39m East          [4m5[24m[4m4[24m814.       [4m2[24m383.          66                23
[90m4[39m North         [4m4[24m[4m9[24m153.       [4m2[24m234.          54                22


In [112]:
# Task 4.4: Product Category Analysis
# TODO: Create 'category_summary' by grouping by Product_Category
#       and calculating the same metrics as regional_summary
#       Then arrange by total_revenue descending

category_summary <- sales_enhanced %>%

group_by(Product_Category) %>%
summarise(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    total_units = sum(Units_Sold),
    transaction_count = n()
) %>%
arrange(desc(total_revenue))


cat("========== CATEGORY SUMMARY ==========\n")
print(category_summary)



[90m# A tibble: 5 Ã— 5[39m
  Product_Category total_revenue avg_revenue total_units transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Monitor                 [4m6[24m[4m9[24m283.       [4m2[24m665.          66                26
[90m2[39m Mouse                   [4m4[24m[4m8[24m687.       [4m2[24m117.          60                23
[90m3[39m Tablet                  [4m4[24m[4m7[24m007.       [4m2[24m612.          49                18
[90m4[39m Keyboard                [4m4[24m[4m4[24m432.       [4m2[24m614.          42                17
[90m5[39m Laptop                  [4m3[24m[4m6[24m968.       [4m2[24m311.          52                16


## Part 5: Data Reshaping with tidyr (Lesson 5)

**Skills Assessed:** pivot_longer(), pivot_wider(), tidy data principles

**Your Tasks:**
1. Reshape data from wide to long format
2. Reshape data from long to wide format
3. Create analysis-ready datasets

In [113]:
# Task 5.1: Create Wide Format Data
# First, create a summary by Region and Product_Category
region_category_revenue <- sales_enhanced %>%
  group_by(Region, Product_Category) %>%
  summarize(total_revenue = sum(Revenue), .groups = 'drop')

cat("========== REGION-CATEGORY DATA (LONG FORMAT) ==========\n")
print(head(region_category_revenue, 10))



[90m# A tibble: 10 Ã— 3[39m
   Region Product_Category total_revenue
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m East   Keyboard                [4m2[24m[4m0[24m912.
[90m 2[39m East   Laptop                   [4m6[24m750.
[90m 3[39m East   Monitor                 [4m1[24m[4m8[24m143.
[90m 4[39m East   Mouse                    [4m6[24m022.
[90m 5[39m East   Tablet                   [4m2[24m988.
[90m 6[39m North  Keyboard                 [4m5[24m974.
[90m 7[39m North  Laptop                   [4m8[24m397.
[90m 8[39m North  Monitor                  [4m5[24m831.
[90m 9[39m North  Mouse                   [4m1[24m[4m4[24m447.
[90m10[39m North  Tablet                  [4m1[24m[4m4[24m505.


In [114]:
# Task 5.2: Reshape to Wide Format
# TODO: Create 'revenue_wide' by pivoting region_category_revenue
#       so that Product_Category values become column names
#       with total_revenue as the values


revenue_wide <- region_category_revenue %>%

pivot_wider(
  names_from = Product_Category,
  values_from = total_revenue
)

cat("========== REVENUE DATA (WIDE FORMAT) ==========\n")
print(revenue_wide)



[90m# A tibble: 4 Ã— 6[39m
  Region Keyboard Laptop Monitor  Mouse Tablet
  [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m
[90m1[39m East     [4m2[24m[4m0[24m912.  [4m6[24m750.  [4m1[24m[4m8[24m143.  [4m6[24m022.  [4m2[24m988.
[90m2[39m North     [4m5[24m974.  [4m8[24m397.   [4m5[24m831. [4m1[24m[4m4[24m447. [4m1[24m[4m4[24m505.
[90m3[39m South     [4m5[24m158. [4m1[24m[4m1[24m018.  [4m2[24m[4m4[24m848. [4m1[24m[4m8[24m113.  [4m9[24m641.
[90m4[39m West     [4m1[24m[4m2[24m390. [4m1[24m[4m0[24m805.  [4m2[24m[4m0[24m461. [4m1[24m[4m0[24m104. [4m1[24m[4m9[24m873.


In [115]:
# Task 5.3: Reshape Back to Long Format
# TODO: Create 'revenue_long' by pivoting revenue_wide back to long format
#       Column names (except Region) should go into 'Product_Category'
#       Values should go into 'revenue'


revenue_long <- revenue_wide %>%

pivot_longer(
  cols = -Region,
  names_to = "Product_Category",
  values_to = "revenue"
)

cat("========== REVENUE DATA (BACK TO LONG FORMAT) ==========\n")
print(head(revenue_long, 10))

[90m# A tibble: 10 Ã— 3[39m
   Region Product_Category revenue
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m East   Keyboard          [4m2[24m[4m0[24m912.
[90m 2[39m East   Laptop             [4m6[24m750.
[90m 3[39m East   Monitor           [4m1[24m[4m8[24m143.
[90m 4[39m East   Mouse              [4m6[24m022.
[90m 5[39m East   Tablet             [4m2[24m988.
[90m 6[39m North  Keyboard           [4m5[24m974.
[90m 7[39m North  Laptop             [4m8[24m397.
[90m 8[39m North  Monitor            [4m5[24m831.
[90m 9[39m North  Mouse             [4m1[24m[4m4[24m447.
[90m10[39m North  Tablet            [4m1[24m[4m4[24m505.


## Part 6: Combining Datasets with Joins (Lesson 6)

**Skills Assessed:** left_join(), inner_join(), data integration

**Your Tasks:**
1. Join customers with orders
2. Join orders with order_items
3. Create integrated dataset

In [116]:
# Task 6.1: Join Customers and Orders
# TODO: Create 'customer_orders' by left joining customers with orders
#       Join on CustomerID

customer_orders <- left_join(customers, orders, by = "CustomerID")

cat("========== CUSTOMER ORDERS ==========\n")
cat("Total rows:", nrow(customer_orders), "\n")
cat("Columns:", ncol(customer_orders), "\n")

Total rows: 200 
Columns: 8 


In [117]:
# Task 6.2: Join Orders and Order Items
# TODO: Create 'orders_with_items' by inner joining orders with order_items
#       Join on OrderID

orders_with_items <- inner_join(orders, order_items, by = "OrderID")

cat("========== ORDERS WITH ITEMS ==========\n")
cat("Total rows:", nrow(orders_with_items), "\n")
head(orders_with_items, 5)

Total rows: 400 


OrderID,CustomerID,Order_Date,Total_Amount,ProductID,Quantity,Unit_Price
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>
1,87,2023-08-30,424.3,2,3,115.72
1,87,2023-08-30,424.3,22,5,206.62
1,87,2023-08-30,424.3,26,5,61.75
3,37,2024-03-19,549.07,19,1,474.92
6,101,2023-07-22,189.85,32,4,272.64


## Part 7: String Manipulation & Date/Time Operations (Lesson 7)

**Skills Assessed:** stringr functions, lubridate functions

**Your Tasks:**
1. Clean text data
2. Parse dates
3. Extract date components

In [118]:
# Task 7.1: Clean Text Data
# TODO: Add these columns to sales_enhanced using mutate():
#   - region_clean: Region with trimmed whitespace and Title Case
#   - category_clean: Product_Category with trimmed whitespace and Title Case

sales_enhanced <- sales_enhanced %>%
  mutate(
    # Your code here:
    region_clean = str_to_title(str_trim(Region)),
    category_clean = str_to_title(str_trim(Product_Category))
  )

cat("========== CLEANED TEXT DATA ==========\n")
head(sales_enhanced %>% select(Region, region_clean, Product_Category, category_clean), 5)



Region,region_clean,Product_Category,category_clean
<chr>,<chr>,<chr>,<chr>
South,South,Keyboard,Keyboard
East,East,Tablet,Tablet
West,West,Keyboard,Keyboard
South,South,Mouse,Mouse
South,South,Monitor,Monitor


In [119]:
# Task 7.2: Parse Dates and Extract Components
# TODO: Add these date-related columns using mutate():
#   - date_parsed: Parse Sale_Date column (use ymd(), mdy(), or dmy() as appropriate)
#   - sale_month: Extract month name from date_parsed
#   - sale_weekday: Extract weekday name from date_parsed


sales_enhanced <- sales_enhanced %>%
  mutate(
    date_parsed = ymd(Sale_Date),
    sale_month = month(date_parsed, label = TRUE, abbr = FALSE),
    sale_weekday = wday(date_parsed, label = TRUE, abbr = FALSE)
  )
cat("========== DATE COMPONENTS ==========\n")
head(sales_enhanced %>% select(Sale_Date, date_parsed, sale_month, sale_weekday), 5)



Sale_Date,date_parsed,sale_month,sale_weekday
<date>,<date>,<ord>,<ord>
2023-01-27,2023-01-27,January,Friday
2023-05-18,2023-05-18,May,Thursday
2023-02-14,2023-02-14,February,Tuesday
2023-07-28,2023-07-28,July,Friday
2023-05-17,2023-05-17,May,Wednesday


## Part 8: Advanced Wrangling & Business Intelligence (Lesson 8)

**Skills Assessed:** case_when(), complex logic, KPIs

**Your Tasks:**
1. Create business categories with case_when()
2. Calculate KPIs
3. Generate executive summary

In [120]:
# Task 8.1: Create Performance Categories
# TODO: Add 'performance_tier' column using case_when():
#   - "High" if Revenue > 25000
#   - "Medium" if Revenue > 15000
#   - "Low" otherwise

sales_enhanced <- sales_enhanced %>%
  mutate(
    performance_tier = case_when(
      Revenue > 25000 ~ "High", Revenue > 15000 ~ "Medium", TRUE ~ "Low"))

cat("========== PERFORMANCE TIERS ==========\n")
table(sales_enhanced$performance_tier)




Low 
100 

In [121]:
# Task 8.2: Calculate Business KPIs
# TODO: Create 'business_kpis' with these metrics:
#   - total_revenue: sum of Revenue
#   - total_transactions: count of rows
#   - avg_transaction_value: mean of Revenue
#   - high_value_pct: percentage where high_value = "Yes"

business_kpis <- sales_enhanced %>%
  summarize(
    total_revenue = sum(Revenue),
    total_transactions = n(),
    avg_transaction_value = mean(Revenue),
    high_value_pct = mean(high_value == "Yes") * 100
  )

cat("========== BUSINESS KPIs ==========\n")
print(business_kpis)

[90m# A tibble: 1 Ã— 4[39m
  total_revenue total_transactions avg_transaction_value high_value_pct
          [3m[90m<dbl>[39m[23m              [3m[90m<int>[39m[23m                 [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m       [4m2[24m[4m4[24m[4m6[24m378.                100                 [4m2[24m464.              0


## Part 9: Reflection Questions

Answer the following questions based on your analysis.

### Question 9.1: Data Cleaning Impact

**How did handling missing values and outliers affect your analysis? Why is data cleaning important before performing business analysis?**

Your answer here:
 Data cleaning is very important before performing business analysis because dealing with missing values and outliers really helps make the analysis more accurate. Missing data could mess up totals and averages, and outliers could make the results look completely different. Cleaning the data makes sure the numbers are calculated to reflect whats going on with sales. In conclusion, data cleaning is important so we can make smart decisions based on real information. 

### Question 9.2: Grouped Analysis Value

**What insights did you gain from the regional and category summaries that you couldn't see in the raw data? How can businesses use this type of grouped analysis?**

Your answer here:
 Looking at the regional and category summaries made it easier to see patterns that weren't obvious in the raw data. For example, we could quickly tell which regions brought in the most revenue and which product categories were top performers. This helps businesses see where their top clients are, see which products aren't doing well, and which regions need more targeted marketing. Like I discussed in the pervious homeworks, businesses can take this data and optimize their inventory, plan new marketing strategies, and even create promotions like BOGO with items that are often purchased together.

### Question 9.3: Data Reshaping Purpose

**Why would you need to reshape data between wide and long formats? Provide a business scenario where each format would be useful.**

Your answer here:
 Reshaping data between wide and long formats is useful because each format is better for different tasks. Wide format is easier to read and compare values side by side. Long format is better for analysis and visualizing like creating a chart. For example, a manager might use wide format to quickly review monthly sales, but use long format to calculate percentages.

### Question 9.4: Joining Datasets

**What is the difference between left_join() and inner_join()? When would you use each one in a business context?**

Your answer here:
 A left_join() keeps all the rows from the first table and adds matching data from a second table. An inner_join() only keeps rows that exist on both tables. You'd use a left join if you want to keep all customers even if some don't have transactions. An inner join is if you only want to analyze customers who actually made purchases.

### Question 9.5: Skills Integration

**Which R data wrangling skill (from Lessons 1-8) do you think is most valuable for business analytics? Why?**

Your answer here:
 I think grouping and summarizing data with group_by() and summarize() are the most valuable skills for business analytics. It lets you quickly see totals, averages, and percentages. In this exam we used these to show category summary by grouping product category, and a summary by Region and Product_Category. Which helps businesses understand trends, performance, and where they need more help. Being able to summarize large sets of data into useful information is important for making decisions.

## Exam Complete!

### What You've Demonstrated

âœ… **Lesson 1:** R basics and data import

âœ… **Lesson 2:** Data cleaning (missing values & outliers)

âœ… **Lesson 3:** Data transformation (select, filter, arrange)

âœ… **Lesson 4:** Advanced transformation (mutate, summarize, group_by)

âœ… **Lesson 5:** Data reshaping (pivot_longer, pivot_wider)

âœ… **Lesson 6:** Combining datasets (joins)

âœ… **Lesson 7:** String manipulation & date/time operations

âœ… **Lesson 8:** Advanced wrangling & business intelligence

### Submission Checklist

Before submitting, ensure:
- [ ] All code cells run without errors
- [ ] All TODO sections completed
- [ ] All required dataframes created with correct names
- [ ] All 5 reflection questions answered
- [ ] Student name and ID filled in at top

**Good work! ðŸŽ‰**