# MIDTERM EXAM: Comprehensive R Data Wrangling Assessment - SOLUTION

**Student Name:** [SOLUTION KEY]

**Student ID:** [SOLUTION KEY]

**Date:** [SOLUTION KEY]

---

This solution demonstrates correct implementation of all tasks using only:
- tidyverse (dplyr, tidyr, stringr, readr)
- lubridate

---

## Part 1: R Basics and Data Import (Lesson 1)

In [None]:
# Task 1.1: Set Working Directory
# SOLUTION: Students should set their own path
# Example: setwd("/Users/username/GitHub/ai-homework-grader-clean/data")

# For this solution, set to the data directory
setwd("/Users/humphrjk/GitHub/ai-homework-grader-clean/data")

# Verify working directory
cat("Current working directory:", getwd(), "\n")

In [None]:
# Task 1.2: Load Required Packages
# SOLUTION:
library(tidyverse)

# Load lubridate for date operations
library(lubridate)

cat("✅ Packages loaded successfully!\n")

In [None]:
# Task 1.3: Import Datasets
# SOLUTION:
sales_data <- read_csv("company_sales_data.csv")

customers <- read_csv("customers.csv")

products <- read_csv("products.csv")

orders <- read_csv("orders.csv")

order_items <- read_csv("order_items.csv")

# Display import summary
cat("✅ Data imported successfully!\n")
cat("Sales data:", nrow(sales_data), "rows\n")
cat("Customers:", nrow(customers), "rows\n")
cat("Products:", nrow(products), "rows\n")
cat("Orders:", nrow(orders), "rows\n")
cat("Order items:", nrow(order_items), "rows\n")

## Part 2: Data Cleaning - Missing Values & Outliers (Lesson 2)

In [None]:
# Task 2.1: Check for Missing Values
# SOLUTION:
missing_summary <- colSums(is.na(sales_data))

cat("========== MISSING VALUES SUMMARY ==========\n")
print(missing_summary)
cat("\nTotal missing values:", sum(missing_summary), "\n")

In [None]:
# Task 2.2: Handle Missing Values
# SOLUTION:
sales_clean <- sales_data %>%
  drop_na()

cat("========== DATA CLEANING RESULTS ==========\n")
cat("Original rows:", nrow(sales_data), "\n")
cat("Cleaned rows:", nrow(sales_clean), "\n")
cat("Rows removed:", nrow(sales_data) - nrow(sales_clean), "\n")

In [None]:
# Task 2.3: Detect Outliers in Revenue
# SOLUTION:
Q1 <- quantile(sales_clean$Revenue, 0.25)
Q3 <- quantile(sales_clean$Revenue, 0.75)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outlier_analysis <- data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Bound", "Upper Bound"),
  Value = c(Q1, Q3, IQR_value, lower_bound, upper_bound)
)

cat("========== OUTLIER ANALYSIS ==========\n")
print(outlier_analysis)

# Count outliers
outlier_count <- sum(sales_clean$Revenue < lower_bound | sales_clean$Revenue > upper_bound)
cat("\nNumber of outliers detected:", outlier_count, "\n")

## Part 3: Data Transformation Part 1 (Lesson 3)

In [None]:
# Task 3.1: Select Specific Columns
# SOLUTION:
sales_summary <- sales_clean %>%
  select(Region, Product_Category, Revenue, Units_Sold, Sale_Date)

cat("========== SELECTED COLUMNS ==========\n")
cat("Columns:", names(sales_summary), "\n")
cat("Rows:", nrow(sales_summary), "\n")
head(sales_summary, 5)

In [None]:
# Task 3.2: Filter High Revenue Sales
# SOLUTION:
high_revenue_sales <- sales_clean %>%
  filter(Revenue > 20000)

cat("========== HIGH REVENUE SALES ==========\n")
cat("Total high revenue transactions:", nrow(high_revenue_sales), "\n")
cat("Total revenue from these sales: $", sum(high_revenue_sales$Revenue), "\n")

In [None]:
# Task 3.3: Sort by Revenue
# SOLUTION:
top_sales <- sales_clean %>%
  arrange(desc(Revenue)) %>%
  head(10)

cat("========== TOP 10 SALES ==========\n")
print(top_sales %>% select(Region, Product_Category, Revenue, Units_Sold))

In [None]:
# Task 3.4: Chain Multiple Operations
# SOLUTION:
regional_top_sales <- sales_clean %>%
  filter(Revenue > 15000) %>%
  select(Region, Product_Category, Revenue) %>%
  arrange(Region, desc(Revenue)) %>%
  head(15)

cat("========== REGIONAL TOP SALES ==========\n")
print(regional_top_sales)

## Part 4: Data Transformation Part 2 (Lesson 4)

In [None]:
# Task 4.1: Create Calculated Columns
# SOLUTION:
sales_enhanced <- sales_clean %>%
  mutate(
    revenue_per_unit = Revenue / Units_Sold,
    high_value = ifelse(Revenue > 20000, "Yes", "No")
  )

cat("========== ENHANCED SALES DATA ==========\n")
cat("New columns added: revenue_per_unit, high_value\n")
head(sales_enhanced %>% select(Revenue, Units_Sold, revenue_per_unit, high_value), 5)

In [None]:
# Task 4.2: Calculate Overall Summary Statistics
# SOLUTION:
overall_summary <- sales_enhanced %>%
  summarize(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    total_units = sum(Units_Sold),
    transaction_count = n()
  )

cat("========== OVERALL SUMMARY ==========\n")
print(overall_summary)

In [None]:
# Task 4.3: Regional Performance Analysis
# SOLUTION:
regional_summary <- sales_enhanced %>%
  group_by(Region) %>%
  summarize(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    transaction_count = n()
  ) %>%
  arrange(desc(total_revenue))

cat("========== REGIONAL SUMMARY ==========\n")
print(regional_summary)

In [None]:
# Task 4.4: Product Category Analysis
# SOLUTION:
category_summary <- sales_enhanced %>%
  group_by(Product_Category) %>%
  summarize(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    transaction_count = n()
  ) %>%
  arrange(desc(total_revenue))

cat("========== CATEGORY SUMMARY ==========\n")
print(category_summary)

## Part 5: Data Reshaping with tidyr (Lesson 5)

In [None]:
# Task 5.1: Create Wide Format Data
region_category_revenue <- sales_enhanced %>%
  group_by(Region, Product_Category) %>%
  summarize(total_revenue = sum(Revenue), .groups = 'drop')

cat("========== REGION-CATEGORY DATA (LONG FORMAT) ==========\n")
print(head(region_category_revenue, 10))

In [None]:
# Task 5.2: Reshape to Wide Format
# SOLUTION:
revenue_wide <- region_category_revenue %>%
  pivot_wider(names_from = Product_Category, values_from = total_revenue)

cat("========== REVENUE DATA (WIDE FORMAT) ==========\n")
print(revenue_wide)

In [None]:
# Task 5.3: Reshape Back to Long Format
# SOLUTION:
revenue_long <- revenue_wide %>%
  pivot_longer(cols = -Region, names_to = "Product_Category", values_to = "revenue")

cat("========== REVENUE DATA (BACK TO LONG FORMAT) ==========\n")
print(head(revenue_long, 10))

## Part 6: Combining Datasets with Joins (Lesson 6)

In [None]:
# Task 6.1: Join Customers and Orders
# SOLUTION:
customer_orders <- customers %>%
  left_join(orders, by = "CustomerID")

cat("========== CUSTOMER ORDERS ==========\n")
cat("Total rows:", nrow(customer_orders), "\n")
cat("Columns:", ncol(customer_orders), "\n")

In [None]:
# Task 6.2: Join Orders and Order Items
# SOLUTION:
orders_with_items <- orders %>%
  inner_join(order_items, by = "OrderID")

cat("========== ORDERS WITH ITEMS ==========\n")
cat("Total rows:", nrow(orders_with_items), "\n")
head(orders_with_items, 5)

## Part 7: String Manipulation & Date/Time Operations (Lesson 7)

In [None]:
# Task 7.1: Clean Text Data
# SOLUTION:
sales_enhanced <- sales_enhanced %>%
  mutate(
    region_clean = str_to_title(str_trim(Region)),
    category_clean = str_to_title(str_trim(Product_Category))
  )

cat("========== CLEANED TEXT DATA ==========\n")
head(sales_enhanced %>% select(Region, region_clean, Product_Category, category_clean), 5)

In [None]:
# Task 7.2: Parse Dates and Extract Components
# SOLUTION:
sales_enhanced <- sales_enhanced %>%
  mutate(
    date_parsed = ymd(Sale_Date),
    sale_month = month(date_parsed, label = TRUE, abbr = FALSE),
    sale_weekday = wday(date_parsed, label = TRUE, abbr = FALSE)
  )

cat("========== DATE COMPONENTS ==========\n")
head(sales_enhanced %>% select(Sale_Date, date_parsed, sale_month, sale_weekday), 5)

## Part 8: Advanced Wrangling & Business Intelligence (Lesson 8)

In [None]:
# Task 8.1: Create Performance Categories
# SOLUTION:
sales_enhanced <- sales_enhanced %>%
  mutate(
    performance_tier = case_when(
      Revenue > 25000 ~ "High",
      Revenue > 15000 ~ "Medium",
      TRUE ~ "Low"
    )
  )

cat("========== PERFORMANCE TIERS ==========\n")
table(sales_enhanced$performance_tier)

In [None]:
# Task 8.2: Calculate Business KPIs
# SOLUTION:
business_kpis <- sales_enhanced %>%
  summarize(
    total_revenue = sum(Revenue),
    total_transactions = n(),
    avg_transaction_value = mean(Revenue),
    high_value_pct = sum(high_value == "Yes") / n() * 100
  )

cat("========== BUSINESS KPIs ==========\n")
print(business_kpis)

## Part 9: Reflection Questions

### Question 9.1: Data Cleaning Impact

**SOLUTION:**

Handling missing values and outliers significantly improved data quality and analysis reliability. By removing rows with missing values, we ensured that all calculations were based on complete records, preventing biased results. Identifying outliers helped us understand extreme values that could skew averages and other statistics.

Data cleaning is crucial before business analysis because:
- Missing values can lead to incorrect calculations and misleading insights
- Outliers can distort summary statistics and hide true patterns
- Clean data ensures stakeholders can trust the analysis results
- It prevents poor business decisions based on flawed data

### Question 9.2: Grouped Analysis Value

**SOLUTION:**

The regional and category summaries revealed patterns invisible in raw data:
- Which regions generate the most revenue
- Which product categories are most profitable
- Average transaction values by segment
- Transaction volume differences across regions

Businesses use grouped analysis to:
- Allocate resources to high-performing regions
- Identify underperforming segments needing attention
- Set region-specific sales targets
- Optimize inventory based on category performance
- Develop targeted marketing strategies

### Question 9.3: Data Reshaping Purpose

**SOLUTION:**

Data reshaping serves different analytical purposes:

**Wide format** is useful for:
- Creating comparison tables (e.g., revenue by region across product categories)
- Excel-style reports for executives
- Correlation analysis between categories
- Dashboard displays

**Long format** is useful for:
- Statistical modeling and machine learning
- Creating visualizations with ggplot2
- Database storage (normalized form)
- Filtering and grouping operations

**Business scenario:** A sales manager might want wide format to quickly compare product performance across regions in a spreadsheet, but long format for creating trend charts in a dashboard.

### Question 9.4: Joining Datasets

**SOLUTION:**

**left_join()** keeps all rows from the left table and matches from the right:
- Use when you want to preserve all records from your primary dataset
- Example: Keep all customers even if they haven't placed orders yet
- Business use: Customer analysis where you need all customers, including inactive ones

**inner_join()** keeps only rows that match in both tables:
- Use when you only want records that exist in both datasets
- Example: Only customers who have placed orders
- Business use: Active customer analysis, calculating actual purchase behavior

**Business context:** Use left_join for customer retention analysis (including non-buyers), but inner_join for purchase pattern analysis (only actual buyers).

### Question 9.5: Skills Integration

**SOLUTION:**

The most valuable skill for business analytics is **group_by() with summarize()** (Lesson 4) because:

1. **Reveals patterns:** Aggregating data by categories uncovers insights hidden in raw data
2. **Supports decisions:** Executives need summarized metrics, not individual transactions
3. **Enables comparisons:** Comparing performance across segments drives strategy
4. **Scales well:** Works with small and large datasets efficiently
5. **Foundation for KPIs:** Most business metrics require aggregation

While all skills are important, grouped aggregation transforms raw data into actionable business intelligence that stakeholders can actually use to make decisions.

## Exam Complete!

### Solution Summary

This solution demonstrates:
- ✅ Proper use of tidyverse functions only
- ✅ Correct implementation of all data wrangling tasks
- ✅ Clean, readable code with pipe operators
- ✅ Thoughtful reflection answers showing business understanding
- ✅ No external packages beyond tidyverse and lubridate

**All tasks completed successfully! 🎉**