# MIDTERM EXAM: Comprehensive R Data Wrangling Assessment

**Student Name:** Gavin Lara

**Student ID:** 01985022

**Date:** 10/19/2025

**Time Limit:** 4 hours

---

## Exam Overview

This comprehensive midterm exam assesses your mastery of ALL R data wrangling skills covered in Lessons 1-8:

- **Lesson 1:** R Basics and Data Import
- **Lesson 2:** Data Cleaning (Missing Values & Outliers)
- **Lesson 3:** Data Transformation Part 1 (select, filter, arrange)
- **Lesson 4:** Data Transformation Part 2 (mutate, summarize, group_by)
- **Lesson 5:** Data Reshaping (pivot_longer, pivot_wider)
- **Lesson 6:** Combining Datasets (joins)
- **Lesson 7:** String Manipulation & Date/Time
- **Lesson 8:** Advanced Wrangling & Best Practices

## Business Scenario

You are a data analyst for a retail company. The executive team needs a comprehensive analysis of:
- Sales performance across products and regions
- Customer behavior and segmentation
- Data quality issues and recommendations
- Strategic insights for business growth

## Instructions

1. **Set your working directory** to where your data files are located
2. Complete ALL tasks in order
3. Write code in the TODO sections
4. Use the pipe operator (%>%) to chain operations
5. Add comments explaining your logic
6. Run all cells to verify your code works
7. Answer all reflection questions

## Grading

- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented code
- **Business Understanding (20%)**: Demonstrates understanding of context
- **Analysis & Insights (15%)**: Meaningful insights and recommendations
- **Reflection Questions (5%)**: Thoughtful answers

## Academic Integrity

This is an individual exam. You may use:
- Course notes and lesson materials
- R documentation and help files
- Your previous homework assignments

You may NOT:
- Collaborate with other students
- Use AI assistants or online forums
- Share code or solutions

---

**Good luck! ðŸŽ“**

## Part 1: R Basics and Data Import (Lesson 1)

**Skills Assessed:** Variables, data types, data import, working directory

**Your Tasks:**
1. Set working directory
2. Load required packages
3. Import multiple datasets
4. Examine data structures

In [40]:
# Task 1.1: Set Working Directory
# TODO: Set your working directory to where your data files are located
# IMPORTANT: Students must set their own path!
# Example: setwd("/Users/yourname/GitHub/ai-homework-grader-clean/data")

# Your code here: 

# Task 1.1: Set Working Directory

data_dir <- "/workspaces/assignment-1-version-3-Gavinlara1/assignment/Homework"  # replace with your path

if (!dir.exists(data_dir)) stop(paste("Path not found:", data_dir))
setwd(data_dir)

# Verify
cat("Current working directory:", getwd(), "\n")
cat("Files:\n"); print(list.files())
# ...existing code...

Current working directory: /workspaces/assignment-1-version-3-Gavinlara1/assignment/Homework 
Files:
 [1] "homework_lesson_3_data_transformation (3).ipynb"                 
 [2] "homework_lesson_4_data_transformation_part2.ipynb"               
 [3] "Lara_Gavin homework_lesson_3.ipynb"                              
 [4] "Lara_Gavin_homework_lesson_4_data_transformation_part2 (1).ipynb"
 [5] "Lara_Gavin_homework_lesson_5_data_reshaping.ipynb"               
 [6] "Lara_Gavin_homework_lesson_6_joins.ipynb"                        
 [7] "Lara_Gavin_homework_lesson_7_string_datetime.ipynb"              
 [8] "MIDTERM_EXAM_COMPREHENSIVE (1).ipynb"                            
 [9] "MIDTERM_EXAM_COMPREHENSIVE (2).ipynb"                            
[10] "README.md"                                                       
Files:
 [1] "homework_lesson_3_data_transformation (3).ipynb"                 
 [2] "homework_lesson_4_data_transformation_part2.ipynb"               
 [3] "Lara_Gavin homework_le

In [41]:
# Task 1.2: Load Required Packages
# TODO: Load tidyverse (includes dplyr, tidyr, stringr, ggplot2)
# TODO: Load lubridate for date operations

required_pkgs <- c("tidyverse", "lubridate")
to_install <- setdiff(required_pkgs, rownames(installed.packages()))
if (length(to_install) > 0) {
  install.packages(to_install, repos = "https://cloud.r-project.org", quiet = TRUE)
}

suppressPackageStartupMessages({
  library(tidyverse)
  library(lubridate)
})

cat("âœ… Packages loaded successfully!\n")

âœ… Packages loaded successfully!


In [42]:
# Task 1.3: Import Datasets
# TODO: Import the following CSV files using read_csv():
#   - company_sales_data.csv -> sales_data
#   - customers.csv -> customers
#   - products.csv -> products
#   - orders.csv -> orders
#   - order_items.csv -> order_items

# Your code here:
files <- c(
  sales_data  = "company_sales_data.csv",
  customers   = "customers.csv",
  products    = "products.csv",
  orders      = "orders.csv",
  order_items = "order_items.csv"
)

# Try locations in this order: working dir, ./data subfolder, data_dir (if defined),
# plus parent-level data folders and the repo-level data folder if present.
repo_data_path <- '/workspaces/assignment-1-version-3-Gavinlara1/data'
parent1 <- dirname(getwd())
parent2 <- dirname(parent1)
locs <- unique(c(
  getwd(),
  file.path(getwd(), 'data'),
  if (exists('data_dir')) normalizePath(data_dir, mustWork = FALSE) else NULL,
  file.path(parent1, 'data'),
  file.path(parent2, 'data'),
  repo_data_path
))

found_paths <- sapply(files, function(fname) {
  candidates <- file.path(locs, fname)
  candidates <- candidates[!is.na(candidates) & nzchar(candidates)]
  hit <- candidates[file.exists(candidates)]
  if (length(hit) > 0) return(hit[1])
  NA_character_
})

missing <- names(found_paths)[is.na(found_paths)]
if (length(missing) > 0) {
  cat("Files in current directory:\n"); print(list.files())
  if (dir.exists("data")) { cat("\nFiles in ./data directory:\n"); print(list.files("data")) }
  if (exists("data_dir")) { cat("\nFiles in data_dir (", data_dir, "):\n"); print(list.files(data_dir)) }
  stop(paste("Missing required file(s):", paste(missing, collapse = ", ")))
}

suppressMessages({
  sales_data  <- readr::read_csv(found_paths["sales_data"],  show_col_types = FALSE)
  customers   <- readr::read_csv(found_paths["customers"],   show_col_types = FALSE)
  products    <- readr::read_csv(found_paths["products"],    show_col_types = FALSE)
  orders      <- readr::read_csv(found_paths["orders"],      show_col_types = FALSE)
  order_items <- readr::read_csv(found_paths["order_items"], show_col_types = FALSE)
})

cat("âœ… Imported datasets:\n")
datasets <- list(sales_data = sales_data, customers = customers, products = products, orders = orders, order_items = order_items)
for (nm in names(datasets)) {
  df <- datasets[[nm]]
  cat(sprintf("- %s: %d rows, %d cols\n", nm, nrow(df), ncol(df)))
}

cat("\nPreview of sales_data:\n")
print(utils::head(sales_data, 5))

âœ… Imported datasets:
- sales_data: 300 rows, 8 cols
- customers: 100 rows, 5 cols
- products: 50 rows, 4 cols
- orders: 250 rows, 4 cols
- order_items: 400 rows, 4 cols

Preview of sales_data:
- sales_data: 300 rows, 8 cols
- customers: 100 rows, 5 cols
- products: 50 rows, 4 cols
- orders: 250 rows, 4 cols
- order_items: 400 rows, 4 cols

Preview of sales_data:
[90m# A tibble: 5 Ã— 8[39m
  TransactionID Sales_Rep_Name Region Product_Category Revenue   Cost Units_Sold
          [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m             1 Carol Davis    Latinâ€¦ Services          [4m2[24m[4m0[24m751. [4m1[24m[4m2[24m253.         78
[90m2[39m             2 Carol Davis    Europe Hardware          [4m3[24m[4m2[24m360. [4m2[24m[4m4[24m595.         13
[90m3[39m             3 Carol Davis    Europe Servi

## Part 2: Data Cleaning - Missing Values & Outliers (Lesson 2)

**Skills Assessed:** Identifying NAs, handling missing data, detecting outliers

**Your Tasks:**
1. Check for missing values in sales_data
2. Handle missing values appropriately
3. Identify outliers in Revenue column
4. Create a cleaned dataset

In [43]:
# Task 2.1: Check for Missing Values
# TODO: Create 'missing_summary' that shows count of NAs in each column of sales_data

# Calculate missing values per column (works for data.frames and tibbles)
missing_summary <- sapply(sales_data, function(x) sum(is.na(x)))
# Also compute percent missing for convenience
missing_pct <- round(100 * missing_summary / nrow(sales_data), 2)
missing_df <- data.frame(column = names(missing_summary), missing_count = as.integer(missing_summary), missing_pct = missing_pct, stringsAsFactors = FALSE)
missing_df <- missing_df[order(-missing_df$missing_count), ]

cat("========== MISSING VALUES SUMMARY ==========\n")
print(missing_df)
cat("\nTotal missing values:", sum(missing_summary), " ( ", sum(missing_summary) / (nrow(sales_data) * ncol(sales_data)) * 100, ")% of all cells\n")

                           column missing_count missing_pct
TransactionID       TransactionID             0           0
Sales_Rep_Name     Sales_Rep_Name             0           0
Region                     Region             0           0
Product_Category Product_Category             0           0
Revenue                   Revenue             0           0
Cost                         Cost             0           0
Units_Sold             Units_Sold             0           0
Sale_Date               Sale_Date             0           0

Total missing values: 0  (  0 )% of all cells
                           column missing_count missing_pct
TransactionID       TransactionID             0           0
Sales_Rep_Name     Sales_Rep_Name             0           0
Region                     Region             0           0
Product_Category Product_Category             0           0
Revenue                   Revenue             0           0
Cost                         Cost             0      

In [44]:
# Task 2.2: Handle Missing Values
# TODO: Create 'sales_clean' by removing rows with ANY missing values

# Remove rows that contain any NA in any column (use tidyr::drop_na)
sales_clean <- tidyr::drop_na(sales_data)

# Sanity checks and summary
original_rows <- nrow(sales_data)
cleaned_rows  <- nrow(sales_clean)
rows_removed  <- original_rows - cleaned_rows
pct_removed   <- if (original_rows > 0) round(100 * rows_removed / original_rows, 2) else 0

cat("========== DATA CLEANING RESULTS ==========\n")
cat("Original rows:", original_rows, "\n")
cat("Cleaned rows:", cleaned_rows, "\n")
cat("Rows removed:", rows_removed, " (", pct_removed, "% )\n")

# Display a small sample of cleaned data to confirm
print(utils::head(sales_clean, 5))

Original rows: 300 
Cleaned rows: 300 
Rows removed: 0  ( 0 % )
Original rows: 300 
Cleaned rows: 300 
Rows removed: 0  ( 0 % )
[90m# A tibble: 5 Ã— 8[39m
  TransactionID Sales_Rep_Name Region Product_Category Revenue   Cost Units_Sold
          [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m             1 Carol Davis    Latinâ€¦ Services          [4m2[24m[4m0[24m751. [4m1[24m[4m2[24m253.         78
[90m2[39m             2 Carol Davis    Europe Hardware          [4m3[24m[4m2[24m360. [4m2[24m[4m4[24m595.         13
[90m3[39m             3 Carol Davis    Europe Services          [4m3[24m[4m9[24m268. [4m2[24m[4m3[24m291.         34
[90m4[39m             4 Bob Smith      Europe Hardware          [4m2[24m[4m8[24m865. [4m1[24m[4m2[24m429.         90
[90m5[39m             5 Frank Miller

In [45]:
# Task 2.3: Detect Outliers in Revenue
# Calculate outlier thresholds using the IQR method
#   - Calculate Q1 (25th percentile) and Q3 (75th percentile) of Revenue
#   - Calculate IQR = Q3 - Q1
#   - Lower bound = Q1 - 1.5 * IQR
#   - Upper bound = Q3 + 1.5 * IQR
# Create 'outlier_analysis' dataframe with these values and identify outlier rows

# Ensure Revenue is numeric and remove NA values for the calculation
revenue_vec <- as.numeric(sales_clean$Revenue)
revenue_vec <- revenue_vec[!is.na(revenue_vec)]

if (length(revenue_vec) == 0) {
  stop('Revenue column has no numeric values after removing NAs; cannot compute outliers')
}

Q1 <- as.numeric(quantile(revenue_vec, probs = 0.25, na.rm = TRUE, type = 7))
Q3 <- as.numeric(quantile(revenue_vec, probs = 0.75, na.rm = TRUE, type = 7))
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outlier_analysis <- data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Bound", "Upper Bound"),
  Value = c(Q1, Q3, IQR_value, lower_bound, upper_bound)
)

cat("========== OUTLIER ANALYSIS ==========\n")
print(outlier_analysis)

# Identify outlier rows in sales_clean (using original sales_clean to preserve rows)
is_outlier <- !is.na(sales_clean$Revenue) & (as.numeric(sales_clean$Revenue) < lower_bound | as.numeric(sales_clean$Revenue) > upper_bound)
outlier_count <- sum(is_outlier, na.rm = TRUE)
cat("\nNumber of outliers detected:", outlier_count, "\n")

# Show a small sample of outliers if any
if (outlier_count > 0) {
  cat("\nSample outlier rows:\n")
  print(head(sales_clean[which(is_outlier), ], 10))
} else {
  cat("\nNo outliers detected using the IQR method.\n")
}

       Metric     Value
1          Q1  15034.29
2          Q3  37707.71
3         IQR  22673.42
4 Lower Bound -18975.84
5 Upper Bound  71717.84

Number of outliers detected: 0 

No outliers detected using the IQR method.
       Metric     Value
1          Q1  15034.29
2          Q3  37707.71
3         IQR  22673.42
4 Lower Bound -18975.84
5 Upper Bound  71717.84

Number of outliers detected: 0 

No outliers detected using the IQR method.


## Part 3: Data Transformation Part 1 (Lesson 3)

**Skills Assessed:** select(), filter(), arrange(), pipe operator

**Your Tasks:**
1. Select specific columns
2. Filter data by conditions
3. Sort data
4. Chain operations with pipe

In [46]:
# Task 3.1: Select Specific Columns
# Create 'sales_summary' with only these columns from sales_clean:
#   Region, Product_Category, Revenue, Units_Sold, Sale_Date

# Robust selection: handle minor column-name variations (e.g., Sale_Date vs SaleDate)
required_cols <- c('Region', 'Product_Category', 'Revenue', 'Units_Sold', 'Sale_Date')
available_cols <- names(sales_clean)

# Helper to find best match for a desired column name
find_col <- function(name, candidates) {
  if (name %in% candidates) return(name)
  # try common variants
  variants <- c(name, gsub('_', '', name), gsub('_', '', tolower(name)), tolower(name))
  hit <- variants[variants %in% candidates]
  if (length(hit) > 0) return(hit[1])
  NA_character_
}

mapped <- sapply(required_cols, find_col, candidates = available_cols, USE.NAMES = FALSE)
missing <- required_cols[is.na(mapped)]
if (length(missing) > 0) stop(paste('Missing required columns in sales_clean:', paste(missing, collapse = ', ')))

# Build sales_summary by selecting the mapped columns in the requested order
sales_summary <- sales_clean %>% dplyr::select(all_of(mapped))

cat("========== SELECTED COLUMNS ==========\n")
cat("Columns:", paste(names(sales_summary), collapse = ', '), "\n")
cat("Rows:", nrow(sales_summary), "\n")
head(sales_summary, 5)

Columns: Region, Product_Category, Revenue, Units_Sold, Sale_Date 
Rows: 300 
Columns: Region, Product_Category, Revenue, Units_Sold, Sale_Date 
Rows: 300 


Region,Product_Category,Revenue,Units_Sold,Sale_Date
<chr>,<chr>,<dbl>,<dbl>,<date>
Latin America,Services,20750.92,78,2023-04-24
Europe,Hardware,32359.98,13,2023-06-09
Europe,Services,39268.4,34,2023-03-25
Europe,Hardware,28865.09,90,2023-04-11
Latin America,Software,3932.36,63,2023-08-26


In [47]:
# Task 3.2: Filter High Revenue Sales
# Create 'high_revenue_sales' by filtering sales_clean for Revenue > 20000

# Ensure Revenue is numeric for comparison (coerce safely)
sales_clean <- sales_clean %>% dplyr::mutate(Revenue = as.numeric(Revenue))

high_revenue_sales <- sales_clean %>%
  dplyr::filter(!is.na(Revenue) & Revenue > 20000)

cat("========== HIGH REVENUE SALES ==========\n")
cat("Total high revenue transactions:", nrow(high_revenue_sales), "\n")
cat("Total revenue from these sales: $", sum(high_revenue_sales$Revenue, na.rm = TRUE), "\n")

# Show top 5 high revenue transactions for quick inspection
print(head(high_revenue_sales %>% dplyr::arrange(desc(Revenue)), 5))

Total high revenue transactions: 194 
Total revenue from these sales: $ 6671906 
Total high revenue transactions: 194 
Total revenue from these sales: $ 6671906 
[90m# A tibble: 5 Ã— 8[39m
  TransactionID Sales_Rep_Name Region Product_Category Revenue   Cost Units_Sold
          [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m           230 Alice Johnson  Asia â€¦ Consulting        [4m4[24m[4m9[24m956. [4m2[24m[4m1[24m741.         88
[90m2[39m           284 David Wilson   Europe Software          [4m4[24m[4m9[24m867. [4m1[24m[4m9[24m117.         96
[90m3[39m             8 Alice Johnson  Europe Consulting        [4m4[24m[4m9[24m857. [4m3[24m[4m7[24m119.          1
[90m4[39m           119 Eva Brown      Asia â€¦ Consulting        [4m4[24m[4m9[24m239. [4m2[24m[4m4[24m051.         72
[

In [48]:
# Task 3.3: Sort by Revenue
# Create 'top_sales' by arranging sales_clean by Revenue in descending order
# and keeping only the top 10 rows

# Ensure Revenue is numeric (should already be from Task 3.2, but coerce again to be safe)
sales_clean <- sales_clean %>% dplyr::mutate(Revenue = as.numeric(Revenue))

top_sales <- sales_clean %>%
  dplyr::arrange(dplyr::desc(Revenue)) %>%
  dplyr::slice_head(n = 10)

cat("========== TOP 10 SALES ==========\n")
print(top_sales %>% dplyr::select(Region, Product_Category, Revenue, Units_Sold))

[90m# A tibble: 10 Ã— 4[39m
   Region        Product_Category Revenue Units_Sold
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m956.         88
[90m 2[39m Europe        Software          [4m4[24m[4m9[24m867.         96
[90m 3[39m Europe        Consulting        [4m4[24m[4m9[24m857.          1
[90m 4[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m239.         72
[90m 5[39m Asia Pacific  Hardware          [4m4[24m[4m8[24m997.         92
[90m 6[39m North America Services          [4m4[24m[4m8[24m884.         62
[90m 7[39m Europe        Software          [4m4[24m[4m8[24m794.         77
[90m 8[39m North America Hardware          [4m4[24m[4m8[24m772.         16
[90m 9[39m North America Consulting        [4m4[24m[4m8[24m748.         63
[90m10[39m Europe        Consulting        [4m4[24m[4m

In [49]:
# Task 3.4: Chain Multiple Operations
# TODO: Create 'regional_top_sales' by:
#   1. Filtering for Revenue > 15000
#   2. Selecting: Region, Product_Category, Revenue
#   3. Arranging by Region (ascending) then Revenue (descending)
#   4. Keeping top 15 rows
# Use the pipe operator to chain all operations

regional_top_sales <- sales_clean %>%
  dplyr::mutate(Revenue = as.numeric(Revenue)) %>%
  dplyr::filter(!is.na(Revenue) & Revenue > 15000) %>%
  dplyr::select(Region, Product_Category, Revenue) %>%
  dplyr::arrange(Region, dplyr::desc(Revenue)) %>%
  dplyr::slice_head(n = 15)

cat("========== REGIONAL TOP SALES ==========\n")
print(regional_top_sales)

[90m# A tibble: 15 Ã— 3[39m
   Region       Product_Category Revenue
   [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific Consulting        [4m4[24m[4m9[24m956.
[90m 2[39m Asia Pacific Consulting        [4m4[24m[4m9[24m239.
[90m 3[39m Asia Pacific Hardware          [4m4[24m[4m8[24m997.
[90m 4[39m Asia Pacific Consulting        [4m4[24m[4m8[24m063.
[90m 5[39m Asia Pacific Services          [4m4[24m[4m6[24m731.
[90m 6[39m Asia Pacific Software          [4m4[24m[4m6[24m615.
[90m 7[39m Asia Pacific Consulting        [4m4[24m[4m6[24m545.
[90m 8[39m Asia Pacific Consulting        [4m4[24m[4m5[24m756.
[90m 9[39m Asia Pacific Services          [4m4[24m[4m5[24m298.
[90m10[39m Asia Pacific Services          [4m4[24m[4m4[24m624.
[90m11[39m Asia Pacific Consulting        [4m4[24m[4m4[24m455.
[90m12[39m Asia Pacific Consulting        [4m4[24m[4m4[24m364.
[9

## Part 4: Data Transformation Part 2 (Lesson 4)

**Skills Assessed:** mutate(), summarize(), group_by()

**Your Tasks:**
1. Create calculated columns with mutate()
2. Calculate summary statistics
3. Perform grouped analysis
4. Generate business metrics

In [50]:
# Task 4.1: Create Calculated Columns
# TODO: Add these new columns to sales_clean using mutate():
#   - revenue_per_unit: Revenue / Units_Sold
#   - high_value: "Yes" if Revenue > 20000, else "No"
# Store result in 'sales_enhanced'

# Ensure numeric columns before calculations
sales_enhanced <- sales_clean %>%
  dplyr::mutate(
    Revenue = as.numeric(Revenue),
    Units_Sold = as.numeric(Units_Sold),
    revenue_per_unit = ifelse(!is.na(Units_Sold) & Units_Sold != 0, Revenue / Units_Sold, NA_real_),
    high_value = ifelse(!is.na(Revenue) & Revenue > 20000, 'Yes', 'No')
  )

cat("========== ENHANCED SALES DATA ==========\n")
cat("New columns added: revenue_per_unit, high_value\n")
head(sales_enhanced %>% dplyr::select(Revenue, Units_Sold, revenue_per_unit, high_value), 5)

New columns added: revenue_per_unit, high_value
New columns added: revenue_per_unit, high_value


Revenue,Units_Sold,revenue_per_unit,high_value
<dbl>,<dbl>,<dbl>,<chr>
20750.92,78,266.03744,Yes
32359.98,13,2489.22923,Yes
39268.4,34,1154.95294,Yes
28865.09,90,320.72322,Yes
3932.36,63,62.41841,No


In [51]:
# Task 4.2: Calculate Overall Summary Statistics
# Create 'overall_summary' with these metrics from sales_enhanced:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - total_units: sum of Units_Sold
#   - transaction_count: count using n()

overall_summary <- sales_enhanced %>%
  dplyr::summarize(
    total_revenue = sum(Revenue, na.rm = TRUE),
    avg_revenue = mean(Revenue, na.rm = TRUE),
    total_units = sum(Units_Sold, na.rm = TRUE),
    transaction_count = dplyr::n()
  )

cat("========== OVERALL SUMMARY ==========\n")
print(overall_summary)

[90m# A tibble: 1 Ã— 4[39m
  total_revenue avg_revenue total_units transaction_count
          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.      [4m2[24m[4m5[24m906.       [4m1[24m[4m6[24m169               300
[90m# A tibble: 1 Ã— 4[39m
  total_revenue avg_revenue total_units transaction_count
          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.      [4m2[24m[4m5[24m906.       [4m1[24m[4m6[24m169               300


In [52]:
# Task 4.3: Regional Performance Analysis
# TODO: Create 'regional_summary' by grouping sales_enhanced by Region
#       and calculating:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - transaction_count: count using n()
# Then arrange by total_revenue descending
# Hint: Use group_by() %>% summarize() %>% arrange()

regional_summary <- sales_enhanced %>%
  dplyr::mutate(Revenue = as.numeric(Revenue)) %>%
  dplyr::group_by(Region) %>%
  dplyr::summarize(
    total_revenue = sum(Revenue, na.rm = TRUE),
    avg_revenue = mean(Revenue, na.rm = TRUE),
    transaction_count = dplyr::n(),
    .groups = 'drop'
  ) %>%
  dplyr::arrange(dplyr::desc(total_revenue))

cat("========== REGIONAL SUMMARY ==========\n")
print(regional_summary)

[90m# A tibble: 4 Ã— 4[39m
  Region        total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Europe             2[4m2[24m[4m2[24m[4m4[24m182.      [4m2[24m[4m7[24m124.                82
[90m2[39m Latin America      2[4m1[24m[4m1[24m[4m2[24m037.      [4m2[24m[4m5[24m446.                83
[90m3[39m Asia Pacific       1[4m8[24m[4m0[24m[4m4[24m243.      [4m2[24m[4m6[24m929.                67
[90m4[39m North America      1[4m6[24m[4m3[24m[4m1[24m248.      [4m2[24m[4m3[24m989.                68
[90m# A tibble: 4 Ã— 4[39m
  Region        total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Europe             2[4m2[24m[4m2[24m[4m4[24m182.      [4m2[24m[4m7[2

In [53]:
# Task 4.4: Product Category Analysis
# TODO: Create 'category_summary' by grouping by Product_Category
#       and calculating the same metrics as regional_summary
#       Then arrange by total_revenue descending

category_summary <- sales_enhanced %>%
  dplyr::mutate(Revenue = as.numeric(Revenue)) %>%
  dplyr::group_by(Product_Category) %>%
  dplyr::summarize(
    total_revenue = sum(Revenue, na.rm = TRUE),
    avg_revenue = mean(Revenue, na.rm = TRUE),
    transaction_count = dplyr::n(),
    .groups = 'drop'
  ) %>%
  dplyr::arrange(dplyr::desc(total_revenue))

cat("========== CATEGORY SUMMARY ==========\n")
print(category_summary)

[90m# A tibble: 4 Ã— 4[39m
  Product_Category total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Consulting            1[4m9[24m[4m7[24m[4m8[24m840.      [4m2[24m[4m6[24m037.                76
[90m2[39m Services              1[4m9[24m[4m6[24m[4m1[24m565.      [4m2[24m[4m7[24m244.                72
[90m3[39m Hardware              1[4m9[24m[4m5[24m[4m1[24m325.      [4m2[24m[4m6[24m730.                73
[90m4[39m Software              1[4m8[24m[4m7[24m[4m9[24m981.      [4m2[24m[4m3[24m797.                79
[90m# A tibble: 4 Ã— 4[39m
  Product_Category total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Consulting            1[4m9[24m[4m7[24m[4m8[24m8

## Part 5: Data Reshaping with tidyr (Lesson 5)

**Skills Assessed:** pivot_longer(), pivot_wider(), tidy data principles

**Your Tasks:**
1. Reshape data from wide to long format
2. Reshape data from long to wide format
3. Create analysis-ready datasets

In [54]:
# Task 5.1: Create Wide Format Data
# First, create a summary by Region and Product_Category
region_category_revenue <- sales_enhanced %>%
  group_by(Region, Product_Category) %>%
  summarize(total_revenue = sum(Revenue), .groups = 'drop')

cat("========== REGION-CATEGORY DATA (LONG FORMAT) ==========\n")
print(head(region_category_revenue, 10))

[90m# A tibble: 10 Ã— 3[39m
   Region        Product_Category total_revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting             [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware               [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services               [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software               [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting             [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware               [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services               [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software               [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting             [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware               [

In [55]:
# Task 5.2: Reshape to Wide Format
# TODO: Create 'revenue_wide' by pivoting region_category_revenue
#       so that Product_Category values become column names
#       with total_revenue as the values


revenue_wide <- region_category_revenue %>%
  tidyr::pivot_wider(
    names_from = Product_Category,
    values_from = total_revenue,
    values_fill = list(total_revenue = 0)
  ) %>%
  # Make column names safe (optional)
  dplyr::rename_with(~ make.names(., unique = TRUE))

cat("========== REVENUE DATA (WIDE FORMAT) ==========\n")
print(revenue_wide)

[90m# A tibble: 4 Ã— 5[39m
  Region        Consulting Hardware Services Software
  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m Asia Pacific     [4m7[24m[4m5[24m[4m9[24m641.  [4m2[24m[4m7[24m[4m1[24m979.  [4m3[24m[4m3[24m[4m1[24m826.  [4m4[24m[4m4[24m[4m0[24m797.
[90m2[39m Europe           [4m3[24m[4m9[24m[4m0[24m670.  [4m7[24m[4m7[24m[4m7[24m044.  [4m5[24m[4m1[24m[4m3[24m507.  [4m5[24m[4m4[24m[4m2[24m961.
[90m3[39m Latin America    [4m4[24m[4m3[24m[4m3[24m397.  [4m4[24m[4m7[24m[4m4[24m257.  [4m6[24m[4m4[24m[4m4[24m772.  [4m5[24m[4m5[24m[4m9[24m611.
[90m4[39m North America    [4m3[24m[4m9[24m[4m5[24m132.  [4m4[24m[4m2[24m[4m8[24m046.  [4m4[24m[4m7[24m[4m1[24m460.  [4m3[24m[4m3[24m[4m6[24m611.
[90m# A tibble: 4 Ã— 5[39m
  Region        Consulting Hardware Services Softwa

In [56]:
# Task 5.3: Reshape Back to Long Format
# TODO: Create 'revenue_long' by pivoting revenue_wide back to long format
#       Column names (except Region) should go into 'Product_Category'
#       Values should go into 'revenue'


revenue_long <- revenue_wide %>%
  tidyr::pivot_longer(
    cols = -Region,
    names_to = 'Product_Category',
    values_to = 'revenue'
  ) %>%
  # If column names were made syntactic, reverse make.names() effects where possible
  dplyr::mutate(
    Product_Category = gsub('.', ' ', Product_Category, fixed = TRUE),
    revenue = as.numeric(revenue)
  ) %>%
  dplyr::arrange(Region, dplyr::desc(revenue))

cat("========== REVENUE DATA (BACK TO LONG FORMAT) ==========\n")
print(head(revenue_long, 10))

[90m# A tibble: 10 Ã— 3[39m
   Region        Product_Category revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting       [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Software         [4m4[24m[4m4[24m[4m0[24m797.
[90m 3[39m Asia Pacific  Services         [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Hardware         [4m2[24m[4m7[24m[4m1[24m979.
[90m 5[39m Europe        Hardware         [4m7[24m[4m7[24m[4m7[24m044.
[90m 6[39m Europe        Software         [4m5[24m[4m4[24m[4m2[24m961.
[90m 7[39m Europe        Services         [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Consulting       [4m3[24m[4m9[24m[4m0[24m670.
[90m 9[39m Latin America Services         [4m6[24m[4m4[24m[4m4[24m772.
[90m10[39m Latin America Software         [4m5[24m[4m5[24m[4m9[24m611.
[90m# A tibble: 10 Ã— 3[39m
   Region

## Part 6: Combining Datasets with Joins (Lesson 6)

**Skills Assessed:** left_join(), inner_join(), data integration

**Your Tasks:**
1. Join customers with orders
2. Join orders with order_items
3. Create integrated dataset

In [57]:
# Task 6.1: Join Customers and Orders
# Create 'customer_orders' by left joining customers with orders
# Robust join: try common key name variants (CustomerID, customer_id, CustomerId)

# Detect possible join keys in each dataframe
cust_keys <- tolower(names(customers))
order_keys <- tolower(names(orders))
possible_keys <- c('customerid', 'customer_id', 'customerid', 'customerid')
join_key <- NULL
order_key <- NULL
for (k in possible_keys) {
  if (k %in% cust_keys && k %in% order_keys) {
    # use the original column names as present in dataframes
    join_key <- names(customers)[which(cust_keys == k)[1]]
    order_key <- names(orders)[which(order_keys == k)[1]]
    break
  }
}
if (is.null(join_key)) {
  stop('Could not find a common Customer key in customers and orders dataframes')
}

# Perform left join using the detected keys (allow differing names)
customer_orders <- dplyr::left_join(customers, orders, by = setNames(order_key, join_key))

cat("========== CUSTOMER ORDERS ==========\n")
cat("Detected join keys -> customers:", join_key, ", orders:", order_key, "\n")
cat("Total rows (customer_orders):", nrow(customer_orders), "\n")
cat("Columns (customer_orders):", ncol(customer_orders), "\n")
# show a small sample to validate the join
print(head(customer_orders, 5))

Detected join keys -> customers: CustomerID , orders: CustomerID 
Total rows (customer_orders): 200 
Columns (customer_orders): 8 
Detected join keys -> customers: CustomerID , orders: CustomerID 
Total rows (customer_orders): 200 
Columns (customer_orders): 8 
[90m# A tibble: 5 Ã— 8[39m
  CustomerID Name  Email City  Registration_Date OrderID Order_Date Total_Amount
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<date>[39m[23m              [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m            [3m[90m<dbl>[39m[23m
[90m1[39m          1 Custâ€¦ custâ€¦ Phoeâ€¦ 2020-10-03             87 2023-03-28         716.
[90m2[39m          1 Custâ€¦ custâ€¦ Phoeâ€¦ 2020-10-03            214 2023-09-12        [4m1[24m344.
[90m3[39m          2 Custâ€¦ custâ€¦ Los â€¦ 2020-06-02            173 2024-02-25         160.
[90m4[39m          2 Custâ€¦ custâ€¦ Los â€¦ 2020-06-02            190 2023-04-19        [4m

In [58]:
# Task 6.2: Join Orders and Order Items
# Create 'orders_with_items' by inner joining orders with order_items
# Robust join: detect common Order key variants (OrderID, order_id, OrderId)

order_keys_orders <- tolower(names(orders))
order_keys_items  <- tolower(names(order_items))
possible_order_keys <- c('orderid', 'order_id', 'orderid', 'orderid')
order_join_key <- NULL
order_item_key <- NULL
for (k in possible_order_keys) {
  if (k %in% order_keys_orders && k %in% order_keys_items) {
    order_join_key <- names(orders)[which(order_keys_orders == k)[1]]
    order_item_key <- names(order_items)[which(order_keys_items == k)[1]]
    break
  }
}
if (is.null(order_join_key)) {
  stop('Could not find a common Order key in orders and order_items dataframes')
}

# Perform inner join (orders -> order_items)
orders_with_items <- dplyr::inner_join(orders, order_items, by = setNames(order_item_key, order_join_key))

cat("========== ORDERS WITH ITEMS ==========\n")
cat("Detected order join keys -> orders:", order_join_key, ", order_items:", order_item_key, "\n")
cat("Total rows (orders_with_items):", nrow(orders_with_items), "\n")
cat("Columns (orders_with_items):", ncol(orders_with_items), "\n")
print(head(orders_with_items, 5))

Detected order join keys -> orders: OrderID , order_items: OrderID 
Total rows (orders_with_items): 400 
Columns (orders_with_items): 7 
Detected order join keys -> orders: OrderID , order_items: OrderID 
Total rows (orders_with_items): 400 
Columns (orders_with_items): 7 
[90m# A tibble: 5 Ã— 7[39m
  OrderID CustomerID Order_Date Total_Amount ProductID Quantity Unit_Price
    [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m            [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m       1         87 2023-08-30         424.         2        3      116. 
[90m2[39m       1         87 2023-08-30         424.        22        5      207. 
[90m3[39m       1         87 2023-08-30         424.        26        5       61.8
[90m4[39m       3         37 2024-03-19         549.        19        1      475. 
[90m5[39m       6        101 2023-07-22         190.        32        4 

## Part 7: String Manipulation & Date/Time Operations (Lesson 7)

**Skills Assessed:** stringr functions, lubridate functions

**Your Tasks:**
1. Clean text data
2. Parse dates
3. Extract date components

In [59]:
# Task 7.1: Clean Text Data
# TODO: Add these columns to sales_enhanced using mutate():
#   - region_clean: Region with trimmed whitespace and Title Case
#   - category_clean: Product_Category with trimmed whitespace and Title Case


sales_enhanced <- sales_enhanced %>%
  dplyr::mutate(
    # Trim whitespace and convert to Title Case for consistency
    region_clean = stringr::str_to_title(stringr::str_squish(as.character(Region))),
    category_clean = stringr::str_to_title(stringr::str_squish(as.character(Product_Category)))
  )

cat("========== CLEANED TEXT DATA ==========\n")
print(head(sales_enhanced %>% dplyr::select(Region, region_clean, Product_Category, category_clean), 5))

[90m# A tibble: 5 Ã— 4[39m
  Region        region_clean  Product_Category category_clean
  [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m         
[90m1[39m Latin America Latin America Services         Services      
[90m2[39m Europe        Europe        Hardware         Hardware      
[90m3[39m Europe        Europe        Services         Services      
[90m4[39m Europe        Europe        Hardware         Hardware      
[90m5[39m Latin America Latin America Software         Software      
[90m# A tibble: 5 Ã— 4[39m
  Region        region_clean  Product_Category category_clean
  [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m         
[90m1[39m Latin America Latin America Services         Services      
[90m2[39m Europe        Europe        Hardware         Hardware      
[90m3[39m Europe        Europe      

In [60]:
# Task 7.2: Parse Dates and Extract Components
# TODO: Add these date-related columns using mutate():
#   - date_parsed: Parse Sale_Date column (use ymd(), mdy(), or dmy() as appropriate)
#   - sale_month: Extract month name from date_parsed
#   - sale_weekday: Extract weekday name from date_parsed


# We'll attempt robust parsing using lubridate::parse_date_time
# Accept common formats: 'Ymd', 'mdY', 'dmy', with separators
sales_enhanced <- sales_enhanced %>%
  dplyr::mutate(
    raw_sale_date = as.character(Sale_Date),
    date_parsed = lubridate::parse_date_time(raw_sale_date, orders = c('Ymd', 'ymd', 'mdy', 'dmy', 'Y-m-d', 'm/d/Y', 'd/m/Y'), tz = 'UTC'),
    # If parse_date_time returned NA for some rows, try ymd() fallback on numeric-looking strings
    # use a safe numeric-only test (no backslash escapes that R will misinterpret)
    date_parsed = ifelse(is.na(date_parsed) & grepl('^[0-9]+$', raw_sale_date), lubridate::ymd(raw_sale_date), date_parsed),
    # Ensure date_parsed is a Date object (not POSIXct) for easier month/week extraction
    date_parsed = as.Date(date_parsed),
    sale_month = ifelse(!is.na(date_parsed), format(date_parsed, '%B'), NA_character_),
    sale_weekday = ifelse(!is.na(date_parsed), format(date_parsed, '%A'), NA_character_)
  )

cat("========== DATE COMPONENTS ==========\n")
print(head(sales_enhanced %>% dplyr::select(Sale_Date, raw_sale_date, date_parsed, sale_month, sale_weekday), 5))

[90m# A tibble: 5 Ã— 5[39m
  Sale_Date  raw_sale_date date_parsed   sale_month sale_weekday
  [3m[90m<date>[39m[23m     [3m[90m<chr>[39m[23m         [3m[90m<date>[39m[23m        [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m       
[90m1[39m 2023-04-24 2023-04-24    4607935-08-18 August     Sunday      
[90m2[39m 2023-06-09 2023-06-09    4618817-03-01 March      Wednesday   
[90m3[39m 2023-03-25 2023-03-25    4600838-12-21 December   Tuesday     
[90m4[39m 2023-04-11 2023-04-11    4604860-05-29 May        Saturday    
[90m5[39m 2023-08-26 2023-08-26    4637268-06-19 June       Tuesday     
[90m# A tibble: 5 Ã— 5[39m
  Sale_Date  raw_sale_date date_parsed   sale_month sale_weekday
  [3m[90m<date>[39m[23m     [3m[90m<chr>[39m[23m         [3m[90m<date>[39m[23m        [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m       
[90m1[39m 2023-04-24 2023-04-24    4607935-08-18 August     Sunday      
[90m2[39m 2023-06-09 2023-06-09    46188

## Part 8: Advanced Wrangling & Business Intelligence (Lesson 8)

**Skills Assessed:** case_when(), complex logic, KPIs

**Your Tasks:**
1. Create business categories with case_when()
2. Calculate KPIs
3. Generate executive summary

In [61]:
# Task 8.1: Create Performance Categories
# TODO: Add 'performance_tier' column using case_when():
#   - "High" if Revenue > 25000
#   - "Medium" if Revenue > 15000
#   - "Low" otherwise

# Ensure Revenue is numeric before applying thresholds
sales_enhanced <- sales_enhanced %>%
  dplyr::mutate(Revenue = as.numeric(Revenue)) %>%
  dplyr::mutate(
    performance_tier = dplyr::case_when(
      !is.na(Revenue) & Revenue > 25000 ~ 'High',
      !is.na(Revenue) & Revenue > 15000 ~ 'Medium',
      !is.na(Revenue) ~ 'Low',
      TRUE ~ NA_character_
    )
  )

cat("========== PERFORMANCE TIERS ==========\n")
print(dplyr::count(sales_enhanced, performance_tier))

[90m# A tibble: 3 Ã— 2[39m
  performance_tier     n
  [3m[90m<chr>[39m[23m            [3m[90m<int>[39m[23m
[90m1[39m High               154
[90m2[39m Low                 74
[90m3[39m Medium              72
[90m# A tibble: 3 Ã— 2[39m
  performance_tier     n
  [3m[90m<chr>[39m[23m            [3m[90m<int>[39m[23m
[90m1[39m High               154
[90m2[39m Low                 74
[90m3[39m Medium              72


In [62]:
# Task 8.2: Calculate Business KPIs
# TODO: Create 'business_kpis' with these metrics:
#   - total_revenue: sum of Revenue
#   - total_transactions: count of rows
#   - avg_transaction_value: mean of Revenue
#   - high_value_pct: percentage where high_value = "Yes"

# Compute business KPIs from sales_enhanced
business_kpis <- sales_enhanced %>%
  dplyr::summarize(
    total_revenue = sum(as.numeric(Revenue), na.rm = TRUE),
    total_transactions = dplyr::n(),
    avg_transaction_value = mean(as.numeric(Revenue), na.rm = TRUE),
    high_value_pct = 100 * mean(ifelse(tolower(as.character(high_value)) == 'yes', 1, 0), na.rm = TRUE)
  )

cat("Business KPIs are shown below:\n")
print(business_kpis)

Business KPIs are shown below:
[90m# A tibble: 1 Ã— 4[39m
  total_revenue total_transactions avg_transaction_value high_value_pct
          [3m[90m<dbl>[39m[23m              [3m[90m<int>[39m[23m                 [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.                300                [4m2[24m[4m5[24m906.           64.7
[90m# A tibble: 1 Ã— 4[39m
  total_revenue total_transactions avg_transaction_value high_value_pct
          [3m[90m<dbl>[39m[23m              [3m[90m<int>[39m[23m                 [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.                300                [4m2[24m[4m5[24m906.           64.7


## Part 9: Reflection Questions

Answer the following questions based on your analysis.

### Question 9.1: Data Cleaning Impact

**How did handling missing values and outliers affect your analysis? Why is data cleaning important before performing business analysis?**

Answer:
- Removing missing values and flagging outliers ensures summary metrics (like totals and averages) reflect valid transactions and arenâ€™t skewed by bad or incomplete data.
- This improves the reliability of KPIs and supports better business decisions.

### Question 9.2: Grouped Analysis Value

**What insights did you gain from the regional and category summaries that you couldn't see in the raw data? How can businesses use this type of grouped analysis?**

Your answer:

- Grouped summaries reveal which regions or product categories drive most revenue and which have higher average transaction values, insights that are not obvious from raw rows.
- Businesses can use these results to target promotions, allocate inventory, and set regional sales strategies based on high-performing segments.

### Question 9.3: Data Reshaping Purpose

**Why would you need to reshape data between wide and long formats? Provide a business scenario where each format would be useful.**

Your answer:

- Wide format is useful for summary tables or dashboards where each product or category is a column (e.g., a regional sales dashboard showing revenue by category across columns).
- Long format is best for analysis and modeling, and for functions that expect tidy data (e.g., a time series plotting or grouped aggregation where Product_Category is a variable rather than separate columns).

### Question 9.4: Joining Datasets

**What is the difference between left_join() and inner_join()? When would you use each one in a business context?**

Your answer:

- `left_join()` keeps all rows from the left table and brings matching rows from the right; use it when you want to retain all customers and attach orders (even if some customers have no orders).
- `inner_join()` keeps only rows with matches in both tables so use it when you need only complete order-item records (e.g., computing metrics only for orders that have line items).

### Question 9.5: Skills Integration

**Which R data wrangling skill (from Lessons 1-8) do you think is most valuable for business analytics? Why?**

Suggested answer:

- Data transformation with dplyr is the most valuable skill because it lets you clean, reshape, and summarize data efficiently for analysis and reportin.
- These functions are versatile, fast, and compose well with pipes, making them central to nearly every business analytics workflow.

## Exam Complete!

### What You've Demonstrated

âœ… **Lesson 1:** R basics and data import
âœ… **Lesson 2:** Data cleaning (missing values & outliers)
âœ… **Lesson 3:** Data transformation (select, filter, arrange)
âœ… **Lesson 4:** Advanced transformation (mutate, summarize, group_by)
âœ… **Lesson 5:** Data reshaping (pivot_longer, pivot_wider)
âœ… **Lesson 6:** Combining datasets (joins)
âœ… **Lesson 7:** String manipulation & date/time operations
âœ… **Lesson 8:** Advanced wrangling & business intelligence

### Submission Checklist

Before submitting, ensure:
- [ ] All code cells run without errors
- [ ] All TODO sections completed
- [ ] All required dataframes created with correct names
- [ ] All 5 reflection questions answered
- [ ] Student name and ID filled in at top

**Good work! ðŸŽ‰**