# MIDTERM EXAM: Comprehensive R Data Wrangling Assessment

**Student Name:** [Deon Schoeman]

**Student ID:** [MCH616]

**Date:** [10/19/2025]

**Time Limit:** 4 hours

---

## Exam Overview

This comprehensive midterm exam assesses your mastery of ALL R data wrangling skills covered in Lessons 1-8:

- **Lesson 1:** R Basics and Data Import
- **Lesson 2:** Data Cleaning (Missing Values & Outliers)
- **Lesson 3:** Data Transformation Part 1 (select, filter, arrange)
- **Lesson 4:** Data Transformation Part 2 (mutate, summarize, group_by)
- **Lesson 5:** Data Reshaping (pivot_longer, pivot_wider)
- **Lesson 6:** Combining Datasets (joins)
- **Lesson 7:** String Manipulation & Date/Time
- **Lesson 8:** Advanced Wrangling & Best Practices

## Business Scenario

You are a data analyst for a retail company. The executive team needs a comprehensive analysis of:
- Sales performance across products and regions
- Customer behavior and segmentation
- Data quality issues and recommendations
- Strategic insights for business growth

## Instructions

1. **Set your working directory** to where your data files are located
2. Complete ALL tasks in order
3. Write code in the TODO sections
4. Use the pipe operator (%>%) to chain operations
5. Add comments explaining your logic
6. Run all cells to verify your code works
7. Answer all reflection questions

## Grading

- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented code
- **Business Understanding (20%)**: Demonstrates understanding of context
- **Analysis & Insights (15%)**: Meaningful insights and recommendations
- **Reflection Questions (5%)**: Thoughtful answers

## Academic Integrity

This is an individual exam. You may use:
- Course notes and lesson materials
- R documentation and help files
- Your previous homework assignments

You may NOT:
- Collaborate with other students
- Use AI assistants or online forums
- Share code or solutions

---

**Good luck! ðŸŽ“**

## Part 1: R Basics and Data Import (Lesson 1)

**Skills Assessed:** Variables, data types, data import, working directory

**Your Tasks:**
1. Set working directory
2. Load required packages
3. Import multiple datasets
4. Examine data structures

In [123]:
# Task 1.1: Set Working Directory
# TODO: Set your working directory to where your data files are located
# IMPORTANT: Students must set their own path!
# Example: setwd("/Users/yourname/GitHub/ai-homework-grader-clean/data")

# Sets the working directory to the data folder in the repostitory.
setwd("/workspaces/Assignment-3-Data-Transformation-with-dplyr---Part-1/data/")


# Verify working directory
cat("Current working directory:", getwd(), "\n")

Current working directory: /workspaces/Assignment-3-Data-Transformation-with-dplyr---Part-1/data 


In [124]:
# Task 1.2: Load Required Packages
# TODO: Load tidyverse (includes dplyr, tidyr, stringr, ggplot2)

# Loads tidyverse for data manipulation and visualization
library(tidyverse)


# TODO: Load lubridate for date operations
# Loads lubridate for date operations
library(lubridate)


cat("âœ… Packages loaded successfully!\n")

âœ… Packages loaded successfully!


In [125]:
# Task 1.3: Import Datasets
# TODO: Import the following CSV files using read_csv():
#   - company_sales_data.csv -> sales_data
#   - customers.csv -> customers
#   - products.csv -> products
#   - orders.csv -> orders
#   - order_items.csv -> order_items

# Your code here:

# Assigns each CSV file to a variable.
sales_data <- read.csv("company_sales_data.csv")

customers <- read.csv("customers.csv")

products <- read.csv("products.csv")

orders <- read.csv("orders.csv")

order_items <- read.csv("order_items.csv")


# Display import summary
cat("âœ… Data imported successfully!\n")
cat("Sales data:", nrow(sales_data), "rows\n")
cat("Customers:", nrow(customers), "rows\n")
cat("Products:", nrow(products), "rows\n")
cat("Orders:", nrow(orders), "rows\n")
cat("Order items:", nrow(order_items), "rows\n")

âœ… Data imported successfully!
Sales data: 300 rows
Customers: 100 rows
Products: 50 rows
Orders: 250 rows
Order items: 400 rows


## Part 2: Data Cleaning - Missing Values & Outliers (Lesson 2)

**Skills Assessed:** Identifying NAs, handling missing data, detecting outliers

**Your Tasks:**
1. Check for missing values in sales_data
2. Handle missing values appropriately
3. Identify outliers in Revenue column
4. Create a cleaned dataset

In [126]:
# Task 2.1: Check for Missing Values
# TODO: Create 'missing_summary' that shows count of NAs in each column of sales_data

# Used sapply to count NAs in each column of sales_data. And according to the result there are no missing values in the dataset.
# I verified this by checking the excel file as well.
missing_summary <- sapply(sales_data, function(x) sum(is.na(x)))


cat("========== MISSING VALUES SUMMARY ==========\n")
print(missing_summary)
cat("\nTotal missing values:", sum(missing_summary), "\n")



   TransactionID   Sales_Rep_Name           Region Product_Category 
               0                0                0                0 
         Revenue             Cost       Units_Sold        Sale_Date 
               0                0                0                0 

Total missing values: 0 


In [127]:
# Task 2.2: Handle Missing Values
# TODO: Create 'sales_clean' by removing rows with ANY missing values

# Removed rows with any missing values from sales_data by using na.omit().
sales_clean <- na.omit(sales_data)


cat("========== DATA CLEANING RESULTS ==========\n")
cat("Original rows:", nrow(sales_data), "\n")
cat("Cleaned rows:", nrow(sales_clean), "\n")
cat("Rows removed:", nrow(sales_data) - nrow(sales_clean), "\n")

Original rows: 300 
Cleaned rows: 300 
Rows removed: 0 


In [128]:
# Task 2.3: Detect Outliers in Revenue
# TODO: Calculate outlier thresholds using IQR method
#   - Calculate Q1 (25th percentile) and Q3 (75th percentile) of Revenue
#   - Calculate IQR = Q3 - Q1
#   - Lower bound = Q1 - 1.5 * IQR
#   - Upper bound = Q3 + 1.5 * IQR
# TODO: Create 'outlier_analysis' dataframe with these values

# Quantile is used to calculate Q1 and Q3. na.rm = true is not needed due to previous cleaning of data set.
# $ is used to access the Revenue column in sales_clean dataframe.
Q1 <- quantile(sales_clean$Revenue, 0.25)
Q3 <- quantile(sales_clean$Revenue, 0.75)
# Quartile 3 minus Quartile 1 gives the IQR value.
IQR_value <- Q3 - Q1
# Lower and upper bounds are calculated using the IQR method.
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outlier_analysis <- data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Bound", "Upper Bound"),
  Value = c(Q1, Q3, IQR_value, lower_bound, upper_bound)
)

cat("========== OUTLIER ANALYSIS ==========\n")
print(outlier_analysis)

# Count outliers
outlier_count <- sum(sales_clean$Revenue < lower_bound | sales_clean$Revenue > upper_bound)
cat("\nNumber of outliers detected:", outlier_count, "\n")

       Metric     Value
1          Q1  15034.29
2          Q3  37707.71
3         IQR  22673.42
4 Lower Bound -18975.84
5 Upper Bound  71717.84

Number of outliers detected: 0 


## Part 3: Data Transformation Part 1 (Lesson 3)

**Skills Assessed:** select(), filter(), arrange(), pipe operator

**Your Tasks:**
1. Select specific columns
2. Filter data by conditions
3. Sort data
4. Chain operations with pipe

In [129]:
# Task 3.1: Select Specific Columns
# TODO: Create 'sales_summary' with only these columns from sales_clean:
#   Region, Product_Category, Revenue, Units_Sold, Sale_Date


# Used the select() to only show the columns Region, Product_Category, Revenue, Units_Sold, and Sale_Date in the sales_summary variable.
# The pipe operator is used to pass the information on to the next argument. It helps with readability of the code
sales_summary <- sales_clean %>%
  select(Region, Product_Category, Revenue, Units_Sold, Sale_Date)
  

cat("========== SELECTED COLUMNS ==========\n")
cat("Columns:", names(sales_summary), "\n")
cat("Rows:", nrow(sales_summary), "\n")
head(sales_summary, 5)



Columns: Region Product_Category Revenue Units_Sold Sale_Date 
Rows: 300 


Unnamed: 0_level_0,Region,Product_Category,Revenue,Units_Sold,Sale_Date
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<int>,<chr>
1,Latin America,Services,20750.92,78,2023-04-24
2,Europe,Hardware,32359.98,13,2023-06-09
3,Europe,Services,39268.4,34,2023-03-25
4,Europe,Hardware,28865.09,90,2023-04-11
5,Latin America,Software,3932.36,63,2023-08-26


In [130]:
# Task 3.2: Filter High Revenue Sales
# TODO: Create 'high_revenue_sales' by filtering sales_clean for Revenue > 20000

#Used filter() to filter Revenue to greater than 20k.
high_revenue_sales <- sales_clean %>%
  filter(Revenue > 20000)
  

cat("========== HIGH REVENUE SALES ==========\n")
cat("Total high revenue transactions:", nrow(high_revenue_sales), "\n")
cat("Total revenue from these sales: $", sum(high_revenue_sales$Revenue), "\n")



Total high revenue transactions: 194 
Total revenue from these sales: $ 6671906 


In [131]:
# Task 3.3: Sort by Revenue
# TODO: Create 'top_sales' by arranging sales_clean by Revenue in descending order
#       and keeping only the top 10 rows

#Used arrange with desc on Revenue to order the top_sales in descending order.
#Then used head to show only the first 10 rows in the descending order.
top_sales <- sales_clean %>%
  arrange(desc(Revenue)) %>%
  head(10)
  

cat("========== TOP 10 SALES ==========\n")
print(top_sales %>% select(Region, Product_Category, Revenue, Units_Sold))



          Region Product_Category  Revenue Units_Sold
1   Asia Pacific       Consulting 49956.01         88
2         Europe         Software 49866.51         96
3         Europe       Consulting 49856.54          1
4   Asia Pacific       Consulting 49238.97         72
5   Asia Pacific         Hardware 48997.22         92
6  North America         Services 48884.31         62
7         Europe         Software 48793.64         77
8  North America         Hardware 48771.67         16
9  North America       Consulting 48747.95         63
10        Europe       Consulting 48571.74         22


In [132]:
# Task 3.4: Chain Multiple Operations
# TODO: Create 'regional_top_sales' by:
#   1. Filtering for Revenue > 15000
#   2. Selecting: Region, Product_Category, Revenue
#   3. Arranging by Region (ascending) then Revenue (descending)
#   4. Keeping top 15 rows
# Use the pipe operator to chain all operations

#Used the pipe operator to chain arguments make the code clean and more readable by connecting filter -> select -> arrange -> head.
regional_top_sales <- sales_clean %>%
  filter(Revenue > 15000) %>%
  select(Region, Product_Category, Revenue) %>%
  arrange(Region, desc(Revenue)) %>%
  head(15)
  

cat("========== REGIONAL TOP SALES ==========\n")
print(regional_top_sales)

         Region Product_Category  Revenue
1  Asia Pacific       Consulting 49956.01
2  Asia Pacific       Consulting 49238.97
3  Asia Pacific         Hardware 48997.22
4  Asia Pacific       Consulting 48063.17
5  Asia Pacific         Services 46731.05
6  Asia Pacific         Software 46615.43
7  Asia Pacific       Consulting 46544.57
8  Asia Pacific       Consulting 45755.79
9  Asia Pacific         Services 45298.38
10 Asia Pacific         Services 44624.16
11 Asia Pacific       Consulting 44454.58
12 Asia Pacific       Consulting 44364.40
13 Asia Pacific         Software 43221.14
14 Asia Pacific         Software 42062.52
15 Asia Pacific         Software 41411.58


## Part 4: Data Transformation Part 2 (Lesson 4)

**Skills Assessed:** mutate(), summarize(), group_by()

**Your Tasks:**
1. Create calculated columns with mutate()
2. Calculate summary statistics
3. Perform grouped analysis
4. Generate business metrics

In [133]:
# Task 4.1: Create Calculated Columns
# TODO: Add these new columns to sales_clean using mutate():
#   - revenue_per_unit: Revenue / Units_Sold
#   - high_value: "Yes" if Revenue > 20000, else "No"
# Store result in 'sales_enhanced'

#Created new column for revenue_per_unit also rounded the decimal places to clean the numbers up.
#Also created a new column for high_value and using case_when to flag yes when revenue is above 20k other wise its not high_value.
sales_enhanced <- sales_clean %>%
  mutate(
    revenue_per_unit = round(Revenue / Units_Sold, 2),
    high_value = case_when(
      Revenue > 20000 ~ "Yes",
      TRUE ~ "No"
    )
    
  )

cat("========== ENHANCED SALES DATA ==========\n")
cat("New columns added: revenue_per_unit, high_value\n")
head(sales_enhanced %>% select(Revenue, Units_Sold, revenue_per_unit, high_value), 5)

New columns added: revenue_per_unit, high_value


Unnamed: 0_level_0,Revenue,Units_Sold,revenue_per_unit,high_value
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<chr>
1,20750.92,78,266.04,Yes
2,32359.98,13,2489.23,Yes
3,39268.4,34,1154.95,Yes
4,28865.09,90,320.72,Yes
5,3932.36,63,62.42,No


In [134]:
# Task 4.2: Calculate Overall Summary Statistics
# TODO: Create 'overall_summary' with these metrics from sales_enhanced:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - total_units: sum of Units_Sold
#   - transaction_count: count using n()

#Used summarize to create a summary of metrics from sales_enhanced.
#Metrics created - total_revenue = sum of Revenue, avg_revenue = mean of Revenue
#Continued, total_units = sum of Units_Sold, transaction_count = count of each transaction

overall_summary <- sales_enhanced %>%
    summarize(
        total_revenue = sum(Revenue),
        avg_revenue = mean(Revenue),
        total_units = sum(Units_Sold),
        transaction_count = n()
    )

cat("========== OVERALL SUMMARY ==========\n")
print(overall_summary)

  total_revenue avg_revenue total_units transaction_count
1       7771711     25905.7       16169               300


In [135]:
# Task 4.3: Regional Performance Analysis
# TODO: Create 'regional_summary' by grouping sales_enhanced by Region
#       and calculating:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - transaction_count: count using n()
# Then arrange by total_revenue descending
# Hint: Use group_by() %>% summarize() %>% arrange()

#Grouped by Region then used pipe operator to chain arguments together.
#Metrics created are total_revenue, avg_revenue, and transaction_count.
# .groups = "drop" removes further grouping.
#Arranged the data in descending order of total_revenue.

regional_summary <- sales_enhanced %>%
  group_by(Region) %>%
  summarize(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    transaction_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

cat("========== REGIONAL SUMMARY ==========\n")
print(regional_summary)

[90m# A tibble: 4 Ã— 4[39m
  Region        total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Europe             2[4m2[24m[4m2[24m[4m4[24m182.      [4m2[24m[4m7[24m124.                82
[90m2[39m Latin America      2[4m1[24m[4m1[24m[4m2[24m037.      [4m2[24m[4m5[24m446.                83
[90m3[39m Asia Pacific       1[4m8[24m[4m0[24m[4m4[24m243.      [4m2[24m[4m6[24m929.                67
[90m4[39m North America      1[4m6[24m[4m3[24m[4m1[24m248.      [4m2[24m[4m3[24m989.                68


In [136]:
# Task 4.4: Product Category Analysis
# TODO: Create 'category_summary' by grouping by Product_Category
#       and calculating the same metrics as regional_summary
#       Then arrange by total_revenue descending

#Grouped by Product_Category then used pipe operator to chain arguments together.
#Metrics created are total_revenue, avg_revenue, and transaction_count. The same as regional_summary.
#Arranged the data in descending order of total_revenue. The same as regional_summary.
category_summary <- sales_enhanced %>%
  group_by(Product_Category) %>%
  summarize(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    transaction_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

cat("========== CATEGORY SUMMARY ==========\n")
print(category_summary)



[90m# A tibble: 4 Ã— 4[39m
  Product_Category total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Consulting            1[4m9[24m[4m7[24m[4m8[24m840.      [4m2[24m[4m6[24m037.                76
[90m2[39m Services              1[4m9[24m[4m6[24m[4m1[24m565.      [4m2[24m[4m7[24m244.                72
[90m3[39m Hardware              1[4m9[24m[4m5[24m[4m1[24m325.      [4m2[24m[4m6[24m730.                73
[90m4[39m Software              1[4m8[24m[4m7[24m[4m9[24m981.      [4m2[24m[4m3[24m797.                79


## Part 5: Data Reshaping with tidyr (Lesson 5)

**Skills Assessed:** pivot_longer(), pivot_wider(), tidy data principles

**Your Tasks:**
1. Reshape data from wide to long format
2. Reshape data from long to wide format
3. Create analysis-ready datasets

In [137]:
# Task 5.1: Create Wide Format Data
# First, create a summary by Region and Product_Category

region_category_revenue <- sales_enhanced %>%
  group_by(Region, Product_Category) %>%
  summarize(total_revenue = sum(Revenue), .groups = 'drop')

cat("========== REGION-CATEGORY DATA (LONG FORMAT) ==========\n")
print(head(region_category_revenue, 10))



[90m# A tibble: 10 Ã— 3[39m
   Region        Product_Category total_revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting             [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware               [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services               [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software               [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting             [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware               [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services               [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software               [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting             [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware               [

In [138]:
# Task 5.2: Reshape to Wide Format
# TODO: Create 'revenue_wide' by pivoting region_category_revenue
#       so that Product_Category values become column names
#       with total_revenue as the values

#Used pivot_wider to separate product category into thier respective new columns with values from total_revenue.
revenue_wide <- region_category_revenue %>%
  pivot_wider(
    names_from = Product_Category,
    values_from = total_revenue
  )
  

cat("========== REVENUE DATA (WIDE FORMAT) ==========\n")
print(revenue_wide)

[90m# A tibble: 4 Ã— 5[39m
  Region        Consulting Hardware Services Software
  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m Asia Pacific     [4m7[24m[4m5[24m[4m9[24m641.  [4m2[24m[4m7[24m[4m1[24m979.  [4m3[24m[4m3[24m[4m1[24m826.  [4m4[24m[4m4[24m[4m0[24m797.
[90m2[39m Europe           [4m3[24m[4m9[24m[4m0[24m670.  [4m7[24m[4m7[24m[4m7[24m044.  [4m5[24m[4m1[24m[4m3[24m507.  [4m5[24m[4m4[24m[4m2[24m961.
[90m3[39m Latin America    [4m4[24m[4m3[24m[4m3[24m397.  [4m4[24m[4m7[24m[4m4[24m257.  [4m6[24m[4m4[24m[4m4[24m772.  [4m5[24m[4m5[24m[4m9[24m611.
[90m4[39m North America    [4m3[24m[4m9[24m[4m5[24m132.  [4m4[24m[4m2[24m[4m8[24m046.  [4m4[24m[4m7[24m[4m1[24m460.  [4m3[24m[4m3[24m[4m6[24m611.


In [139]:
# Task 5.3: Reshape Back to Long Format
# TODO: Create 'revenue_long' by pivoting revenue_wide back to long format
#       Column names (except Region) should go into 'Product_Category'
#       Values should go into 'revenue'

#Used pivot_longer to put consulting, hardware, services, and software back under Product_Category column.
#The new column also got assigned its respective values under revenue.

revenue_long <- revenue_wide %>%
  pivot_longer(
    cols = c(Consulting, Hardware, Services, Software),
    names_to = "Product_Category",
    values_to = "revenue"
  )

cat("========== REVENUE DATA (BACK TO LONG FORMAT) ==========\n")
print(head(revenue_long, 10))

[90m# A tibble: 10 Ã— 3[39m
   Region        Product_Category revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting       [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware         [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services         [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software         [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting       [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware         [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services         [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software         [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting       [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware         [4m4[24m[4m7[24m[4m4[24m257.


## Part 6: Combining Datasets with Joins (Lesson 6)

**Skills Assessed:** left_join(), inner_join(), data integration

**Your Tasks:**
1. Join customers with orders
2. Join orders with order_items
3. Create integrated dataset

In [140]:
# Task 6.1: Join Customers and Orders
# TODO: Create 'customer_orders' by left joining customers with orders
#       Join on CustomerID

#left_join to left join customers with orders by CustomerID.
#The Left Join uses the complete left table and then only uses matching data for the right table.

customer_orders <- left_join(customers, orders, by = "CustomerID")


cat("========== CUSTOMER ORDERS ==========\n")
cat("Total rows:", nrow(customer_orders), "\n")
cat("Columns:", ncol(customer_orders), "\n")



Total rows: 200 
Columns: 8 


In [141]:
# Task 6.2: Join Orders and Order Items
# TODO: Create 'orders_with_items' by inner joining orders with order_items
#       Join on OrderID

#Used inner_join to inner join orders and order_items by OrderID.
#Inner Join only uses matching data between the two data sets.

orders_with_items <- inner_join(orders, order_items, by = "OrderID")


cat("========== ORDERS WITH ITEMS ==========\n")
cat("Total rows:", nrow(orders_with_items), "\n")
head(orders_with_items, 5)

Total rows: 400 


Unnamed: 0_level_0,OrderID,CustomerID,Order_Date,Total_Amount,ProductID,Quantity,Unit_Price
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>,<int>,<int>,<dbl>
1,1,87,2023-08-30,424.3,2,3,115.72
2,1,87,2023-08-30,424.3,22,5,206.62
3,1,87,2023-08-30,424.3,26,5,61.75
4,3,37,2024-03-19,549.07,19,1,474.92
5,6,101,2023-07-22,189.85,32,4,272.64


## Part 7: String Manipulation & Date/Time Operations (Lesson 7)

**Skills Assessed:** stringr functions, lubridate functions

**Your Tasks:**
1. Clean text data
2. Parse dates
3. Extract date components

In [142]:
# Task 7.1: Clean Text Data
# TODO: Add these columns to sales_enhanced using mutate():
#   - region_clean: Region with trimmed whitespace and Title Case
#   - category_clean: Product_Category with trimmed whitespace and Title Case

#Used str_trim to clean up white space and used str_to_title to Title Case both region_clean and category_clean.
sales_enhanced <- sales_enhanced %>%
  mutate(
    region_clean = str_trim(Region),
    region_clean = str_to_title(region_clean),
    category_clean = str_trim(Product_Category),
    category_clean = str_to_title(category_clean)
  )

cat("========== CLEANED TEXT DATA ==========\n")
head(sales_enhanced %>% select(Region, region_clean, Product_Category, category_clean), 5)



Unnamed: 0_level_0,Region,region_clean,Product_Category,category_clean
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,Latin America,Latin America,Services,Services
2,Europe,Europe,Hardware,Hardware
3,Europe,Europe,Services,Services
4,Europe,Europe,Hardware,Hardware
5,Latin America,Latin America,Software,Software


In [143]:
# Task 7.2: Parse Dates and Extract Components
# TODO: Add these date-related columns using mutate():
#   - date_parsed: Parse Sale_Date column (use ymd(), mdy(), or dmy() as appropriate)
#   - sale_month: Extract month name from date_parsed
#   - sale_weekday: Extract weekday name from date_parsed

#Used ymd for the correct format to parse sale date. Then used month() to get the month from date_parsed.
#Then I used wday() with label = true to give a String name instead of a  value for weekday and abbr as false to not abbreviate the weekday name.
sales_enhanced <- sales_enhanced %>%
  mutate(
    date_parsed = ymd(Sale_Date),
    sale_month = month(date_parsed),
    sale_weekday = wday(date_parsed, label = TRUE, abbr = FALSE)
  )

cat("========== DATE COMPONENTS ==========\n")
head(sales_enhanced %>% select(Sale_Date, date_parsed, sale_month, sale_weekday), 5)



Unnamed: 0_level_0,Sale_Date,date_parsed,sale_month,sale_weekday
Unnamed: 0_level_1,<chr>,<date>,<dbl>,<ord>
1,2023-04-24,2023-04-24,4,Monday
2,2023-06-09,2023-06-09,6,Friday
3,2023-03-25,2023-03-25,3,Saturday
4,2023-04-11,2023-04-11,4,Tuesday
5,2023-08-26,2023-08-26,8,Saturday


## Part 8: Advanced Wrangling & Business Intelligence (Lesson 8)

**Skills Assessed:** case_when(), complex logic, KPIs

**Your Tasks:**
1. Create business categories with case_when()
2. Calculate KPIs
3. Generate executive summary

In [144]:
# Task 8.1: Create Performance Categories
# TODO: Add 'performance_tier' column using case_when():
#   - "High" if Revenue > 25000
#   - "Medium" if Revenue > 15000
#   - "Low" otherwise

#Used mutate with case_when to create a new column called performance_tier to sort between high, medium and low revenue.
sales_enhanced <- sales_enhanced %>%
  mutate(
    performance_tier = case_when(
      Revenue > 25000 ~ "High",
      Revenue > 15000 ~ "Medium",
      TRUE ~ "Low"
    )
  )

cat("========== PERFORMANCE TIERS ==========\n")
table(sales_enhanced$performance_tier)




  High    Low Medium 
   154     74     72 

In [145]:
# Task 8.2: Calculate Business KPIs
# TODO: Create 'business_kpis' with these metrics:
#   - total_revenue: sum of Revenue
#   - total_transactions: count of rows
#   - avg_transaction_value: mean of Revenue
#   - high_value_pct: percentage where high_value = "Yes"

business_kpis <- sales_enhanced %>%
  summarize(
    total_revenue = sum(Revenue),
    total_transactions = n(),
    avg_transaction_value = mean(Revenue),
    high_value_pct = round(154 / 300 * 100, 2)
  )
cat("========== BUSINESS KPIs ==========\n")
print(business_kpis)

  total_revenue total_transactions avg_transaction_value high_value_pct
1       7771711                300               25905.7          51.33


## Part 9: Reflection Questions

Answer the following questions based on your analysis.

### Question 9.1: Data Cleaning Impact

**How did handling missing values and outliers affect your analysis? Why is data cleaning important before performing business analysis?**

Your answer here: The data set did not have outliers or missing values. Fortunately, that meant the data did not affect the analysis very much. However, if there were outliers or missing values they would have affected the averages and percentiles.



### Question 9.2: Grouped Analysis Value

**What insights did you gain from the regional and category summaries that you couldn't see in the raw data? How can businesses use this type of grouped analysis?**

Your answer here: It was easier to see who the top selling regions were which were. Initially it looked like Asia Pacific was on top due to having the highest individual revenue; however, after some arranging of the data. It showed that Latin America and Europe was on top. Also the top revenue producing categories were Consulting and Services. Just looking at the raw data it would have been very difficult to come to these answers.



### Question 9.3: Data Reshaping Purpose

**Why would you need to reshape data between wide and long formats? Provide a business scenario where each format would be useful.**

Your answer here: For breaking things into wider format. I can think of a scenario where a category is really long and it would be beneficial to break it further down into sub-categories to make it easier for a user to read and understand a wider table. As for making it into a longer table it would be easier to run computer calculations on longer tables.



### Question 9.4: Joining Datasets

**What is the difference between left_join() and inner_join()? When would you use each one in a business context?**

Your answer here: In a left join the complete table on the left is kept and only matching data from the right table is joined with the left table. This may end up with some NA fields in the data. In an inner join only matching data between both data sets is joined, and no NA fields occurs. For a left join business context: If a company is merging data with a company it bought, and it wants to retain all its current data, but wants to join what it can from the new company. Inner_Join would be useful when trying to run analysis on suppliers data and the product data the company sells. Since the only information the company wants to look at is the matching products. 



### Question 9.5: Skills Integration

**Which R data wrangling skill (from Lessons 1-8) do you think is most valuable for business analytics? Why?**

Your answer here: I think they are all very important to run any kind of analysis, but being able to mutate, summarize, and group by data is really valuable. Take for example mutating, being able to create new columns with existing information and being able analyze a regions total sales or units sold. Or take for example being able to summarize data set by creating KPI's to quickly and easily see the performance metrics from a data set. I think they are very powerful tools that can help focus the answers in a data set.



## Exam Complete!

### What You've Demonstrated

âœ… **Lesson 1:** R basics and data import
âœ… **Lesson 2:** Data cleaning (missing values & outliers)
âœ… **Lesson 3:** Data transformation (select, filter, arrange)
âœ… **Lesson 4:** Advanced transformation (mutate, summarize, group_by)
âœ… **Lesson 5:** Data reshaping (pivot_longer, pivot_wider)
âœ… **Lesson 6:** Combining datasets (joins)
âœ… **Lesson 7:** String manipulation & date/time operations
âœ… **Lesson 8:** Advanced wrangling & business intelligence

### Submission Checklist

Before submitting, ensure:
- [ ] All code cells run without errors
- [ ] All TODO sections completed
- [ ] All required dataframes created with correct names
- [ ] All 5 reflection questions answered
- [ ] Student name and ID filled in at top

**Good work! ðŸŽ‰**