# MIDTERM EXAM: Comprehensive R Data Wrangling Assessment

**Student Name:** Alejandro De Santiago Palomares Salinas

**Student ID:** aeb923

**Date:** 10/19/2025

**Time Limit:** 4 hours

---

## Exam Overview

This comprehensive midterm exam assesses your mastery of ALL R data wrangling skills covered in Lessons 1-8:

- **Lesson 1:** R Basics and Data Import
- **Lesson 2:** Data Cleaning (Missing Values & Outliers)
- **Lesson 3:** Data Transformation Part 1 (select, filter, arrange)
- **Lesson 4:** Data Transformation Part 2 (mutate, summarize, group_by)
- **Lesson 5:** Data Reshaping (pivot_longer, pivot_wider)
- **Lesson 6:** Combining Datasets (joins)
- **Lesson 7:** String Manipulation & Date/Time
- **Lesson 8:** Advanced Wrangling & Best Practices

## Business Scenario

You are a data analyst for a retail company. The executive team needs a comprehensive analysis of:
- Sales performance across products and regions
- Customer behavior and segmentation
- Data quality issues and recommendations
- Strategic insights for business growth

## Instructions

1. **Set your working directory** to where your data files are located
2. Complete ALL tasks in order
3. Write code in the TODO sections
4. Use the pipe operator (%>%) to chain operations
5. Add comments explaining your logic
6. Run all cells to verify your code works
7. Answer all reflection questions

## Grading

- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented code
- **Business Understanding (20%)**: Demonstrates understanding of context
- **Analysis & Insights (15%)**: Meaningful insights and recommendations
- **Reflection Questions (5%)**: Thoughtful answers

## Academic Integrity

This is an individual exam. You may use:
- Course notes and lesson materials
- R documentation and help files
- Your previous homework assignments

You may NOT:
- Collaborate with other students
- Use AI assistants or online forums
- Share code or solutions

---

**Good luck! ðŸŽ“**

## Part 1: R Basics and Data Import (Lesson 1)

**Skills Assessed:** Variables, data types, data import, working directory

**Your Tasks:**
1. Set working directory
2. Load required packages
3. Import multiple datasets
4. Examine data structures

In [None]:
# Task 1.1: Set Working Directory
# TODO: Set your working directory to where your data files are located
# IMPORTANT: Students must set their own path!
# Example: setwd("/Users/yourname/GitHub/ai-homework-grader-clean/data")
setwd("/workspaces/assignment-2-version3-Aledesan-utsa/data")
getwd()
# Verify working directory
cat("Current working directory:", getwd(), "\n")

#no comment needed, set up of working directory and data path

Current working directory: /workspaces/assignment-2-version3-Aledesan-utsa/data 


In [None]:
# Task 1.2: Load Required Packages
# TODO: Load tidyverse (includes dplyr, tidyr, stringr, ggplot2)
library(tidyverse)

# TODO: Load lubridate for date operations
library(lubridate)

cat("âœ… Packages loaded successfully!\n")

#no comment needed, loading of required packages

âœ… Packages loaded successfully!


In [None]:
# Task 1.3: Import Datasets
# TODO: Import the following CSV files using read_csv():
#   - company_sales_data.csv -> sales_data
#   - customers.csv -> customers
#   - products.csv -> products
#   - orders.csv -> orders
#   - order_items.csv -> order_items

# Your code here:
sales_data <- read_csv("company_sales_data.csv")

customers <- read_csv("customers.csv")

products <- read_csv("products.csv")

orders <- read_csv("orders.csv")

order_items <- read_csv("order_items.csv")


# Display import summary
cat("âœ… Data imported successfully!\n")
cat("Sales data:", nrow(sales_data), "rows\n")
cat("Customers:", nrow(customers), "rows\n")
cat("Products:", nrow(products), "rows\n")
cat("Orders:", nrow(orders), "rows\n")
cat("Order items:", nrow(order_items), "rows\n")

#no comment needed, importing of datasets and naming them accordingly

[1mRows: [22m[34m300[39m [1mColumns: [22m[34m8[39m
[36mâ”€â”€[39m [1mColumn specification[22m [36mâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Sales_Rep_Name, Region, Product_Category
[32mdbl[39m  (4): TransactionID, Revenue, Cost, Units_Sold
[34mdate[39m (1): Sale_Date

[36mâ„¹[39m Use `spec()` to retrieve the full column specification for this data.
[36mâ„¹[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m100[39m [1mColumns: [22m[34m5[39m
[36mâ”€â”€[39m [1mColumn specification[22m [36mâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Name, Email, City
[32mdbl[39m


[36mâ„¹[39m Use `spec()` to retrieve the full column specification for this data.
[36mâ„¹[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m50[39m [1mColumns: [22m[34m4[39m
[36mâ”€â”€[39m [1mColumn specification[22m [36mâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Product_Name, Category
[32mdbl[39m (2): ProductID, Supplier_ID

[36mâ„¹[39m Use `spec()` to retrieve the full column specification for this data.
[36mâ„¹[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m250[39m [1mColumns: [22m[34m4[39m
[36mâ”€â”€[39m [1mColumn specification[22m [36mâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â

âœ… Data imported successfully!
Sales data: 300 rows
Customers: 100 rows
Products: 50 rows
Orders: 250 rows
Order items: 400 rows


## Part 2: Data Cleaning - Missing Values & Outliers (Lesson 2)

**Skills Assessed:** Identifying NAs, handling missing data, detecting outliers

**Your Tasks:**
1. Check for missing values in sales_data
2. Handle missing values appropriately
3. Identify outliers in Revenue column
4. Create a cleaned dataset

In [None]:
# Task 2.1: Check for Missing Values
# TODO: Create 'missing_summary' that shows count of NAs in each column of sales_data


missing_summary <- sapply(sales_data, function(x) sum(is.na(x)))

cat("========== MISSING VALUES SUMMARY ==========\n")
print(missing_summary)
cat("\nTotal missing values:", sum(missing_summary), "\n")

#I noticed that there are no NA values in the dataset, I am not quite sure if this is meant to be like that or my data set is faulty.

   TransactionID   Sales_Rep_Name           Region Product_Category 
               0                0                0                0 
         Revenue             Cost       Units_Sold        Sale_Date 
               0                0                0                0 

Total missing values: 0 


In [None]:
# Task 2.2: Handle Missing Values
# TODO: Create 'sales_clean' by removing rows with ANY missing values


sales_clean <- missing_summary <- sales_data %>%
drop_na()

cat("========== DATA CLEANING RESULTS ==========\n")
cat("Original rows:", nrow(sales_data), "\n")
cat("Cleaned rows:", nrow(sales_clean), "\n")
cat("Rows removed:", nrow(sales_data) - nrow(sales_clean), "\n")

#still tried to remove NAs even though there were none, again I am not sure if this is meant to be like that or my data set is faulty.

Original rows: 300 
Cleaned rows: 300 
Rows removed: 0 


In [None]:
# Task 2.3: Detect Outliers in Revenue
# TODO: Calculate outlier thresholds using IQR method
#   - Calculate Q1 (25th percentile) and Q3 (75th percentile) of Revenue
#   - Calculate IQR = Q3 - Q1
#   - Lower bound = Q1 - 1.5 * IQR
#   - Upper bound = Q3 + 1.5 * IQR
# TODO: Create 'outlier_analysis' dataframe with these values

Q1 <- quantile(sales_clean$Revenue, 0.25, na.rm = TRUE)
Q3 <- quantile(sales_clean$Revenue, 0.75, na.rm = TRUE)
IQR_value <-  Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outlier_analysis <- data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Bound", "Upper Bound"),
  Value = c(Q1, Q3, IQR_value, lower_bound, upper_bound)
)

cat("========== OUTLIER ANALYSIS ==========\n")
print(outlier_analysis)

# Count outliers
outlier_count <- sum(sales_clean$Revenue < lower_bound | sales_clean$Revenue > upper_bound)
cat("\nNumber of outliers detected:", outlier_count, "\n")

#I defined the key metrics for outlier detection and counted the number of outliers accordingly, even though there were none in my dataset, once again not sure if this is meant to be like that or my data set is faulty.

       Metric     Value
1          Q1  15034.29
2          Q3  37707.71
3         IQR  22673.42
4 Lower Bound -18975.84
5 Upper Bound  71717.84

Number of outliers detected: 0 


## Part 3: Data Transformation Part 1 (Lesson 3)

**Skills Assessed:** select(), filter(), arrange(), pipe operator

**Your Tasks:**
1. Select specific columns
2. Filter data by conditions
3. Sort data
4. Chain operations with pipe

In [None]:
# Task 3.1: Select Specific Columns
# TODO: Create 'sales_summary' with only these columns from sales_clean:
#   Region, Product_Category, Revenue, Units_Sold, Sale_Date

sales_summary <- sales_clean %>%
  select(Region, Product_Category, Revenue, Units_Sold, Sale_Date)
  
cat("========== SELECTED COLUMNS ==========\n")
cat("Columns:", names(sales_summary), "\n")
cat("Rows:", nrow(sales_summary), "\n")
head(sales_summary, 5)

#used the basic select function to select specific columns from the cleaned sales data created in the task before

Columns: Region Product_Category Revenue Units_Sold Sale_Date 
Rows: 300 


Region,Product_Category,Revenue,Units_Sold,Sale_Date
<chr>,<chr>,<dbl>,<dbl>,<date>
Latin America,Services,20750.92,78,2023-04-24
Europe,Hardware,32359.98,13,2023-06-09
Europe,Services,39268.4,34,2023-03-25
Europe,Hardware,28865.09,90,2023-04-11
Latin America,Software,3932.36,63,2023-08-26


In [None]:
# Task 3.2: Filter High Revenue Sales
# TODO: Create 'high_revenue_sales' by filtering sales_clean for Revenue > 20000


high_revenue_sales <- sales_clean %>%
  filter(Revenue > 20000)
  
cat("========== HIGH REVENUE SALES ==========\n")
cat("Total high revenue transactions:", nrow(high_revenue_sales), "\n")
cat("Total revenue from these sales: $", sum(high_revenue_sales$Revenue), "\n")

#filtered the cleaned sales data to only include transactions where revenue was higher than 20000

Total high revenue transactions: 194 
Total revenue from these sales: $ 6671906 


In [None]:
# Task 3.3: Sort by Revenue
# TODO: Create 'top_sales' by arranging sales_clean by Revenue in descending order
#       and keeping only the top 10 rows


top_sales <- sales_clean %>%
  arrange(desc(Revenue)) %>%
  slice_head(n = 10)
  
cat("========== TOP 10 SALES ==========\n")
print(top_sales %>% select(Region, Product_Category, Revenue, Units_Sold))

#sorted the cleaned sales data by revenue in descending order and kept only the top 10 rows to display top sales

[90m# A tibble: 10 Ã— 4[39m
   Region        Product_Category Revenue Units_Sold
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m956.         88
[90m 2[39m Europe        Software          [4m4[24m[4m9[24m867.         96
[90m 3[39m Europe        Consulting        [4m4[24m[4m9[24m857.          1
[90m 4[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m239.         72
[90m 5[39m Asia Pacific  Hardware          [4m4[24m[4m8[24m997.         92
[90m 6[39m North America Services          [4m4[24m[4m8[24m884.         62
[90m 7[39m Europe        Software          [4m4[24m[4m8[24m794.         77
[90m 8[39m North America Hardware          [4m4[24m[4m8[24m772.         16
[90m 9[39m North America Consulting        [4m4[24m[4m8[24m748.         63
[90m10[39m Europe        Consulting        [4m4[24m[4m

In [None]:
# Task 3.4: Chain Multiple Operations
# TODO: Create 'regional_top_sales' by:
#   1. Filtering for Revenue > 15000
#   2. Selecting: Region, Product_Category, Revenue
#   3. Arranging by Region (ascending) then Revenue (descending)
#   4. Keeping top 15 rows
# Use the pipe operator to chain all operations

regional_top_sales <- sales_clean %>%
  filter(Revenue > 15000) %>%
  select(Region, Product_Category, Revenue) %>%
  arrange(Region, desc(Revenue)) %>%
  slice_head(n = 15)
  
cat("========== REGIONAL TOP SALES ==========\n")
print(regional_top_sales)

#chained multiple dplyr operations using the pipe operator to filter, select, arrange, and slice the sales data accordingly for tidy display and easier analysis

[90m# A tibble: 15 Ã— 3[39m
   Region       Product_Category Revenue
   [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific Consulting        [4m4[24m[4m9[24m956.
[90m 2[39m Asia Pacific Consulting        [4m4[24m[4m9[24m239.
[90m 3[39m Asia Pacific Hardware          [4m4[24m[4m8[24m997.
[90m 4[39m Asia Pacific Consulting        [4m4[24m[4m8[24m063.
[90m 5[39m Asia Pacific Services          [4m4[24m[4m6[24m731.
[90m 6[39m Asia Pacific Software          [4m4[24m[4m6[24m615.
[90m 7[39m Asia Pacific Consulting        [4m4[24m[4m6[24m545.
[90m 8[39m Asia Pacific Consulting        [4m4[24m[4m5[24m756.
[90m 9[39m Asia Pacific Services          [4m4[24m[4m5[24m298.
[90m10[39m Asia Pacific Services          [4m4[24m[4m4[24m624.
[90m11[39m Asia Pacific Consulting        [4m4[24m[4m4[24m455.
[90m12[39m Asia Pacific Consulting        [4m4[24m[4m4[24m364.
[9

## Part 4: Data Transformation Part 2 (Lesson 4)

**Skills Assessed:** mutate(), summarize(), group_by()

**Your Tasks:**
1. Create calculated columns with mutate()
2. Calculate summary statistics
3. Perform grouped analysis
4. Generate business metrics

In [None]:
# Task 4.1: Create Calculated Columns
# TODO: Add these new columns to sales_clean using mutate():
#   - revenue_per_unit: Revenue / Units_Sold
#   - high_value: "Yes" if Revenue > 20000, else "No"
# Store result in 'sales_enhanced'

sales_enhanced <- sales_clean %>%
  mutate(
    revenue_per_unit = Revenue / Units_Sold,
    high_value = ifelse(Revenue > 20000, "Yes", "No")
  )

cat("========== ENHANCED SALES DATA ==========\n")
cat("New columns added: revenue_per_unit, high_value\n")
head(sales_enhanced %>% select(Revenue, Units_Sold, revenue_per_unit, high_value), 5)

#used mutate to create new calculated columns in the cleaned sales data for better analysis

New columns added: revenue_per_unit, high_value


Revenue,Units_Sold,revenue_per_unit,high_value
<dbl>,<dbl>,<dbl>,<chr>
20750.92,78,266.03744,Yes
32359.98,13,2489.22923,Yes
39268.4,34,1154.95294,Yes
28865.09,90,320.72322,Yes
3932.36,63,62.41841,No


In [None]:
# Task 4.2: Calculate Overall Summary Statistics
# TODO: Create 'overall_summary' with these metrics from sales_enhanced:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - total_units: sum of Units_Sold
#   - transaction_count: count using n()


overall_summary <- sales_enhanced %>%
    summarise(
        total_revenue = sum(Revenue, na.rm = TRUE),
        avg_revenue = mean(Revenue, na.rm = TRUE),
        total_units = sum(Units_Sold, na.rm = TRUE),
        transaction_count = n()
    )

cat("========== OVERALL SUMMARY ==========\n")
print(overall_summary)
#used the summaerise function to calculate key overall metrics from the cleaned sales data

[90m# A tibble: 1 Ã— 4[39m
  total_revenue avg_revenue total_units transaction_count
          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.      [4m2[24m[4m5[24m906.       [4m1[24m[4m6[24m169               300


In [None]:
# Task 4.3: Regional Performance Analysis
# TODO: Create 'regional_summary' by grouping sales_enhanced by Region
#       and calculating:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - transaction_count: count using n()
# Then arrange by total_revenue descending
# Hint: Use group_by() %>% summarize() %>% arrange()

regional_summary <- sales_enhanced %>%
  group_by(Region) %>%
  summarise(
    total_revenue = sum(Revenue, na.rm = TRUE),
    avg_revenue = mean(Revenue, na.rm = TRUE),
    transaction_count = n()
  ) %>%
  arrange(desc(total_revenue))

cat("========== REGIONAL SUMMARY ==========\n")
print(regional_summary)
#grouped the cleaned sales data by region and calculated key metrics, then arranged by total revenue in descending order for better analysis

[90m# A tibble: 4 Ã— 4[39m
  Region        total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Europe             2[4m2[24m[4m2[24m[4m4[24m182.      [4m2[24m[4m7[24m124.                82
[90m2[39m Latin America      2[4m1[24m[4m1[24m[4m2[24m037.      [4m2[24m[4m5[24m446.                83
[90m3[39m Asia Pacific       1[4m8[24m[4m0[24m[4m4[24m243.      [4m2[24m[4m6[24m929.                67
[90m4[39m North America      1[4m6[24m[4m3[24m[4m1[24m248.      [4m2[24m[4m3[24m989.                68


In [None]:
# Task 4.4: Product Category Analysis
# TODO: Create 'category_summary' by grouping by Product_Category
#       and calculating the same metrics as regional_summary
#       Then arrange by total_revenue descending

category_summary <- sales_enhanced %>%
  group_by(Product_Category) %>%
  summarise(
    total_revenue = sum(Revenue, na.rm = TRUE),
    avg_revenue = mean(Revenue, na.rm = TRUE),
    transaction_count = n()
  ) %>%
  arrange(desc(total_revenue))

cat("========== CATEGORY SUMMARY ==========\n")
print(category_summary)
#grouped the cleaned sales data by product category and calculated key metrics, then arranged by total revenue in descending order for better analysis

[90m# A tibble: 4 Ã— 4[39m
  Product_Category total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Consulting            1[4m9[24m[4m7[24m[4m8[24m840.      [4m2[24m[4m6[24m037.                76
[90m2[39m Services              1[4m9[24m[4m6[24m[4m1[24m565.      [4m2[24m[4m7[24m244.                72
[90m3[39m Hardware              1[4m9[24m[4m5[24m[4m1[24m325.      [4m2[24m[4m6[24m730.                73
[90m4[39m Software              1[4m8[24m[4m7[24m[4m9[24m981.      [4m2[24m[4m3[24m797.                79


## Part 5: Data Reshaping with tidyr (Lesson 5)

**Skills Assessed:** pivot_longer(), pivot_wider(), tidy data principles

**Your Tasks:**
1. Reshape data from wide to long format
2. Reshape data from long to wide format
3. Create analysis-ready datasets

In [None]:
# Task 5.1: Create Wide Format Data
# First, create a summary by Region and Product_Category
region_category_revenue <- sales_enhanced %>%
  group_by(Region, Product_Category) %>%
  summarize(total_revenue = sum(Revenue), .groups = 'drop')

cat("========== REGION-CATEGORY DATA (LONG FORMAT) ==========\n")
print(head(region_category_revenue, 10))
#used the group_by and summarize functions to create a summary of total revenue by region and product category in long format

[90m# A tibble: 10 Ã— 3[39m
   Region        Product_Category total_revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting             [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware               [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services               [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software               [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting             [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware               [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services               [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software               [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting             [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware               [

In [None]:
# Task 5.2: Reshape to Wide Format
# TODO: Create 'revenue_wide' by pivoting region_category_revenue
#       so that Product_Category values become column names
#       with total_revenue as the values
revenue_wide <- region_category_revenue %>%
  pivot_wider(names_from = Product_Category, values_from = total_revenue, values_fill = 0)
  
cat("========== REVENUE DATA (WIDE FORMAT) ==========\n")
print(revenue_wide)
#used pivot_wider to reshape the long format data into wide format for better readability and analysis of revenue by region and product category

[90m# A tibble: 4 Ã— 5[39m
  Region        Consulting Hardware Services Software
  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m Asia Pacific     [4m7[24m[4m5[24m[4m9[24m641.  [4m2[24m[4m7[24m[4m1[24m979.  [4m3[24m[4m3[24m[4m1[24m826.  [4m4[24m[4m4[24m[4m0[24m797.
[90m2[39m Europe           [4m3[24m[4m9[24m[4m0[24m670.  [4m7[24m[4m7[24m[4m7[24m044.  [4m5[24m[4m1[24m[4m3[24m507.  [4m5[24m[4m4[24m[4m2[24m961.
[90m3[39m Latin America    [4m4[24m[4m3[24m[4m3[24m397.  [4m4[24m[4m7[24m[4m4[24m257.  [4m6[24m[4m4[24m[4m4[24m772.  [4m5[24m[4m5[24m[4m9[24m611.
[90m4[39m North America    [4m3[24m[4m9[24m[4m5[24m132.  [4m4[24m[4m2[24m[4m8[24m046.  [4m4[24m[4m7[24m[4m1[24m460.  [4m3[24m[4m3[24m[4m6[24m611.


In [None]:
# Task 5.3: Reshape Back to Long Format
# TODO: Create 'revenue_long' by pivoting revenue_wide back to long format
#       Column names (except Region) should go into 'Product_Category'
#       Values should go into 'revenue'

revenue_long <- revenue_wide %>%
  pivot_longer(cols = -Region, names_to = "Product_Category", values_to = "total_revenue")

cat("========== REVENUE DATA (BACK TO LONG FORMAT) ==========\n")
print(head(revenue_long, 10))
#used pivot_longer to reshape the wide format data back into long format for further analysis

[90m# A tibble: 10 Ã— 3[39m
   Region        Product_Category total_revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting             [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware               [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services               [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software               [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting             [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware               [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services               [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software               [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting             [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware               [

## Part 6: Combining Datasets with Joins (Lesson 6)

**Skills Assessed:** left_join(), inner_join(), data integration

**Your Tasks:**
1. Join customers with orders
2. Join orders with order_items
3. Create integrated dataset

In [None]:
# Task 6.1: Join Customers and Orders
# TODO: Create 'customer_orders' by left joining customers with orders
#       Join on CustomerID

customer_orders <- order_items
    left_join(customers, orders, by = "CustomerID")

cat("========== CUSTOMER ORDERS ==========\n")
cat("Total rows:", nrow(customer_orders), "\n")
cat("Columns:", ncol(customer_orders), "\n")
#used left_join to combine customer and order data based on CustomerID

CustomerID,Name,Email,City,Registration_Date,OrderID,Order_Date,Total_Amount
<dbl>,<chr>,<chr>,<chr>,<date>,<dbl>,<date>,<dbl>
1,Customer 1,customer1@email.com,Phoenix,2020-10-03,87,2023-03-28,716.18
1,Customer 1,customer1@email.com,Phoenix,2020-10-03,214,2023-09-12,1343.63
2,Customer 2,customer2@email.com,Los Angeles,2020-06-02,173,2024-02-25,159.98
2,Customer 2,customer2@email.com,Los Angeles,2020-06-02,190,2023-04-19,1503.04
3,Customer 3,customer3@email.com,Chicago,2021-04-20,29,2023-03-07,441.06
3,Customer 3,customer3@email.com,Chicago,2021-04-20,146,2023-04-10,85.28
4,Customer 4,customer4@email.com,Houston,2022-06-16,7,2023-05-01,482.98
4,Customer 4,customer4@email.com,Houston,2022-06-16,61,2024-01-19,666.43
5,Customer 5,customer5@email.com,Phoenix,2023-08-22,15,2023-12-27,560.36
5,Customer 5,customer5@email.com,Phoenix,2023-08-22,150,2023-03-28,278.89


Total rows: 400 
Columns: 4 


In [None]:
# Task 6.2: Join Orders and Order Items
# TODO: Create 'orders_with_items' by inner joining orders with order_items
#       Join on OrderID

orders_with_items <- order_items %>%
  inner_join(orders, by = "OrderID")

cat("========== ORDERS WITH ITEMS ==========\n")
cat("Total rows:", nrow(orders_with_items), "\n")
head(orders_with_items, 5)
#used inner_join to combine order and order item data based on OrderID

Total rows: 400 


OrderID,ProductID,Quantity,Unit_Price,CustomerID,Order_Date,Total_Amount
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<date>,<dbl>
213,8,1,43.81,75,2023-11-09,1941.93
176,18,5,489.16,53,2023-09-27,1811.81
118,2,5,442.09,41,2024-02-21,267.1
58,19,3,321.92,84,2023-05-07,880.59
202,2,4,280.43,58,2023-03-30,1557.9


## Part 7: String Manipulation & Date/Time Operations (Lesson 7)

**Skills Assessed:** stringr functions, lubridate functions

**Your Tasks:**
1. Clean text data
2. Parse dates
3. Extract date components

In [None]:
# Task 7.1: Clean Text Data
# TODO: Add these columns to sales_enhanced using mutate():
#   - region_clean: Region with trimmed whitespace and Title Case
#   - category_clean: Product_Category with trimmed whitespace and Title Case


sales_enhanced <- sales_enhanced %>%
  mutate(
    region_clean = str_to_title(str_trim(Region)),
    category_clean = str_to_title(str_trim(Product_Category))
  )

cat("========== CLEANED TEXT DATA ==========\n")
head(sales_enhanced %>% select(Region, region_clean, Product_Category, category_clean), 5)
#used stringr functions to clean text data in the sales dataset by trimming whitespace and converting to title case



Region,region_clean,Product_Category,category_clean
<chr>,<chr>,<chr>,<chr>
Latin America,Latin America,Services,Services
Europe,Europe,Hardware,Hardware
Europe,Europe,Services,Services
Europe,Europe,Hardware,Hardware
Latin America,Latin America,Software,Software


In [None]:
# Task 7.2: Parse Dates and Extract Components
# TODO: Add these date-related columns using mutate():
#   - date_parsed: Parse Sale_Date column (use ymd(), mdy(), or dmy() as appropriate)
#   - sale_month: Extract month name from date_parsed
#   - sale_weekday: Extract weekday name from date_parsed


sales_enhanced <- sales_enhanced %>%
  mutate(
    date_parsed = ymd(Sale_Date),
    sale_month = month(date_parsed, label = TRUE, abbr = FALSE),
    sale_weekday = wday(date_parsed, label = TRUE, abbr = FALSE)
  )

cat("========== DATE COMPONENTS ==========\n")
head(sales_enhanced %>% select(Sale_Date, date_parsed, sale_month, sale_weekday), 5)

#used lubridate functions to parse date strings and extract month and weekday names for better analysis



Sale_Date,date_parsed,sale_month,sale_weekday
<date>,<date>,<ord>,<ord>
2023-04-24,2023-04-24,April,Monday
2023-06-09,2023-06-09,June,Friday
2023-03-25,2023-03-25,March,Saturday
2023-04-11,2023-04-11,April,Tuesday
2023-08-26,2023-08-26,August,Saturday


## Part 8: Advanced Wrangling & Business Intelligence (Lesson 8)

**Skills Assessed:** case_when(), complex logic, KPIs

**Your Tasks:**
1. Create business categories with case_when()
2. Calculate KPIs
3. Generate executive summary

In [None]:
# Task 8.1: Create Performance Categories
# TODO: Add 'performance_tier' column using case_when():
#   - "High" if Revenue > 25000
#   - "Medium" if Revenue > 15000
#   - "Low" otherwise

sales_enhanced <- sales_enhanced %>%
  mutate(
    performance_tier = case_when(
      Revenue > 25000 ~ "High",
      Revenue > 15000 ~ "Medium",
      TRUE ~ "Low" 
    )
  )

cat("========== PERFORMANCE TIERS ==========\n")
table(sales_enhanced$performance_tier)
#used case_when to categorize sales performance into tiers based on revenue thresholds to get a better understanding of sales distribution and possibly get average ticekt size




  High    Low Medium 
   154     74     72 

In [None]:
# Task 8.2: Calculate Business KPIs
# TODO: Create 'business_kpis' with these metrics:
#   - total_revenue: sum of Revenue
#   - total_transactions: count of rows
#   - avg_transaction_value: mean of Revenue
#   - high_value_pct: percentage where high_value = "Yes"

business_kpis <- sales_enhanced %>%
  summarize(
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_transactions = n(),
    avg_transaction_value = mean(Revenue, na.rm = TRUE),
    high_value_pct = mean(high_value == "Yes") * 100
  )

cat("========== BUSINESS KPIs ==========\n")
print(business_kpis)
#calculated key business performance indicators from the cleaned sales data for overall business insights

[90m# A tibble: 1 Ã— 4[39m
  total_revenue total_transactions avg_transaction_value high_value_pct
          [3m[90m<dbl>[39m[23m              [3m[90m<int>[39m[23m                 [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.                300                [4m2[24m[4m5[24m906.           64.7


## Part 9: Reflection Questions

Answer the following questions based on your analysis.

### Question 9.1: Data Cleaning Impact

**How did handling missing values and outliers affect your analysis? Why is data cleaning important before performing business analysis?**

Your answer here: I honestly do no know if there were any outliers in the data because I did not find any. I am not sure if this is meant to be like that or my data set is faulty. However, if i had to, I would handle missing values by removing rows with NAs in critical columns like Revenue. Data cleaning is crucial because it ensures the accuracy and reliability of the analysis, leading to better business decisions and better planned business strategies.

### Question 9.2: Grouped Analysis Value

**What insights did you gain from the regional and category summaries that you couldn't see in the raw data? How can businesses use this type of grouped analysis?**

Your answer here: Grouped analysis allowed me to see which regions and product categories were performing better than others. For example, I could identify that certain regions had significantly higher revenues compared to others, and some product categories were more popular among customers. Businesses can use this type of analysis to allocate resources more effectively, target marketing efforts, and optimize inventory based on regional preferences and category performance. We can also gathered insights about what products to promote in which regions based on past performance, and which products we should retire or remarket based on low performance.

### Question 9.3: Data Reshaping Purpose

**Why would you need to reshape data between wide and long formats? Provide a business scenario where each format would be useful.**

Your answer here: Reshaping data is essential for different types of analysis and visualization. Wide format is useful when you want to compare multiple variables side by side, such as sales figures for different products across several months. This format makes it easier to gather insights, create charts, and tables. Long format is beneficial for time series analysis or when using certain statistical models that require data in a tidy format. For example, if we want to analyze sales trends over time for each product, having the data in long format allows us to easily bucket by product and date, facilitating trend analysis and forecasting.

### Question 9.4: Joining Datasets

**What is the difference between left_join() and inner_join()? When would you use each one in a business context?**

Your answer here: Left_join() returns all rows from the left dataset and the matched rows from the right dataset. If there is no match, it fills with NAs. This is useful when you want to retain all records from the primary dataset, such as keeping all customers even if they haven't made any orders. Inner_join() returns only the matching rows in both datasets. This is useful when you only want to analyze records that have correlated entries in both datasets, for example, analyzing only customers who have made purchases.

### Question 9.5: Skills Integration

**Which R data wrangling skill (from Lessons 1-8) do you think is most valuable for business analytics? Why?**

Your answer here: I believe that all the functions learnt so far are very useful depending on the necessites of the business analysis. However, if I had to choose one, I would say that either mutate() and summarize() are the best. These functions allow analysts to create new insights by transforming existing data and calculating key metrics that drive business decisions. For example, creating performance tiers or calculating average transaction values can directly inform marketing strategies and operational improvements without having to do complex coding or manual calculations.

## Exam Complete!

### What You've Demonstrated

âœ… **Lesson 1:** R basics and data import
âœ… **Lesson 2:** Data cleaning (missing values & outliers)
âœ… **Lesson 3:** Data transformation (select, filter, arrange)
âœ… **Lesson 4:** Advanced transformation (mutate, summarize, group_by)
âœ… **Lesson 5:** Data reshaping (pivot_longer, pivot_wider)
âœ… **Lesson 6:** Combining datasets (joins)
âœ… **Lesson 7:** String manipulation & date/time operations
âœ… **Lesson 8:** Advanced wrangling & business intelligence

### Submission Checklist

Before submitting, ensure:
- [ ] All code cells run without errors
- [ ] All TODO sections completed
- [ ] All required dataframes created with correct names
- [ ] All 5 reflection questions answered
- [ ] Student name and ID filled in at top

**Good work! ðŸŽ‰**