# Homework Assignment - Lesson 6: Combining Datasets - Joins

**Due Date:** [October 12, 2025]

**Instructions:**
- Complete the following tasks in this R notebook
- Use the exact variable names specified for each task
- Ensure your code runs without errors and produces the expected outputs
- Submit your completed notebook with all code and outputs

---

## Part 1: Data Import and Setup

**Task:** Import the following CSV files and examine their structure.

**Required Variable Names:**
You must use these EXACT variable names:
- `customers` for customers.csv
- `orders` for orders.csv  
- `order_items` for order_items.csv
- `products` for products.csv
- `suppliers` for suppliers.csv

**Expected Outputs:**
- All datasets loaded successfully
- Print dataset dimensions using `nrow()` and `ncol()`
- Show first 3 rows of each dataset using `head(dataset, 3)`

In [1]:
# Your code here:
# 1. Load the tidyverse package
# 2. Set working directory to sample_datasets  
# 3. Import all CSV files using read_csv() with EXACT variable names above
# 4. Print dimensions and head() for each dataset

# Load tidyverse
library(tidyverse)

# Set working directory
# setwd("sample_datasets")  # Skip this if directory doesn't exist

# Import datasets with exact variable names
customers <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/customers.csv")
orders <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/orders.csv")
order_items <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/order_items.csv")
products <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/products.csv")
suppliers <- read_csv("/workspaces/assignment-1-version-3-trinitysch/data/suppliers.csv")

# Print dataset information (uncomment after creating variables above):
cat("Dataset Dimensions:\n")
cat("Customers:", nrow(customers), "rows x", ncol(customers), "columns\n")
cat("Orders:", nrow(orders), "rows x", ncol(orders), "columns\n")
cat("Order Items:", nrow(order_items), "rows x", ncol(order_items), "columns\n")
cat("Products:", nrow(products), "rows x", ncol(products), "columns\n")
cat("Suppliers:", nrow(suppliers), "rows x", ncol(suppliers), "columns\n")

# Show first 3 rows of each dataset (uncomment after loading data):
head(customers, 3)
head(orders, 3)
head(order_items, 3)
head(products, 3)
head(suppliers, 3)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()

Dataset Dimensions:
Customers: 100 rows x 5 columns
Orders: 250 rows x 4 columns
Order Items: 400 rows x 4 columns
Products: 50 rows x 4 columns
Suppliers: 10 rows x 3 columns
Customers: 100 rows x 5 columns
Orders: 250 rows x 4 columns
Order Items: 400 rows x 4 columns
Products: 50 rows x 4 columns
Suppliers: 10 rows x 3 columns


CustomerID,Name,Email,City,Registration_Date
<dbl>,<chr>,<chr>,<chr>,<date>
1,Customer 1,customer1@email.com,Phoenix,2020-10-03
2,Customer 2,customer2@email.com,Los Angeles,2020-06-02
3,Customer 3,customer3@email.com,Chicago,2021-04-20


OrderID,CustomerID,Order_Date,Total_Amount
<dbl>,<dbl>,<date>,<dbl>
1,87,2023-08-30,424.3
2,12,2024-03-24,183.09
3,37,2024-03-19,549.07


OrderID,ProductID,Quantity,Unit_Price
<dbl>,<dbl>,<dbl>,<dbl>
213,8,1,43.81
176,18,5,489.16
118,2,5,442.09


ProductID,Product_Name,Category,Supplier_ID
<dbl>,<chr>,<chr>,<dbl>
1,Product 1,Home,6
2,Product 2,Electronics,7
3,Product 3,Home,1


Supplier_ID,Supplier_Name,Country
<dbl>,<chr>,<chr>
1,Supplier 1,USA
2,Supplier 2,USA
3,Supplier 3,Japan


## Part 2: Basic Joins

**Tasks:**
1. **Inner Join:** Create `customer_orders_inner` by joining customers and orders
2. **Left Join:** Create `customer_orders_left` to show all customers  
3. **Right Join:** Create `customer_orders_right` to show all orders
4. **Full Join:** Create `customer_orders_full` for complete data view

Use the exact variable names specified above for autograding. Analyze row counts and explain results.

In [2]:
# Create an inner join between customers and orders
customer_orders_inner <- inner_join(customers, orders, by = "CustomerID")

# Analyze the results and compare row counts
cat("Inner Join Results:\n")
cat("Customers:", nrow(customers), "\n")
cat("Orders:", nrow(orders), "\n")
cat("Inner Join Result (customer_orders_inner):", nrow(customer_orders_inner), "\n")

# Check for customers with no orders and orders with no customers (should be 0 for inner join)
cat("Customers in join:", length(unique(customer_orders_inner$CustomerID)), "\n")
cat("Orders in join:", length(unique(customer_orders_inner$OrderID)), "\n")


Inner Join Results:
Customers: 100 
Orders: 250 
Inner Join Result (customer_orders_inner): 200 
Customers in join: 100 
Orders in join: 200 
Customers: 100 
Orders: 250 
Inner Join Result (customer_orders_inner): 200 
Customers in join: 100 
Orders in join: 200 


In [3]:
# Create a left join to keep all customers
customer_orders_left <- left_join(customers, orders, by = "CustomerID")

# Count how many customers have no orders
customers_without_orders <- sum(is.na(customer_orders_left$OrderID))
cat("Left Join Results:\n")
cat("Total rows:", nrow(customer_orders_left), "\n")
cat("Customers without orders:", customers_without_orders, "\n")

Left Join Results:
Total rows: 200 
Customers without orders: 0 
Total rows: 200 
Customers without orders: 0 


In [4]:
# Create a right join to keep all orders
customer_orders_right <- right_join(customers, orders, by = "CustomerID")

# Check for orders with invalid customer IDs (i.e., orders with no matching customer)
orders_invalid_customers <- sum(is.na(customer_orders_right$Name))
cat("Right Join Results:\n")
cat("Total rows:", nrow(customer_orders_right), "\n")
cat("Orders with invalid customer IDs:", orders_invalid_customers, "\n")

Right Join Results:
Total rows: 250 
Total rows: 250 
Orders with invalid customer IDs: 50 
Orders with invalid customer IDs: 50 


In [5]:
# Part 2.1: Inner Join
customer_orders <- inner_join(customers, orders, by = "CustomerID")

cat("Inner Join Results:\n")
cat("Customers", nrow(customers), "\n")
cat("Orders:", nrow(orders), "\n")
cat("Inner Join Result:", nrow(customer_orders),"\n")

# Part 2.2: Left Join
customer_orders_left <- left_join(customers, orders, by = "CustomerID")
customers_without_orders <- sum(is.na(customer_orders_left$OrderID))

cat("Left Join Results:\n")
cat("Total rows:", nrow(customer_orders_left), "\n")
cat("Customers without orders:", customers_without_orders,"\n")


# Part 2.3: Right Join
customer_orders_right <- right_join(customers, orders, by = "CustomerID")
orders_invalid_customers <- sum(is.na(customer_orders_right$Name))

cat("Right Join Results:\n")
cat("Total rows", nrow(customer_orders_right), "\n")
cat("Orders with invalid customer IDs:", orders_invalid_customers, "\n")

# Part 2.4: Full Join
customer_orders_full <- full_join(customers, orders, by = "CustomerID")
customers_only <- sum(is.na(customer_orders_full$OrderID))
orders_only <- sum(is.na(customer_orders_full$Name))

cat("Full Join Results:\n")
cat("Total rows:", nrow(customer_orders_full), "\n")
cat("Customers without orders:", customers_only, "\n")
cat("Orders without valid customers:", orders_only,"\n")

Inner Join Results:
Customers 100 
Orders: 250 
Customers 100 
Orders: 250 
Inner Join Result: 200 
Left Join Results:
Total rows: 200 
Customers without orders: 0 
Inner Join Result: 200 
Left Join Results:
Total rows: 200 
Customers without orders: 0 
Right Join Results:
Total rows 250 
Orders with invalid customer IDs: 50 
Right Join Results:
Total rows 250 
Orders with invalid customer IDs: 50 
Full Join Results:
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 
Full Join Results:
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 


In [8]:
# Part 2.1: Inner Join (REQUIRED variable name: customer_orders)
# Create an inner join between customers and orders
customer_orders <- inner_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Inner Join Results:\n")
cat("Customers:", nrow(customers), "\n")
cat("Orders:", nrow(orders), "\n")
cat("Inner Join Result:", nrow(customer_orders), "\n")

Inner Join Results:
Customers: 100 
Orders: 250 
Customers: 100 
Orders: 250 
Inner Join Result: 200 
Inner Join Result: 200 


In [9]:
# Part 2.2: Left Join (REQUIRED variable name: customer_orders_left)  
# Create a left join to keep all customers
customer_orders_left <- left_join(customers, orders, by = "CustomerID")

# Required output for autograding:
customers_without_orders <- sum(is.na(customer_orders_left$OrderID))
cat("Left Join Results:\n")
cat("Total rows:", nrow(customer_orders_left), "\n")
cat("Customers without orders:", customers_without_orders, "\n")

Left Join Results:
Total rows: 200 
Customers without orders: 0 
Total rows: 200 
Customers without orders: 0 


In [10]:
# Part 2.1: Inner Join (REQUIRED variable name: customer_orders)
# Create an inner join between customers and orders
customer_orders <- inner_join(customers, orders, by = "CustomerID")

# Expected output:
cat("Inner Join Results:\n")
cat("Customers:", nrow(customers), "\n")
cat("Orders:", nrow(orders), "\n")
cat("Inner Join Result:", nrow(customer_orders), "\n")

Inner Join Results:
Customers: 100 
Orders: 250 
Inner Join Result: 200 
Customers: 100 
Orders: 250 
Inner Join Result: 200 


In [11]:
# Part 2.2: Left Join (REQUIRED variable name: customer_orders_left)  
# Create a left join to keep all customers
customer_orders_left <- left_join(customers, orders, by = "CustomerID")

# Expected output:
customers_without_orders <- sum(is.na(customer_orders_left$OrderID))
cat("Left Join Results:\n")
cat("Total rows:", nrow(customer_orders_left), "\n")
cat("Customers without orders:", customers_without_orders, "\n")

Left Join Results:
Total rows: 200 
Customers without orders: 0 
Total rows: 200 
Customers without orders: 0 


In [12]:
# Part 2.3: Right Join (REQUIRED variable name: customer_orders_right)
# Create a right join to keep all orders
customer_orders_right <- right_join(customers, orders, by = "CustomerID")

# Expected output:
orders_invalid_customers <- sum(is.na(customer_orders_right$Name))
cat("Right Join Results:\n")
cat("Total rows:", nrow(customer_orders_right), "\n")
cat("Orders with invalid customer IDs:", orders_invalid_customers, "\n")

Right Join Results:
Total rows: 250 
Orders with invalid customer IDs: 50 
Total rows: 250 
Orders with invalid customer IDs: 50 


In [13]:
# Part 2.4: Full Join (REQUIRED variable name: customer_orders_full)
# Create a full join to include all customers and all orders
customer_orders_full <- full_join(customers, orders, by = "CustomerID")

# Analyze unmatched records
customers_only <- sum(is.na(customer_orders_full$OrderID))
orders_only <- sum(is.na(customer_orders_full$Name))
cat("Full Join Results:\n")
cat("Total rows:", nrow(customer_orders_full), "\n")
cat("Customers without orders:", customers_only, "\n")
cat("Orders without valid customers:", orders_only, "\n")

Full Join Results:
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 


In [14]:
# Part 2.4: Full Join (REQUIRED variable name: customer_orders_full)

# Create a full join to include all customers and all orders
customer_orders_full <- full_join(customers, orders, by = "CustomerID")

# Required output for autograding:
customers_only <- sum(is.na(customer_orders_full$OrderID))
orders_only <- sum(is.na(customer_orders_full$Name))
cat("Full Join Results:\n")
cat("Total rows:", nrow(customer_orders_full), "\n")
cat("Customers without orders:", customers_only, "\n")
cat("Orders without valid customers:", orders_only, "\n")

Full Join Results:
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 


## Part 3: Multi-Table Joins

**Tasks:**
1. Create a comprehensive dataset by joining `orders`, `customers`, and `order_items`
2. Extend this dataset by adding `products` and `suppliers` information
3. Create a complete supply chain view showing the full customer-to-supplier relationship

Build your joins step by step and examine the results at each stage.

## Part 3: Multi-Table Joins

**Requirements for Autograding:**
You must create these EXACT variable names in order:

1. `orders_items` - Join orders and order_items  
2. `orders_customers_items` - Add customers to above
3. `complete_order_data` - Add products information
4. `complete_data` - Add suppliers for complete supply chain view

**Expected Outputs:**
Print the number of rows for each step to show the join progression.

In [15]:
# Step 1: REQUIRED variable name: orders_items
# Join orders and order_items by OrderID
orders_items <- inner_join(orders, order_items, by = "OrderID")

# Required output for autograding:
cat("Step 1 - Orders with Items:", nrow(orders_items), "rows\n")

Step 1 - Orders with Items: 400 rows


In [16]:
# Step 2: REQUIRED variable name: orders_customers_items  
# Join orders_items with customers by CustomerID
orders_customers_items <- left_join(orders_items, customers, by = "CustomerID")

# Required output for autograding:
cat("Step 2 - Add Customers:", nrow(orders_customers_items), "rows\n")

Step 2 - Add Customers: 400 rows


In [17]:
# Step 3: REQUIRED variable name: complete_order_data
# Join orders_customers_items with products by ProductID
complete_order_data <- left_join(orders_customers_items, products, by = "ProductID")

# Required output for autograding:
cat("Step 3 - Add Products:", nrow(complete_order_data), "rows\n")

Step 3 - Add Products: 400 rows


In [18]:
# Step 4: REQUIRED variable name: complete_data
# Join complete_order_data with suppliers by Supplier_ID
complete_data <- left_join(complete_order_data, suppliers, by = "Supplier_ID")

# Required output for autograding:
cat("Step 4 - Complete Supply Chain:", nrow(complete_data), "rows\n")

Step 4 - Complete Supply Chain: 400 rows


## Part 4: Data Quality Analysis

**Tasks:**
1. Use `anti_join()` to find unmatched records between tables
2. Use `semi_join()` to find matched records
3. Check for duplicate keys and analyze their impact on joins
4. Identify and document any data quality issues

This analysis helps understand data integrity and potential issues in your datasets.

In [25]:
# Part 4: Data Quality Analysis
# Using anti_join() and semi_join() for data validation

library(dplyr)

# Assuming you have these data frames:
# customers - customer table with customer_id
# orders - orders table with customer_id and order_id
# products - products table with product_id
# order_items - order items table with product_id and order_id

# 1. Find customers with NO orders
customers_no_orders <- anti_join(customers, orders, by = "CustomerID")
cat("Customers with no orders:", nrow(customers_no_orders), "\n")

# 2. Find orders without matching customers (orphaned orders)
orphaned_orders <- anti_join(orders, customers, by = "CustomerID")
cat("Orphaned orders:", nrow(orphaned_orders), "\n")

# 3. Find products that were never ordered
products_never_ordered <- anti_join(products, order_items, by = "ProductID")
cat("Products never ordered:", nrow(products_never_ordered), "\n")

# 4. Find customers who have placed orders (active customers)
active_customers <- semi_join(customers, orders, by = "CustomerID")
cat("Active customers:", nrow(active_customers), "\n")

# Summary output
cat("\n=== Data Quality Summary ===\n")
cat("Total customers:", nrow(customers), "\n")
cat("Active customers:", nrow(active_customers), "\n")
cat("Inactive customers:", nrow(customers_no_orders), "\n")
cat("Total orders:", nrow(orders), "\n")
cat("Orphaned orders:", nrow(orphaned_orders), "\n")
cat("Total products:", nrow(products), "\n")
cat("Products never ordered:", nrow(products_never_ordered), "\n")

Customers with no orders: 0 
Orphaned orders: 50 
Orphaned orders: 50 
Products never ordered: 0 
Active customers: 100 

=== Data Quality Summary ===
Total customers: 100 
Active customers: 100 
Products never ordered: 0 
Active customers: 100 

=== Data Quality Summary ===
Total customers: 100 
Active customers: 100 
Inactive customers: 0 
Total orders: 250 
Orphaned orders: 50 
Total products: 50 
Products never ordered: 0 
Inactive customers: 0 
Total orders: 250 
Orphaned orders: 50 
Total products: 50 
Products never ordered: 0 


In [19]:
# REQUIRED variable name: customers_no_orders
# Use anti_join() to find customers with no orders
customers_no_orders <- anti_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Customers who never placed an order:", nrow(customers_no_orders), "\n")

Customers who never placed an order: 0 


In [20]:
# REQUIRED variable name: orphaned_orders
# Use anti_join() to find orders without customers
orphaned_orders <- anti_join(orders, customers, by = "CustomerID")

# Required output for autograding:
cat("Orders without corresponding customers:", nrow(orphaned_orders), "\n")

Orders without corresponding customers: 50 


In [21]:
# REQUIRED variable name: products_never_ordered
# Use anti_join() to find products that were never ordered
products_never_ordered <- anti_join(products, order_items, by = "ProductID")

# Required output for autograding:
cat("Products that were never ordered:", nrow(products_never_ordered), "\n")

Products that were never ordered: 0 


In [22]:
# REQUIRED variable name: active_customers
# Use semi_join() to find customers with orders
active_customers <- semi_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Customers who placed at least one order:", nrow(active_customers), "\n")

Customers who placed at least one order: 100 


In [26]:
# Data integrity checks - count duplicate keys
# Calculate and print duplicate counts for each dataset
duplicate_customers <- sum(duplicated(customers$CustomerID))
duplicate_orders <- sum(duplicated(orders$OrderID))
duplicate_products <- sum(duplicated(products$ProductID))
duplicate_suppliers <- sum(duplicated(suppliers$Supplier_ID))

# Required outputs for autograding:
cat("Data Quality Summary:\n")
cat("Duplicate customer IDs:", duplicate_customers, "\n")
cat("Duplicate order IDs:", duplicate_orders, "\n")
cat("Duplicate product IDs:", duplicate_products, "\n")
cat("Duplicate supplier IDs:", duplicate_suppliers, "\n")

Data Quality Summary:
Duplicate customer IDs: 0 
Duplicate order IDs: 0 
Duplicate product IDs: 0 
Duplicate supplier IDs: 0 
Duplicate customer IDs: 0 
Duplicate order IDs: 0 
Duplicate product IDs: 0 
Duplicate supplier IDs: 0 


## Part 5: Business Analysis with Joined Data

**Tasks:**
1. Calculate customer lifetime value metrics (total spent, number of orders, average order value)
2. Analyze product performance (total quantity sold, revenue generated, order frequency)
3. Evaluate supplier performance (total revenue, number of products supplied)
4. Create a regional analysis by customer city

Use your joined datasets to generate meaningful business insights.

In [34]:
# Part 5: Business Analysis with Joined Data

library(dplyr)

# 1. CUSTOMER METRICS
# Analyze spending patterns, order frequency, and average order value
customer_metrics <- complete_data %>%
  group_by(CustomerID, Name) %>%
  summarise(
    Total_Spent = sum(Quantity * Unit_Price, na.rm = TRUE),
    Order_Count = n_distinct(OrderID),
    Avg_Order_Value = ifelse(Order_Count > 0, Total_Spent / Order_Count, NA),
    .groups = "drop"
  )

cat("Customer Metrics:\n")
print(head(customer_metrics))
cat("Records:", nrow(customer_metrics), "\n\n")

# 2. PRODUCT METRICS
# Analyze product performance including revenue, quantity sold, and reorder frequency
product_metrics <- complete_data %>%
  group_by(ProductID, Product_Name) %>%
  summarise(
    Total_Revenue = sum(Quantity * Unit_Price, na.rm = TRUE),
    Total_Quantity = sum(Quantity, na.rm = TRUE),
    Order_Frequency = n_distinct(OrderID),
    .groups = "drop"
  )

cat("Product Metrics:\n")
print(head(product_metrics))
cat("Records:", nrow(product_metrics), "\n\n")

# 3. SUPPLIER METRICS
# Count the number of products each supplier provides
supplier_metrics <- complete_data %>%
  group_by(Supplier_ID, Supplier_Name) %>%
  summarise(
    Product_Count = n_distinct(ProductID, na.rm = TRUE),
    .groups = "drop"
  )

cat("Supplier Metrics:\n")
print(head(supplier_metrics))
cat("Records:", nrow(supplier_metrics), "\n\n")

# 4. REGIONAL ANALYSIS
# Analyze sales performance by customer city
regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Quantity * Unit_Price, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Avg_Customer_Value = ifelse(Customer_Count > 0, Total_Sales / Customer_Count, NA),
    .groups = "drop"
  )

cat("Regional Analysis:\n")
print(head(regional_analysis))
cat("Records:", nrow(regional_analysis), "\n\n")

# Summary Statistics
cat("\n=== Business Analysis Summary ===\n")
cat("Total Customers Analyzed:", nrow(customer_metrics), "\n")
cat("Total Products Analyzed:", nrow(product_metrics), "\n")
cat("Total Suppliers:", nrow(supplier_metrics), "\n")
cat("Total Regions:", nrow(regional_analysis), "\n")
cat("\nTop Customer by Spending:\n")
print(customer_metrics %>% arrange(desc(Total_Spent)) %>% head(1))
cat("\nTop Product by Revenue:\n")
print(product_metrics %>% arrange(desc(Total_Revenue)) %>% head(1))
cat("\nTop Region by Sales:\n")
print(regional_analysis %>% arrange(desc(Total_Sales)) %>% head(1))

Customer Metrics:
[90m# A tibble: 6 × 5[39m
  CustomerID Name       Total_Spent Order_Count Avg_Order_Value
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<dbl>[39m[23m       [3m[90m<int>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m          1 Customer 1       [4m4[24m815.           2           [4m2[24m407.
[90m2[39m          3 Customer 3       [4m6[24m107.           2           [4m3[24m053.
[90m3[39m          4 Customer 4       [4m1[24m091.           2            545.
[90m4[39m          5 Customer 5       [4m4[24m943.           2           [4m2[24m471.
[90m5[39m          7 Customer 7       [4m3[24m219.           2           [4m1[24m610.
[90m6[39m          8 Customer 8       [4m2[24m448.           2           [4m1[24m224.
Records: 99 

[90m# A tibble: 6 × 5[39m
  CustomerID Name       Total_Spent Order_Count Avg_Order_Value
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<dbl>

[90m# A tibble: 6 × 3[39m
  Supplier_ID Supplier_Name Product_Count
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m
[90m1[39m           1 Supplier 1                5
[90m2[39m           2 Supplier 2                3
[90m3[39m           3 Supplier 3                5
[90m4[39m           4 Supplier 4                6
[90m5[39m           5 Supplier 5                5
[90m6[39m           6 Supplier 6                5
Records: 10 

Records: 10 

Regional Analysis:
Regional Analysis:
[90m# A tibble: 6 × 4[39m
  City        Total_Sales Customer_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m7[24m[4m2[24m278.             23              [4m3[24m143.
[90m2[39m Houston          [4m3[24m[4m8[24m143.             18              [4m2[24m119.
[90m3[39m Los Angeles      [4m4

In [27]:
# REQUIRED variable name: customer_metrics
# Calculate customer lifetime value metrics
# Required columns: CustomerID, Name, Total_Spent, Order_Count, Avg_Order_Value

customer_metrics <- complete_data %>%
  group_by(CustomerID, Name) %>%
  summarise(
    Total_Spent = sum(Quantity * Unit_Price, na.rm = TRUE),
    Order_Count = n_distinct(OrderID),
    Avg_Order_Value = ifelse(Order_Count > 0, Total_Spent / Order_Count, NA),
    .groups = "drop"
  )

# Required output for autograding:
cat("Customer Analysis Summary:\n")
cat("Total customers analyzed:", nrow(customer_metrics), "\n")
cat("Top customer total spent: $", round(max(customer_metrics$Total_Spent, na.rm = TRUE), 2), "\n")
print(head(customer_metrics[order(-customer_metrics$Total_Spent), ], 3))

Customer Analysis Summary:
Total customers analyzed: 99 
Top customer total spent: $ 17221.69 
Total customers analyzed: 99 
Top customer total spent: $ 17221.69 
[90m# A tibble: 3 × 5[39m
  CustomerID Name  Total_Spent Order_Count Avg_Order_Value
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<int>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m        103 [31mNA[39m         [4m1[24m[4m7[24m222.           9           [4m1[24m914.
[90m2[39m        102 [31mNA[39m         [4m1[24m[4m6[24m116.           8           [4m2[24m015.
[90m3[39m        104 [31mNA[39m         [4m1[24m[4m3[24m427.           7           [4m1[24m918.
[90m# A tibble: 3 × 5[39m
  CustomerID Name  Total_Spent Order_Count Avg_Order_Value
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<int>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m        103 [31mNA[39m  

In [29]:
# REQUIRED variable name: product_metrics  
# Analyze product performance
# Required columns: ProductID, Product_Name, Total_Revenue, Total_Quantity, Order_Frequency

product_metrics <- complete_data %>%
  group_by(ProductID, Product_Name) %>%
  summarise(
    Total_Revenue = sum(Quantity * Unit_Price, na.rm = TRUE),
    Total_Quantity = sum(Quantity, na.rm = TRUE),
    Order_Frequency = n_distinct(OrderID),
    .groups = "drop"
  )

# Required output for autograding:
cat("Product Analysis Summary:\n")
cat("Total products analyzed:", nrow(product_metrics), "\n")
cat("Top product revenue: $", round(max(product_metrics$Total_Revenue, na.rm = TRUE), 2), "\n")
print(head(product_metrics[order(-product_metrics$Total_Revenue), ], 3))

Product Analysis Summary:
Total products analyzed: 50 
Top product revenue: $ 13970.66 
Total products analyzed: 50 
Top product revenue: $ 13970.66 
[90m# A tibble: 3 × 5[39m
  ProductID Product_Name Total_Revenue Total_Quantity Order_Frequency
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m        26 Product 26          [4m1[24m[4m3[24m971.             60              18
[90m2[39m        47 Product 47          [4m1[24m[4m2[24m582.             38              10
[90m3[39m        43 Product 43          [4m1[24m[4m0[24m584.             36              10
[90m# A tibble: 3 × 5[39m
  ProductID Product_Name Total_Revenue Total_Quantity Order_Frequency
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m        26 Product 

In [35]:
# REQUIRED variable name: supplier_metrics
# Evaluate supplier performance  
# Required columns: Supplier_ID, Supplier_Name, Product_Count

supplier_metrics <- complete_data %>%
  group_by(Supplier_ID, Supplier_Name) %>%
  summarise(
    Product_Count = n_distinct(ProductID, na.rm = TRUE),
    .groups = "drop"
  )

# Required output for autograding:
cat("Supplier Analysis Summary:\n")
cat("Total suppliers analyzed:", nrow(supplier_metrics), "\n")
cat("Max products per supplier:", max(supplier_metrics$Product_Count, na.rm = TRUE), "\n")
print(supplier_metrics[order(-supplier_metrics$Product_Count), ])

Supplier Analysis Summary:
Total suppliers analyzed: 10 
Max products per supplier: 10 
Total suppliers analyzed: 10 
Max products per supplier: 10 
[90m# A tibble: 10 × 3[39m
   Supplier_ID Supplier_Name Product_Count
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m
[90m 1[39m           7 Supplier 7               10
[90m 2[39m           4 Supplier 4                6
[90m 3[39m          10 Supplier 10               6
[90m 4[39m           1 Supplier 1                5
[90m 5[39m           3 Supplier 3                5
[90m 6[39m           5 Supplier 5                5
[90m 7[39m           6 Supplier 6                5
[90m 8[39m           2 Supplier 2                3
[90m 9[39m           8 Supplier 8                3
[90m10[39m           9 Supplier 9                2
[90m# A tibble: 10 × 3[39m
   Supplier_ID Supplier_Name Product_Count
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [

In [36]:
# REQUIRED variable name: regional_analysis
# Create regional analysis by customer city
# Required columns: City, Total_Sales, Customer_Count, Avg_Customer_Value

regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Quantity * Unit_Price, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Avg_Customer_Value = ifelse(Customer_Count > 0, Total_Sales / Customer_Count, NA),
    .groups = "drop"
  )

# Required output for autograding:
cat("Regional Analysis Summary:\n")
cat("Total cities analyzed:", nrow(regional_analysis), "\n")
cat("Highest city sales: $", round(max(regional_analysis$Total_Sales, na.rm = TRUE), 2), "\n")
print(regional_analysis[order(-regional_analysis$Total_Sales), ])

Regional Analysis Summary:
Total cities analyzed: 6 
Highest city sales: $ 72277.52 
[90m# A tibble: 6 × 4[39m
  City        Total_Sales Customer_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m7[24m[4m2[24m278.             23              [4m3[24m143.
[90m2[39m [31mNA[39m               [4m6[24m[4m8[24m753.              5             [4m1[24m[4m3[24m751.
[90m3[39m Phoenix          [4m4[24m[4m9[24m333.             19              [4m2[24m596.
[90m4[39m Los Angeles      [4m4[24m[4m8[24m506.             18              [4m2[24m695.
[90m5[39m Houston          [4m3[24m[4m8[24m143.             18              [4m2[24m119.
[90m6[39m New York         [4m3[24m[4m6[24m417.             16              [4m2[24m276.
Total cities analyzed: 6 
Highest city sales: $ 72277.52 
[90m# A tibble: 6 × 4[39m
  Ci

In [37]:
# REQUIRED variable name: product_metrics  
# Analyze product performance
product_metrics <- complete_data %>%
  group_by(ProductID, Product_Name) %>%
  summarise(
    Total_Revenue = sum(Quantity * Unit_Price, na.rm = TRUE),
    Total_Quantity = sum(Quantity, na.rm = TRUE),
    Order_Frequency = n_distinct(OrderID),
    .groups = "drop"
  )

# Required output for autograding:
cat("Product Analysis Summary:\n")
cat("Total products analyzed:", nrow(product_metrics), "\n")
cat("Top product revenue: $", round(max(product_metrics$Total_Revenue, na.rm = TRUE), 2), "\n")
print(head(product_metrics[order(-product_metrics$Total_Revenue), ], 3))

Product Analysis Summary:
Total products analyzed: 50 
Top product revenue: $ 13970.66 
Total products analyzed: 50 
Top product revenue: $ 13970.66 
[90m# A tibble: 3 × 5[39m
  ProductID Product_Name Total_Revenue Total_Quantity Order_Frequency
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m        26 Product 26          [4m1[24m[4m3[24m971.             60              18
[90m2[39m        47 Product 47          [4m1[24m[4m2[24m582.             38              10
[90m3[39m        43 Product 43          [4m1[24m[4m0[24m584.             36              10
[90m# A tibble: 3 × 5[39m
  ProductID Product_Name Total_Revenue Total_Quantity Order_Frequency
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m        26 Product 

In [38]:
# REQUIRED variable name: supplier_metrics
# Evaluate supplier performance  
# Required columns: Supplier_ID, Supplier_Name, Product_Count

supplier_metrics <- complete_data %>%
  group_by(Supplier_ID, Supplier_Name) %>%
  summarise(
    Product_Count = n_distinct(ProductID, na.rm = TRUE),
.groups = "drop"
  )

# Required output for autograding:
cat("Supplier Analysis Summary:\n")
cat("Total suppliers analyzed:", nrow(supplier_metrics), "\n")
cat("Max products per supplier:", max(supplier_metrics$Product_Count, na.rm = TRUE), "\n")
print(supplier_metrics[order(-supplier_metrics$Product_Count), ])

Supplier Analysis Summary:
Total suppliers analyzed: 10 
Max products per supplier: 10 
Total suppliers analyzed: 10 
Max products per supplier: 10 
[90m# A tibble: 10 × 3[39m
   Supplier_ID Supplier_Name Product_Count
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m
[90m 1[39m           7 Supplier 7               10
[90m 2[39m           4 Supplier 4                6
[90m 3[39m          10 Supplier 10               6
[90m 4[39m           1 Supplier 1                5
[90m 5[39m           3 Supplier 3                5
[90m 6[39m           5 Supplier 5                5
[90m 7[39m           6 Supplier 6                5
[90m 8[39m           2 Supplier 2                3
[90m 9[39m           8 Supplier 8                3
[90m10[39m           9 Supplier 9                2
[90m# A tibble: 10 × 3[39m
   Supplier_ID Supplier_Name Product_Count
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [

In [39]:
# REQUIRED variable name: regional_analysis
# Create regional analysis by customer city
# Required columns: City, Total_Sales, Customer_Count, Avg_Customer_Value

regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Quantity * Unit_Price, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Avg_Customer_Value = ifelse(Customer_Count > 0, Total_Sales / Customer_Count, NA),
.groups = "drop"
  )

# Required output for autograding:
cat("Regional Analysis Summary:\n")
cat("Total cities analyzed:", nrow(regional_analysis), "\n")
cat("Highest city sales: $", round(max(regional_analysis$Total_Sales, na.rm = TRUE), 2), "\n")
print(regional_analysis[order(-regional_analysis$Total_Sales), ])

Regional Analysis Summary:
Total cities analyzed: 6 
Highest city sales: $ 72277.52 
Total cities analyzed: 6 
Highest city sales: $ 72277.52 
[90m# A tibble: 6 × 4[39m
  City        Total_Sales Customer_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m7[24m[4m2[24m278.             23              [4m3[24m143.
[90m2[39m [31mNA[39m               [4m6[24m[4m8[24m753.              5             [4m1[24m[4m3[24m751.
[90m3[39m Phoenix          [4m4[24m[4m9[24m333.             19              [4m2[24m596.
[90m4[39m Los Angeles      [4m4[24m[4m8[24m506.             18              [4m2[24m695.
[90m5[39m Houston          [4m3[24m[4m8[24m143.             18              [4m2[24m119.
[90m6[39m New York         [4m3[24m[4m6[24m417.             16              [4m2[24m276.
[90m# A tibble: 6 × 4[39m
  Ci

In [48]:
# Supplier Performance Analysis with Enhanced Metrics

library(dplyr)

# Core supplier performance metrics
# Use complete_data for consistency with other analysis
supplier_metrics <- complete_data %>%
  filter(!is.na(Supplier_Name)) %>%
  group_by(Supplier_ID, Supplier_Name, Country) %>%
  summarise(
    Total_Revenue = sum(Quantity * Unit_Price, na.rm = TRUE),
    Products_Supplied = n_distinct(ProductID),
    Orders_Involved = n_distinct(OrderID),
    Total_Quantity = sum(Quantity, na.rm = TRUE),
    Avg_Order_Value = ifelse(n_distinct(OrderID) > 0, sum(Quantity * Unit_Price, na.rm = TRUE) / n_distinct(OrderID), NA),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Revenue))

cat("Supplier Performance:\n")
print(supplier_metrics)

# Summary Statistics
cat("\n=== Supplier Performance Summary ===\n")
cat("Total Suppliers:", nrow(supplier_metrics), "\n")
cat("Total Revenue Generated:", sum(supplier_metrics$Total_Revenue), "\n\n")

# Top Performers
cat("Top 5 Suppliers by Revenue:\n")
print(head(supplier_metrics, 5))

cat("\n\nTop 5 Suppliers by Product Diversity:\n")
print(supplier_metrics %>% arrange(desc(Products_Supplied)) %>% head(5))

cat("\n\nTop 5 Suppliers by Orders Involved:\n")
print(supplier_metrics %>% arrange(desc(Orders_Involved)) %>% head(5))

# Supplier Efficiency Analysis
cat("\n\n=== Supplier Efficiency ===\n")
efficiency <- supplier_metrics %>%
  mutate(
    Revenue_per_Product = ifelse(Products_Supplied > 0, Total_Revenue / Products_Supplied, NA),
    Revenue_per_Order = Avg_Order_Value,
    Quantity_per_Order = ifelse(Orders_Involved > 0, Total_Quantity / Orders_Involved, NA)
  ) %>%
  select(Supplier_Name, Revenue_per_Product, Revenue_per_Order, Quantity_per_Order) %>%
  arrange(desc(Revenue_per_Order))

print(efficiency)

# Geographic Analysis
cat("\n\n=== Revenue by Supplier Country ===\n")
country_summary <- supplier_metrics %>%
  group_by(Country) %>%
  summarise(
    Suppliers = n(),
    Total_Revenue = sum(Total_Revenue),
    Avg_Supplier_Revenue = mean(Total_Revenue),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Revenue))

print(country_summary)

Supplier Performance:
[90m# A tibble: 10 × 8[39m
   Supplier_ID Supplier_Name Country     Total_Revenue Products_Supplied
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m 1[39m           7 Supplier 7    Germany            [4m5[24m[4m2[24m328.                10
[90m 2[39m           5 Supplier 5    Japan              [4m5[24m[4m0[24m692.                 5
[90m 3[39m          10 Supplier 10   Japan              [4m4[24m[4m3[24m450.                 6
[90m 4[39m           1 Supplier 1    USA                [4m3[24m[4m4[24m821.                 5
[90m 5[39m           3 Supplier 3    Japan              [4m3[24m[4m0[24m401.                 5
[90m 6[39m           4 Supplier 4    South Korea        [4m2[24m[4m9[24m898.                 6
[90m 7[39m           6 Supplier 6    Japan              [4m2[24m[4m7[24m453.                 5


[90m# A tibble: 5 × 8[39m
  Supplier_ID Supplier_Name Country     Total_Revenue Products_Supplied
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m           7 Supplier 7    Germany            [4m5[24m[4m2[24m328.                10
[90m2[39m          10 Supplier 10   Japan              [4m4[24m[4m3[24m450.                 6
[90m3[39m           4 Supplier 4    South Korea        [4m2[24m[4m9[24m898.                 6
[90m4[39m           5 Supplier 5    Japan              [4m5[24m[4m0[24m692.                 5
[90m5[39m           1 Supplier 1    USA                [4m3[24m[4m4[24m821.                 5
[90m# ℹ 3 more variables: Orders_Involved <int>, Total_Quantity <dbl>,[39m
[90m#   Avg_Order_Value <dbl>[39m


Top 5 Suppliers by Orders Involved:


Top 5 Suppliers by Orders Involved:
[90m# A tibble: 5 × 8[39m
  Supplier_ID Su

In [49]:
# Regional Analysis by Customer City
# Use complete_data for consistency with other analysis
regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Quantity * Unit_Price, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Order_Count = n_distinct(OrderID),
    Avg_Customer_Value = ifelse(Customer_Count > 0, Total_Sales / Customer_Count, NA),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Sales))

cat("Regional Sales Analysis:\n")
print(regional_analysis)

Regional Sales Analysis:
[90m# A tibble: 6 × 5[39m
  City        Total_Sales Customer_Count Order_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m       [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m7[24m[4m2[24m278.             23          36              [4m3[24m143.
[90m2[39m [31mNA[39m               [4m6[24m[4m8[24m753.              5          41             [4m1[24m[4m3[24m751.
[90m3[39m Phoenix          [4m4[24m[4m9[24m333.             19          32              [4m2[24m596.
[90m4[39m Los Angeles      [4m4[24m[4m8[24m506.             18          32              [4m2[24m695.
[90m5[39m Houston          [4m3[24m[4m8[24m143.             18          30              [4m2[24m119.
[90m6[39m New York         [4m3[24m[4m6[24m417.             16          26              [4m2[24m276.
[90m# A tibble: 6 × 5[39m
  City 

## Part 6: Complex Business Questions

**Tasks:**
Answer the following business questions using your joined datasets:

1. **Customer Segmentation:** Identify your top 10% of customers by total spending
2. **Product Recommendations:** Which products are frequently bought together?
3. **Supplier Dependency:** Which suppliers are most critical to the business?
4. **Market Expansion:** Which cities have high customer counts but low average order values?

Provide data-driven answers with supporting analysis.

## Part 6: Complex Business Questions

**Requirements for Autograding:**
You must create these EXACT variable names:

1. `top_customers` - Top 10% of customers by spending (use quantile function)
2. `product_combinations` - Products frequently bought together  
3. `critical_suppliers` - Suppliers ranked by importance
4. `market_expansion` - Cities with expansion opportunities

**Expected Outputs:**
Each analysis should include specific metrics and rankings as shown in the required outputs below.

In [44]:
# REQUIRED variable name: top_customers
# Identify top 10% of customers by total spending using quantile()
# Use customer_metrics from previous analysis
top_10_percent_threshold <- quantile(customer_metrics$Total_Spent, 0.9, na.rm = TRUE)
top_customers <- customer_metrics %>%
  filter(Total_Spent >= top_10_percent_threshold) %>%
  arrange(desc(Total_Spent))
# Required output for autograding:
cat("Customer Segmentation Results:\n")
cat("Top 10% spending threshold: $", round(top_10_percent_threshold, 2), "\n")
cat("Number of top customers:", nrow(top_customers), "\n")

Customer Segmentation Results:
Top 10% spending threshold: $ 5593.79 
Number of top customers: 10 
Top 10% spending threshold: $ 5593.79 
Number of top customers: 10 


In [50]:
# REQUIRED variable name: product_combinations
# Find products frequently bought together
# Approach: For each order with multiple products, find all product pairs
# Use order_items to get product pairs per order
library(dplyr)
library(tidyr)
# Only consider orders with more than one product
multi_item_orders <- order_items %>%
  group_by(OrderID) %>%
  filter(n() > 1) %>%
  ungroup()
# Create all product pairs within each order
product_pairs <- multi_item_orders %>%
  select(OrderID, ProductID) %>%
  group_by(OrderID) %>%
  summarise(pairs = list(as.data.frame(t(combn(ProductID, 2)))), .groups = "drop") %>%
  unnest(pairs) %>%
  rename(ProductA = V1, ProductB = V2)
# Count frequency of each product pair
product_combinations <- product_pairs %>%
  group_by(ProductA, ProductB) %>%
  summarise(n = n(), .groups = "drop") %>%
  arrange(desc(n))
# Required output for autograding:
cat("Product Combination Analysis:\n")
cat("Total product combinations found:", nrow(product_combinations), "\n")
if(nrow(product_combinations) > 0) {
  cat("Most frequent combination appears", max(product_combinations$n, na.rm = TRUE), "times\n")
}

Product Combination Analysis:
Total product combinations found: 302 
Most frequent combination appears 3 times
Total product combinations found: 302 
Most frequent combination appears 3 times


In [51]:
# REQUIRED variable name: critical_suppliers
# Analyze which suppliers are most critical
# Consider both revenue contribution and product diversity
# Use complete_data for analysis
critical_suppliers <- complete_data %>%
  filter(!is.na(Supplier_ID)) %>%
  group_by(Supplier_ID, Supplier_Name) %>%
  summarise(
    Total_Revenue = sum(Quantity * Unit_Price, na.rm = TRUE),
    Products_Supplied = n_distinct(ProductID),
    Orders_Involved = n_distinct(OrderID),
    .groups = "drop"
  ) %>%
  mutate(
    Importance_Score = scale(Total_Revenue) + scale(Products_Supplied) + scale(Orders_Involved)
  ) %>%
  arrange(desc(Importance_Score))
# Required output for autograding:
cat("Critical Suppliers Analysis:\n")
cat("Total suppliers analyzed:", nrow(critical_suppliers), "\n")
cat("Top 3 most critical suppliers:\n")
print(head(critical_suppliers, 3))

Critical Suppliers Analysis:
Total suppliers analyzed: 10 
Top 3 most critical suppliers:
Total suppliers analyzed: 10 
Top 3 most critical suppliers:
[90m# A tibble: 3 × 6[39m
  Supplier_ID Supplier_Name Total_Revenue Products_Supplied Orders_Involved
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m           7 Supplier 7           [4m5[24m[4m2[24m328.                10              66
[90m2[39m           5 Supplier 5           [4m5[24m[4m0[24m692.                 5              54
[90m3[39m          10 Supplier 10          [4m4[24m[4m3[24m450.                 6              43
[90m# ℹ 1 more variable: Importance_Score <dbl[,1]>[39m
[90m# A tibble: 3 × 6[39m
  Supplier_ID Supplier_Name Total_Revenue Products_Supplied Orders_Involved
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m  

In [52]:
# REQUIRED variable name: market_expansion
# Identify market expansion opportunities
# Focus on cities with high customer counts but low average order values
# Use regional_analysis from previous analysis
customer_count_threshold <- quantile(regional_analysis$Customer_Count, 0.75, na.rm = TRUE)
avg_order_value_threshold <- quantile(regional_analysis$Avg_Customer_Value, 0.25, na.rm = TRUE)
market_expansion <- regional_analysis %>%
  filter(Customer_Count >= customer_count_threshold,
         Avg_Customer_Value <= avg_order_value_threshold) %>%
  arrange(Avg_Customer_Value)
# Required output for autograding:
cat("Market Expansion Analysis:\n")
cat("Cities evaluated for expansion:", nrow(market_expansion), "\n")
cat("Top expansion opportunities:\n")
print(head(market_expansion, 5))

Market Expansion Analysis:
Cities evaluated for expansion: 0 
Top expansion opportunities:
Cities evaluated for expansion: 0 
Top expansion opportunities:
[90m# A tibble: 0 × 5[39m
[90m# ℹ 5 variables: City <chr>, Total_Sales <dbl>, Customer_Count <int>,[39m
[90m#   Order_Count <int>, Avg_Customer_Value <dbl>[39m
[90m# A tibble: 0 × 5[39m
[90m# ℹ 5 variables: City <chr>, Total_Sales <dbl>, Customer_Count <int>,[39m
[90m#   Order_Count <int>, Avg_Customer_Value <dbl>[39m


## Part 7: Summary and Insights

**Requirements for Autograding:**
Complete the analysis summary below with specific findings from your work. Your responses will be evaluated for completeness and accuracy.

**Grading Criteria:**
- Specific metrics and numbers from your analysis
- Clear identification of data quality issues  
- Actionable business recommendations
- Technical insights about join performance and appropriateness

## Analysis Summary

### Key Findings from Join Operations:
- **Join Efficiency:** [Report specific percentages of matched records]
- **Data Relationships:** [Describe the relationships you discovered between tables]
- **Multi-table Joins:** [Explain how the data grew/changed through the join sequence]

### Data Quality Issues Discovered:
- **Orphaned Records:** [Report specific counts of unmatched records]
- **Missing Data:** [Identify which tables had missing or invalid references]
- **Referential Integrity:** [Describe any foreign key violations found]

### Business Insights:
- **Top Customer:** [Name and spending amount of highest value customer]
- **Best Product:** [Product name and revenue from your analysis]
- **Regional Performance:** [Key findings about city-level sales patterns]
- **Supplier Analysis:** [Insights about supplier diversity and concentration]

### Strategic Recommendations:
- [Provide at least 3 specific, actionable recommendations based on your data analysis]

### Technical Learnings:
- **Inner vs Left Joins:** [When you used each and why]
- **Multi-table Strategy:** [Your approach to building complex joined datasets]
- **Performance Considerations:** [Observations about join efficiency and data size]

In [53]:
# Optional: Save your analysis results (not required for autograding)
# Save key datasets to CSV files if desired (uncomment to use):
# write_csv(customer_metrics, "customer_metrics.csv")
# write_csv(product_metrics, "product_metrics.csv")
# write_csv(supplier_metrics, "supplier_metrics.csv")
# write_csv(regional_analysis, "regional_analysis.csv")
# write_csv(top_customers, "top_customers.csv")
# write_csv(product_combinations, "product_combinations.csv")
# write_csv(critical_suppliers, "critical_suppliers.csv")
# write_csv(market_expansion, "market_expansion.csv")

# Final summary output for autograding verification:
cat("\n=== HOMEWORK 6 COMPLETION SUMMARY ===\n")
cat("✓ Data Import: 5 datasets loaded\n")
cat("✓ Basic Joins: 4 join types completed\n") 
cat("✓ Multi-table Joins: 4-step progression completed\n")
cat("✓ Data Quality: Anti-joins and semi-joins performed\n")
cat("✓ Business Analysis: Customer, product, supplier, and regional analysis completed\n")
cat("✓ Complex Questions: Advanced business scenarios analyzed\n")
cat("✓ Summary: Comprehensive insights documented\n")
cat("\nAll required variables created for autograding verification.\n")


=== HOMEWORK 6 COMPLETION SUMMARY ===
✓ Data Import: 5 datasets loaded
✓ Basic Joins: 4 join types completed
✓ Multi-table Joins: 4-step progression completed
✓ Data Quality: Anti-joins and semi-joins performed
✓ Business Analysis: Customer, product, supplier, and regional analysis completed
✓ Data Import: 5 datasets loaded
✓ Basic Joins: 4 join types completed
✓ Multi-table Joins: 4-step progression completed
✓ Data Quality: Anti-joins and semi-joins performed
✓ Business Analysis: Customer, product, supplier, and regional analysis completed
✓ Complex Questions: Advanced business scenarios analyzed
✓ Summary: Comprehensive insights documented

All required variables created for autograding verification.
✓ Complex Questions: Advanced business scenarios analyzed
✓ Summary: Comprehensive insights documented

All required variables created for autograding verification.
