# Homework Assignment - Lesson 6: Combining Datasets - Joins

**Due Date:** [Insert Due Date Here]

**Instructions:**
- Complete the following tasks in this R notebook
- Use the exact variable names specified for each task
- Ensure your code runs without errors and produces the expected outputs
- Submit your completed notebook with all code and outputs

---

## Part 1: Data Import and Setup

**Task:** Import the following CSV files and examine their structure.

**Required Variable Names:**
You must use these EXACT variable names:
- `customers` for customers.csv
- `orders` for orders.csv  
- `order_items` for order_items.csv
- `products` for products.csv
- `suppliers` for suppliers.csv

**Expected Outputs:**
- All datasets loaded successfully
- Print dataset dimensions using `nrow()` and `ncol()`
- Show first 3 rows of each dataset using `head(dataset, 3)`

In [3]:
# Load tidyverse
library(tidyverse)

# Set working directory
setwd("/workspaces/Fall2025-MS3083-Base_Template/data")

# Import datasets with exact variable names
customers <- read_csv("customers.csv")
orders <- read_csv("orders.csv")
order_items <- read_csv("order_items.csv")
products <- read_csv("products.csv")
suppliers <- read_csv("suppliers.csv")

# Print dataset information
cat("Dataset Dimensions:\n")
cat("Customers:", nrow(customers), "rows x", ncol(customers), "columns\n")
cat("Orders:", nrow(orders), "rows x", ncol(orders), "columns\n")
cat("Order Items:", nrow(order_items), "rows x", ncol(order_items), "columns\n")
cat("Products:", nrow(products), "rows x", ncol(products), "columns\n")
cat("Suppliers:", nrow(suppliers), "rows x", ncol(suppliers), "columns\n")

# Show first 3 rows of each dataset
head(customers, 3)
head(orders, 3)
head(order_items, 3)
head(products, 3)
head(suppliers, 3)


[1mRows: [22m[34m100[39m [1mColumns: [22m[34m5[39m


[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Name, Email, City
[32mdbl[39m  (1): CustomerID
[34mdate[39m (1): Registration_Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m250[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m  (3): OrderID, CustomerID, Total_Amount
[34mdate[39m (1): Order_Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m400[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────

Dataset Dimensions:
Customers: 100 rows x 5 columns
Orders: 250 rows x 4 columns
Order Items: 400 rows x 4 columns
Products: 50 rows x 4 columns
Suppliers: 10 rows x 3 columns


CustomerID,Name,Email,City,Registration_Date
<dbl>,<chr>,<chr>,<chr>,<date>
1,Customer 1,customer1@email.com,Phoenix,2020-10-03
2,Customer 2,customer2@email.com,Los Angeles,2020-06-02
3,Customer 3,customer3@email.com,Chicago,2021-04-20


OrderID,CustomerID,Order_Date,Total_Amount
<dbl>,<dbl>,<date>,<dbl>
1,87,2023-08-30,424.3
2,12,2024-03-24,183.09
3,37,2024-03-19,549.07


OrderID,ProductID,Quantity,Unit_Price
<dbl>,<dbl>,<dbl>,<dbl>
213,8,1,43.81
176,18,5,489.16
118,2,5,442.09


ProductID,Product_Name,Category,Supplier_ID
<dbl>,<chr>,<chr>,<dbl>
1,Product 1,Home,6
2,Product 2,Electronics,7
3,Product 3,Home,1


Supplier_ID,Supplier_Name,Country
<dbl>,<chr>,<chr>
1,Supplier 1,USA
2,Supplier 2,USA
3,Supplier 3,Japan


## Part 2: Basic Joins

**Tasks:**
1. **Inner Join:** Create `customer_orders_inner` by joining customers and orders
2. **Left Join:** Create `customer_orders_left` to show all customers  
3. **Right Join:** Create `customer_orders_right` to show all orders
4. **Full Join:** Create `customer_orders_full` for complete data view

Use the exact variable names specified above for autograding. Analyze row counts and explain results.

In [5]:

customer_orders <- inner_join(customers, orders, by = "CustomerID")


cat("Customers rows:", nrow(customers), "\n")
cat("Orders rows:", nrow(orders), "\n")
cat("Joined rows:", nrow(customer_orders), "\n")

head(customer_orders, 5)




Customers rows: 100 
Orders rows: 250 
Joined rows: 200 


CustomerID,Name,Email,City,Registration_Date,OrderID,Order_Date,Total_Amount
<dbl>,<chr>,<chr>,<chr>,<date>,<dbl>,<date>,<dbl>
1,Customer 1,customer1@email.com,Phoenix,2020-10-03,87,2023-03-28,716.18
1,Customer 1,customer1@email.com,Phoenix,2020-10-03,214,2023-09-12,1343.63
2,Customer 2,customer2@email.com,Los Angeles,2020-06-02,173,2024-02-25,159.98
2,Customer 2,customer2@email.com,Los Angeles,2020-06-02,190,2023-04-19,1503.04
3,Customer 3,customer3@email.com,Chicago,2021-04-20,29,2023-03-07,441.06


In [11]:

customer_orders_left <- left_join(customers, orders, by = "CustomerID")


no_order_count <- sum(is.na(customer_orders_left$OrderID))

cat("Total customers:", nrow(customers), "\n")
cat("Customers with no orders:", no_order_count, "\n")

#no borders Extra Credit :)
customer_orders_left %>%
  filter(is.na(OrderID)) %>%
  head(5)




Total customers: 100 
Customers with no orders: 0 


CustomerID,Name,Email,City,Registration_Date,OrderID,Order_Date,Total_Amount
<dbl>,<chr>,<chr>,<chr>,<date>,<dbl>,<date>,<dbl>


In [9]:
# Create a right join to keep all orders
customer_orders_right <- right_join(customers, orders, by = "CustomerID")

# Check for orders with invalid (missing) customer IDs
invalid_orders <- customer_orders_right %>%
  filter(is.na(Name))

cat("Total orders:", nrow(orders), "\n")
cat("Orders with invalid customer IDs:", nrow(invalid_orders), "\n")

#showing a few invalid orders (there are 50)
head(invalid_orders, 5)




Total orders: 250 
Orders with invalid customer IDs: 50 


CustomerID,Name,Email,City,Registration_Date,OrderID,Order_Date,Total_Amount
<dbl>,<chr>,<chr>,<chr>,<date>,<dbl>,<date>,<dbl>
101,,,,,6,2023-07-22,189.85
105,,,,,9,2023-07-19,834.35
102,,,,,11,2023-06-15,1267.23
105,,,,,12,2023-10-02,1513.43
103,,,,,19,2023-10-09,537.43


<VSCode.Cell id="#VSC-0da36c1e" language="markdown">
## Part 2: Basic Joins

**Required Variable Names:**
You must create these EXACT variable names:

**Part 2.1:** Create `customer_orders` using inner_join()
**Part 2.2:** Create `customer_orders_left` using left_join()  
**Part 2.3:** Create `customer_orders_right` using right_join()
**Part 2.4:** Create `customer_orders_full` using full_join()

**Expected Outputs:**
For each join, print:
- Number of rows in the result using `nrow()`
- Comparison with original dataset sizes
- Analysis of unmatched records

In [12]:
# Part 2.1: Inner Join (REQUIRED variable name: customer_orders)

# Create inner join between customers and orders
customer_orders <- inner_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Inner Join Results:\n")
cat("Customers:", nrow(customers), "\n")
cat("Orders:", nrow(orders), "\n")
cat("Inner Join Result:", nrow(customer_orders), "\n")


Inner Join Results:
Customers: 100 
Orders: 250 
Inner Join Result: 200 


In [13]:
# Part 2.2: Left Join (REQUIRED variable name: customer_orders_left)

# Create a left join to keep all customers
customer_orders_left <- left_join(customers, orders, by = "CustomerID")

# Required output for autograding:
customers_without_orders <- sum(is.na(customer_orders_left$OrderID))
cat("Left Join Results:\n")
cat("Total rows:", nrow(customer_orders_left), "\n")
cat("Customers without orders:", customers_without_orders, "\n")


Left Join Results:
Total rows: 200 
Customers without orders: 0 


In [None]:
# Part 2.1: Inner Join (REQUIRED variable name: customer_orders)

# Create an inner join between customers and orders
customer_orders <- inner_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Inner Join Results:\n")
cat("Customers:", nrow(customers), "\n")
cat("Orders:", nrow(orders), "\n")
cat("Inner Join Result:", nrow(customer_orders), "\n")


Inner Join Results:
Customers: 100 
Orders: 250 
Inner Join Result: 200 


In [None]:
# Part 2.2: Left Join (REQUIRED variable name: customer_orders_left)

# Create a left join to keep all customers
customer_orders_left <- left_join(customers, orders, by = "CustomerID")

# Required output for autograding:
customers_without_orders <- sum(is.na(customer_orders_left$OrderID))
cat("Left Join Results:\n")
cat("Total rows:", nrow(customer_orders_left), "\n")
cat("Customers without orders:", customers_without_orders, "\n")


Left Join Results:
Total rows: 200 
Customers without orders: 0 


In [None]:
# Part 2.3: Right Join (REQUIRED variable name: customer_orders_right)

# Create a right join to keep all orders
customer_orders_right <- right_join(customers, orders, by = "CustomerID")

# Required output for autograding:
orders_invalid_customers <- sum(is.na(customer_orders_right$Name))
cat("Right Join Results:\n")
cat("Total rows:", nrow(customer_orders_right), "\n")
cat("Orders with invalid customer IDs:", orders_invalid_customers, "\n")


Right Join Results:
Total rows: 250 
Orders with invalid customer IDs: 50 


In [17]:
# Part 2.4: Full Join (REQUIRED variable name: customer_orders_full)

# Create a full join to include all customers and all orders
customer_orders_full <- full_join(customers, orders, by = "CustomerID")

# Required output for autograding:
customers_only <- sum(is.na(customer_orders_full$OrderID))
orders_only <- sum(is.na(customer_orders_full$Name))
cat("Full Join Results:\n")
cat("Total rows:", nrow(customer_orders_full), "\n")
cat("Customers without orders:", customers_only, "\n")
cat("Orders without valid customers:", orders_only, "\n")


Full Join Results:
Total rows: 250 
Customers without orders: 0 
Orders without valid customers: 50 


## Part 3: Multi-Table Joins

**Tasks:**
1. Create a comprehensive dataset by joining `orders`, `customers`, and `order_items`
2. Extend this dataset by adding `products` and `suppliers` information
3. Create a complete supply chain view showing the full customer-to-supplier relationship

Build your joins step by step and examine the results at each stage.

## Part 3: Multi-Table Joins

**Requirements for Autograding:**
You must create these EXACT variable names in order:

1. `orders_items` - Join orders and order_items  
2. `orders_customers_items` - Add customers to above
3. `complete_order_data` - Add products information
4. `complete_data` - Add suppliers for complete supply chain view

**Expected Outputs:**
Print the number of rows for each step to show the join progression.

In [19]:
# Step 1: REQUIRED variable name: orders_items

# Create an inner join between orders and order_items
orders_items <- inner_join(orders, order_items, by = "OrderID")

# Required output for autograding:
cat("Step 1 - Orders with Items:", nrow(orders_items), "rows\n")


Step 1 - Orders with Items: 400 rows


In [20]:
# Step 2: REQUIRED variable name: orders_customers_items  

# Join customers to the existing orders_items dataset
orders_customers_items <- inner_join(customers, orders_items, by = "CustomerID")

# Required output for autograding:
cat("Step 2 - Add Customers:", nrow(orders_customers_items), "rows\n")


Step 2 - Add Customers: 310 rows


In [21]:
# Step 3: REQUIRED variable name: complete_order_data

# Join products to the existing orders_customers_items dataset
complete_order_data <- inner_join(orders_customers_items, products, by = "ProductID")

# Required output for autograding:
cat("Step 3 - Add Products:", nrow(complete_order_data), "rows\n")


Step 3 - Add Products: 310 rows


In [None]:
#had to check the names
colnames(products)
colnames(suppliers)

In [25]:
# Step 4: REQUIRED variable name: complete_data

# Join suppliers to the complete_order_data dataset
complete_data <- inner_join(complete_order_data, suppliers, by = "Supplier_ID")

# Required output for autograding:
cat("Step 4 - Complete Supply Chain:", nrow(complete_data), "rows\n")


Step 4 - Complete Supply Chain: 310 rows


## Part 4: Data Quality Analysis

**Tasks:**
1. Use `anti_join()` to find unmatched records between tables
2. Use `semi_join()` to find matched records
3. Check for duplicate keys and analyze their impact on joins
4. Identify and document any data quality issues

This analysis helps understand data integrity and potential issues in your datasets.

<VSCode.Cell id="#VSC-d7f91846" language="markdown">
## Part 4: Data Quality Analysis

**Requirements for Autograding:**
You must create these EXACT variable names:

1. `customers_no_orders` - Use anti_join() to find customers with no orders
2. `orphaned_orders` - Use anti_join() to find orders without customers  
3. `products_never_ordered` - Use anti_join() to find unordered products
4. `active_customers` - Use semi_join() to find customers with orders

**Expected Outputs:**
Print the count of records for each using `nrow()`.

In [26]:
# REQUIRED variable name: customers_no_orders

# Identify customers who never placed an order
customers_no_orders <- anti_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Customers who never placed an order:", nrow(customers_no_orders), "\n")


Customers who never placed an order: 0 


In [27]:
# REQUIRED variable name: orphaned_orders

# Identify orders that do not have a matching customer
orphaned_orders <- anti_join(orders, customers, by = "CustomerID")

# Required output for autograding:
cat("Orders without corresponding customers:", nrow(orphaned_orders), "\n")


Orders without corresponding customers: 50 


In [28]:
# REQUIRED variable name: products_never_ordered

# Identify products that were never included in any order
products_never_ordered <- anti_join(products, order_items, by = "ProductID")

# Required output for autograding:
cat("Products that were never ordered:", nrow(products_never_ordered), "\n")


Products that were never ordered: 0 


In [29]:
# REQUIRED variable name: active_customers

# Identify customers who placed at least one order
active_customers <- semi_join(customers, orders, by = "CustomerID")

# Required output for autograding:
cat("Customers who placed at least one order:", nrow(active_customers), "\n")


Customers who placed at least one order: 100 


In [30]:
# Data integrity checks - count duplicate keys
# Your code here: Calculate and print duplicate counts for each dataset

duplicate_customers <- sum(duplicated(customers$CustomerID))
duplicate_orders <- sum(duplicated(orders$OrderID))
duplicate_products <- sum(duplicated(products$ProductID))
duplicate_suppliers <- sum(duplicated(suppliers$Supplier_ID))

# Required outputs for autograding:
cat("Data Quality Summary:\n")
cat("Duplicate customer IDs:", duplicate_customers, "\n")
cat("Duplicate order IDs:", duplicate_orders, "\n")
cat("Duplicate product IDs:", duplicate_products, "\n")
cat("Duplicate supplier IDs:", duplicate_suppliers, "\n")


Data Quality Summary:
Duplicate customer IDs: 0 
Duplicate order IDs: 0 
Duplicate product IDs: 0 
Duplicate supplier IDs: 0 


## Part 5: Business Analysis with Joined Data

**Tasks:**
1. Calculate customer lifetime value metrics (total spent, number of orders, average order value)
2. Analyze product performance (total quantity sold, revenue generated, order frequency)
3. Evaluate supplier performance (total revenue, number of products supplied)
4. Create a regional analysis by customer city

Use your joined datasets to generate meaningful business insights.

<VSCode.Cell id="#VSC-f7364f43" language="markdown">
## Part 5: Business Analysis with Joined Data

**Requirements for Autograding:**
You must create these EXACT variable names:

1. `customer_metrics` - Customer analysis with Total_Spent, Order_Count, Avg_Order_Value
2. `product_metrics` - Product analysis with Total_Revenue, Total_Quantity, Order_Frequency  
3. `supplier_metrics` - Supplier analysis with Product_Count by supplier
4. `regional_analysis` - Sales analysis by customer city

**Required Columns:**
- customer_metrics: CustomerID, Name, Total_Spent, Order_Count, Avg_Order_Value
- product_metrics: ProductID, Product_Name, Total_Revenue, Total_Quantity, Order_Frequency
- supplier_metrics: Supplier_ID, Supplier_Name, Product_Count
- regional_analysis: City, Total_Sales, Customer_Count, Avg_Customer_Value

In [31]:
# REQUIRED variable name: customer_metrics
# Calculate customer lifetime value metrics
# Required columns: CustomerID, Name, Total_Spent, Order_Count, Avg_Order_Value

customer_metrics <- complete_data %>%
  group_by(CustomerID, Name) %>%
  summarise(
    Total_Spent = sum(Total_Amount, na.rm = TRUE),
    Order_Count = n_distinct(OrderID),
    Avg_Order_Value = Total_Spent / Order_Count,
    .groups = "drop"
  )

# Required output for autograding:
cat("Customer Analysis Summary:\n")
cat("Total customers analyzed:", nrow(customer_metrics), "\n")
cat("Top customer total spent: $", round(max(customer_metrics$Total_Spent, na.rm = TRUE), 2), "\n")
print(head(customer_metrics[order(-customer_metrics$Total_Spent), ], 3))


Customer Analysis Summary:
Total customers analyzed: 94 
Top customer total spent: $ 12115.12 
[90m# A tibble: 3 × 5[39m
  CustomerID Name        Total_Spent Order_Count Avg_Order_Value
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m       [3m[90m<int>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m         88 Customer 88      [4m1[24m[4m2[24m115.           2           [4m6[24m058.
[90m2[39m         63 Customer 63      [4m1[24m[4m1[24m653.           2           [4m5[24m827.
[90m3[39m         82 Customer 82       [4m9[24m371.           2           [4m4[24m686.


In [32]:
# REQUIRED variable name: product_metrics  
# Analyze product performance
# Required columns: ProductID, Product_Name, Total_Revenue, Total_Quantity, Order_Frequency

product_metrics <- complete_data %>%
  group_by(ProductID, Product_Name) %>%
  summarise(
    Total_Revenue = sum(Total_Amount, na.rm = TRUE),
    Total_Quantity = sum(Quantity, na.rm = TRUE),
    Order_Frequency = n_distinct(OrderID),
    .groups = "drop"
  )

# Required output for autograding:
cat("Product Analysis Summary:\n")
cat("Total products analyzed:", nrow(product_metrics), "\n")
cat("Top product revenue: $", round(max(product_metrics$Total_Revenue, na.rm = TRUE), 2), "\n")
print(head(product_metrics[order(-product_metrics$Total_Revenue), ], 3))


Product Analysis Summary:
Total products analyzed: 50 
Top product revenue: $ 14652.16 
[90m# A tibble: 3 × 5[39m
  ProductID Product_Name Total_Revenue Total_Quantity Order_Frequency
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m        26 Product 26          [4m1[24m[4m4[24m652.             42              13
[90m2[39m        20 Product 20          [4m1[24m[4m3[24m153.             33              10
[90m3[39m        43 Product 43          [4m1[24m[4m2[24m456.             36              10


In [34]:
# REQUIRED variable name: supplier_metrics
# Evaluate supplier performance  
# Required columns: Supplier_ID, Supplier_Name, Product_Count

supplier_metrics <- products %>%
  inner_join(suppliers, by = "Supplier_ID") %>%
  group_by(Supplier_ID, Supplier_Name) %>%
  summarise(
    Product_Count = n_distinct(ProductID),
    .groups = "drop"
  )

# Required output for autograding:
cat("Supplier Analysis Summary:\n")
cat("Total suppliers analyzed:", nrow(supplier_metrics), "\n")
cat("Max products per supplier:", max(supplier_metrics$Product_Count, na.rm = TRUE), "\n")
print(supplier_metrics[order(-supplier_metrics$Product_Count), ])


Supplier Analysis Summary:


Total suppliers analyzed: 10 
Max products per supplier: 10 
[90m# A tibble: 10 × 3[39m
   Supplier_ID Supplier_Name Product_Count
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m
[90m 1[39m           7 Supplier 7               10
[90m 2[39m           4 Supplier 4                6
[90m 3[39m          10 Supplier 10               6
[90m 4[39m           1 Supplier 1                5
[90m 5[39m           3 Supplier 3                5
[90m 6[39m           5 Supplier 5                5
[90m 7[39m           6 Supplier 6                5
[90m 8[39m           2 Supplier 2                3
[90m 9[39m           8 Supplier 8                3
[90m10[39m           9 Supplier 9                2


In [35]:
# REQUIRED variable name: regional_analysis
# Create regional analysis by customer city
# Required columns: City, Total_Sales, Customer_Count, Avg_Customer_Value

regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Total_Amount, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Avg_Customer_Value = Total_Sales / Customer_Count,
    .groups = "drop"
  )

# Required output for autograding:
cat("Regional Analysis Summary:\n")
cat("Total cities analyzed:", nrow(regional_analysis), "\n")
cat("Highest city sales: $", round(max(regional_analysis$Total_Sales, na.rm = TRUE), 2), "\n")
print(regional_analysis[order(-regional_analysis$Total_Sales), ])


Regional Analysis Summary:
Total cities analyzed: 5 
Highest city sales: $ 89493.62 
[90m# A tibble: 5 × 4[39m
  City        Total_Sales Customer_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m8[24m[4m9[24m494.             23              [4m3[24m891.
[90m2[39m Los Angeles      [4m7[24m[4m7[24m421.             18              [4m4[24m301.
[90m3[39m Houston          [4m6[24m[4m0[24m856.             18              [4m3[24m381.
[90m4[39m Phoenix          [4m5[24m[4m7[24m646.             19              [4m3[24m034.
[90m5[39m New York         [4m4[24m[4m4[24m742.             16              [4m2[24m796.


In [36]:
product_performance <- complete_data %>%
  group_by(ProductID, Product_Name) %>%
  summarise(
    Total_Revenue = sum(Total_Amount, na.rm = TRUE),
    Total_Quantity = sum(Quantity, na.rm = TRUE),
    Order_Frequency = n_distinct(OrderID),
    Avg_Revenue_Per_Order = Total_Revenue / Order_Frequency,
    .groups = "drop"
  )

# Print a quick summary
cat("Product Performance Summary:\n")
cat("Total products analyzed:", nrow(product_performance), "\n")
cat("Highest product revenue: $", round(max(product_performance$Total_Revenue, na.rm = TRUE), 2), "\n")
print(head(product_performance[order(-product_performance$Total_Revenue), ], 5))

Product Performance Summary:
Total products analyzed: 50 
Highest product revenue: $ 14652.16 
[90m# A tibble: 5 × 6[39m
  ProductID Product_Name Total_Revenue Total_Quantity Order_Frequency
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m        26 Product 26          [4m1[24m[4m4[24m652.             42              13
[90m2[39m        20 Product 20          [4m1[24m[4m3[24m153.             33              10
[90m3[39m        43 Product 43          [4m1[24m[4m2[24m456.             36              10
[90m4[39m        47 Product 47          [4m1[24m[4m1[24m980.             35               8
[90m5[39m        10 Product 10          [4m1[24m[4m1[24m086.             32               9
[90m# ℹ 1 more variable: Avg_Revenue_Per_Order <dbl>[39m


In [37]:
# Your code here:
# Evaluate supplier performance

supplier_performance <- products %>%
  inner_join(suppliers, by = "Supplier_ID") %>%
  group_by(Supplier_ID, Supplier_Name) %>%
  summarise(
    Product_Count = n_distinct(ProductID),
    Total_Revenue = sum(complete_data$Total_Amount[complete_data$Supplier_ID == first(Supplier_ID)], na.rm = TRUE),
    .groups = "drop"
  )

# Print summary
cat("Supplier Performance Summary:\n")
cat("Total suppliers analyzed:", nrow(supplier_performance), "\n")
cat("Max products per supplier:", max(supplier_performance$Product_Count, na.rm = TRUE), "\n")
print(supplier_performance[order(-supplier_performance$Product_Count), ])



Supplier Performance Summary:
Total suppliers analyzed: 10 
Max products per supplier: 10 
[90m# A tibble: 10 × 4[39m
   Supplier_ID Supplier_Name Product_Count Total_Revenue
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m         [3m[90m<dbl>[39m[23m
[90m 1[39m           7 Supplier 7               10        [4m5[24m[4m8[24m639.
[90m 2[39m           4 Supplier 4                6        [4m4[24m[4m0[24m670.
[90m 3[39m          10 Supplier 10               6        [4m3[24m[4m4[24m971.
[90m 4[39m           1 Supplier 1                5        [4m3[24m[4m2[24m722.
[90m 5[39m           3 Supplier 3                5        [4m3[24m[4m9[24m302.
[90m 6[39m           5 Supplier 5                5        [4m5[24m[4m6[24m591.
[90m 7[39m           6 Supplier 6                5        [4m2[24m[4m6[24m377.
[90m 8[39m           2 Supplier 2                3        [4m1[24m[4m5[24m408.
[90m 9[3

In [39]:
# Your code here:
# Create regional analysis by customer city

regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Total_Amount, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Avg_Customer_Value = Total_Sales / Customer_Count,
    .groups = "drop"
  )

# Print summary
cat("Regional Analysis Summary:\n")
cat("Total cities analyzed:", nrow(regional_analysis), "\n")
cat("Highest city sales: $", round(max(regional_analysis$Total_Sales, na.rm = TRUE), 2), "\n")
print(regional_analysis[order(-regional_analysis$Total_Sales), ])




Regional Analysis Summary:


Total cities analyzed: 5 
Highest city sales: $ 89493.62 
[90m# A tibble: 5 × 4[39m
  City        Total_Sales Customer_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m8[24m[4m9[24m494.             23              [4m3[24m891.
[90m2[39m Los Angeles      [4m7[24m[4m7[24m421.             18              [4m4[24m301.
[90m3[39m Houston          [4m6[24m[4m0[24m856.             18              [4m3[24m381.
[90m4[39m Phoenix          [4m5[24m[4m7[24m646.             19              [4m3[24m034.
[90m5[39m New York         [4m4[24m[4m4[24m742.             16              [4m2[24m796.


In [41]:
# Supplier Performance Analysis
supplier_metrics <- complete_data %>%
  filter(!is.na(Supplier_Name)) %>%
  group_by(Supplier_ID, Supplier_Name, Country) %>%
  summarise(
    Total_Revenue = sum(Total_Amount, na.rm = TRUE),
    Products_Supplied = n_distinct(ProductID),
    Orders_Involved = n_distinct(OrderID),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Revenue))

cat("Supplier Performance:\n")
print(supplier_metrics)


Supplier Performance:
[90m# A tibble: 10 × 6[39m
   Supplier_ID Supplier_Name Country     Total_Revenue Products_Supplied
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m 1[39m           7 Supplier 7    Germany            [4m5[24m[4m8[24m639.                10
[90m 2[39m           5 Supplier 5    Japan              [4m5[24m[4m6[24m591.                 5
[90m 3[39m           4 Supplier 4    South Korea        [4m4[24m[4m0[24m670.                 6
[90m 4[39m           3 Supplier 3    Japan              [4m3[24m[4m9[24m302.                 5
[90m 5[39m          10 Supplier 10   Japan              [4m3[24m[4m4[24m971.                 6
[90m 6[39m           1 Supplier 1    USA                [4m3[24m[4m2[24m722.                 5
[90m 7[39m           6 Supplier 6    Japan              [4m2[24m[4m6[24m377.                 5


In [44]:
# Regional Analysis by Customer City
regional_analysis <- complete_data %>%
  group_by(City) %>%
  summarise(
    Total_Sales = sum(Total_Amount, na.rm = TRUE),
    Customer_Count = n_distinct(CustomerID),
    Order_Count = n_distinct(OrderID),
    Avg_Customer_Value = Total_Sales / Customer_Count,
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Sales))

cat("Regional Sales Analysis:\n")
print(regional_analysis)



Regional Sales Analysis:
[90m# A tibble: 5 × 5[39m
  City        Total_Sales Customer_Count Order_Count Avg_Customer_Value
  [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m          [3m[90m<int>[39m[23m       [3m[90m<int>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m Chicago          [4m8[24m[4m9[24m494.             23          36              [4m3[24m891.
[90m2[39m Los Angeles      [4m7[24m[4m7[24m421.             18          32              [4m4[24m301.
[90m3[39m Houston          [4m6[24m[4m0[24m856.             18          30              [4m3[24m381.
[90m4[39m Phoenix          [4m5[24m[4m7[24m646.             19          32              [4m3[24m034.
[90m5[39m New York         [4m4[24m[4m4[24m742.             16          26              [4m2[24m796.


## Part 6: Complex Business Questions

**Tasks:**
Answer the following business questions using your joined datasets:

1. **Customer Segmentation:** Identify your top 10% of customers by total spending
2. **Product Recommendations:** Which products are frequently bought together?
3. **Supplier Dependency:** Which suppliers are most critical to the business?
4. **Market Expansion:** Which cities have high customer counts but low average order values?

Provide data-driven answers with supporting analysis.

## Part 6: Complex Business Questions

**Requirements for Autograding:**
You must create these EXACT variable names:

1. `top_customers` - Top 10% of customers by spending (use quantile function)
2. `product_combinations` - Products frequently bought together  
3. `critical_suppliers` - Suppliers ranked by importance
4. `market_expansion` - Cities with expansion opportunities

**Expected Outputs:**
Each analysis should include specific metrics and rankings as shown in the required outputs below.

In [45]:
# REQUIRED variable name: top_customers
# Identify top 10% of customers by total spending
# Use quantile() function to find the 90th percentile threshold

top_10_percent_threshold <- quantile(customer_metrics$Total_Spent, 0.9, na.rm = TRUE)

top_customers <- customer_metrics %>%
  filter(Total_Spent >= top_10_percent_threshold) %>%
  arrange(desc(Total_Spent))

# Required output for autograding:
cat("Customer Segmentation Results:\n")
cat("Top 10% spending threshold: $", round(top_10_percent_threshold, 2), "\n")
cat("Number of top customers:", nrow(top_customers), "\n")


Customer Segmentation Results:
Top 10% spending threshold: $ 6458.29 
Number of top customers: 10 


In [48]:
# REQUIRED variable name: product_combinations
# Find products frequently bought together
# Hint: Group by OrderID, filter orders with multiple items

# Step 1: Filter orders with multiple items
multi_item_orders <- complete_data %>%
  group_by(OrderID) %>%
  filter(n_distinct(ProductID) > 1)

# Step 2: Create product combinations within each order
product_combinations <- multi_item_orders %>%
  select(OrderID, ProductID, Product_Name) %>%
  group_by(OrderID) %>%
  summarise(
    combinations = list(t(combn(Product_Name, 2))),
    .groups = "drop"
  ) %>%
  unnest(combinations) %>%
  group_by(combinations.1 = combinations[, 1], combinations.2 = combinations[, 2]) %>%
  summarise(n = n(), .groups = "drop") %>%
  arrange(desc(n))

# Required output for autograding:
cat("Product Combination Analysis:\n")
cat("Total product combinations found:", nrow(product_combinations), "\n")
if(nrow(product_combinations) > 0) {
  cat("Most frequent combination appears", max(product_combinations$n, na.rm = TRUE), "times\n")
}



Product Combination Analysis:
Total product combinations found: 221 
Most frequent combination appears 2 times


In [49]:
# REQUIRED variable name: critical_suppliers
# Analyze which suppliers are most critical
# Consider both revenue contribution and product diversity

critical_suppliers <- complete_data %>%
  filter(!is.na(Supplier_Name)) %>%
  group_by(Supplier_ID, Supplier_Name, Country) %>%
  summarise(
    Total_Revenue = sum(Total_Amount, na.rm = TRUE),
    Product_Diversity = n_distinct(ProductID),
    .groups = "drop"
  ) %>%
  mutate(
    Critical_Score = (scale(Total_Revenue) + scale(Product_Diversity)) / 2
  ) %>%
  arrange(desc(Critical_Score))

# Required output for autograding:
cat("Critical Suppliers Analysis:\n")
cat("Total suppliers analyzed:", nrow(critical_suppliers), "\n")
print("Top 3 most critical suppliers:")
print(head(critical_suppliers, 3))


Critical Suppliers Analysis:
Total suppliers analyzed: 10 
[1] "Top 3 most critical suppliers:"
[90m# A tibble: 3 × 6[39m
  Supplier_ID Supplier_Name Country     Total_Revenue Product_Diversity
        [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m           7 Supplier 7    Germany            [4m5[24m[4m8[24m639.                10
[90m2[39m           5 Supplier 5    Japan              [4m5[24m[4m6[24m591.                 5
[90m3[39m           4 Supplier 4    South Korea        [4m4[24m[4m0[24m670.                 6
[90m# ℹ 1 more variable: Critical_Score <dbl[,1]>[39m


In [50]:
# REQUIRED variable name: market_expansion  
# Identify market expansion opportunities
# Focus on cities with multiple customers but varying order values

market_expansion <- complete_data %>%
  group_by(City) %>%
  summarise(
    Customer_Count = n_distinct(CustomerID),
    Avg_Order_Value = mean(Total_Amount, na.rm = TRUE),
    Order_Variability = sd(Total_Amount, na.rm = TRUE),
    Total_Sales = sum(Total_Amount, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(Customer_Count > 1 & Order_Variability > 0) %>%  # ensure active markets with varied spending
  arrange(desc(Order_Variability), desc(Total_Sales))

# Required output for autograding:
cat("Market Expansion Analysis:\n")
cat("Cities evaluated for expansion:", nrow(market_expansion), "\n")
print("Top expansion opportunities:")
print(head(market_expansion, 5))


Market Expansion Analysis:
Cities evaluated for expansion: 5 
[1] "Top expansion opportunities:"
[90m# A tibble: 5 × 5[39m
  City        Customer_Count Avg_Order_Value Order_Variability Total_Sales
  [3m[90m<chr>[39m[23m                [3m[90m<int>[39m[23m           [3m[90m<dbl>[39m[23m             [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m
[90m1[39m Chicago                 23           [4m1[24m147.              597.      [4m8[24m[4m9[24m494.
[90m2[39m New York                16            877.              584.      [4m4[24m[4m4[24m742.
[90m3[39m Phoenix                 19            930.              537.      [4m5[24m[4m7[24m646.
[90m4[39m Los Angeles             18           [4m1[24m191.              515.      [4m7[24m[4m7[24m421.
[90m5[39m Houston                 18           [4m1[24m127.              490.      [4m6[24m[4m0[24m856.


## Part 7: Summary and Insights

**Requirements for Autograding:**
Complete the analysis summary below with specific findings from your work. Your responses will be evaluated for completeness and accuracy.

**Grading Criteria:**
- Specific metrics and numbers from your analysis
- Clear identification of data quality issues  
- Actionable business recommendations
- Technical insights about join performance and appropriateness

## Analysis Summary

### Key Findings from Join Operations:
- **Join Efficiency:** The inner join between customers and orders returned 200 matched rows, representing roughly 80% of the total 250 orders. This means most orders had valid customer links, while about 20% did not. When expanding to multi-table joins (orders → items → products → suppliers), the dataset stabilized at 310 rows, showing consistent relationships across the supply chain.
- **Data Relationships:** Each customer could have multiple orders, and each supplier provided several products, forming strong one-to-many relationships. These patterns reminded me of interdependent teams in I/O psychology — each unit (customer, supplier, or department) operates semi-independently but still contributes to overall system performance.
- **Multi-table Joins:** As more tables were joined, the data became richer — going from basic transactions to a complete picture of who bought what, from whom, and where. This mirrors how organizations integrate multiple data sources (like HR, operations, and finance) to make evidence-based decisions.

### Data Quality Issues Discovered:
- **Orphaned Records:** There were 50 orders without matching customers, indicating either deleted or invalid customer records.
- **Missing Data:** The suppliers table initially used Supplier_ID while other datasets used SupplierID, which caused temporary join failures. After aligning column names, referential integrity was restored.
- **Referential Integrity:** All other tables joined cleanly, and no duplicate primary keys were found. However, the orphaned orders represent about 20% data loss in customer traceability. This is an important gap from a data governance perspective.

### Business Insights:
- **Top Customer:** Customer 88 was the top spender with $12,115.12 total spent across 2 orders, averaging over $6,000 per purchase.
- **Best Product:** Product 26 was the highest-grossing item with $14,652.16 in total revenue from 42 units sold across 13 orders.
- **Regional Performance:** Chicago led all cities with $89,493.62 in total sales and 23 customers, followed by Los Angeles at $77,421.37. Los Angeles also had the highest average customer value at $4,301 per customer, showing strong spending power.
- **Supplier Analysis:** Supplier 7 (Germany) was the most critical partner, contributing $58,639 in revenue and offering 10 distinct products. Most top suppliers were concentrated in Japan and Germany, revealing a moderately centralized supply network.

### Strategic Recommendations:

- Expand high-variability markets like Chicago and New York.
Both cities have large customer bases and inconsistent order values — a sign of untapped potential. Targeted campaigns could convert moderate spenders into loyal customers.

- Prioritize and protect top suppliers.
Supplier 7 (Germany) and Supplier 5 (Japan) drive the bulk of product diversity and revenue. Form long-term contracts or dual-source backups to reduce operational risk.

- Launch a loyalty or tiered rewards program for high-value customers.
The top 10% of customers spend over $6,458, making them prime candidates for retention strategies.

- Audit orphaned orders.
Investigate why 50 orders lack customer data — these could indicate system errors, manual entry mistakes, or dropped database links.

### Technical Learnings:
- **Inner vs Left Joins:** I used inner joins when accuracy was the goal (e.g., linking only valid orders to customers) and left joins to preserve full context (e.g., keeping all customers even if they had no orders). It’s like focusing on confirmed participants in a study versus including everyone for context — both approaches serve different analytical goals.
- **Multi-table Strategy:** I joined incrementally: first customers → orders, then order items → products → suppliers. This modular process made debugging easier and ensured that relationships stayed consistent across each step — similar to building layered models in organizational network analysis.
- **Performance Considerations:** Each join increased data volume but not exponentially. The main bottleneck came from unaligned column names, not computation time. Cleaning column consistency early saved significant time, much like ensuring consistent measurement scales in I/O research before running analyses.


Closing Reflection

Even though I’m not a business major, approaching this as an I/O psychology student helped me think in systems terms — how information flow, coordination, and structure affect outcomes. The data revealed interdependencies much like those seen in real organizations: customers act as end-users, suppliers as input sources, and cities as distribution hubs. The same principles of alignment, feedback, and system efficiency apply whether analyzing people or business data.

In [51]:
# Optional: Save your analysis results (not required for autograding)
# Your code here: Save key datasets to CSV files if desired


# Final summary output for autograding verification:
cat("\n=== HOMEWORK 6 COMPLETION SUMMARY ===\n")
cat("✓ Data Import: 5 datasets loaded\n")
cat("✓ Basic Joins: 4 join types completed\n") 
cat("✓ Multi-table Joins: 4-step progression completed\n")
cat("✓ Data Quality: Anti-joins and semi-joins performed\n")
cat("✓ Business Analysis: Customer, product, supplier, and regional analysis completed\n")
cat("✓ Complex Questions: Advanced business scenarios analyzed\n")
cat("✓ Summary: Comprehensive insights documented\n")
cat("\nAll required variables created for autograding verification.\n")


=== HOMEWORK 6 COMPLETION SUMMARY ===
✓ Data Import: 5 datasets loaded
✓ Basic Joins: 4 join types completed
✓ Multi-table Joins: 4-step progression completed
✓ Data Quality: Anti-joins and semi-joins performed
✓ Business Analysis: Customer, product, supplier, and regional analysis completed
✓ Complex Questions: Advanced business scenarios analyzed
✓ Summary: Comprehensive insights documented

All required variables created for autograding verification.
