# Homework Assignment - Lesson 3: Data Transformation with dplyr - Part 1

**Student Name:** Wesley Cook

**Due Date:** 9/21/2025

**Objective:** Learn to use dplyr functions (`select()`, `filter()`, `arrange()`) and the pipe operator (`%>%`) for data transformation and analysis.

---

## Instructions

- Complete all tasks in this notebook
- Use the pipe operator (`%>%`) wherever possible to chain operations
- Ensure your code is well-commented and easy to understand
- Run all cells to verify your code works correctly
- Answer all reflection questions at the end

---

## Part 1: Data Import and Setup

In this section, you'll import the retail transactions dataset and perform initial exploration.

**Dataset:** `retail_transactions.csv` - This dataset contains transaction records from a retail business with information about customers, products, dates, amounts, and quantities.

In [12]:
# Load required libraries
library(tidyverse)
setwd("/workspaces/Wespitory/data")

# Set working directory if needed
# setwd("path/to/your/data")

# Task 1.1: Import the retail_transactions.csv file
# Create a data frame named 'transactions'
# Note: Import the retail_transactions.csv file

# Your code here:
transactions <- read_csv("retail_transactions.csv")

# Display success message
cat("Data imported successfully!\n")
cat("Dataset dimensions:", nrow(transactions), "rows x", ncol(transactions), "columns\n")

[1mRows: [22m[34m500[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (4): CustomerName, CustomerCity, ProductName, ProductCategory
[32mdbl[39m  (4): TransactionID, CustomerID, TotalAmount, Quantity
[34mdate[39m (1): TransactionDate



[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Data imported successfully!
Dataset dimensions: 500 rows x 9 columns


In [13]:
# Task 1.2: Initial Exploration

# Display the first 10 rows
cat("First 10 rows of the dataset:\n")
# Your code here:

print(head(transactions, 10))


# Check the structure of the dataset
cat("\nDataset structure:\n")
# Your code here:

str(transactions)

# Display column names and their data types
cat("\nColumn names:\n")
# Your code here:

print(colnames(transactions))
cat("\nData types of each column:\n")

First 10 rows of the dataset:
[90m# A tibble: 10 × 9[39m
   TransactionID CustomerID CustomerName CustomerCity ProductName    
           [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m          
[90m 1[39m             1         81 Customer 39  Chicago      Adidas Jacket  
[90m 2[39m             2         13 Customer 63  Philadelphia Samsung TV     
[90m 3[39m             3         18 Customer 98  Chicago      Adidas Jacket  
[90m 4[39m             4         76 Customer 39  Houston      Dell Laptop    
[90m 5[39m             5         86 Customer 45  New York     Nike Shoes     
[90m 6[39m             6         37 Customer 8   Philadelphia Adidas Jacket  
[90m 7[39m             7         45 Customer 83  New York     HP Printer     
[90m 8[39m             8         11 Customer 60  Chicago      Samsung TV     
[90m 9[39m             9         13 Customer 69  Houston      iP

## Part 2: Column Selection with `select()`

Practice different methods of selecting columns from your dataset.

In [14]:
# Task 2.1: Basic Selection
# Create 'basic_info' with TransactionID, CustomerID, ProductName, and TotalAmount

basic_info <- transactions %>%
  # Your code here:

  select(TransactionID, CustomerID, ProductName, TotalAmount)

# Display the result
cat("Basic info dataset (first 5 rows):\n")
head(basic_info, 5)

Basic info dataset (first 5 rows):


TransactionID,CustomerID,ProductName,TotalAmount
<dbl>,<dbl>,<chr>,<dbl>
1,81,Adidas Jacket,632.39
2,13,Samsung TV,114.28
3,18,Adidas Jacket,1289.24
4,76,Dell Laptop,885.4
5,86,Nike Shoes,95.95


In [15]:
# Task 2.2: Range Selection
# Create 'customer_details' with all columns from CustomerID to CustomerCity (inclusive)

customer_details <- transactions %>%
  # Your code here:

  select(CustomerID:CustomerCity)

# Display the result
cat("Customer details (first 5 rows):\n")
head(customer_details, 5)

Customer details (first 5 rows):


CustomerID,CustomerName,CustomerCity
<dbl>,<chr>,<chr>
81,Customer 39,Chicago
13,Customer 63,Philadelphia
18,Customer 98,Chicago
76,Customer 39,Houston
86,Customer 45,New York


In [16]:
# Task 2.3: Pattern-Based Selection

# Create 'date_columns' with columns starting with "Date" or "Time"
date_columns <- transactions %>%
  # Your code here:

  select(starts_with("Date") | starts_with("Time"))

# Create 'amount_columns' with columns containing the word "Amount"
amount_columns <- transactions %>%
  # Your code here:

  select(contains("Amount"))


# Display column names for verification
cat("Date/Time columns:", names(date_columns), "\n")
cat("Amount columns:", names(amount_columns), "\n")

Date/Time columns:  
Amount columns: TotalAmount 


In [17]:
# Task 2.4: Exclusion Selection
# Create 'no_ids' without TransactionID and CustomerID columns

no_ids <- transactions %>%
  # Your code here:

  select(-TransactionID, -CustomerID)

# Display column names for verification
cat("Columns after removing IDs:", names(no_ids), "\n")
cat("Number of columns:", ncol(no_ids), "\n")

Columns after removing IDs: CustomerName CustomerCity ProductName ProductCategory TotalAmount Quantity TransactionDate 
Number of columns: 7 


## Part 3: Row Filtering with `filter()`

Learn to filter rows based on various conditions.

In [18]:
# Task 3.1: Single Condition Filtering

# Filter transactions with TotalAmount > $100
high_value_transactions <- transactions %>%
  # Your code here:

  filter(TotalAmount > 100)


# Filter transactions from "Electronics" category
electronics_transactions <- transactions %>%
  # Your code here:

  filter(ProductCategory == "Electronics")

# Display results
cat("High value transactions (>$100):", nrow(high_value_transactions), "rows\n")
cat("Electronics transactions:", nrow(electronics_transactions), "rows\n")

High value transactions (>$100): 470 rows
Electronics transactions: 93 rows


In [19]:
# Task 3.2: Multiple Condition Filtering (AND)
# Filter for TotalAmount > $50 AND Quantity > 1 AND CustomerCity == "New York"

ny_bulk_purchases <- transactions %>%
  # Your code here:

  filter(TotalAmount > 50 & Quantity > 1 & CustomerCity == "New York")


# Display results
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "rows\n")
if(nrow(ny_bulk_purchases) > 0) {
  head(ny_bulk_purchases)
}

NY bulk purchases: 75 rows


TransactionID,CustomerID,CustomerName,CustomerCity,ProductName,ProductCategory,TotalAmount,Quantity,TransactionDate
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<date>
5,86,Customer 45,New York,Nike Shoes,Computers,95.95,5,2024-08-13
7,45,Customer 83,New York,HP Printer,Clothing,78.71,3,2024-05-02
15,1,Customer 52,New York,Nike Shoes,Clothing,602.79,5,2024-01-05
22,80,Customer 4,New York,iPhone 14,Books,1424.99,7,2024-02-01
25,97,Customer 17,New York,iPhone 14,Movies,999.24,2,2024-02-03
29,3,Customer 23,New York,Samsung TV,Clothing,1392.13,5,2024-04-30


In [20]:
# Task 3.3: Multiple Condition Filtering (OR)
# Filter for ProductCategory = "Books" OR "Music" OR "Movies"

entertainment_transactions <- transactions %>%
  # Your code here:

  filter(ProductCategory %in% c("Books", "Music", "Movies"))

# Display results
cat("Entertainment transactions:", nrow(entertainment_transactions), "rows\n")
if(nrow(entertainment_transactions) > 0) {
  head(entertainment_transactions)
}

Entertainment transactions: 227 rows


TransactionID,CustomerID,CustomerName,CustomerCity,ProductName,ProductCategory,TotalAmount,Quantity,TransactionDate
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<date>
2,13,Customer 63,Philadelphia,Samsung TV,Music,114.28,3,2024-12-08
8,11,Customer 60,Chicago,Samsung TV,Music,871.93,3,2024-04-30
9,13,Customer 69,Houston,iPhone 14,Music,1347.56,8,2024-08-08
10,55,Customer 24,Chicago,Sony Headphones,Books,633.51,1,2024-06-23
11,100,Customer 95,Philadelphia,HP Printer,Movies,572.43,6,2024-12-09
14,19,Customer 100,Phoenix,Nike Shoes,Books,32.29,3,2024-12-11


In [28]:
# Task 3.4: Date-Based Filtering
# Filter transactions from March 2024
# Note: Adjust the date format and column name based on your actual data

march_transactions <- transactions %>%
  # Your code here (you may need to convert date format first):

  filter(lubridate::month(TransactionDate) == 3 & lubridate::year(TransactionDate) == 2024)


# Display results
cat("March 2024 transactions:", nrow(march_transactions), "rows\n")

March 2024 transactions: 41 rows


In [31]:
# Task 3.5: Advanced Filtering Challenge
# Find customers who made purchases in both "Electronics" AND "Clothing" categories
# Hint: This requires identifying customers who appear in both categories

# Step 1: Find customers who bought Electronics
electronics_customers <- transactions %>%
  # Your code here:

  filter(ProductCategory == "Electronics") 

# Step 2: Find customers who bought Clothing
clothing_customers <- transactions %>%
  # Your code here:

  filter(ProductCategory == "Clothing") 

# Step 3: Find customers who bought both
both_categories_customers <- # Your code here:

  intersect(electronics_customers$CustomerID, clothing_customers$CustomerID)


# Display results
cat("Customers who bought both Electronics and Clothing:", length(both_categories_customers), "customers\n")

Customers who bought both Electronics and Clothing: 38 customers


## Part 4: Data Sorting with `arrange()`

Practice sorting data by single and multiple columns.

In [32]:
# Task 4.1: Single Column Sorting

# Sort by TotalAmount ascending
transactions_by_amount_asc <- transactions %>%
  # Your code here:

  arrange(TotalAmount)

# Sort by TotalAmount descending
transactions_by_amount_desc <- transactions %>%
  # Your code here:

  arrange(desc(TotalAmount))

# Display top 5 of each
cat("Lowest amounts:\n")
head(transactions_by_amount_asc %>% select(CustomerName, ProductName, TotalAmount), 5)

cat("\nHighest amounts:\n")
head(transactions_by_amount_desc %>% select(CustomerName, ProductName, TotalAmount), 5)

Lowest amounts:


CustomerName,ProductName,TotalAmount
<chr>,<chr>,<dbl>
Customer 95,Adidas Jacket,27.66
Customer 100,Nike Shoes,32.29
Customer 50,Adidas Jacket,35.01
Customer 83,Samsung TV,36.37
Customer 69,Dell Laptop,37.33



Highest amounts:


CustomerName,ProductName,TotalAmount
<chr>,<chr>,<dbl>
Customer 60,Sony Headphones,1499.52
Customer 28,Dell Laptop,1491.96
Customer 81,Sony Headphones,1491.62
Customer 79,iPhone 14,1488.95
Customer 20,HP Printer,1487.44


In [33]:
# Task 4.2: Multiple Column Sorting
# Sort by CustomerCity (ascending), then by TotalAmount (descending)

transactions_by_city_amount <- transactions %>%
  # Your code here:

  arrange(CustomerCity, desc(TotalAmount))

# Display first 10 rows
cat("Transactions sorted by city, then amount:\n")
head(transactions_by_city_amount %>% select(CustomerCity, CustomerName, ProductName, TotalAmount), 10)

Transactions sorted by city, then amount:


CustomerCity,CustomerName,ProductName,TotalAmount
<chr>,<chr>,<chr>,<dbl>
Chicago,Customer 18,iPhone 14,1487.43
Chicago,Customer 20,HP Printer,1476.49
Chicago,Customer 97,Dell Laptop,1459.42
Chicago,Customer 49,Dell Laptop,1453.94
Chicago,Customer 70,Nike Shoes,1428.35
Chicago,Customer 71,Samsung TV,1424.38
Chicago,Customer 62,Sony Headphones,1416.35
Chicago,Customer 11,Nike Shoes,1407.51
Chicago,Customer 99,HP Printer,1388.73
Chicago,Customer 49,Nike Shoes,1370.57


In [34]:
# Task 4.3: Date-Based Sorting
# Sort by TransactionDate chronologically (oldest first)

transactions_chronological <- transactions %>%
  # Your code here:

  arrange(TransactionDate)  


# Display first 5 transactions chronologically
cat("Earliest transactions:\n")
head(transactions_chronological %>% select(TransactionDate, CustomerName, ProductName, TotalAmount), 5)

Earliest transactions:


TransactionDate,CustomerName,ProductName,TotalAmount
<date>,<chr>,<chr>,<dbl>
2024-01-01,Customer 62,Sony Headphones,1416.35
2024-01-01,Customer 99,Sony Headphones,1326.27
2024-01-02,Customer 83,Nike Shoes,808.93
2024-01-02,Customer 61,Adidas Jacket,502.72
2024-01-02,Customer 69,Adidas Jacket,277.7


## Part 5: Chaining Operations

Combine multiple dplyr operations using the pipe operator.

In [35]:
# Task 5.1: Simple Chain
# Filter TotalAmount > $75, select specific columns, arrange by TotalAmount descending

premium_purchases <- transactions %>%
  # Your code here:

  filter(TotalAmount > 75) %>%
  select(TransactionID, CustomerName, ProductName, TotalAmount) %>%
  arrange(desc(TotalAmount))

# Display results
cat("Premium purchases (>$75):\n")
head(premium_purchases, 10)

Premium purchases (>$75):


TransactionID,CustomerName,ProductName,TotalAmount
<dbl>,<chr>,<chr>,<dbl>
150,Customer 60,Sony Headphones,1499.52
61,Customer 28,Dell Laptop,1491.96
154,Customer 81,Sony Headphones,1491.62
187,Customer 79,iPhone 14,1488.95
136,Customer 20,HP Printer,1487.44
297,Customer 18,iPhone 14,1487.43
378,Customer 83,iPhone 14,1484.31
16,Customer 20,HP Printer,1476.49
479,Customer 10,Dell Laptop,1473.27
84,Customer 14,iPhone 14,1471.59


In [37]:
# Task 5.2: Complex Chain
# Filter for Electronics/Computers, select columns, arrange by date/amount, keep top 20

recent_tech_purchases <- transactions %>%
  # Your code here:

  filter(ProductCategory %in% c("Electronics", "Computers")) %>%
  arrange(TransactionDate, desc(TotalAmount)) %>%  
  head(20) %>%
  select(TransactionID, TransactionDate, CustomerName, ProductName, TotalAmount)
   

# Display results
cat("Recent tech purchases (top 20):\n")
print(recent_tech_purchases)

Recent tech purchases (top 20):
[90m# A tibble: 20 × 5[39m
   TransactionID TransactionDate CustomerName ProductName     TotalAmount
           [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m          [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m 1[39m            32 2024-01-01      Customer 62  Sony Headphones      [4m1[24m416. 
[90m 2[39m           133 2024-01-01      Customer 99  Sony Headphones      [4m1[24m326. 
[90m 3[39m           251 2024-01-02      Customer 69  Adidas Jacket         278. 
[90m 4[39m           137 2024-01-03      Customer 2   HP Printer            463. 
[90m 5[39m           399 2024-01-08      Customer 76  Sony Headphones       463. 
[90m 6[39m           370 2024-01-08      Customer 56  HP Printer            186. 
[90m 7[39m           167 2024-01-12      Customer 76  Sony Headphones       997. 
[90m 8[39m            81 2024-01-14      Customer 98  Dell Laptop            96.8
[90m 

In [39]:
# Task 5.3: Business Intelligence Chain
# Identify high-value repeat customers (TotalAmount > $200)

high_value_customers <- transactions %>%
  # Your code here:

  filter(TotalAmount > 200) %>%
  group_by(CustomerID, CustomerName) %>%
  summarise(TotalSpent = sum(TotalAmount), Transactions = n(), .groups = 'drop') %>%
  filter(Transactions > 1) %>%
  arrange(desc(TotalSpent))
  


# Display results
cat("High-value customers:\n")
head(high_value_customers, 15)

High-value customers:


CustomerID,CustomerName,TotalSpent,Transactions
<dbl>,<chr>,<dbl>,<int>
7,Customer 18,2782.54,2
77,Customer 99,2628.75,2
72,Customer 13,1986.1,2
70,Customer 49,1823.74,2
69,Customer 14,1095.99,2
87,Customer 87,717.69,2


## Part 6: Data Analysis Questions

Answer the following questions using the datasets you've created.

In [40]:
# Question 6.1: Transaction Volume
# Count transactions in each filtered dataset

cat("Transaction counts by dataset:\n")
cat("High value transactions:", nrow(high_value_transactions), "\n")
cat("Electronics transactions:", nrow(electronics_transactions), "\n")
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "\n")
cat("Entertainment transactions:", nrow(entertainment_transactions), "\n")
cat("March transactions:", nrow(march_transactions), "\n")
cat("Premium purchases:", nrow(premium_purchases), "\n")
cat("Recent tech purchases:", nrow(recent_tech_purchases), "\n")
cat("High value customers:", nrow(high_value_customers), "\n")

Transaction counts by dataset:
High value transactions: 470 
Electronics transactions: 93 
NY bulk purchases: 75 
Entertainment transactions: 227 
March transactions: 41 
Premium purchases: 483 
Recent tech purchases: 20 
High value customers: 6 


In [42]:
# Question 6.2: Top Customers
# Find the customer who appears most frequently in high_value_customers

if(nrow(high_value_customers) > 0) {
  customer_frequency <- high_value_customers %>%
    # Your code here to count customer appearances:

    arrange(desc(TotalSpent)) %>%
    slice(1)
    
  
  cat("Most frequent high-value customer:\n")
  print(customer_frequency)
} else {
  cat("No high-value customers found\n")
}

Most frequent high-value customer:
[90m# A tibble: 1 × 4[39m
  CustomerID CustomerName TotalSpent Transactions
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m        [3m[90m<int>[39m[23m
[90m1[39m          7 Customer 18       [4m2[24m783.            2


In [43]:
# Question 6.3: Product Analysis
# Find top 5 most expensive transactions in entertainment_transactions

if(nrow(entertainment_transactions) > 0) {
  top_entertainment <- entertainment_transactions %>%
    # Your code here:

    arrange(desc(TotalAmount)) %>%
    head(5) %>%
    select(TransactionID, CustomerName, ProductName, TotalAmount)
    
  
  cat("Top 5 most expensive entertainment transactions:\n")
  print(top_entertainment)
} else {
  cat("No entertainment transactions found\n")
}

Top 5 most expensive entertainment transactions:
[90m# A tibble: 5 × 4[39m
  TransactionID CustomerName ProductName     TotalAmount
          [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m1[39m           154 Customer 81  Sony Headphones       [4m1[24m492.
[90m2[39m           378 Customer 83  iPhone 14             [4m1[24m484.
[90m3[39m           479 Customer 10  Dell Laptop           [4m1[24m473.
[90m4[39m           384 Customer 97  Dell Laptop           [4m1[24m459.
[90m5[39m           468 Customer 47  Dell Laptop           [4m1[24m457.


In [44]:
# Question 6.4: Geographic Analysis
# Find the city with the highest single transaction amount

highest_transaction_by_city <- transactions_by_city_amount %>%
  # Your code here:
  slice(1) %>%
  select(CustomerCity, CustomerName, ProductName, TotalAmount)


cat("City with highest single transaction:\n")
print(highest_transaction_by_city)

City with highest single transaction:
[90m# A tibble: 1 × 4[39m
  CustomerCity CustomerName ProductName TotalAmount
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m
[90m1[39m Chicago      Customer 18  iPhone 14         [4m1[24m487.


## Part 7: Reflection Questions

Please answer the following questions in the markdown cells below.

### Question 7.1: Pipe Operator Benefits

**How does using the pipe operator (`%>%`) improve code readability compared to nested function calls? Provide a specific example from your homework.**

Your answer here:

The pipe operator improves code readability by showing the data transformations in
a friendly linear fashion while nested function cells can be hard to read due
to its cimposition of commands inside other commands. 

### Question 7.2: Filtering Strategy

**When filtering data for business analysis, what are the trade-offs between being very specific (many conditions) versus being more general (fewer conditions)? How might this affect your insights?**

Your answer here:

With more specific conditions, you're able to target more exact data which helps
with precise insights. However, a trade off is that when looking for more
specific data, you may limit your insights if its too niche and can sometimes
be bias. On the other hand, general conditions do the opposite. They provide a 
larger pool of data ot analyze but may not be as precise which could lead to 
shallow reports.

### Question 7.3: Sorting Importance

**Why is data sorting important in business analytics? Provide three specific business scenarios where sorting data would be crucial for decision-making.**

Your answer here:

1. You want to identify the most expensive items sold and how many people bought
   them.
2. You want to analyze employee performance and rank them based on sales or
   customer ratings.
3. You want to keep track of stock levels being brought to a store and sort them
   based on how quickly they are selling.

### Question 7.4: Real-World Application

**Describe a real business scenario where you might need to combine `select()`, `filter()`, and `arrange()` operations. What insights would you be trying to gain?**

Your answer here:

When drafting athletes to be on a team, one may want to use thses functions to 
select select them based on their position, filter them based off performance
levels and then arrange them in ascending order based off their stats to see
the top performing playerts.

## Summary and Submission

### What You've Learned

In this homework, you've practiced:
- Using `select()` for column selection with various methods
- Using `filter()` for row filtering with single and multiple conditions
- Using `arrange()` for sorting data by single and multiple columns
- Chaining operations with the pipe operator (`%>%`)
- Analyzing business data to generate insights

### Submission Checklist

Before submitting, ensure you have:
- [ ] Completed all code tasks
- [ ] Run all cells successfully
- [ ] Answered all reflection questions
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator where appropriate
- [ ] Verified your results make sense

### Next Steps

In the next lesson, you'll learn about:
- `mutate()` for creating new columns
- `summarize()` for calculating summary statistics
- `group_by()` for grouped operations
- Advanced data transformation techniques