# Homework Assignment - Lesson 3: Data Transformation with dplyr - Part 1

**Student Name:** [Alexander Weis]

**Due Date:** [September 21, 2025]

**Objective:** Learn to use dplyr functions (`select()`, `filter()`, `arrange()`) and the pipe operator (`%>%`) for data transformation and analysis.

---

## Instructions

- Complete all tasks in this notebook
- Use the pipe operator (`%>%`) wherever possible to chain operations
- Ensure your code is well-commented and easy to understand
- Run all cells to verify your code works correctly
- Answer all reflection questions at the end

---

## Part 1: Data Import and Setup

In this section, you'll import the retail transactions dataset and perform initial exploration.

**Dataset:** `retail_transactions.csv` - This dataset contains transaction records from a retail business with information about customers, products, dates, amounts, and quantities.

In [5]:
# Load required libraries
library(tidyverse)

# Set working directory if needed
setwd("/workspaces/assignment-1-Alex-Weis4-2/data")

# Task 1.1: Import the retail_transactions.csv file
# Create a data frame named 'transactions'
# Note: Import the retail_transactions.csv file

# Your code here:
transactions <- read_csv("retail_transactions.csv")

# Display success message
cat("Data imported successfully!\n")
cat("Dataset dimensions:", nrow(transactions), "rows x", ncol(transactions), "columns\n")

[1mRows: [22m[34m500[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (4): CustomerName, CustomerCity, ProductName, ProductCategory
[32mdbl[39m  (4): TransactionID, CustomerID, TotalAmount, Quantity
[34mdate[39m (1): TransactionDate

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Data imported successfully!
Dataset dimensions: 500 rows x 9 columns


In [8]:
# Task 1.2: Initial Exploration

# Display the first 10 rows
cat("First 10 rows of the dataset:\n")
# Your code here:
head(transactions, 10)

# Check the structure of the dataset
cat("\nDataset structure:\n")
# Your code here:
str(transactions)

# Display column names and their data types
cat("\nColumn names:\n")
# Your code here:
summary(transactions)

First 10 rows of the dataset:


TransactionID,CustomerID,CustomerName,CustomerCity,ProductName,ProductCategory,TotalAmount,Quantity,TransactionDate
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<date>
1,81,Customer 39,Chicago,Adidas Jacket,Clothing,632.39,3,2024-03-09
2,13,Customer 63,Philadelphia,Samsung TV,Music,114.28,3,2024-12-08
3,18,Customer 98,Chicago,Adidas Jacket,Computers,1289.24,7,2024-01-22
4,76,Customer 39,Houston,Dell Laptop,Computers,885.4,2,2024-07-02
5,86,Customer 45,New York,Nike Shoes,Computers,95.95,5,2024-08-13
6,37,Customer 8,Philadelphia,Adidas Jacket,Electronics,1126.34,2,2024-04-15
7,45,Customer 83,New York,HP Printer,Clothing,78.71,3,2024-05-02
8,11,Customer 60,Chicago,Samsung TV,Music,871.93,3,2024-04-30
9,13,Customer 69,Houston,iPhone 14,Music,1347.56,8,2024-08-08
10,55,Customer 24,Chicago,Sony Headphones,Books,633.51,1,2024-06-23



Dataset structure:
spc_tbl_ [500 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ TransactionID  : num [1:500] 1 2 3 4 5 6 7 8 9 10 ...
 $ CustomerID     : num [1:500] 81 13 18 76 86 37 45 11 13 55 ...
 $ CustomerName   : chr [1:500] "Customer 39" "Customer 63" "Customer 98" "Customer 39" ...
 $ CustomerCity   : chr [1:500] "Chicago" "Philadelphia" "Chicago" "Houston" ...
 $ ProductName    : chr [1:500] "Adidas Jacket" "Samsung TV" "Adidas Jacket" "Dell Laptop" ...
 $ ProductCategory: chr [1:500] "Clothing" "Music" "Computers" "Computers" ...
 $ TotalAmount    : num [1:500] 632 114 1289 885 96 ...
 $ Quantity       : num [1:500] 3 3 7 2 5 2 3 3 8 1 ...
 $ TransactionDate: Date[1:500], format: "2024-03-09" "2024-12-08" ...
 - attr(*, "spec")=
  .. cols(
  ..   TransactionID = [32mcol_double()[39m,
  ..   CustomerID = [32mcol_double()[39m,
  ..   CustomerName = [31mcol_character()[39m,
  ..   CustomerCity = [31mcol_character()[39m,
  ..   ProductName = [31mcol_character()[39m,


 TransactionID     CustomerID     CustomerName       CustomerCity      
 Min.   :  1.0   Min.   :  1.00   Length:500         Length:500        
 1st Qu.:125.8   1st Qu.: 27.00   Class :character   Class :character  
 Median :250.5   Median : 52.00   Mode  :character   Mode  :character  
 Mean   :250.5   Mean   : 51.26                                        
 3rd Qu.:375.2   3rd Qu.: 77.00                                        
 Max.   :500.0   Max.   :100.00                                        
 ProductName        ProductCategory     TotalAmount         Quantity   
 Length:500         Length:500         Min.   :  27.66   Min.   :1.00  
 Class :character   Class :character   1st Qu.: 393.81   1st Qu.:2.00  
 Mode  :character   Mode  :character   Median : 763.64   Median :4.00  
                                       Mean   : 758.69   Mean   :4.44  
                                       3rd Qu.:1137.10   3rd Qu.:7.00  
                                       Max.   :1499.52   Max.   

## Part 2: Column Selection with `select()`

Practice different methods of selecting columns from your dataset.

In [11]:
# Task 2.1: Basic Selection
# Create 'basic_info' with TransactionID, CustomerID, ProductName, and TotalAmount

basic_info <- transactions %>%
select(TransactionID, CustomerID, ProductName, TotalAmount)

# Display the result
cat("Basic info dataset (first 5 rows):\n")
head(basic_info, 5)

Basic info dataset (first 5 rows):


TransactionID,CustomerID,ProductName,TotalAmount
<dbl>,<dbl>,<chr>,<dbl>
1,81,Adidas Jacket,632.39
2,13,Samsung TV,114.28
3,18,Adidas Jacket,1289.24
4,76,Dell Laptop,885.4
5,86,Nike Shoes,95.95


In [12]:
# Task 2.2: Range Selection
# Create 'customer_details' with all columns from CustomerID to CustomerCity (inclusive)

customer_details <- transactions %>%
  # Your code here:
select(CustomerID:CustomerCity)

# Display the result
cat("Customer details (first 5 rows):\n")
head(customer_details, 5)

Customer details (first 5 rows):


CustomerID,CustomerName,CustomerCity
<dbl>,<chr>,<chr>
81,Customer 39,Chicago
13,Customer 63,Philadelphia
18,Customer 98,Chicago
76,Customer 39,Houston
86,Customer 45,New York


In [19]:
# Task 2.3: Pattern-Based Selection

# Create 'date_columns' with columns starting with "Date" or "Time"
date_columns <- transactions %>%
select(starts_with("Date") | starts_with("Time"))

# Create 'amount_columns' with columns containing the word "Amount"
amount_columns <- transactions %>%
select(contains("Amount"))

# Display column names for verification
cat("Date/Time columns:", names(date_columns), "\n")
cat("Amount columns:", names(amount_columns), "\n")

Date/Time columns:  
Amount columns: TotalAmount 


In [None]:
# Task 2.4: Exclusion Selection
# Create 'no_ids' without TransactionID and CustomerID columns

no_ids <- transactions %>%
  # Your code here:
filter()

# Display column names for verification
cat("Columns after removing IDs:", names(no_ids), "\n")
cat("Number of columns:", ncol(no_ids), "\n")

Columns after removing IDs: TransactionID CustomerID CustomerName CustomerCity ProductName ProductCategory TotalAmount Quantity TransactionDate 
Number of columns: 9 


## Part 3: Row Filtering with `filter()`

Learn to filter rows based on various conditions.

In [None]:
# Task 3.1: Single Condition Filtering

# Filter transactions with TotalAmount > $100
high_value_transactions <- transactions %>%
  # Your code here:


# Filter transactions from "Electronics" category
electronics_transactions <- transactions %>%
  # Your code here:


# Display results
cat("High value transactions (>$100):", nrow(high_value_transactions), "rows\n")
cat("Electronics transactions:", nrow(electronics_transactions), "rows\n")

In [None]:
# Task 3.2: Multiple Condition Filtering (AND)
# Filter for TotalAmount > $50 AND Quantity > 1 AND CustomerCity == "New York"

ny_bulk_purchases <- transactions %>%
  # Your code here:


# Display results
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "rows\n")
if(nrow(ny_bulk_purchases) > 0) {
  head(ny_bulk_purchases)
}

In [None]:
# Task 3.3: Multiple Condition Filtering (OR)
# Filter for ProductCategory = "Books" OR "Music" OR "Movies"

entertainment_transactions <- transactions %>%
  # Your code here:


# Display results
cat("Entertainment transactions:", nrow(entertainment_transactions), "rows\n")
if(nrow(entertainment_transactions) > 0) {
  head(entertainment_transactions)
}

In [None]:
# Task 3.4: Date-Based Filtering
# Filter transactions from March 2024
# Note: Adjust the date format and column name based on your actual data

march_transactions <- transactions %>%
  # Your code here (you may need to convert date format first):


# Display results
cat("March 2024 transactions:", nrow(march_transactions), "rows\n")

In [None]:
# Task 3.5: Advanced Filtering Challenge
# Find customers who made purchases in both "Electronics" AND "Clothing" categories
# Hint: This requires identifying customers who appear in both categories

# Step 1: Find customers who bought Electronics
electronics_customers <- transactions %>%
  # Your code here:


# Step 2: Find customers who bought Clothing
clothing_customers <- transactions %>%
  # Your code here:


# Step 3: Find customers who bought both
both_categories_customers <- # Your code here:


# Display results
cat("Customers who bought both Electronics and Clothing:", length(both_categories_customers), "customers\n")

## Part 4: Data Sorting with `arrange()`

Practice sorting data by single and multiple columns.

In [None]:
# Task 4.1: Single Column Sorting

# Sort by TotalAmount ascending
transactions_by_amount_asc <- transactions %>%
  # Your code here:


# Sort by TotalAmount descending
transactions_by_amount_desc <- transactions %>%
  # Your code here:


# Display top 5 of each
cat("Lowest amounts:\n")
head(transactions_by_amount_asc %>% select(CustomerName, ProductName, TotalAmount), 5)

cat("\nHighest amounts:\n")
head(transactions_by_amount_desc %>% select(CustomerName, ProductName, TotalAmount), 5)

In [None]:
# Task 4.2: Multiple Column Sorting
# Sort by CustomerCity (ascending), then by TotalAmount (descending)

transactions_by_city_amount <- transactions %>%
  # Your code here:


# Display first 10 rows
cat("Transactions sorted by city, then amount:\n")
head(transactions_by_city_amount %>% select(CustomerCity, CustomerName, ProductName, TotalAmount), 10)

In [None]:
# Task 4.3: Date-Based Sorting
# Sort by TransactionDate chronologically (oldest first)

transactions_chronological <- transactions %>%
  # Your code here:


# Display first 5 transactions chronologically
cat("Earliest transactions:\n")
head(transactions_chronological %>% select(TransactionDate, CustomerName, ProductName, TotalAmount), 5)

## Part 5: Chaining Operations

Combine multiple dplyr operations using the pipe operator.

In [None]:
# Task 5.1: Simple Chain
# Filter TotalAmount > $75, select specific columns, arrange by TotalAmount descending

premium_purchases <- transactions %>%
  # Your code here:


# Display results
cat("Premium purchases (>$75):\n")
head(premium_purchases, 10)

In [None]:
# Task 5.2: Complex Chain
# Filter for Electronics/Computers, select columns, arrange by date/amount, keep top 20

recent_tech_purchases <- transactions %>%
  # Your code here:


# Display results
cat("Recent tech purchases (top 20):\n")
print(recent_tech_purchases)

In [None]:
# Task 5.3: Business Intelligence Chain
# Identify high-value repeat customers (TotalAmount > $200)

high_value_customers <- transactions %>%
  # Your code here:


# Display results
cat("High-value customers:\n")
head(high_value_customers, 15)

## Part 6: Data Analysis Questions

Answer the following questions using the datasets you've created.

In [None]:
# Question 6.1: Transaction Volume
# Count transactions in each filtered dataset

cat("Transaction counts by dataset:\n")
cat("High value transactions:", nrow(high_value_transactions), "\n")
cat("Electronics transactions:", nrow(electronics_transactions), "\n")
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "\n")
cat("Entertainment transactions:", nrow(entertainment_transactions), "\n")
cat("March transactions:", nrow(march_transactions), "\n")
cat("Premium purchases:", nrow(premium_purchases), "\n")
cat("Recent tech purchases:", nrow(recent_tech_purchases), "\n")
cat("High value customers:", nrow(high_value_customers), "\n")

In [None]:
# Question 6.2: Top Customers
# Find the customer who appears most frequently in high_value_customers

if(nrow(high_value_customers) > 0) {
  customer_frequency <- high_value_customers %>%
    # Your code here to count customer appearances:
  
  
  cat("Most frequent high-value customer:\n")
  print(customer_frequency)
} else {
  cat("No high-value customers found\n")
}

In [None]:
# Question 6.3: Product Analysis
# Find top 5 most expensive transactions in entertainment_transactions

if(nrow(entertainment_transactions) > 0) {
  top_entertainment <- entertainment_transactions %>%
    # Your code here:
  
  
  cat("Top 5 most expensive entertainment transactions:\n")
  print(top_entertainment)
} else {
  cat("No entertainment transactions found\n")
}

In [None]:
# Question 6.4: Geographic Analysis
# Find the city with the highest single transaction amount

highest_transaction_by_city <- transactions_by_city_amount %>%
  # Your code here:


cat("City with highest single transaction:\n")
print(highest_transaction_by_city)

## Part 7: Reflection Questions

Please answer the following questions in the markdown cells below.

### Question 7.1: Pipe Operator Benefits

**How does using the pipe operator (`%>%`) improve code readability compared to nested function calls? Provide a specific example from your homework.**

Your answer here:


### Question 7.2: Filtering Strategy

**When filtering data for business analysis, what are the trade-offs between being very specific (many conditions) versus being more general (fewer conditions)? How might this affect your insights?**

Your answer here:


### Question 7.3: Sorting Importance

**Why is data sorting important in business analytics? Provide three specific business scenarios where sorting data would be crucial for decision-making.**

Your answer here:

1. 
2. 
3. 

### Question 7.4: Real-World Application

**Describe a real business scenario where you might need to combine `select()`, `filter()`, and `arrange()` operations. What insights would you be trying to gain?**

Your answer here:


## Summary and Submission

### What You've Learned

In this homework, you've practiced:
- Using `select()` for column selection with various methods
- Using `filter()` for row filtering with single and multiple conditions
- Using `arrange()` for sorting data by single and multiple columns
- Chaining operations with the pipe operator (`%>%`)
- Analyzing business data to generate insights

### Submission Checklist

Before submitting, ensure you have:
- [ ] Completed all code tasks
- [ ] Run all cells successfully
- [ ] Answered all reflection questions
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator where appropriate
- [ ] Verified your results make sense

### Next Steps

In the next lesson, you'll learn about:
- `mutate()` for creating new columns
- `summarize()` for calculating summary statistics
- `group_by()` for grouped operations
- Advanced data transformation techniques