# Homework Assignment - Lesson 3: Data Transformation with dplyr - Part 1

**Student Name:** Michael Alexander

**Due Date:** 09/21/2025

**Objective:** Learn to use dplyr functions (`select()`, `filter()`, `arrange()`) and the pipe operator (`%>%`) for data transformation and analysis.

---

## Instructions

- Complete all tasks in this notebook
- Use the pipe operator (`%>%`) wherever possible to chain operations
- Ensure your code is well-commented and easy to understand
- Run all cells to verify your code works correctly
- Answer all reflection questions at the end

---

## Part 1: Data Import and Setup

In this section, you'll import the retail transactions dataset and perform initial exploration.

**Dataset:** `retail_transactions.csv` - This dataset contains transaction records from a retail business with information about customers, products, dates, amounts, and quantities.

In [1]:
# Install required packages
if (!require(tidyverse, quietly = TRUE)) {
  install.packages("tidyverse", repos = "https://cran.rstudio.com/")
  library(tidyverse)
}

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
# Load required libraries
library(tidyverse)

# Set working directory if needed
# setwd("path/to/your/data")

# Task 1.1: Import the retail_transactions.csv file
# Create a data frame named 'transactions'
# Note: Import the retail_transactions.csv file

# Your code here:
transactions <- read_csv("retail_transactions.csv")

# Display success message
cat("Data imported successfully!\n")
cat("Dataset dimensions:", nrow(transactions), "rows x", ncol(transactions), "columns\n")

[1mRows: [22m[34m15[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (7): TransactionID, CustomerID, CustomerName, CustomerCity, ProductID, ...
[32mdbl[39m  (3): Quantity, UnitPrice, TotalAmount
[34mdate[39m (1): TransactionDate

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Data imported successfully!
Dataset dimensions: 15 rows x 11 columns


In [3]:
# Task 1.2: Initial Exploration

# Display the first 10 rows
cat("First 10 rows of the dataset:\n")
# Your code here:
head(transactions, 10)

# Check the structure of the dataset
cat("\nDataset structure:\n")
# Your code here:
str(transactions)

# Display column names and their data types
cat("\nColumn names:\n")
# Your code here:
colnames(transactions)

First 10 rows of the dataset:


TransactionID,CustomerID,CustomerName,CustomerCity,ProductID,ProductName,ProductCategory,Quantity,UnitPrice,TotalAmount,TransactionDate
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<date>
T001,C001,John Smith,New York,P001,Laptop,Electronics,1,999.99,999.99,2024-03-15
T002,C002,Sarah Johnson,Los Angeles,P002,Book: Data Science,Books,2,29.99,59.98,2024-03-16
T003,C003,Mike Brown,Chicago,P003,Headphones,Electronics,1,199.99,199.99,2024-03-17
T004,C001,John Smith,New York,P004,T-Shirt,Clothing,3,19.99,59.97,2024-03-18
T005,C004,Emily Davis,Houston,P005,Coffee Maker,Electronics,1,79.99,79.99,2024-03-19
T006,C002,Sarah Johnson,Los Angeles,P006,Movie: Inception,Movies,1,14.99,14.99,2024-03-20
T007,C005,David Wilson,Phoenix,P007,Jeans,Clothing,2,49.99,99.98,2024-03-21
T008,C003,Mike Brown,Chicago,P008,Music Album,Music,1,12.99,12.99,2024-03-22
T009,C006,Lisa Anderson,Philadelphia,P009,Smartphone,Electronics,1,699.99,699.99,2024-03-23
T010,C007,Robert Taylor,San Antonio,P010,Cooking Book,Books,1,24.99,24.99,2024-03-24



Dataset structure:
spc_tbl_ [15 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ TransactionID  : chr [1:15] "T001" "T002" "T003" "T004" ...
 $ CustomerID     : chr [1:15] "C001" "C002" "C003" "C001" ...
 $ CustomerName   : chr [1:15] "John Smith" "Sarah Johnson" "Mike Brown" "John Smith" ...
 $ CustomerCity   : chr [1:15] "New York" "Los Angeles" "Chicago" "New York" ...
 $ ProductID      : chr [1:15] "P001" "P002" "P003" "P004" ...
 $ ProductName    : chr [1:15] "Laptop" "Book: Data Science" "Headphones" "T-Shirt" ...
 $ ProductCategory: chr [1:15] "Electronics" "Books" "Electronics" "Clothing" ...
 $ Quantity       : num [1:15] 1 2 1 3 1 1 2 1 1 1 ...
 $ UnitPrice      : num [1:15] 1000 30 200 20 80 ...
 $ TotalAmount    : num [1:15] 1000 60 200 60 80 ...
 $ TransactionDate: Date[1:15], format: "2024-03-15" "2024-03-16" ...
 - attr(*, "spec")=
  .. cols(
  ..   TransactionID = [31mcol_character()[39m,
  ..   CustomerID = [31mcol_character()[39m,
  ..   CustomerName = [31mcol_c

## Part 2: Column Selection with `select()`

Practice different methods of selecting columns from your dataset.

In [4]:
# Task 2.1: Basic Selection
# Create 'basic_info' with TransactionID, CustomerID, ProductName, and TotalAmount

basic_info <- transactions %>%
  # Your code here:
  select(TransactionID, CustomerID, ProductName, TotalAmount)

# Display the result
cat("Basic info dataset (first 5 rows):\n")
head(basic_info, 5)

Basic info dataset (first 5 rows):


TransactionID,CustomerID,ProductName,TotalAmount
<chr>,<chr>,<chr>,<dbl>
T001,C001,Laptop,999.99
T002,C002,Book: Data Science,59.98
T003,C003,Headphones,199.99
T004,C001,T-Shirt,59.97
T005,C004,Coffee Maker,79.99


In [5]:
# Task 2.2: Range Selection
# Create 'customer_details' with all columns from CustomerID to CustomerCity (inclusive)

customer_details <- transactions %>%
  # Your code here:
  select(CustomerID:CustomerCity)

# Display the result
cat("Customer details (first 5 rows):\n")
head(customer_details, 5)

Customer details (first 5 rows):


CustomerID,CustomerName,CustomerCity
<chr>,<chr>,<chr>
C001,John Smith,New York
C002,Sarah Johnson,Los Angeles
C003,Mike Brown,Chicago
C001,John Smith,New York
C004,Emily Davis,Houston


In [6]:
# Task 2.3: Pattern-Based Selection

# Create 'date_columns' with columns starting with "Date" or "Time" or containing "Date"
date_columns <- transactions %>%
  # Your code here:
  select(starts_with("Date") | starts_with("Time") | contains("Date"))

# Create 'amount_columns' with columns containing the word "Amount"
amount_columns <- transactions %>%
  # Your code here:
  select(contains("Amount"))

# Display column names for verification
cat("Date/Time columns:", names(date_columns), "\n")
cat("Amount columns:", names(amount_columns), "\n")

Date/Time columns: TransactionDate 
Amount columns: TotalAmount 


In [7]:
# Task 2.4: Exclusion Selection
# Create 'no_ids' without TransactionID and CustomerID columns

no_ids <- transactions %>%
  # Your code here:
  select(-TransactionID, -CustomerID)

# Display column names for verification
cat("Columns after removing IDs:", names(no_ids), "\n")
cat("Number of columns:", ncol(no_ids), "\n")

Columns after removing IDs: CustomerName CustomerCity ProductID ProductName ProductCategory Quantity UnitPrice TotalAmount TransactionDate 
Number of columns: 9 


## Part 3: Row Filtering with `filter()`

Learn to filter rows based on various conditions.

In [8]:
# Task 3.1: Single Condition Filtering

# Filter transactions with TotalAmount > $100
high_value_transactions <- transactions %>%
  # Your code here:
  filter(TotalAmount > 100)

# Filter transactions from "Electronics" category
electronics_transactions <- transactions %>%
  # Your code here:
  filter(ProductCategory == "Electronics")

# Display results
cat("High value transactions (>$100):", nrow(high_value_transactions), "rows\n")
cat("Electronics transactions:", nrow(electronics_transactions), "rows\n")

High value transactions (>$100): 5 rows
Electronics transactions: 6 rows


In [9]:
# Task 3.2: Multiple Condition Filtering (AND)
# Filter for TotalAmount > $50 AND Quantity > 1 AND CustomerCity == "New York"

ny_bulk_purchases <- transactions %>%
  # Your code here:
  filter(TotalAmount > 50 & Quantity > 1 & CustomerCity == "New York")

# Display results
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "rows\n")
if(nrow(ny_bulk_purchases) > 0) {
  head(ny_bulk_purchases)
}

NY bulk purchases: 1 rows


TransactionID,CustomerID,CustomerName,CustomerCity,ProductID,ProductName,ProductCategory,Quantity,UnitPrice,TotalAmount,TransactionDate
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<date>
T004,C001,John Smith,New York,P004,T-Shirt,Clothing,3,19.99,59.97,2024-03-18


In [10]:
# Task 3.3: Multiple Condition Filtering (OR)
# Filter for ProductCategory = "Books" OR "Music" OR "Movies"

entertainment_transactions <- transactions %>%
  # Your code here:
  filter(ProductCategory %in% c("Books", "Music", "Movies"))

# Display results
cat("Entertainment transactions:", nrow(entertainment_transactions), "rows\n")
if(nrow(entertainment_transactions) > 0) {
  head(entertainment_transactions)
}

Entertainment transactions: 6 rows


TransactionID,CustomerID,CustomerName,CustomerCity,ProductID,ProductName,ProductCategory,Quantity,UnitPrice,TotalAmount,TransactionDate
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<date>
T002,C002,Sarah Johnson,Los Angeles,P002,Book: Data Science,Books,2,29.99,59.98,2024-03-16
T006,C002,Sarah Johnson,Los Angeles,P006,Movie: Inception,Movies,1,14.99,14.99,2024-03-20
T008,C003,Mike Brown,Chicago,P008,Music Album,Music,1,12.99,12.99,2024-03-22
T010,C007,Robert Taylor,San Antonio,P010,Cooking Book,Books,1,24.99,24.99,2024-03-24
T013,C009,William Garcia,Dallas,P013,Guitar,Music,1,199.99,199.99,2024-03-27
T015,C001,John Smith,New York,P015,Python Programming Book,Books,1,39.99,39.99,2024-03-29


In [11]:
# Task 3.4: Date-Based Filtering
# Filter transactions from March 2024
# Note: Adjust the date format and column name based on your actual data

march_transactions <- transactions %>%
  # Your code here (you may need to convert date format first):
  filter(format(TransactionDate, "%Y-%m") == "2024-03")

# Display results
cat("March 2024 transactions:", nrow(march_transactions), "rows\n")

March 2024 transactions: 15 rows


In [12]:
# Task 3.5: Advanced Filtering Challenge
# Find customers who made purchases in both "Electronics" AND "Clothing" categories
# Hint: This requires identifying customers who appear in both categories

# Step 1: Find customers who bought Electronics
electronics_customers <- transactions %>%
  # Your code here:
  filter(ProductCategory == "Electronics") %>%
  pull(CustomerID) %>%
  unique()

# Step 2: Find customers who bought Clothing
clothing_customers <- transactions %>%
  # Your code here:
  filter(ProductCategory == "Clothing") %>%
  pull(CustomerID) %>%
  unique()

# Step 3: Find customers who bought both
both_categories_customers <- intersect(electronics_customers, clothing_customers)

# Display results
cat("Customers who bought both Electronics and Clothing:", length(both_categories_customers), "customers\n")

Customers who bought both Electronics and Clothing: 1 customers


## Part 4: Data Sorting with `arrange()`

Practice sorting data by single and multiple columns.

In [13]:
# Task 4.1: Single Column Sorting

# Sort by TotalAmount ascending
transactions_by_amount_asc <- transactions %>%
  # Your code here:
  arrange(TotalAmount)

# Sort by TotalAmount descending
transactions_by_amount_desc <- transactions %>%
  # Your code here:
  arrange(desc(TotalAmount))

# Display top 5 of each
cat("Lowest amounts:\n")
head(transactions_by_amount_asc %>% select(CustomerName, ProductName, TotalAmount), 5)

cat("\nHighest amounts:\n")
head(transactions_by_amount_desc %>% select(CustomerName, ProductName, TotalAmount), 5)

Lowest amounts:


CustomerName,ProductName,TotalAmount
<chr>,<chr>,<dbl>
Mike Brown,Music Album,12.99
Sarah Johnson,Movie: Inception,14.99
Robert Taylor,Cooking Book,24.99
John Smith,Python Programming Book,39.99
John Smith,T-Shirt,59.97



Highest amounts:


CustomerName,ProductName,TotalAmount
<chr>,<chr>,<dbl>
John Smith,Laptop,999.99
Lisa Anderson,Smartphone,699.99
Emily Davis,Tablet,299.99
Mike Brown,Headphones,199.99
William Garcia,Guitar,199.99


In [14]:
# Task 4.2: Multiple Column Sorting
# Sort by CustomerCity (ascending), then by TotalAmount (descending)

transactions_by_city_amount <- transactions %>%
  # Your code here:
  arrange(CustomerCity, desc(TotalAmount))

# Display first 10 rows
cat("Transactions sorted by city, then amount:\n")
head(transactions_by_city_amount %>% select(CustomerCity, CustomerName, ProductName, TotalAmount), 10)

Transactions sorted by city, then amount:


CustomerCity,CustomerName,ProductName,TotalAmount
<chr>,<chr>,<chr>,<dbl>
Chicago,Mike Brown,Headphones,199.99
Chicago,Mike Brown,Music Album,12.99
Dallas,William Garcia,Guitar,199.99
Houston,Emily Davis,Tablet,299.99
Houston,Emily Davis,Coffee Maker,79.99
Los Angeles,Sarah Johnson,Book: Data Science,59.98
Los Angeles,Sarah Johnson,Movie: Inception,14.99
New York,John Smith,Laptop,999.99
New York,John Smith,T-Shirt,59.97
New York,John Smith,Python Programming Book,39.99


In [15]:
# Task 4.3: Date-Based Sorting
# Sort by TransactionDate chronologically (oldest first)

transactions_chronological <- transactions %>%
  # Your code here:
  arrange(TransactionDate)

# Display first 5 transactions chronologically
cat("Earliest transactions:\n")
head(transactions_chronological %>% select(TransactionDate, CustomerName, ProductName, TotalAmount), 5)

Earliest transactions:


TransactionDate,CustomerName,ProductName,TotalAmount
<date>,<chr>,<chr>,<dbl>
2024-03-15,John Smith,Laptop,999.99
2024-03-16,Sarah Johnson,Book: Data Science,59.98
2024-03-17,Mike Brown,Headphones,199.99
2024-03-18,John Smith,T-Shirt,59.97
2024-03-19,Emily Davis,Coffee Maker,79.99


## Part 5: Chaining Operations

Combine multiple dplyr operations using the pipe operator.

In [16]:
# Task 5.1: Simple Chain
# Filter TotalAmount > $75, select specific columns, arrange by TotalAmount descending

premium_purchases <- transactions %>%
  # Your code here:
  filter(TotalAmount > 75) %>%
  select(CustomerName, ProductName, ProductCategory, TotalAmount) %>%
  arrange(desc(TotalAmount))

# Display results
cat("Premium purchases (>$75):\n")
head(premium_purchases, 10)

Premium purchases (>$75):


CustomerName,ProductName,ProductCategory,TotalAmount
<chr>,<chr>,<chr>,<dbl>
John Smith,Laptop,Electronics,999.99
Lisa Anderson,Smartphone,Electronics,699.99
Emily Davis,Tablet,Electronics,299.99
Mike Brown,Headphones,Electronics,199.99
William Garcia,Guitar,Music,199.99
David Wilson,Jeans,Clothing,99.98
Jennifer Martinez,Sneakers,Clothing,89.99
Emily Davis,Coffee Maker,Electronics,79.99
Michelle Rodriguez,Laptop Stand,Electronics,79.98


In [17]:
# Task 5.2: Complex Chain
# Filter for Electronics/Computers, select columns, arrange by date/amount, keep top 20

recent_tech_purchases <- transactions %>%
  # Your code here:
  filter(ProductCategory %in% c("Electronics", "Computers")) %>%
  select(TransactionDate, CustomerName, ProductName, TotalAmount) %>%
  arrange(desc(TransactionDate), desc(TotalAmount)) %>%
  slice_head(n = 20)

# Display results
cat("Recent tech purchases (top 20):\n")
print(recent_tech_purchases)

Recent tech purchases (top 20):
[90m# A tibble: 6 × 4[39m
  TransactionDate CustomerName       ProductName  TotalAmount
  [3m[90m<date>[39m[23m          [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m 2024-03-28      Michelle Rodriguez Laptop Stand        80.0
[90m2[39m 2024-03-25      Emily Davis        Tablet             300. 
[90m3[39m 2024-03-23      Lisa Anderson      Smartphone         700. 
[90m4[39m 2024-03-19      Emily Davis        Coffee Maker        80.0
[90m5[39m 2024-03-17      Mike Brown         Headphones         200. 
[90m6[39m 2024-03-15      John Smith         Laptop            [4m1[24m000. 


In [18]:
# Task 5.3: Business Intelligence Chain
# Identify high-value repeat customers (TotalAmount > $200)

high_value_customers <- transactions %>%
  # Your code here:
  filter(TotalAmount > 200) %>%
  select(CustomerID, CustomerName, ProductName, TotalAmount, TransactionDate) %>%
  arrange(CustomerName, desc(TotalAmount))

# Display results
cat("High-value customers:\n")
head(high_value_customers, 15)

High-value customers:


CustomerID,CustomerName,ProductName,TotalAmount,TransactionDate
<chr>,<chr>,<chr>,<dbl>,<date>
C004,Emily Davis,Tablet,299.99,2024-03-25
C001,John Smith,Laptop,999.99,2024-03-15
C006,Lisa Anderson,Smartphone,699.99,2024-03-23


## Part 6: Data Analysis Questions

Answer the following questions using the datasets you've created.

In [19]:
# Question 6.1: Transaction Volume
# Count transactions in each filtered dataset

cat("Transaction counts by dataset:\n")
cat("High value transactions:", nrow(high_value_transactions), "\n")
cat("Electronics transactions:", nrow(electronics_transactions), "\n")
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "\n")
cat("Entertainment transactions:", nrow(entertainment_transactions), "\n")
cat("March transactions:", nrow(march_transactions), "\n")
cat("Premium purchases:", nrow(premium_purchases), "\n")
cat("Recent tech purchases:", nrow(recent_tech_purchases), "\n")
cat("High value customers:", nrow(high_value_customers), "\n")

Transaction counts by dataset:
High value transactions: 5 
Electronics transactions: 6 
NY bulk purchases: 1 
Entertainment transactions: 6 
March transactions: 15 
Premium purchases: 9 
Recent tech purchases: 6 
High value customers: 3 


In [20]:
# Question 6.2: Top Customers
# Find the customer who appears most frequently in high_value_customers

if(nrow(high_value_customers) > 0) {
  customer_frequency <- high_value_customers %>%
    # Your code here to count customer appearances:
    count(CustomerName, sort = TRUE)
  
  cat("Most frequent high-value customer:\n")
  print(customer_frequency)
} else {
  cat("No high-value customers found\n")
}

Most frequent high-value customer:
[90m# A tibble: 3 × 2[39m
  CustomerName      n
  [3m[90m<chr>[39m[23m         [3m[90m<int>[39m[23m
[90m1[39m Emily Davis       1
[90m2[39m John Smith        1
[90m3[39m Lisa Anderson     1


In [21]:
# Question 6.3: Product Analysis
# Find top 5 most expensive transactions in entertainment_transactions

if(nrow(entertainment_transactions) > 0) {
  top_entertainment <- entertainment_transactions %>%
    # Your code here:
    select(CustomerName, ProductName, ProductCategory, TotalAmount) %>%
    arrange(desc(TotalAmount)) %>%
    slice_head(n = 5)
  
  cat("Top 5 most expensive entertainment transactions:\n")
  print(top_entertainment)
} else {
  cat("No entertainment transactions found\n")
}

Top 5 most expensive entertainment transactions:
[90m# A tibble: 5 × 4[39m
  CustomerName   ProductName             ProductCategory TotalAmount
  [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m                   [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m1[39m William Garcia Guitar                  Music                 200. 
[90m2[39m Sarah Johnson  Book: Data Science      Books                  60.0
[90m3[39m John Smith     Python Programming Book Books                  40.0
[90m4[39m Robert Taylor  Cooking Book            Books                  25.0
[90m5[39m Sarah Johnson  Movie: Inception        Movies                 15.0


In [22]:
# Question 6.4: Geographic Analysis
# Find the city with the highest single transaction amount

highest_transaction_by_city <- transactions_by_city_amount %>%
  # Your code here:
  slice_head(n = 1) %>%
  select(CustomerCity, CustomerName, ProductName, TotalAmount)

cat("City with highest single transaction:\n")
print(highest_transaction_by_city)

City with highest single transaction:
[90m# A tibble: 1 × 4[39m
  CustomerCity CustomerName ProductName TotalAmount
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m             [3m[90m<dbl>[39m[23m
[90m1[39m Chicago      Mike Brown   Headphones         200.


## Part 7: Reflection Questions

Please answer the following questions in the markdown cells below.

### Question 7.1: Pipe Operator Benefits

**How does using the pipe operator (`%>%`) improve code readability compared to nested function calls? Provide a specific example from your homework.**

The pipe operator (`%>%`) significantly improves code readability by allowing operations to flow from left to right in a logical sequence, making the code read like a step-by-step process. Instead of nesting functions inside each other (which can become confusing), the pipe operator lets you chain operations in the order they are executed.

**Example from the homework:**

Without pipes (nested approach):
```r
premium_purchases <- arrange(select(filter(transactions, TotalAmount > 75), CustomerName, ProductName, ProductCategory, TotalAmount), desc(TotalAmount))
```

With pipes (readable approach):
```r
premium_purchases <- transactions %>%
  filter(TotalAmount > 75) %>%
  select(CustomerName, ProductName, ProductCategory, TotalAmount) %>%
  arrange(desc(TotalAmount))
```

The piped version clearly shows the workflow: start with transactions → filter for high amounts → select specific columns → sort by amount. This makes the code much easier to read, debug, and modify.

### Question 7.2: Filtering Strategy

**When filtering data for business analysis, what are the trade-offs between being very specific (many conditions) versus being more general (fewer conditions)? How might this affect your insights?**

**Very Specific Filtering (Many Conditions):**
- **Pros:** Provides highly targeted insights, reduces noise, focuses on exact scenarios (like our NY bulk purchases filter)
- **Cons:** May exclude relevant data, could result in very small sample sizes, might miss broader patterns
- **Example:** Filtering for `TotalAmount > 50 & Quantity > 1 & CustomerCity == "New York"` gave us only 1 transaction

**More General Filtering (Fewer Conditions):**
- **Pros:** Captures broader trends, larger sample sizes, reveals overall patterns in the business
- **Cons:** May include irrelevant data, insights could be too general to be actionable
- **Example:** Filtering just for `ProductCategory == "Electronics"` gave us 6 transactions with diverse insights

**Impact on Insights:** The key is finding the right balance. Too specific and you might miss important trends; too general and insights become meaningless. For business decisions, it's often better to start broad to understand overall patterns, then drill down with specific filters to investigate particular segments or anomalies.

### Question 7.3: Sorting Importance

**Why is data sorting important in business analytics? Provide three specific business scenarios where sorting data would be crucial for decision-making.**

Data sorting is crucial for identifying patterns, priorities, and outliers in business data. It helps decision makers quickly focus on the most important information.

1. **Sales Performance Analysis:** Sorting sales data by revenue (highest to lowest) helps identify top performing products, customers, or sales representatives. This enables management to recognize high performers, allocate resources effectively, and replicate successful strategies.

2. **Customer Service Priority:** Sorting customer complaints by date and severity helps support teams prioritize urgent issues and ensure timely responses. Chronological sorting also reveals trends in customer satisfaction over time.

3. **Inventory Management:** Sorting products by stock levels (lowest to highest) immediately highlights items that need reordering, preventing stockouts. Sorting by sales velocity helps identify fast moving items that require more frequent restocking versus slow moving inventory that might need promotional pricing.

### Question 7.4: Real-World Application

**Describe a real business scenario where you might need to combine `select()`, `filter()`, and `arrange()` operations. What insights would you be trying to gain?**

**Scenario: E-commerce Customer Retention Analysis**

Imagine you're working for an online retailer and need to identify customers at risk of churning to create a targeted retention campaign.

**Combined Operations:**
```r
at_risk_customers <- customer_data %>%
  filter(last_purchase_days > 90 & total_spent > 500 & num_purchases > 3) %>%
  select(customer_id, customer_name, email, total_spent, last_purchase_date, preferred_category) %>%
  arrange(desc(total_spent), desc(last_purchase_days))
```

**Insights Being Gained:**
- **Filter:** Identifies valuable customers (high spending, multiple purchases) who haven't bought recently (potential churn risk)
- **Select:** Focuses on essential information needed for the retention campaign (contact info, spending patterns, preferences)
- **Arrange:** Prioritizes customers by value (highest spenders first) and risk level (longest time since purchase)

This analysis would help the marketing team create personalized retention offers for high value customers who show signs of disengagement, maximizing the ROI of retention efforts by focusing on the most valuable at risk customers first.

## Summary and Submission

### What You've Learned

In this homework, you've practiced:
- Using `select()` for column selection with various methods
- Using `filter()` for row filtering with single and multiple conditions
- Using `arrange()` for sorting data by single and multiple columns
- Chaining operations with the pipe operator (`%>%`)
- Analyzing business data to generate insights

### Submission Checklist

Before submitting, ensure you have:
- [ ] Completed all code tasks
- [ ] Run all cells successfully
- [ ] Answered all reflection questions
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator where appropriate
- [ ] Verified your results make sense

### Next Steps

In the next lesson, you'll learn about:
- `mutate()` for creating new columns
- `summarize()` for calculating summary statistics
- `group_by()` for grouped operations
- Advanced data transformation techniques