# Homework Assignment - Lesson 3: Data Transformation with dplyr - Part 1

**Student Name:** Sean Johnson

**Due Date:** 09/21/2025

**Objective:** Learn to use dplyr functions (`select()`, `filter()`, `arrange()`) and the pipe operator (`%>%`) for data transformation and analysis.

---

## Instructions

- Complete all tasks in this notebook
- Use the pipe operator (`%>%`) wherever possible to chain operations
- Ensure your code is well-commented and easy to understand
- Run all cells to verify your code works correctly
- Answer all reflection questions at the end

---

## Part 1: Data Import and Setup

In this section, you'll import the retail transactions dataset and perform initial exploration.

**Dataset:** `retail_transactions.csv` - This dataset contains transaction records from a retail business with information about customers, products, dates, amounts, and quantities.

In [1]:
# Load required libraries
library(tidyverse)

# Set working directory to your workspace root
setwd("/workspaces/assignment-1-version-3-seanjohnson04")

# Task 1.1: Import the retail_transactions.csv file
# If the file is not in this folder, move it here or update the path accordingly.
transactions <- read_csv("data/retail_transactions.csv")

# Display success message
cat("Data imported successfully!\n")
cat("Dataset dimensions:", nrow(transactions), "rows x", ncol(transactions), "columns\n")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m500[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


Data imported successfully!
Dataset dimensions: 500 rows x 9 columns


In [2]:
# Task 1.2: Initial Exploration

# Display the first 10 rows
cat("First 10 rows of the dataset:\n")
print(head(transactions, 10))

# Check the structure of the dataset
cat("\nDataset structure:\n")
str(transactions)

# Display column names and their data types
cat("\nColumn names and data types:\n")
print(sapply(transactions, class))

First 10 rows of the dataset:
[90m# A tibble: 10 × 9[39m
   TransactionID CustomerID CustomerName CustomerCity ProductName    
           [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m          
[90m 1[39m             1         81 Customer 39  Chicago      Adidas Jacket  
[90m 2[39m             2         13 Customer 63  Philadelphia Samsung TV     
[90m 3[39m             3         18 Customer 98  Chicago      Adidas Jacket  
[90m 4[39m             4         76 Customer 39  Houston      Dell Laptop    
[90m 5[39m             5         86 Customer 45  New York     Nike Shoes     
[90m 6[39m             6         37 Customer 8   Philadelphia Adidas Jacket  
[90m 7[39m             7         45 Customer 83  New York     HP Printer     
[90m 8[39m             8         11 Customer 60  Chicago      Samsung TV     
[90m 9[39m             9         13 Customer 69  Houston      iP

## Part 2: Column Selection with `select()`

Practice different methods of selecting columns from your dataset.

In [12]:
# Task 2.1: Basic Selection
# Create 'basic_info' with TransactionID, CustomerID, ProductName, and TotalAmount

basic_info <- transactions %>%
  select(TransactionID, CustomerID, ProductName, TotalAmount)

# Display the result
cat("Basic info dataset (first 5 rows):\n")
print(head(basic_info, 5))

Basic info dataset (first 5 rows):
[90m# A tibble: 5 × 4[39m
  TransactionID CustomerID ProductName   TotalAmount
          [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m
[90m1[39m             1         81 Adidas Jacket       632. 
[90m2[39m             2         13 Samsung TV          114. 
[90m3[39m             3         18 Adidas Jacket      [4m1[24m289. 
[90m4[39m             4         76 Dell Laptop         885. 
[90m5[39m             5         86 Nike Shoes           96.0
[90m# A tibble: 5 × 4[39m
  TransactionID CustomerID ProductName   TotalAmount
          [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m
[90m1[39m             1         81 Adidas Jacket       632. 
[90m2[39m             2         13 Samsung TV          114. 
[90m3[39m             3         18 Adidas Jacket      [4m1[24m289. 
[90m4[39m            

In [13]:
# Task 2.2: Range Selection
# Create 'customer_details' with all columns from CustomerID to CustomerCity (inclusive)

customer_details <- transactions %>%
  select(CustomerID:CustomerCity)

# Display the result
cat("Customer details (first 5 rows):\n")
print(head(customer_details, 5))

Customer details (first 5 rows):
[90m# A tibble: 5 × 3[39m
  CustomerID CustomerName CustomerCity
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       
[90m1[39m         81 Customer 39  Chicago     
[90m2[39m         13 Customer 63  Philadelphia
[90m3[39m         18 Customer 98  Chicago     
[90m4[39m         76 Customer 39  Houston     
[90m5[39m         86 Customer 45  New York    
[90m# A tibble: 5 × 3[39m
  CustomerID CustomerName CustomerCity
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       
[90m1[39m         81 Customer 39  Chicago     
[90m2[39m         13 Customer 63  Philadelphia
[90m3[39m         18 Customer 98  Chicago     
[90m4[39m         76 Customer 39  Houston     
[90m5[39m         86 Customer 45  New York    


In [14]:
# Task 2.3: Pattern-Based Selection

# Create 'date_columns' with columns starting with "Date" or "Time"
date_columns <- transactions %>%
  select(starts_with("Date"), starts_with("Time"))

# Create 'amount_columns' with columns containing the word "Amount"
amount_columns <- transactions %>%
  select(contains("Amount"))

# Display column names for verification
cat("Date/Time columns:", names(date_columns), "\n")
cat("Amount columns:", names(amount_columns), "\n")

Date/Time columns:  
Amount columns: TotalAmount 
Amount columns: TotalAmount 


In [15]:
# Task 2.4: Exclusion Selection
# Create 'no_ids' without TransactionID and CustomerID columns

no_ids <- transactions %>%
  select(-TransactionID, -CustomerID)

# Display column names for verification
cat("Columns after removing IDs:", names(no_ids), "\n")
cat("Number of columns:", ncol(no_ids), "\n")

Columns after removing IDs: CustomerName CustomerCity ProductName ProductCategory TotalAmount Quantity TransactionDate 
Number of columns: 7 
Number of columns: 7 


## Part 3: Row Filtering with `filter()`

Learn to filter rows based on various conditions.

In [16]:
# Task 3.1: Single Condition Filtering

# Filter transactions with TotalAmount > $100
high_value_transactions <- transactions %>%
  filter(TotalAmount > 100)

# Filter transactions from "Electronics" category
electronics_transactions <- transactions %>%
  filter(ProductCategory == "Electronics")

# Display results
cat("High value transactions (>$100):", nrow(high_value_transactions), "rows\n")
cat("Electronics transactions:", nrow(electronics_transactions), "rows\n")

High value transactions (>$100): 470 rows
Electronics transactions: 93 rows
Electronics transactions: 93 rows


In [17]:
# Task 3.2: Multiple Condition Filtering (AND)
# Filter for TotalAmount > $50 AND Quantity > 1 AND CustomerCity == "New York"

ny_bulk_purchases <- transactions %>%
  filter(TotalAmount > 50, Quantity > 1, CustomerCity == "New York")

# Display results
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "rows\n")
if(nrow(ny_bulk_purchases) > 0) {
  print(head(ny_bulk_purchases))
}

NY bulk purchases: 75 rows
[90m# A tibble: 6 × 9[39m
  TransactionID CustomerID CustomerName CustomerCity ProductName ProductCategory
          [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m          
[90m1[39m             5         86 Customer 45  New York     Nike Shoes  Computers      
[90m2[39m             7         45 Customer 83  New York     HP Printer  Clothing       
[90m3[39m            15          1 Customer 52  New York     Nike Shoes  Clothing       
[90m4[39m            22         80 Customer 4   New York     iPhone 14   Books          
[90m5[39m            25         97 Customer 17  New York     iPhone 14   Movies         
[90m6[39m            29          3 Customer 23  New York     Samsung TV  Clothing       
[90m# ℹ 3 more variables: TotalAmount <dbl>, Quantity <dbl>, TransactionDate <date>[39m
[90m# A tibble: 6 × 9[39m
  Tra

In [18]:
# Task 3.3: Multiple Condition Filtering (OR)
# Filter for ProductCategory = "Books" OR "Music" OR "Movies"

entertainment_transactions <- transactions %>%
  filter(ProductCategory %in% c("Books", "Music", "Movies"))

# Display results
cat("Entertainment transactions:", nrow(entertainment_transactions), "rows\n")
if(nrow(entertainment_transactions) > 0) {
  print(head(entertainment_transactions))
}

Entertainment transactions: 227 rows
[90m# A tibble: 6 × 9[39m
  TransactionID CustomerID CustomerName CustomerCity ProductName ProductCategory
          [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m          
[90m1[39m             2         13 Customer 63  Philadelphia Samsung TV  Music          
[90m2[39m             8         11 Customer 60  Chicago      Samsung TV  Music          
[90m3[39m             9         13 Customer 69  Houston      iPhone 14   Music          
[90m4[39m            10         55 Customer 24  Chicago      Sony Headp… Books          
[90m5[39m            11        100 Customer 95  Philadelphia HP Printer  Movies         
[90m6[39m            14         19 Customer 100 Phoenix      Nike Shoes  Books          
[90m# ℹ 3 more variables: TotalAmount <dbl>, Quantity <dbl>, TransactionDate <date>[39m
[90m# A tibble: 6 × 9

In [19]:
# Task 3.4: Date-Based Filtering
# Filter transactions from March 2024
# Note: Adjust the date format and column name based on your actual data

# Assuming the date column is named 'TransactionDate' and is in YYYY-MM-DD format
march_transactions <- transactions %>%
  filter(format(as.Date(TransactionDate), "%Y-%m") == "2024-03")

# Display results
cat("March 2024 transactions:", nrow(march_transactions), "rows\n")

March 2024 transactions: 41 rows


In [20]:
# Task 3.5: Advanced Filtering Challenge
# Find customers who made purchases in both "Electronics" AND "Clothing" categories
# Hint: This requires identifying customers who appear in both categories

# Step 1: Find customers who bought Electronics
electronics_customers <- transactions %>%
  filter(ProductCategory == "Electronics") %>%
  pull(CustomerID) %>%
  unique()

# Step 2: Find customers who bought Clothing
clothing_customers <- transactions %>%
  filter(ProductCategory == "Clothing") %>%
  pull(CustomerID) %>%
  unique()

# Step 3: Find customers who bought both
both_categories_customers <- intersect(electronics_customers, clothing_customers)

# Display results
cat("Customers who bought both Electronics and Clothing:", length(both_categories_customers), "customers\n")

Customers who bought both Electronics and Clothing: 38 customers


## Part 4: Data Sorting with `arrange()`

Practice sorting data by single and multiple columns.

In [21]:
# Task 4.1: Single Column Sorting

# Sort by TotalAmount ascending
transactions_by_amount_asc <- transactions %>%
  arrange(TotalAmount)

# Sort by TotalAmount descending
transactions_by_amount_desc <- transactions %>%
  arrange(desc(TotalAmount))

# Display top 5 of each
cat("Lowest amounts:\n")
print(head(transactions_by_amount_asc %>% select(CustomerName, ProductName, TotalAmount), 5))

cat("\nHighest amounts:\n")
print(head(transactions_by_amount_desc %>% select(CustomerName, ProductName, TotalAmount), 5))

Lowest amounts:
[90m# A tibble: 5 × 3[39m
  CustomerName ProductName   TotalAmount
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m
[90m1[39m Customer 95  Adidas Jacket        27.7
[90m2[39m Customer 100 Nike Shoes           32.3
[90m3[39m Customer 50  Adidas Jacket        35.0
[90m4[39m Customer 83  Samsung TV           36.4
[90m5[39m Customer 69  Dell Laptop          37.3

Highest amounts:
[90m# A tibble: 5 × 3[39m
  CustomerName ProductName   TotalAmount
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m               [3m[90m<dbl>[39m[23m
[90m1[39m Customer 95  Adidas Jacket        27.7
[90m2[39m Customer 100 Nike Shoes           32.3
[90m3[39m Customer 50  Adidas Jacket        35.0
[90m4[39m Customer 83  Samsung TV           36.4
[90m5[39m Customer 69  Dell Laptop          37.3

Highest amounts:
[90m# A tibble: 5 × 3[39m
  CustomerName ProductName     TotalAmount
  [3m[90m<chr>[39m[23m       

In [22]:
# Task 4.2: Multiple Column Sorting
# Sort by CustomerCity (ascending), then by TotalAmount (descending)

transactions_by_city_amount <- transactions %>%
  arrange(CustomerCity, desc(TotalAmount))

# Display first 10 rows
cat("Transactions sorted by city, then amount:\n")
head(transactions_by_city_amount %>% select(CustomerCity, CustomerName, ProductName, TotalAmount), 10)

Transactions sorted by city, then amount:


CustomerCity,CustomerName,ProductName,TotalAmount
<chr>,<chr>,<chr>,<dbl>
Chicago,Customer 18,iPhone 14,1487.43
Chicago,Customer 20,HP Printer,1476.49
Chicago,Customer 97,Dell Laptop,1459.42
Chicago,Customer 49,Dell Laptop,1453.94
Chicago,Customer 70,Nike Shoes,1428.35
Chicago,Customer 71,Samsung TV,1424.38
Chicago,Customer 62,Sony Headphones,1416.35
Chicago,Customer 11,Nike Shoes,1407.51
Chicago,Customer 99,HP Printer,1388.73
Chicago,Customer 49,Nike Shoes,1370.57


In [23]:
# Task 4.3: Date-Based Sorting
# Sort by TransactionDate chronologically (oldest first)

transactions_chronological <- transactions %>%
  arrange(as.Date(TransactionDate))

# Display first 5 transactions chronologically
cat("Earliest transactions:\n")
head(transactions_chronological %>% select(TransactionDate, CustomerName, ProductName, TotalAmount), 5)

Earliest transactions:


TransactionDate,CustomerName,ProductName,TotalAmount
<date>,<chr>,<chr>,<dbl>
2024-01-01,Customer 62,Sony Headphones,1416.35
2024-01-01,Customer 99,Sony Headphones,1326.27
2024-01-02,Customer 83,Nike Shoes,808.93
2024-01-02,Customer 61,Adidas Jacket,502.72
2024-01-02,Customer 69,Adidas Jacket,277.7


## Part 5: Chaining Operations

Combine multiple dplyr operations using the pipe operator.

In [24]:
# Task 5.1: Simple Chain
# Filter TotalAmount > $75, select specific columns, arrange by TotalAmount descending

premium_purchases <- transactions %>%
  filter(TotalAmount > 75) %>%
  select(TransactionID, CustomerName, ProductName, TotalAmount) %>%
  arrange(desc(TotalAmount))

# Display results
cat("Premium purchases (>$75):\n")
head(premium_purchases, 10)

Premium purchases (>$75):


TransactionID,CustomerName,ProductName,TotalAmount
<dbl>,<chr>,<chr>,<dbl>
150,Customer 60,Sony Headphones,1499.52
61,Customer 28,Dell Laptop,1491.96
154,Customer 81,Sony Headphones,1491.62
187,Customer 79,iPhone 14,1488.95
136,Customer 20,HP Printer,1487.44
297,Customer 18,iPhone 14,1487.43
378,Customer 83,iPhone 14,1484.31
16,Customer 20,HP Printer,1476.49
479,Customer 10,Dell Laptop,1473.27
84,Customer 14,iPhone 14,1471.59


In [25]:
# Task 5.2: Complex Chain
# Filter for Electronics/Computers, select columns, arrange by date/amount, keep top 20

recent_tech_purchases <- transactions %>%
  filter(ProductCategory %in% c("Electronics", "Computers")) %>%
  select(TransactionDate, CustomerName, ProductName, ProductCategory, TotalAmount) %>%
  arrange(desc(as.Date(TransactionDate)), desc(TotalAmount)) %>%
  head(20)

# Display results
cat("Recent tech purchases (top 20):\n")
print(recent_tech_purchases)

Recent tech purchases (top 20):
[90m# A tibble: 20 × 5[39m
   TransactionDate CustomerName ProductName     ProductCategory TotalAmount
   [3m[90m<date>[39m[23m          [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m           [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m 1[39m 2024-12-30      Customer 27  Sony Headphones Electronics            540.
[90m 2[39m 2024-12-30      Customer 33  Sony Headphones Computers              207.
[90m 3[39m 2024-12-25      Customer 39  Samsung TV      Computers              121.
[90m 4[39m 2024-12-24      Customer 25  Sony Headphones Electronics            700.
[90m 5[39m 2024-12-23      Customer 85  iPhone 14       Computers             [4m1[24m433.
[90m 6[39m 2024-12-19      Customer 73  Dell Laptop     Computers             [4m1[24m221.
[90m 7[39m 2024-12-13      Customer 17  Dell Laptop     Computers              141.
[90m 8[39m 2024-12-12      Customer 6   Nike Shoes      Computers     

In [26]:
# Task 5.3: Business Intelligence Chain
# Identify high-value repeat customers (TotalAmount > $200)

high_value_customers <- transactions %>%
  filter(TotalAmount > 200) %>%
  group_by(CustomerID, CustomerName) %>%
  summarise(TotalPurchases = n(), TotalSpent = sum(TotalAmount), .groups = 'drop') %>%
  filter(TotalPurchases > 1) %>%
  arrange(desc(TotalSpent))

# Display results
cat("High-value customers:\n")
head(high_value_customers, 15)

High-value customers:


CustomerID,CustomerName,TotalPurchases,TotalSpent
<dbl>,<chr>,<int>,<dbl>
7,Customer 18,2,2782.54
77,Customer 99,2,2628.75
72,Customer 13,2,1986.1
70,Customer 49,2,1823.74
69,Customer 14,2,1095.99
87,Customer 87,2,717.69


## Part 6: Data Analysis Questions

Answer the following questions using the datasets you've created.

In [27]:
# Question 6.1: Transaction Volume
# Count transactions in each filtered dataset

cat("Transaction counts by dataset:\n")
cat("High value transactions:", nrow(high_value_transactions), "\n")
cat("Electronics transactions:", nrow(electronics_transactions), "\n")
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "\n")
cat("Entertainment transactions:", nrow(entertainment_transactions), "\n")
cat("March transactions:", nrow(march_transactions), "\n")
cat("Premium purchases:", nrow(premium_purchases), "\n")
cat("Recent tech purchases:", nrow(recent_tech_purchases), "\n")
cat("High value customers:", nrow(high_value_customers), "\n")

Transaction counts by dataset:
High value transactions: 470 
Electronics transactions: 93 
High value transactions: 470 
Electronics transactions: 93 
NY bulk purchases: 75 
Entertainment transactions: 227 
March transactions: 41 
Premium purchases: 483 
Recent tech purchases: 20 
NY bulk purchases: 75 
Entertainment transactions: 227 
March transactions: 41 
Premium purchases: 483 
Recent tech purchases: 20 
High value customers: 6 
High value customers: 6 


In [29]:
# Question 6.2: Top Customers
# Find the customer who appears most frequently in high_value_customers

if(nrow(high_value_customers) > 0) {
  customer_frequency <- high_value_customers %>%
    arrange(desc(TotalPurchases), desc(TotalSpent)) %>%
    head(1)
  
  cat("Most frequent high-value customer:\n")
  print(customer_frequency)
} else {
  cat("No high-value customers found\n")
}

Most frequent high-value customer:
[90m# A tibble: 1 × 4[39m
  CustomerID CustomerName TotalPurchases TotalSpent
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m                 [3m[90m<int>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m          7 Customer 18               2      [4m2[24m783.


In [30]:
# Question 6.3: Product Analysis
# Find top 5 most expensive transactions in entertainment_transactions

if(nrow(entertainment_transactions) > 0) {
  top_entertainment <- entertainment_transactions %>%
    arrange(desc(TotalAmount)) %>%
    head(5)
  
  cat("Top 5 most expensive entertainment transactions:\n")
  print(top_entertainment)
} else {
  cat("No entertainment transactions found\n")
}

Top 5 most expensive entertainment transactions:
[90m# A tibble: 5 × 9[39m
  TransactionID CustomerID CustomerName CustomerCity ProductName ProductCategory
          [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m          
[90m1[39m           154         18 Customer 81  Phoenix      Sony Headp… Books          
[90m2[39m           378         53 Customer 83  Philadelphia iPhone 14   Movies         
[90m3[39m           479         83 Customer 10  Houston      Dell Laptop Books          
[90m4[39m           384         42 Customer 97  Chicago      Dell Laptop Books          
[90m5[39m           468         32 Customer 47  Los Angeles  Dell Laptop Books          
[90m# ℹ 3 more variables: TotalAmount <dbl>, Quantity <dbl>, TransactionDate <date>[39m


In [31]:
# Question 6.4: Geographic Analysis
# Find the city with the highest single transaction amount

highest_transaction_by_city <- transactions_by_city_amount %>%
  arrange(desc(TotalAmount)) %>%
  select(CustomerCity, CustomerName, ProductName, TotalAmount) %>%
  head(1)

cat("City with highest single transaction:\n")
print(highest_transaction_by_city)

City with highest single transaction:
[90m# A tibble: 1 × 4[39m
  CustomerCity CustomerName ProductName     TotalAmount
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m1[39m New York     Customer 60  Sony Headphones       [4m1[24m500.
[90m# A tibble: 1 × 4[39m
  CustomerCity CustomerName ProductName     TotalAmount
  [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m1[39m New York     Customer 60  Sony Headphones       [4m1[24m500.


## Part 7: Reflection Questions

Please answer the following questions in the markdown cells below.

### Question 7.1: Pipe Operator Benefits

**How does using the pipe operator (`%>%`) improve code readability compared to nested function calls? Provide a specific example from your homework.**

Using the pipe operator (%>%) makes your code easier to read by showing each step in a clear, step-by-step order. Instead of dealing with complicated nested functions, you can follow the flow of the data more naturally. This helps with understanding, fixing, and changing your code.

**Example:**
```r
premium_purchases <- transactions %>%
  filter(TotalAmount > 75) %>%
  select(TransactionID, CustomerName, ProductName, TotalAmount) %>%
  arrange(desc(TotalAmount))
```
The example is easier to read than this:
```r
premium_purchases <- arrange(select(filter(transactions, TotalAmount > 75), TransactionID, CustomerName, ProductName, TotalAmount), desc(TotalAmount))
```

### Question 7.2: Filtering Strategy

**When filtering data for business analysis, what are the trade-offs between being very specific (many conditions) versus being more general (fewer conditions)? How might this affect your insights?**

Using lots of specific filters helps you focus on a narrow set of data, which is good for answering detailed questions or spotting small trends. But it might leave out important data and give you a limited view.

Using fewer, broader filters gives you a bigger picture and helps find overall trends, but it can bring in too much extra data, making it harder to find clear insights.

**Trade-offs:**
- Specific filters: More precise answers, but risk of missing important outliers or related cases.
- General filters: Broader insights, but risk of including too much irrelevant data.

The choice depends on your business goal, sometimes you need detail, other times you need a high-level overview of data.

### Question 7.3: Sorting Importance

**Why is data sorting important in business analytics? Provide three specific business scenarios where sorting data would be crucial for decision-making.**

Sorting data is important in business analytics because it helps you quickly spot patterns, unusual values, and what matters most. It makes it easier to focus on the right information and make better decisions.

**Three business scenarios where sorting is crucial:**

Sales Focus: Sort customers by how much they’ve spent to find top buyers for upselling or keeping loyal.

Managing Stock: Sort products by how many are sold or left in stock to restock popular items or clear out slow ones.

Customer Help: Sort support requests by urgency or date so the most important ones get solved first.

### Question 7.4: Real-World Application

**Describe a real business scenario where you might need to combine `select()`, `filter()`, and `arrange()` operations. What insights would you be trying to gain?**

A real example in business could be finding your best repeat customers for a loyalty program. You could:

Use filter() to keep only purchases from the past year and above a certain amount,

Use select() to focus on key details like CustomerID, Name, and TotalAmount,

Use arrange() to sort customers by how much they spent, from highest to lowest.

This makes it easy to see which customers bring in the most money and should get special offers, helping keep them loyal and boosting your business.

## Summary and Submission

### What You've Learned

In this homework, you've practiced:
- Using `select()` for column selection with various methods
- Using `filter()` for row filtering with single and multiple conditions
- Using `arrange()` for sorting data by single and multiple columns
- Chaining operations with the pipe operator (`%>%`)
- Analyzing business data to generate insights

### Submission Checklist

Before submitting, ensure you have:
- [ ] Completed all code tasks
- [ ] Run all cells successfully
- [ ] Answered all reflection questions
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator where appropriate
- [ ] Verified your results make sense

### Next Steps

In the next lesson, you'll learn about:
- `mutate()` for creating new columns
- `summarize()` for calculating summary statistics
- `group_by()` for grouped operations
- Advanced data transformation techniques