# Homework Assignment - Lesson 3: Data Transformation with dplyr - Part 1

**Student Name:** Alejandro De Santiago Palomares Salinas

**Due Date:** 09/21/2025

**Objective:** Learn to use dplyr functions (`select()`, `filter()`, `arrange()`) and the pipe operator (`%>%`) for data transformation and analysis.

---

## Instructions

- Complete all tasks in this notebook
- Use the pipe operator (`%>%`) wherever possible to chain operations
- Ensure your code is well-commented and easy to understand
- Run all cells to verify your code works correctly
- Answer all reflection questions at the end

---

## Part 1: Data Import and Setup

In this section, you'll import the retail transactions dataset and perform initial exploration.

**Dataset:** `retail_transactions.csv` - This dataset contains transaction records from a retail business with information about customers, products, dates, amounts, and quantities.

In [9]:
# Load required libraries
library(tidyverse)

# Set working directory if needed
setwd("/workspaces/assignment-2-version3-Aledesan-utsa/data")

# Task 1.1: Import the retail_transactions.csv file
# Create a data frame named 'transactions'
# Note: Import the retail_transactions.csv file

# Your code here:
transactions <- read.csv("retail_transactions.csv")

# Display success message
cat("Data imported successfully!\n")
cat("Dataset dimensions:", nrow(transactions), "rows x", ncol(transactions), "columns\n")

Data imported successfully!
Dataset dimensions: 500 rows x 9 columns


In [10]:
# Task 1.2: Initial Exploration

# Display the first 10 rows
cat("First 10 rows of the dataset:\n")
head(transactions, 10)


# Check the structure of the dataset
cat("\nDataset structure:\n")
str(transactions)


# Display column names and their data types
cat("\nColumn names:\n")
colnames(transactions)

First 10 rows of the dataset:


Unnamed: 0_level_0,TransactionID,CustomerID,CustomerName,CustomerCity,ProductName,ProductCategory,TotalAmount,Quantity,TransactionDate
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<int>,<chr>
1,1,81,Customer 39,Chicago,Adidas Jacket,Clothing,632.39,3,2024-03-09
2,2,13,Customer 63,Philadelphia,Samsung TV,Music,114.28,3,2024-12-08
3,3,18,Customer 98,Chicago,Adidas Jacket,Computers,1289.24,7,2024-01-22
4,4,76,Customer 39,Houston,Dell Laptop,Computers,885.4,2,2024-07-02
5,5,86,Customer 45,New York,Nike Shoes,Computers,95.95,5,2024-08-13
6,6,37,Customer 8,Philadelphia,Adidas Jacket,Electronics,1126.34,2,2024-04-15
7,7,45,Customer 83,New York,HP Printer,Clothing,78.71,3,2024-05-02
8,8,11,Customer 60,Chicago,Samsung TV,Music,871.93,3,2024-04-30
9,9,13,Customer 69,Houston,iPhone 14,Music,1347.56,8,2024-08-08
10,10,55,Customer 24,Chicago,Sony Headphones,Books,633.51,1,2024-06-23



Dataset structure:
'data.frame':	500 obs. of  9 variables:
 $ TransactionID  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CustomerID     : int  81 13 18 76 86 37 45 11 13 55 ...
 $ CustomerName   : chr  "Customer 39" "Customer 63" "Customer 98" "Customer 39" ...
 $ CustomerCity   : chr  "Chicago" "Philadelphia" "Chicago" "Houston" ...
 $ ProductName    : chr  "Adidas Jacket" "Samsung TV" "Adidas Jacket" "Dell Laptop" ...
 $ ProductCategory: chr  "Clothing" "Music" "Computers" "Computers" ...
 $ TotalAmount    : num  632 114 1289 885 96 ...
 $ Quantity       : int  3 3 7 2 5 2 3 3 8 1 ...
 $ TransactionDate: chr  "2024-03-09" "2024-12-08" "2024-01-22" "2024-07-02" ...

Column names:


## Part 2: Column Selection with `select()`

Practice different methods of selecting columns from your dataset.

In [12]:
# Task 2.1: Basic Selection
# Create 'basic_info' with TransactionID, CustomerID, ProductName, and TotalAmount

basic_info <- transactions %>%
select(TransactionID, CustomerID, ProductName,TotalAmount)
                                       
                                       
print("Selected Columns (TransactionID, CustomerID, ProductName,TotalAmount):")
print(basic_info)               # Result has only 4 columns instead of original 8


# Display the result
cat("Basic info dataset (first 5 rows):\n")
head(basic_info, 5)

[1] "Selected Columns (TransactionID, CustomerID, ProductName,TotalAmount):"
    TransactionID CustomerID     ProductName TotalAmount
1               1         81   Adidas Jacket      632.39
2               2         13      Samsung TV      114.28
3               3         18   Adidas Jacket     1289.24
4               4         76     Dell Laptop      885.40
5               5         86      Nike Shoes       95.95
6               6         37   Adidas Jacket     1126.34
7               7         45      HP Printer       78.71
8               8         11      Samsung TV      871.93
9               9         13       iPhone 14     1347.56
10             10         55 Sony Headphones      633.51
11             11        100      HP Printer      572.43
12             12         89   Adidas Jacket     1208.43
13             13         56     Dell Laptop     1337.94
14             14         19      Nike Shoes       32.29
15             15          1      Nike Shoes      602.79
16         

Unnamed: 0_level_0,TransactionID,CustomerID,ProductName,TotalAmount
Unnamed: 0_level_1,<int>,<int>,<chr>,<dbl>
1,1,81,Adidas Jacket,632.39
2,2,13,Samsung TV,114.28
3,3,18,Adidas Jacket,1289.24
4,4,76,Dell Laptop,885.4
5,5,86,Nike Shoes,95.95


In [14]:
# Task 2.2: Range Selection
# Create 'customer_details' with all columns from CustomerID to CustomerCity (inclusive)

customer_details <- transactions %>%
 select(CustomerID:CustomerCity)           
print("Selected Columns by Range (CustomerID to CustomerCity):")
print(customer_details)


# Display the result
cat("Customer details (first 5 rows):\n")
head(customer_details, 5)

[1] "Selected Columns by Range (CustomerID to CustomerCity):"
    CustomerID CustomerName CustomerCity
1           81  Customer 39      Chicago
2           13  Customer 63 Philadelphia
3           18  Customer 98      Chicago
4           76  Customer 39      Houston
5           86  Customer 45     New York
6           37   Customer 8 Philadelphia
7           45  Customer 83     New York
8           11  Customer 60      Chicago
9           13  Customer 69      Houston
10          55  Customer 24      Chicago
11         100  Customer 95 Philadelphia
12          89  Customer 22 Philadelphia
13          56  Customer 62      Phoenix
14          19 Customer 100      Phoenix
15           1  Customer 52     New York
16          78  Customer 20      Chicago
17           2  Customer 38     New York
18           5  Customer 15      Phoenix
19           7   Customer 1      Phoenix
20          27  Customer 57      Chicago
21          31   Customer 1      Houston
22          80   Customer 4     New 

Unnamed: 0_level_0,CustomerID,CustomerName,CustomerCity
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,81,Customer 39,Chicago
2,13,Customer 63,Philadelphia
3,18,Customer 98,Chicago
4,76,Customer 39,Houston
5,86,Customer 45,New York


In [15]:
# Task 2.3: Pattern-Based Selection

# Create 'date_columns' with columns starting with "Date" or "Time"
date_columns <- transactions %>%
  select(starts_with("Date"), starts_with("Time"))  # selects columns starting with "Date" or "Time"
print("Selected Date/Time Columns:")
print(date_columns)

# Create 'amount_columns' with columns containing the word "Amount"
amount_columns <- transactions %>%
  select(contains("Amount"))     
print("Selected Columns Containing 'Amount':")
print(amount_columns)        


# Display column names for verification
cat("Date/Time columns:", names(date_columns), "\n")
cat("Amount columns:", names(amount_columns), "\n")

[1] "Selected Date/Time Columns:"
data frame with 0 columns and 500 rows
[1] "Selected Columns Containing 'Amount':"
    TotalAmount
1        632.39
2        114.28
3       1289.24
4        885.40
5         95.95
6       1126.34
7         78.71
8        871.93
9       1347.56
10       633.51
11       572.43
12      1208.43
13      1337.94
14        32.29
15       602.79
16      1476.49
17       520.91
18      1012.57
19       834.08
20      1176.07
21       259.44
22      1424.99
23       897.63
24      1168.05
25       999.24
26       504.96
27        88.45
28       703.88
29      1392.13
30       294.82
31      1219.37
32      1416.35
33        82.98
34       887.44
35       536.69
36       871.96
37      1080.28
38        42.27
39        50.25
40       744.85
41       247.98
42      1158.09
43       817.22
44      1136.57
45      1206.70
46       229.81
47        57.22
48       935.30
49      1220.59
50       808.93
51      1397.43
52      1246.39
53       502.72
54       475.40
55 

In [16]:
# Task 2.4: Exclusion Selection
# Create 'no_ids' without TransactionID and CustomerID columns

no_ids <- transactions %>%
  select(-TransactionID, -CustomerID)       #
print("Selected All Columns Except TransactionID and CustomerID:")
print(no_ids)         


# Display column names for verification
cat("Columns after removing IDs:", names(no_ids), "\n")
cat("Number of columns:", ncol(no_ids), "\n")

[1] "Selected All Columns Except TransactionID and CustomerID:"
    CustomerName CustomerCity     ProductName ProductCategory TotalAmount
1    Customer 39      Chicago   Adidas Jacket        Clothing      632.39
2    Customer 63 Philadelphia      Samsung TV           Music      114.28
3    Customer 98      Chicago   Adidas Jacket       Computers     1289.24
4    Customer 39      Houston     Dell Laptop       Computers      885.40
5    Customer 45     New York      Nike Shoes       Computers       95.95
6     Customer 8 Philadelphia   Adidas Jacket     Electronics     1126.34
7    Customer 83     New York      HP Printer        Clothing       78.71
8    Customer 60      Chicago      Samsung TV           Music      871.93
9    Customer 69      Houston       iPhone 14           Music     1347.56
10   Customer 24      Chicago Sony Headphones           Books      633.51
11   Customer 95 Philadelphia      HP Printer          Movies      572.43
12   Customer 22 Philadelphia   Adidas Jacket   

## Part 3: Row Filtering with `filter()`

Learn to filter rows based on various conditions.

In [19]:
# Task 3.1: Single Condition Filtering

# Filter transactions with TotalAmount > $100
high_value_transactions <- transactions %>%
  filter(TotalAmount > 100)                
print("Items with TotalAmount > 100:")
print(high_value_transactions)          


# Filter transactions from "Electronics" category
electronics_transactions <- transactions %>%
  filter(ProductCategory == "Electronics")
print("Items in Electronics Category:")
print(electronics_transactions)


# Display results
cat("High value transactions (>$100):", nrow(high_value_transactions), "rows\n")
cat("Electronics transactions:", nrow(electronics_transactions), "rows\n")

[1] "Items with TotalAmount > 100:"
    TransactionID CustomerID CustomerName CustomerCity     ProductName
1               1         81  Customer 39      Chicago   Adidas Jacket
2               2         13  Customer 63 Philadelphia      Samsung TV
3               3         18  Customer 98      Chicago   Adidas Jacket
4               4         76  Customer 39      Houston     Dell Laptop
5               6         37   Customer 8 Philadelphia   Adidas Jacket
6               8         11  Customer 60      Chicago      Samsung TV
7               9         13  Customer 69      Houston       iPhone 14
8              10         55  Customer 24      Chicago Sony Headphones
9              11        100  Customer 95 Philadelphia      HP Printer
10             12         89  Customer 22 Philadelphia   Adidas Jacket
11             13         56  Customer 62      Phoenix     Dell Laptop
12             15          1  Customer 52     New York      Nike Shoes
13             16         78  Customer 20

In [20]:
# Task 3.2: Multiple Condition Filtering (AND)
# Filter for TotalAmount > $50 AND Quantity > 1 AND CustomerCity == "New York"

ny_bulk_purchases <- transactions %>%
  filter(TotalAmount > 50, Quantity > 1, CustomerCity == "New York")


# Display results
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "rows\n")
if(nrow(ny_bulk_purchases) > 0) {
  head(ny_bulk_purchases)
}

NY bulk purchases: 75 rows


Unnamed: 0_level_0,TransactionID,CustomerID,CustomerName,CustomerCity,ProductName,ProductCategory,TotalAmount,Quantity,TransactionDate
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<int>,<chr>
1,5,86,Customer 45,New York,Nike Shoes,Computers,95.95,5,2024-08-13
2,7,45,Customer 83,New York,HP Printer,Clothing,78.71,3,2024-05-02
3,15,1,Customer 52,New York,Nike Shoes,Clothing,602.79,5,2024-01-05
4,22,80,Customer 4,New York,iPhone 14,Books,1424.99,7,2024-02-01
5,25,97,Customer 17,New York,iPhone 14,Movies,999.24,2,2024-02-03
6,29,3,Customer 23,New York,Samsung TV,Clothing,1392.13,5,2024-04-30


In [21]:
# Task 3.3: Multiple Condition Filtering (OR)
# Filter for ProductCategory = "Books" OR "Music" OR "Movies"

entertainment_transactions <- transactions %>%
  filter(ProductCategory %in% c("Books", "Music", "Movies"))


# Display results
cat("Entertainment transactions:", nrow(entertainment_transactions), "rows\n")
if(nrow(entertainment_transactions) > 0) {
  head(entertainment_transactions)
}

Entertainment transactions: 227 rows


Unnamed: 0_level_0,TransactionID,CustomerID,CustomerName,CustomerCity,ProductName,ProductCategory,TotalAmount,Quantity,TransactionDate
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<int>,<chr>
1,2,13,Customer 63,Philadelphia,Samsung TV,Music,114.28,3,2024-12-08
2,8,11,Customer 60,Chicago,Samsung TV,Music,871.93,3,2024-04-30
3,9,13,Customer 69,Houston,iPhone 14,Music,1347.56,8,2024-08-08
4,10,55,Customer 24,Chicago,Sony Headphones,Books,633.51,1,2024-06-23
5,11,100,Customer 95,Philadelphia,HP Printer,Movies,572.43,6,2024-12-09
6,14,19,Customer 100,Phoenix,Nike Shoes,Books,32.29,3,2024-12-11


In [None]:
# Task 3.4: Date-Based Filtering
# Filter transactions from March 2024
# Note: Adjust the date format and column name based on your actual data

march_transactions <- transactions %>%
  # Your code here (you may need to convert date format first):
  filter(TransactionDate >= as.Date("2024-03-01") & TransactionDate <= as.Date("2024-03-31"))



# Display results
cat("March 2024 transactions:", nrow(march_transactions), "rows\n")

March 2024 transactions: 41 rows


In [23]:
# Task 3.5: Advanced Filtering Challenge
# Find customers who made purchases in both "Electronics" AND "Clothing" categories
# Hint: This requires identifying customers who appear in both categories

# Step 1: Find customers who bought Electronics
electronics_customers <- transactions %>%
  filter(ProductCategory == "Electronics")


# Step 2: Find customers who bought Clothing
clothing_customers <- transactions %>%
  filter(ProductCategory == "Clothing")


# Step 3: Find customers who bought both
both_categories_customers <- transactions %>%
  filter(CustomerID %in% electronics_customers$CustomerID & 
         CustomerID %in% clothing_customers$CustomerID)


# Display results
cat("Customers who bought both Electronics and Clothing:", length(both_categories_customers), "customers\n")

Customers who bought both Electronics and Clothing: 9 customers


## Part 4: Data Sorting with `arrange()`

Practice sorting data by single and multiple columns.

In [25]:
# Task 4.1: Single Column Sorting

# Sort by TotalAmount ascending
transactions_by_amount_asc <- transactions %>%
  arrange(TotalAmount)                      
print("Arranged by TotalAmount (Ascending):")
print(transactions_by_amount_asc)


# Sort by TotalAmount descending
transactions_by_amount_desc <- transactions %>%
  arrange(desc(TotalAmount))               
print("Arranged by TotalAmount (Descending):")
print(transactions_by_amount_desc)


# Display top 5 of each
cat("Lowest amounts:\n")
head(transactions_by_amount_asc %>% select(CustomerName, ProductName, TotalAmount), 5)

cat("\nHighest amounts:\n")
head(transactions_by_amount_desc %>% select(CustomerName, ProductName, TotalAmount), 5)

[1] "Arranged by TotalAmount (Ascending):"
    TransactionID CustomerID CustomerName CustomerCity     ProductName
1             315          1  Customer 95 Philadelphia   Adidas Jacket
2              14         19 Customer 100      Phoenix      Nike Shoes
3             477         72  Customer 50     New York   Adidas Jacket
4             275        100  Customer 83 Philadelphia      Samsung TV
5             282         48  Customer 69      Chicago     Dell Laptop
6             473         76  Customer 29     New York      HP Printer
7             325         26  Customer 43  Los Angeles      Samsung TV
8              38         55  Customer 61     New York   Adidas Jacket
9             464         84  Customer 83 Philadelphia      HP Printer
10             39         36  Customer 83  Los Angeles Sony Headphones
11            296         56  Customer 68      Chicago      Samsung TV
12             47         87  Customer 87      Chicago      HP Printer
13            406          7  Cust

    TransactionID CustomerID CustomerName CustomerCity     ProductName
1             150         64  Customer 60     New York Sony Headphones
2              61         87  Customer 28     New York     Dell Laptop
3             154         18  Customer 81      Phoenix Sony Headphones
4             187         64  Customer 79      Phoenix       iPhone 14
5             136         99  Customer 20      Phoenix      HP Printer
6             297          7  Customer 18      Chicago       iPhone 14
7             378         53  Customer 83 Philadelphia       iPhone 14
8              16         78  Customer 20      Chicago      HP Printer
9             479         83  Customer 10      Houston     Dell Laptop
10             84         41  Customer 14      Phoenix       iPhone 14
11            232         91  Customer 11     New York     Dell Laptop
12            341         47  Customer 36      Houston Sony Headphones
13            384         42  Customer 97      Chicago     Dell Laptop
14    

Unnamed: 0_level_0,CustomerName,ProductName,TotalAmount
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
1,Customer 95,Adidas Jacket,27.66
2,Customer 100,Nike Shoes,32.29
3,Customer 50,Adidas Jacket,35.01
4,Customer 83,Samsung TV,36.37
5,Customer 69,Dell Laptop,37.33



Highest amounts:


Unnamed: 0_level_0,CustomerName,ProductName,TotalAmount
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
1,Customer 60,Sony Headphones,1499.52
2,Customer 28,Dell Laptop,1491.96
3,Customer 81,Sony Headphones,1491.62
4,Customer 79,iPhone 14,1488.95
5,Customer 20,HP Printer,1487.44


In [26]:
# Task 4.2: Multiple Column Sorting
# Sort by CustomerCity (ascending), then by TotalAmount (descending)

transactions_by_city_amount <- transactions %>%
  arrange(CustomerCity, desc(TotalAmount))


# Display first 10 rows
cat("Transactions sorted by city, then amount:\n")
head(transactions_by_city_amount %>% select(CustomerCity, CustomerName, ProductName, TotalAmount), 10)

Transactions sorted by city, then amount:


Unnamed: 0_level_0,CustomerCity,CustomerName,ProductName,TotalAmount
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>
1,Chicago,Customer 18,iPhone 14,1487.43
2,Chicago,Customer 20,HP Printer,1476.49
3,Chicago,Customer 97,Dell Laptop,1459.42
4,Chicago,Customer 49,Dell Laptop,1453.94
5,Chicago,Customer 70,Nike Shoes,1428.35
6,Chicago,Customer 71,Samsung TV,1424.38
7,Chicago,Customer 62,Sony Headphones,1416.35
8,Chicago,Customer 11,Nike Shoes,1407.51
9,Chicago,Customer 99,HP Printer,1388.73
10,Chicago,Customer 49,Nike Shoes,1370.57


In [29]:
# Task 4.3: Date-Based Sorting
# Sort by TransactionDate chronologically (oldest first)

transactions_chronological <- transactions %>%
  arrange(TransactionDate)


# Display first 5 transactions chronologically
cat("Earliest transactions:\n")
head(transactions_chronological %>% select(TransactionDate, CustomerName, ProductName, TotalAmount), 5)

Earliest transactions:


Unnamed: 0_level_0,TransactionDate,CustomerName,ProductName,TotalAmount
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>
1,2024-01-01,Customer 62,Sony Headphones,1416.35
2,2024-01-01,Customer 99,Sony Headphones,1326.27
3,2024-01-02,Customer 83,Nike Shoes,808.93
4,2024-01-02,Customer 61,Adidas Jacket,502.72
5,2024-01-02,Customer 69,Adidas Jacket,277.7


## Part 5: Chaining Operations

Combine multiple dplyr operations using the pipe operator.

In [30]:
premium_purchases <- transactions %>%
  
  filter(TotalAmount > 75) %>%
  select(CustomerName, ProductName, TotalAmount) %>%
  arrange(desc(TotalAmount))

# Display results
cat("Premium purchases (>$75):\n")
head(premium_purchases, 10)

Premium purchases (>$75):


Unnamed: 0_level_0,CustomerName,ProductName,TotalAmount
Unnamed: 0_level_1,<chr>,<chr>,<dbl>
1,Customer 60,Sony Headphones,1499.52
2,Customer 28,Dell Laptop,1491.96
3,Customer 81,Sony Headphones,1491.62
4,Customer 79,iPhone 14,1488.95
5,Customer 20,HP Printer,1487.44
6,Customer 18,iPhone 14,1487.43
7,Customer 83,iPhone 14,1484.31
8,Customer 20,HP Printer,1476.49
9,Customer 10,Dell Laptop,1473.27
10,Customer 14,iPhone 14,1471.59


In [31]:
# Task 5.2: Complex Chain
# Filter for Electronics/Computers, select columns, arrange by date/amount, keep top 20

recent_tech_purchases <- transactions %>%
  filter(ProductCategory %in% c("Electronics", "Computers")) %>%  # Step 1: Filter for tech categories
  select(TransactionDate, CustomerName, ProductName, TotalAmount) %>%  # Step 2: Select specific columns
  arrange(desc(TransactionDate), desc(TotalAmount)) %>%  # Step 3: Sort by date (newest first), then amount (highest first)
  head(20) 


# Display results
cat("Recent tech purchases (top 20):\n")
print(recent_tech_purchases)

Recent tech purchases (top 20):
   TransactionDate CustomerName     ProductName TotalAmount
1       2024-12-30  Customer 27 Sony Headphones      539.52
2       2024-12-30  Customer 33 Sony Headphones      206.62
3       2024-12-25  Customer 39      Samsung TV      121.22
4       2024-12-24  Customer 25 Sony Headphones      700.12
5       2024-12-23  Customer 85       iPhone 14     1433.31
6       2024-12-19  Customer 73     Dell Laptop     1220.59
7       2024-12-13  Customer 17     Dell Laptop      140.76
8       2024-12-12   Customer 6      Nike Shoes      246.96
9       2024-12-11  Customer 54   Adidas Jacket     1231.83
10      2024-12-10  Customer 19      Nike Shoes      192.98
11      2024-12-07  Customer 12       iPhone 14      314.66
12      2024-12-06  Customer 85      HP Printer      929.88
13      2024-12-04  Customer 39     Dell Laptop      957.29
14      2024-12-04  Customer 12      Samsung TV      940.10
15      2024-12-02  Customer 90   Adidas Jacket     1445.64
16      

In [32]:
# Task 5.3: Business Intelligence Chain
# Identify high-value repeat customers (TotalAmount > $200)

high_value_customers <- transactions %>%
  filter(TotalAmount > 200) %>%
  select(CustomerName, ProductName, TotalAmount, TransactionDate) %>%
  arrange(desc(TotalAmount))


# Display results
cat("High-value customers:\n")
head(high_value_customers, 15)

High-value customers:


Unnamed: 0_level_0,CustomerName,ProductName,TotalAmount,TransactionDate
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<chr>
1,Customer 60,Sony Headphones,1499.52,2024-03-03
2,Customer 28,Dell Laptop,1491.96,2024-08-15
3,Customer 81,Sony Headphones,1491.62,2024-10-18
4,Customer 79,iPhone 14,1488.95,2024-10-23
5,Customer 20,HP Printer,1487.44,2024-05-02
6,Customer 18,iPhone 14,1487.43,2024-07-15
7,Customer 83,iPhone 14,1484.31,2024-12-24
8,Customer 20,HP Printer,1476.49,2024-05-29
9,Customer 10,Dell Laptop,1473.27,2024-04-14
10,Customer 14,iPhone 14,1471.59,2024-11-16


## Part 6: Data Analysis Questions

Answer the following questions using the datasets you've created.

In [None]:
# Question 6.1: Transaction Volume
# Count transactions in each filtered dataset

cat("Transaction counts by dataset:\n")
cat("High value transactions:", nrow(high_value_transactions), "\n")
cat("Electronics transactions:", nrow(electronics_transactions), "\n")
cat("NY bulk purchases:", nrow(ny_bulk_purchases), "\n")
cat("Entertainment transactions:", nrow(entertainment_transactions), "\n")
cat("March transactions:", nrow(march_transactions), "\n")
cat("Premium purchases:", nrow(premium_purchases), "\n")
cat("Recent tech purchases:", nrow(recent_tech_purchases), "\n")
cat("High value customers:", nrow(high_value_customers), "\n")

In [33]:
# Question 6.2: Top Customers
# Find the customer who appears most frequently in high_value_customers

if(nrow(high_value_customers) > 0) {
  customer_frequency <- high_value_customers %>%
    group_by(CustomerName) %>%
    summarise(frequency = n()) %>%
    arrange(desc(frequency))   
  
  
  cat("Most frequent high-value customer:\n")
  print(customer_frequency)
} else {
  cat("No high-value customers found\n")
}

Most frequent high-value customer:
[90m# A tibble: 98 × 2[39m
   CustomerName frequency
   [3m[90m<chr>[39m[23m            [3m[90m<int>[39m[23m
[90m 1[39m Customer 18         12
[90m 2[39m Customer 25         10
[90m 3[39m Customer 87         10
[90m 4[39m Customer 11          8
[90m 5[39m Customer 53          8
[90m 6[39m Customer 98          8
[90m 7[39m Customer 1           7
[90m 8[39m Customer 20          7
[90m 9[39m Customer 47          7
[90m10[39m Customer 74          7
[90m# ℹ 88 more rows[39m


In [None]:
# Question 6.3: Product Analysis
# Find top 5 most expensive transactions in entertainment_transactions

if(nrow(entertainment_transactions) > 0) {
  top_entertainment <- entertainment_transactions %>%
    # Your code here:
  
  
  cat("Top 5 most expensive entertainment transactions:\n")
  print(top_entertainment)
} else {
  cat("No entertainment transactions found\n")
}

In [35]:
# Question 6.4: Geographic Analysis
# Find the city with the highest single transaction amount

highest_transaction_by_city <- transactions_by_city_amount %>%
  arrange(desc(TotalAmount)) %>%
  head(5)

cat("City with highest single transaction:\n")
print(highest_transaction_by_city)

City with highest single transaction:
  TransactionID CustomerID CustomerName CustomerCity     ProductName
1           150         64  Customer 60     New York Sony Headphones
2            61         87  Customer 28     New York     Dell Laptop
3           154         18  Customer 81      Phoenix Sony Headphones
4           187         64  Customer 79      Phoenix       iPhone 14
5           136         99  Customer 20      Phoenix      HP Printer
  ProductCategory TotalAmount Quantity TransactionDate
1     Electronics     1499.52        2      2024-03-03
2       Computers     1491.96        7      2024-08-15
3           Books     1491.62        3      2024-10-18
4       Computers     1488.95        4      2024-10-23
5     Electronics     1487.44        1      2024-05-02


## Part 7: Reflection Questions

Please answer the following questions in the markdown cells below.

### Question 7.1: Pipe Operator Benefits

**How does using the pipe operator (`%>%`) improve code readability compared to nested function calls? Provide a specific example from your homework.**

Your answer here: It makes the code look more organized and easier to read. Not only does writing nested functions make it look messy, but it also makes it harder to debug and find problems within the code.


### Question 7.2: Filtering Strategy

**When filtering data for business analysis, what are the trade-offs between being very specific (many conditions) versus being more general (fewer conditions)? How might this affect your insights?**

Your answer here: By being very specific, we are only seeing a very specific part of the information through a microscope, potentially missing out on other important insights. By being more general, we can see the bigger picture, but we might miss out on some important details that could be crucial for decision-making in business.


### Question 7.3: Sorting Importance

**Why is data sorting important in business analytics? Provide three specific business scenarios where sorting data would be crucial for decision-making.**

Your answer here:

1. Managing inventory by sorting products based on sales (bestseller vs slowseller) and identify which items need restocking.
2. Sales performance of a set of products
3. Customer Relationship Management (CRM) by sorting customers based on their purchase history and frequency to identify high-value customers for targeted marketing campaigns.

### Question 7.4: Real-World Application

**Describe a real business scenario where you might need to combine `select()`, `filter()`, and `arrange()` operations. What insights would you be trying to gain?**

Your answer here: A real business scenario could be analyzing customer purchase behavior during a holiday sale. By selecting relevant columns such as CustomerName, ProductName, TotalAmount, and TransactionDate, filtering for transactions that occurred during the sale period and had a TotalAmount greater than a certain threshold, and arranging the results by TotalAmount in descending order, we could identify high-value customers and popular products. This insight would help in tailoring marketing strategies and inventory management for future sales events.


## Summary and Submission

### What You've Learned

In this homework, you've practiced:
- Using `select()` for column selection with various methods
- Using `filter()` for row filtering with single and multiple conditions
- Using `arrange()` for sorting data by single and multiple columns
- Chaining operations with the pipe operator (`%>%`)
- Analyzing business data to generate insights

### Submission Checklist

Before submitting, ensure you have:
- [ ] Completed all code tasks
- [ ] Run all cells successfully
- [ ] Answered all reflection questions
- [ ] Used proper commenting in your code
- [ ] Used the pipe operator where appropriate
- [ ] Verified your results make sense

### Next Steps

In the next lesson, you'll learn about:
- `mutate()` for creating new columns
- `summarize()` for calculating summary statistics
- `group_by()` for grouped operations
- Advanced data transformation techniques