# Lesson 4: Data Transformation with dplyr - Part 2

**Topic:** Advanced Data Transformation with `mutate()`, `summarize()`, and `group_by()`

**Learning Objectives:**
- Use `mutate()` to create new variables and transform existing ones
- Apply `summarize()` to generate summary statistics
- Utilize `group_by()` to perform operations by groups
- Combine functions to calculate Key Performance Indicators (KPIs)
- Practice chaining operations with the pipe operator (`%>%`)

---

## Overview

In this lesson, we'll build upon the foundation from Lesson 3 and explore more advanced data transformation techniques. These functions are essential for:
- Creating calculated fields and derived metrics
- Generating business reports and summaries
- Performing grouped analysis by categories
- Building dashboards and KPI reports

---

## Setup and Data Preparation

Before we dive into advanced data transformation techniques, we need to prepare our working environment and understand our dataset. This section will:

**🔧 Environment Setup:**
- Load the tidyverse package collection, which includes dplyr and other essential data science tools
- Understand what capabilities each package provides for our analysis work

**📊 Dataset Creation:**
- Create a realistic customer transaction dataset that mirrors real business scenarios
- Understand the structure and meaning of each data field
- Explore the relationships between customers, products, and regions

**💡 Why This Matters:**
In real-world data analysis, you'll often work with transactional data similar to what we're creating here. Understanding how to manipulate this type of data is crucial for:
- E-commerce analytics
- Retail performance analysis
- Customer behavior studies
- Financial reporting
- Business intelligence dashboards

Let's start by setting up our environment and examining our sample data.

In [None]:
# Load necessary packages for data manipulation and analysis
library(tidyverse)    # This loads multiple packages including:
                      # - dplyr: for data manipulation (mutate, summarize, group_by, etc.)
                      # - ggplot2: for data visualization
                      # - readr: for reading data files
                      # - and several others

# Confirm successful loading with user-friendly messages
cat("Tidyverse loaded successfully!\n")
cat("This includes: dplyr, ggplot2, readr, and more\n")

In [None]:
# Create a sample customer transaction dataset for demonstration
customer_transactions <- data.frame(
  # Customer identification numbers (1-5, with repeats showing multiple purchases)
  CustomerID = c(1, 1, 2, 2, 3, 3, 1, 4, 4, 5),
  # Unique transaction identifiers (sequential from 1001-1010)
  TransactionID = 1001:1010,
  # Product categories showing business diversity
  ProductCategory = c("Electronics", "Books", "Books", "Electronics", "Home Goods",
                      "Electronics", "Books", "Home Goods", "Books", "Electronics"),
  # Transaction amounts in dollars (varying from $20 to $300)
  Amount = c(120.50, 35.00, 50.00, 250.75, 80.00, 150.00, 20.00, 100.00, 45.00, 300.00),
  # Number of items purchased per transaction
  Quantity = c(1, 2, 1, 1, 1, 1, 1, 2, 1, 1),
  # Transaction dates in March 2024 (using as.Date() to ensure proper date format)
  TransactionDate = as.Date(c("2024-03-01", "2024-03-05", "2024-03-02", "2024-03-06", "2024-03-03",
                              "2024-03-07", "2024-03-08", "2024-03-04", "2024-03-09", "2024-03-10")),
  # Geographic regions for business location analysis
  Region = c("North", "North", "South", "South", "East", "East", "North", "West", "West", "East")
)

# Display the complete dataset to understand our data structure
print("Original Customer Transactions Data:")
print(customer_transactions)

# Generate quick statistics about our dataset dimensions and scope
cat("\nDataset Overview:\n")
cat("Rows:", nrow(customer_transactions), "\n")  # Count total number of transactions
cat("Columns:", ncol(customer_transactions), "\n")  # Count number of data fields
# Calculate and display the date range of our transaction data
cat("Date range:", as.character(min(customer_transactions$TransactionDate)), "to", 
    as.character(max(customer_transactions$TransactionDate)), "\n")

## Part 1: Creating New Variables with `mutate()`

The `mutate()` function is one of the most powerful and frequently used functions in dplyr. It allows you to add new columns to your dataset or modify existing ones without changing the original structure of your data.

**🎯 Core Capabilities:**
- **Create calculated fields** - Combine existing columns with mathematical operations
- **Transform existing data** - Convert units, apply formulas, or change data types
- **Add conditional variables** - Create flags, categories, or segments based on business rules
- **Build derived metrics** - Calculate KPIs, ratios, and performance indicators

**💼 Business Applications:**
In real-world scenarios, `mutate()` is essential for:
- **Financial Analysis:** Calculating profit margins, tax amounts, discounts
- **Customer Analytics:** Creating customer segments, lifetime value calculations
- **Inventory Management:** Computing stock ratios, reorder points, turnover rates
- **Performance Metrics:** Developing KPIs, conversion rates, growth percentages
- **Data Cleaning:** Standardizing formats, creating consistent categories

**🔧 Common Patterns:**
- Simple arithmetic: `mutate(total = price * quantity)`
- Conditional logic: `mutate(category = if_else(amount > 100, "High", "Low"))`
- Complex conditions: `mutate(segment = case_when(...))`
- Date manipulation: `mutate(month = month(date))`
- Text processing: `mutate(clean_name = str_to_title(name))`

**Syntax:** `mutate(new_column = calculation)`

Let's explore these capabilities with practical examples that mirror real business scenarios.

In [None]:
# Example 1: Calculate TotalPrice for each transaction
# Using mutate() to create a new column by multiplying existing columns
transactions_with_total_price <- customer_transactions %>%  # Start with original data
  mutate(TotalPrice = Amount * Quantity)  # Create new column: TotalPrice = Amount × Quantity

print("Transactions with TotalPrice:")
print(transactions_with_total_price)

# Show just the relevant columns to focus on the calculation
cat("\nAmount, Quantity, and calculated TotalPrice:\n")
transactions_with_total_price %>%
  select(TransactionID, Amount, Quantity, TotalPrice) %>%  # Select specific columns for clarity
  print()

In [None]:
# Example 2: Create a categorical variable based on conditions
# Using mutate() with logical comparison to create TRUE/FALSE flags
transactions_with_value_flag <- customer_transactions %>%  # Start with original data
  mutate(IsHighValue = Amount > 100)  # Create boolean column: TRUE if Amount > $100, FALSE otherwise

print("Transactions with HighValue Flag:")
transactions_with_value_flag %>%
  select(TransactionID, Amount, IsHighValue) %>%  # Show only relevant columns
  print()

# Count how many high-value transactions we have using sum() on logical values
# sum() treats TRUE as 1 and FALSE as 0, so this counts TRUE values
high_value_count <- sum(transactions_with_value_flag$IsHighValue)
cat("\nNumber of high-value transactions (>$100):", high_value_count, "\n")

In [None]:
# Example 3: Transform an existing column (currency conversion)
# Using mutate() to create a transformed version of an existing column
transactions_converted_amount <- customer_transactions %>%  # Start with original data
  mutate(AmountUSD = Amount * 1.05)  # Convert to USD: multiply by exchange rate of 1.05

print("Transactions with Converted Amount:")
transactions_converted_amount %>%
  select(TransactionID, Amount, AmountUSD) %>%  # Show original and converted amounts
  print()

# Explain the conversion formula used
cat("\nConversion applied: Original Amount × 1.05 = USD Amount\n")

In [None]:
# Example 4: Multiple mutations in one step
# Demonstrating how to create multiple new columns in a single mutate() call
transactions_multiple_columns <- customer_transactions %>%  # Start with original data
  mutate(
    # Simple calculation: multiply amount by quantity
    TotalPrice = Amount * Quantity,
    # Logical condition: create boolean for high-value transactions
    IsHighValue = Amount > 100,
    # Complex conditional logic using case_when() for multiple categories
    PriceCategory = case_when(
      Amount < 50 ~ "Low",      # If amount is less than $50, categorize as "Low"
      Amount < 150 ~ "Medium",  # If amount is $50-149, categorize as "Medium"
      TRUE ~ "High"             # All other cases (≥$150) categorize as "High"
    ),
    # Date manipulation: extract month name from transaction date
    Month = format(TransactionDate, "%B")
  )

print("Transactions with Multiple New Columns:")
transactions_multiple_columns %>%
  # Select specific columns to show the results of our mutations
  select(TransactionID, Amount, TotalPrice, IsHighValue, PriceCategory, Month) %>%
  print()

In [None]:
# Example 5: Financial Metrics for Business Analysis
# Creating realistic business financial metrics using mutate()
# Adding cost data to demonstrate profit, ROI, and margin calculations
transactions_with_financial_metrics <- customer_transactions %>%  # Start with original data
  mutate(
    # Simulate cost data (typically 60-80% of revenue for this example)
    Cost = Amount * runif(n(), 0.6, 0.8),  # Random cost between 60-80% of amount
    # Calculate basic financial metrics
    Profit = Amount - Cost,                 # Profit = Revenue - Cost
    Profit_Margin = (Profit / Amount) * 100,  # Profit margin as percentage
    ROI = (Profit / Cost) * 100,           # Return on Investment as percentage
    Cost_Ratio = (Cost / Amount) * 100,    # Cost as percentage of revenue
    # Create revenue size categories for business segmentation
    Revenue_Size = case_when(
      Amount > 200 ~ "Large",              # Transactions over $200
      Amount > 100 ~ "Medium",             # Transactions $100-200
      TRUE ~ "Small"                       # Transactions under $100
    )
  )

print("Financial Metrics Example:")
transactions_with_financial_metrics %>%
  select(TransactionID, Amount, Cost, Profit, Profit_Margin, ROI, Revenue_Size) %>%
  mutate(
    # Round financial metrics for cleaner display
    Cost = round(Cost, 2),
    Profit = round(Profit, 2),
    Profit_Margin = round(Profit_Margin, 1),
    ROI = round(ROI, 1)
  ) %>%
  print()

cat("\nKey Financial Metrics Explained:\n")
cat("- Profit = Revenue - Cost\n")
cat("- Profit Margin = (Profit / Revenue) × 100\n")
cat("- ROI = (Profit / Cost) × 100\n")
cat("- Cost Ratio = (Cost / Revenue) × 100\n")

## Part 2: Generating Summary Statistics with `summarize()`

The `summarize()` function is the cornerstone of data aggregation in dplyr. Unlike `mutate()` which adds columns to existing data, `summarize()` condenses your entire dataset (or groups within it) into summary statistics that answer key business questions.

**🎯 Core Purpose:**
`summarize()` transforms detailed transactional data into high-level insights by calculating aggregate metrics. This is essential for executive reporting, performance monitoring, and strategic decision-making.

**📊 Types of Summary Statistics:**
- **Central Tendency:** mean(), median(), mode - "What's typical?"
- **Spread:** sd(), var(), range() - "How varied is our data?"
- **Totals:** sum() - "What's the overall total?"
- **Counts:** n(), n_distinct() - "How many records/unique values?"
- **Extremes:** min(), max() - "What are the boundaries?"
- **Percentiles:** quantile() - "Where do specific values rank?"

**💼 Business Applications:**
- **Financial Reporting:** Total revenue, average order value, profit margins
- **Operational Metrics:** Daily transaction counts, customer acquisition rates
- **Performance Dashboards:** Monthly sales summaries, regional comparisons
- **Quality Control:** Error rates, compliance percentages, defect counts
- **Customer Analytics:** Customer lifetime value, average spend per visit

**🔍 Common Business Questions Answered:**
- "What was our total revenue this quarter?"
- "What's the average transaction amount by product category?"
- "How many unique customers did we serve?"
- "What's the typical order size?"
- "Which region has the highest sales variance?"

**Syntax:** `summarize(metric_name = function(column))`

The power of `summarize()` becomes even greater when combined with `group_by()`, allowing you to calculate these metrics for different segments of your business.

In [None]:
# Example 1: Calculate overall summary statistics
# Using summarize() to reduce the entire dataset to key aggregate metrics
overall_summary <- customer_transactions %>%  # Start with original data
  summarize(
    # Calculate the mean (average) of all transaction amounts
    AverageAmount = mean(Amount),
    # Sum all amounts to get total revenue across all transactions
    TotalRevenue = sum(Amount),
    # Sum all quantities to get total items sold
    TotalQuantity = sum(Quantity),
    # n() counts the number of rows (transactions) in the dataset
    NumberOfTransactions = n(),
    # n_distinct() counts unique values - here, unique customers
    UniqueCustomers = n_distinct(CustomerID),
    # Find the smallest transaction amount
    MinAmount = min(Amount),
    # Find the largest transaction amount
    MaxAmount = max(Amount)
  )

print("Overall Summary Statistics:")
print(overall_summary)

In [None]:
# Example 2: More detailed summary with additional metrics
# Demonstrating more advanced statistical measures and data quality checks
detailed_summary <- customer_transactions %>%  # Start with original data
  summarize(
    # CENTRAL TENDENCY MEASURES
    # round() ensures clean output with 2 decimal places
    Mean_Amount = round(mean(Amount), 2),    # Average transaction amount
    Median_Amount = median(Amount),          # Middle value when amounts are sorted
    
    # VARIABILITY MEASURES
    # Standard deviation shows how spread out the amounts are
    SD_Amount = round(sd(Amount), 2),
    
    # BUSINESS TOTALS
    Total_Revenue = sum(Amount),             # Sum of all transaction amounts
    Total_Items = sum(Quantity),             # Sum of all items sold
    
    # COUNTING METRICS
    Transaction_Count = n(),                           # Total number of transactions
    Customer_Count = n_distinct(CustomerID),          # Number of unique customers
    Product_Categories = n_distinct(ProductCategory), # Number of different product types
    
    # DATE ANALYSIS
    # Calculate the span of days covered by our transaction data
    Date_Range_Days = as.numeric(max(TransactionDate) - min(TransactionDate)) + 1
  )

print("Detailed Summary Statistics:")
print(detailed_summary)

## Part 3: Grouped Operations with `group_by()`

The `group_by()` function unlocks the true analytical power of dplyr by allowing you to perform operations on subsets of your data. This is where data analysis moves from simple overall summaries to sophisticated comparative analysis that drives business insights.

**🎯 Core Concept:**
Instead of calculating one summary for your entire dataset, `group_by()` splits your data into meaningful categories and performs calculations separately for each group. This enables you to compare performance across different segments of your business.

**🔄 How It Works:**
1. **Split:** `group_by()` divides your data into groups based on one or more columns
2. **Apply:** Functions like `summarize()` or `mutate()` operate on each group independently
3. **Combine:** Results are assembled into a new dataset showing metrics by group

**💼 Strategic Business Applications:**
- **Market Segmentation:** Compare performance across customer segments, regions, or product lines
- **Performance Benchmarking:** Identify top and bottom performers in any category
- **Trend Analysis:** Track metrics over time periods (monthly, quarterly, yearly)
- **Resource Allocation:** Determine where to focus marketing, inventory, or staff resources
- **Competitive Analysis:** Compare different product categories, channels, or business units

**📊 Common Grouping Dimensions:**
- **Geographic:** Region, country, state, city, store location
- **Temporal:** Year, quarter, month, day of week, hour
- **Product:** Category, brand, price tier, supplier
- **Customer:** Segment, loyalty level, acquisition channel
- **Operational:** Sales channel, payment method, promotion type

**🔍 Questions Group Analysis Answers:**
- "Which product category generates the most revenue?"
- "How does customer behavior vary by region?"
- "What's our monthly sales trend?"
- "Which customer segment has the highest average order value?"
- "How do weekend sales compare to weekday sales?"

**Syntax:** `group_by(column_name) %>% summarize(...)`

The combination of `group_by()` and `summarize()` forms the backbone of business intelligence and reporting in data science.

In [None]:
# Example 1: Group by ProductCategory and summarize
# Using group_by() to split data into categories, then summarize each group
summary_by_category <- customer_transactions %>%  # Start with original data
  group_by(ProductCategory) %>%    # Split data into groups by product category
  summarize(
    # For each product category, calculate these metrics:
    TotalSales = sum(Amount),                        # Sum all amounts in this category
    AverageAmount = round(mean(Amount), 2),          # Average transaction amount
    AverageQuantity = round(mean(Quantity), 2),      # Average items per transaction
    NumTransactions = n(),                           # Count transactions in this category
    UniqueCustomers = n_distinct(CustomerID),       # Count unique customers per category
    .groups = 'drop' # Important: remove grouping structure after summarizing
  )

print("Summary by Product Category:")
print(summary_by_category)

# Sort by total sales to see top categories
cat("\nTop categories by total sales:\n")
summary_by_category %>%
  arrange(desc(TotalSales)) %>%  # Sort in descending order by TotalSales
  print()

In [None]:
# Example 2: Group by Region and summarize
# Analyzing business performance across different geographic regions
summary_by_region <- customer_transactions %>%    # Start with original data
  group_by(Region) %>%     # Split data into groups by geographic region
  summarize(
    # Calculate key metrics for each region
    TotalSales = sum(Amount),                              # Sum all transaction amounts per region
    AverageAmountPerTransaction = round(mean(Amount), 2),  # Average transaction value per region
    UniqueCustomers = n_distinct(CustomerID),             # Count unique customers per region
    TransactionCount = n(),                               # Count total transactions per region
    .groups = 'drop'    # Remove grouping structure after calculation
  )

print("Summary by Region:")
print(summary_by_region)

# Calculate percentage of total sales by region for market share analysis
total_sales <- sum(summary_by_region$TotalSales)    # Calculate overall total sales
summary_by_region_with_pct <- summary_by_region %>%
  mutate(PercentageOfTotal = round((TotalSales / total_sales) * 100, 1))  # Calculate each region's percentage

cat("\nRegional sales with percentage of total:\n")
print(summary_by_region_with_pct)

In [None]:
# Example 3: Group by multiple columns (CustomerID and ProductCategory)
# Analyzing customer behavior across different product categories
summary_by_customer_product <- customer_transactions %>%    # Start with original data
  group_by(CustomerID, ProductCategory) %>%    # Group by BOTH customer AND product category
  summarize(
    # Calculate metrics for each customer-category combination
    TotalSpent = sum(Amount),                              # How much each customer spent per category
    TotalItems = sum(Quantity),                           # How many items each customer bought per category
    NumTransactions = n(),                                # Number of transactions per customer-category
    AvgTransactionAmount = round(mean(Amount), 2),        # Average transaction size per customer-category
    .groups = 'drop'    # Remove the multi-level grouping structure
  )

print("Summary by Customer and Product Category:")
print(summary_by_customer_product)

# Find customers who shop in multiple categories (cross-selling analysis)
customers_multiple_categories <- summary_by_customer_product %>%
  group_by(CustomerID) %>%    # Re-group by customer only
  summarize(
    # Count how many different categories each customer shops in
    CategoriesShoppedIn = n(),
    # Calculate total spending across all categories per customer
    TotalSpentAllCategories = sum(TotalSpent),
    .groups = 'drop'
  ) %>%
  filter(CategoriesShoppedIn > 1)    # Keep only customers who shop in multiple categories

cat("\nCustomers who shop in multiple categories:\n")
print(customers_multiple_categories)

## Part 4: Counting Observations with `count()`

The `count()` function is a powerful yet simple tool that serves as a convenient shortcut for one of the most common data analysis tasks: understanding the frequency distribution of categorical variables in your dataset.

**🎯 Why Counting Matters in Business:**
Frequency analysis is fundamental to understanding your business patterns. Before diving into complex analytics, you need to know the basic distribution of your data - how many transactions by category, how many customers per region, what's the distribution of order sizes, etc.

**⚡ Efficiency Benefits:**
While you could achieve the same results using `group_by() %>% summarize(n = n())`, the `count()` function provides:
- **Cleaner syntax** - Fewer lines of code, easier to read and write
- **Built-in sorting** - Optional `sort = TRUE` parameter automatically orders results
- **Faster execution** - Optimized for counting operations
- **Less error-prone** - Reduces chances of syntax mistakes in complex grouping operations

**📊 Common Business Applications:**
- **Market Share Analysis** - Count transactions by product category to understand market positioning
- **Geographic Distribution** - Count customers or sales by region to guide expansion decisions
- **Inventory Planning** - Count orders by product to forecast demand
- **Customer Behavior** - Count visits, purchases, or interactions by customer segment
- **Quality Control** - Count defects, returns, or complaints by category
- **Seasonality Analysis** - Count transactions by time period to identify patterns

**🔍 What Count Analysis Reveals:**
- **Volume Patterns** - Which categories dominate your business
- **Market Concentration** - Is business evenly distributed or concentrated in few areas
- **Operational Focus** - Where to allocate resources and attention
- **Growth Opportunities** - Underperforming categories that might need investment
- **Risk Assessment** - Over-dependence on specific segments

**Note:** `count()` is equivalent to `group_by() %>% summarize(n = n())` but with cleaner syntax.

**Common Pattern:** `count(variable, sort = TRUE)` gives you an immediate ranked frequency table.

In [None]:
# Example 1: Count transactions per product category
# Using count() as a convenient shortcut for frequency analysis
count_by_product <- customer_transactions %>%    # Start with original data
  count(ProductCategory, sort = TRUE)    # Count occurrences and sort by frequency (highest first)
                                         # This is equivalent to: group_by(ProductCategory) %>% summarize(n = n()) %>% arrange(desc(n))

print("Transaction Count by Product Category:")
print(count_by_product)

# Add percentage calculations to show market share by transaction volume
count_by_product_with_pct <- count_by_product %>%
  mutate(Percentage = round((n / sum(n)) * 100, 1))    # Calculate percentage of total transactions

cat("\nWith percentages:\n")
print(count_by_product_with_pct)

In [None]:
# Example 2: Count transactions per region
# Using count() as a shortcut for group_by() %>% summarize(n = n())
count_by_region <- customer_transactions %>%  # Start with original data
  count(Region, sort = TRUE)    # Count occurrences of each region, sort by count (descending)

print("Transaction Count by Region:")
print(count_by_region)

# Compare with unique customers by region to get customer distribution
unique_customers_by_region <- customer_transactions %>%  # Start with original data
  group_by(Region) %>%     # Group by geographic region
  summarize(UniqueCustomers = n_distinct(CustomerID), .groups = 'drop')  # Count unique customers per region

cat("\nUnique Customers by Region:\n")
print(unique_customers_by_region)

# Combine both metrics using join operations for comprehensive regional analysis
region_analysis <- count_by_region %>%      # Start with transaction counts
  left_join(unique_customers_by_region, by = "Region") %>%    # Add customer counts by matching Region
  mutate(TransactionsPerCustomer = round(n / UniqueCustomers, 2)) %>%   # Calculate average transactions per customer
  rename(TotalTransactions = n)    # Rename 'n' to more descriptive name

cat("\nCombined Regional Analysis:\n")
print(region_analysis)

In [None]:
# Example 3: Cross-tabulation using table() function
# Understanding relationships between two categorical variables
cat("Cross-Tabulation Analysis:\n")
cat("═══════════════════════════\n")

# Create cross-tabulation of Region vs ProductCategory using base R table()
region_product_crosstab <- table(customer_transactions$Region, customer_transactions$ProductCategory)

cat("Region vs Product Category Cross-Tabulation:\n")
print(region_product_crosstab)

# Add row and column totals for comprehensive analysis
cat("\nWith Row and Column Totals:\n")
addmargins(region_product_crosstab)

# Alternative: Using count() for cross-tabulation (dplyr approach)
region_product_count <- customer_transactions %>%
  count(Region, ProductCategory, sort = TRUE)

cat("\nSame Cross-Tabulation using dplyr count():\n")
print(region_product_count)

# Convert count results to percentage for market share analysis
region_product_pct <- region_product_count %>%
  mutate(
    Total = sum(n),
    Percentage = round((n / Total) * 100, 1)
  ) %>%
  select(-Total)

cat("\nCross-Tabulation with Percentages:\n")
print(region_product_pct)

cat("\nKey Cross-Tabulation Uses:\n")
cat("- Market segmentation analysis\n")
cat("- Geographic product performance\n") 
cat("- Customer behavior patterns\n")
cat("- Resource allocation decisions\n")

## Part 5: Combining Functions for Advanced Analysis

The real power of dplyr emerges when we combine multiple functions together to create sophisticated business analytics. In this section, we'll learn how to chain together `mutate()`, `group_by()`, and `summarize()` to build comprehensive reports that would typically require multiple separate operations.

### What You'll Learn:
- **Advanced Chaining**: How to connect multiple dplyr functions using the pipe operator (`%>%`)
- **KPI Calculations**: Creating key performance indicators by combining calculated fields with grouping
- **Multi-Level Analysis**: Building dashboards that combine different metrics for business insights
- **Customer Analytics**: Understanding customer behavior through transaction frequency and value metrics

### Business Context:
In real-world analytics, we rarely use just one function in isolation. Business stakeholders need comprehensive reports that combine multiple metrics to tell a complete story. For example, knowing total revenue by region is useful, but combining it with customer counts, transaction frequency, and average values provides much deeper insights for decision-making.

The examples below demonstrate how to build these multi-dimensional analyses step by step.

In [None]:
# Example 1: Calculate KPIs by region
# Combining mutate(), group_by(), and summarize() for comprehensive business metrics
kpis_by_region <- customer_transactions %>%    # Start with original data
  # STEP 1: Create calculated fields using mutate()
  mutate(TotalPrice = Amount * Quantity) %>%   # Calculate total value per transaction
  # STEP 2: Group data by geographic region
  group_by(Region) %>%
  # STEP 3: Calculate key performance indicators for each region
  summarize(
    TotalRevenue = sum(TotalPrice),              # Sum of all transaction values
    AvgTransactionValue = round(mean(Amount), 2), # Average amount per transaction
    TotalTransactions = n(),                     # Count of transactions
    UniqueCustomers = n_distinct(CustomerID),   # Count of unique customers
    # Calculate average transactions per customer (frequency metric)
    AvgTransactionsPerCustomer = round(n() / n_distinct(CustomerID), 2),
    .groups = 'drop'  # Remove grouping structure
  ) %>%
  # STEP 4: Add percentage calculations using mutate() on the summarized data
  mutate(
    # Calculate each region's percentage of total revenue
    RevenuePercentage = round((TotalRevenue / sum(TotalRevenue)) * 100, 1)
  ) %>%
  # STEP 5: Sort by revenue to show top-performing regions first
  arrange(desc(TotalRevenue))

print("Key Performance Indicators (KPIs) by Region:")
print(kpis_by_region)

In [None]:
# Example 2: Customer segmentation analysis
# Creating customer profiles for targeted marketing and business strategy
customer_segmentation <- customer_transactions %>%    # Start with original data
  group_by(CustomerID) %>%    # Group by individual customers
  summarize(
    # Calculate key customer metrics
    TotalSpent = sum(Amount),                              # Total money spent by each customer
    TransactionCount = n(),                               # Number of transactions per customer
    AvgTransactionAmount = round(mean(Amount), 2),        # Average spend per transaction
    CategoriesShoppedIn = n_distinct(ProductCategory),    # Number of different categories purchased
    # Calculate customer tenure (days between first and last purchase)
    DaysSinceFirstPurchase = as.numeric(max(TransactionDate) - min(TransactionDate)),
    .groups = 'drop'
  ) %>%
  # Create customer segments using business rules
  mutate(
    # Segment by spending level using case_when() for multiple conditions
    CustomerSegment = case_when(
      TotalSpent >= 300 ~ "High Value",     # Big spenders
      TotalSpent >= 150 ~ "Medium Value",   # Moderate spenders
      TRUE ~ "Low Value"                    # Light spenders (all other cases)
    ),
    # Segment by shopping diversity
    ShoppingBehavior = case_when(
      CategoriesShoppedIn >= 3 ~ "Diverse Shopper",    # Shops across many categories
      CategoriesShoppedIn == 2 ~ "Moderate Shopper",   # Shops in 2 categories
      TRUE ~ "Focused Shopper"                          # Shops in only 1 category
    )
  )

print("Customer Segmentation Analysis:")
print(customer_segmentation)

# Create a summary of segment distributions for strategic planning
segment_summary <- customer_segmentation %>%
  count(CustomerSegment, ShoppingBehavior) %>%    # Count customers in each segment combination
  arrange(CustomerSegment, ShoppingBehavior)      # Sort for better readability

cat("\nCustomer Segment Distribution:\n")
print(segment_summary)

In [None]:
# Example 3: Time-based analysis (by date)
# Analyzing business performance patterns over time and by day of week
daily_analysis <- customer_transactions %>%    # Start with original data
  mutate(
    # Extract day of week from transaction date for pattern analysis
    DayOfWeek = weekdays(TransactionDate),
    # Calculate total transaction value (amount × quantity)
    TotalValue = Amount * Quantity
  ) %>%
  # Group by both date and day of week for comprehensive time analysis
  group_by(TransactionDate, DayOfWeek) %>%
  summarize(
    # Calculate daily business metrics
    DailyRevenue = sum(TotalValue),                  # Total revenue per day
    DailyTransactions = n(),                         # Number of transactions per day
    UniqueCustomers = n_distinct(CustomerID),       # Number of unique customers per day
    AvgTransactionSize = round(mean(Amount), 2),     # Average transaction amount per day
    .groups = 'drop'
  ) %>%
  arrange(TransactionDate)    # Sort chronologically

print("Daily Business Analysis:")
print(daily_analysis)

# Create weekly performance patterns by aggregating daily data
weekly_summary <- daily_analysis %>%
  group_by(DayOfWeek) %>%    # Group by day of week only
  summarize(
    # Calculate average performance metrics across all instances of each day
    AvgDailyRevenue = round(mean(DailyRevenue), 2),
    AvgDailyTransactions = round(mean(DailyTransactions), 2),
    .groups = 'drop'
  )

cat("\nAverage Performance by Day of Week:\n")
print(weekly_summary)

## Part 6: Business Intelligence Dashboard Summary

In this final section, we'll create a comprehensive business intelligence summary that demonstrates how to combine all the dplyr functions you've learned into a professional, executive-ready dashboard. This is where everything comes together to create real business value.

### What You'll Learn:
- **Professional Reporting**: How to format output for executive consumption using `cat()` and structured headers
- **Multi-Section Analysis**: Breaking down complex business questions into digestible sections
- **Comprehensive Metrics**: Combining overall, categorical, and regional analysis in one cohesive report
- **Executive Summary Skills**: Presenting technical analysis in business-friendly language

### Business Context:
Real-world business intelligence dashboards are the culmination of data analysis work. They need to be:
- **Clear and Professional**: Easy for non-technical stakeholders to understand
- **Comprehensive**: Covering multiple business dimensions (financial, operational, geographic)
- **Actionable**: Providing insights that can drive business decisions
- **Well-Structured**: Organized in logical sections that tell a story

### Dashboard Structure:
Our dashboard will include three key sections:
1. **Overall Business Metrics**: High-level KPIs that provide the big picture
2. **Category Performance**: Product-level insights for inventory and marketing decisions
3. **Regional Analysis**: Geographic performance for expansion and resource allocation

This approach mirrors what you'd see in professional business intelligence tools like Tableau, Power BI, or executive reporting systems.

In [None]:
# Create a comprehensive business dashboard
# This demonstrates how to combine all dplyr functions to create executive reporting

# Create a professional header using cat() for formatted output
cat("\n" , "=", rep("=", 50), "\n")
cat("       BUSINESS INTELLIGENCE DASHBOARD\n")
cat("=", rep("=", 50), "\n\n")

# SECTION 1: Calculate overall business metrics using summarize()
cat("📊 OVERALL BUSINESS METRICS\n")
cat("─────────────────────────────\n")
overall_metrics <- customer_transactions %>%    # Start with original data
  summarize(
    # Key financial metrics
    total_revenue = sum(Amount),                 # Sum all transaction amounts
    total_transactions = n(),                    # Count total transactions
    unique_customers = n_distinct(CustomerID),   # Count unique customers
    avg_transaction = round(mean(Amount), 2),    # Calculate average transaction value
    # Business period covered by the data
    date_range = paste(min(TransactionDate), "to", max(TransactionDate))
  )

# Display metrics using cat() for clean business report formatting
cat("💰 Total Revenue: $", overall_metrics$total_revenue, "\n")
cat("🛒 Total Transactions:", overall_metrics$total_transactions, "\n")
cat("👥 Unique Customers:", overall_metrics$unique_customers, "\n")
cat("📈 Average Transaction: $", overall_metrics$avg_transaction, "\n")
cat("📅 Date Range:", overall_metrics$date_range, "\n\n")

# SECTION 2: Analyze top performing categories using group_by() and summarize()
cat("🏆 TOP PERFORMING CATEGORIES\n")
cat("──────────────────────────────\n")
top_categories <- customer_transactions %>%     # Start with original data
  group_by(ProductCategory) %>%    # Group by product type
  summarize(
    revenue = sum(Amount),         # Calculate total revenue per category
    transactions = n(),            # Count transactions per category
    .groups = 'drop'               # Remove grouping structure
  ) %>%
  arrange(desc(revenue))           # Sort by revenue (highest first)

# Use a loop to display results in a formatted business report style
for(i in 1:nrow(top_categories)) {
  cat(i, ".", top_categories$ProductCategory[i], ": $", top_categories$revenue[i], 
      " (", top_categories$transactions[i], " transactions)\n")
}

# SECTION 3: Analyze regional performance using group_by() and summarize()
cat("\n🌍 REGIONAL PERFORMANCE\n")
cat("─────────────────────\n")
regional_perf <- customer_transactions %>%     # Start with original data
  group_by(Region) %>%      # Group by geographic region
  summarize(
    # Calculate key regional metrics
    revenue = sum(Amount),                       # Total revenue per region
    customers = n_distinct(CustomerID),         # Unique customers per region
    .groups = 'drop'
  ) %>%
  arrange(desc(revenue))    # Sort regions by revenue (highest first)

# Use a loop to display regional performance in business report format
for(i in 1:nrow(regional_perf)) {
  cat("📍", regional_perf$Region[i], ": $", regional_perf$revenue[i], 
      " (", regional_perf$customers[i], " customers)\n")
}

# Close the dashboard with a professional footer
cat("\n", "=", rep("=", 50), "\n")

## Data Validation and Quality Assurance

**Why Validation Matters in Business Analytics:**
Before presenting results to stakeholders or making business decisions, data analysts must verify that their calculations are correct and their data is reliable. Poor data quality can lead to incorrect insights and costly business mistakes.

**🔍 Types of Validation:**
- **Calculation Verification**: Ensure derived metrics are computed correctly
- **Range Validation**: Check that values fall within expected ranges
- **Consistency Checks**: Verify that related metrics align logically
- **Outlier Detection**: Identify unusual values that might indicate errors
- **Missing Data Analysis**: Check for gaps that could affect analysis
- **Business Logic Validation**: Ensure results make business sense

**⚠️ Common Data Quality Issues:**
- Mathematical errors in calculated fields
- Extreme outliers that skew results
- Missing values in critical business metrics
- Inconsistent data formats or units
- Impossible values (negative quantities, future dates for historical data)

**🛠️ Validation Techniques:**
- Use `all.equal()` to verify calculations
- Apply `summary()` to spot outliers and impossible values
- Use logical tests to check business rules
- Compare totals and subtotals for consistency
- Check data types and formats

Let's demonstrate these validation techniques with practical examples.

In [None]:
# Data Validation Examples
cat("🔍 DATA VALIDATION DEMONSTRATION\n")
cat("══════════════════════════════════\n\n")

# First, let's create some data with calculated fields to validate
validation_data <- customer_transactions %>%
  mutate(
    # Add simulated cost data for profit calculations
    Cost = Amount * runif(n(), 0.6, 0.8),  # Cost is 60-80% of amount
    Profit = Amount - Cost,                 # Calculate profit
    Profit_Margin = (Profit / Amount) * 100, # Calculate profit margin percentage
    ROI = (Profit / Cost) * 100,           # Return on investment
    # Create some categorical variables
    Value_Category = case_when(
      Amount > 200 ~ "High",
      Amount > 100 ~ "Medium", 
      TRUE ~ "Low"
    )
  )

# VALIDATION 1: Verify calculated fields are correct
cat("1. CALCULATION VERIFICATION\n")
cat("──────────────────────────\n")

# Check if Profit calculation is correct (Profit = Amount - Cost)
profit_check <- all.equal(validation_data$Profit, validation_data$Amount - validation_data$Cost)
cat("Profit calculation check:", ifelse(profit_check == TRUE, "✅ PASSED", "❌ FAILED"), "\n")

# Check if Profit_Margin calculation is correct
expected_margin <- (validation_data$Profit / validation_data$Amount) * 100
margin_check <- all.equal(validation_data$Profit_Margin, expected_margin)
cat("Profit margin calculation check:", ifelse(margin_check == TRUE, "✅ PASSED", "❌ FAILED"), "\n")

# VALIDATION 2: Range and outlier detection
cat("\n2. RANGE AND OUTLIER VALIDATION\n")
cat("────────────────────────────────\n")

# Check for impossible values
negative_profits <- sum(validation_data$Profit < 0)
extreme_margins <- sum(validation_data$Profit_Margin > 100 | validation_data$Profit_Margin < -50, na.rm = TRUE)

cat("Transactions with negative profit:", negative_profits, "\n")
cat("Transactions with extreme profit margins (>100% or <-50%):", extreme_margins, "\n")

# Display summary statistics to spot outliers
cat("\nProfit Margin Summary (looking for outliers):\n")
print(summary(validation_data$Profit_Margin))

# VALIDATION 3: Business logic validation
cat("\n3. BUSINESS LOGIC VALIDATION\n")
cat("────────────────────────────\n")

# Check that all amounts are positive (business rule)
negative_amounts <- sum(validation_data$Amount <= 0)
cat("Transactions with non-positive amounts:", negative_amounts, "\n")

# Check that quantities are positive integers
invalid_quantities <- sum(validation_data$Quantity <= 0 | validation_data$Quantity != round(validation_data$Quantity))
cat("Transactions with invalid quantities:", invalid_quantities, "\n")

# VALIDATION 4: Consistency checks
cat("\n4. CONSISTENCY VALIDATION\n")
cat("─────────────────────────\n")

# Verify that category assignments are consistent
high_value_count <- sum(validation_data$Value_Category == "High")
high_value_amount_count <- sum(validation_data$Amount > 200)
cat("High value category count:", high_value_count, "\n")
cat("Amounts > $200 count:", high_value_amount_count, "\n")
cat("Category assignment consistency:", ifelse(high_value_count == high_value_amount_count, "✅ CONSISTENT", "❌ INCONSISTENT"), "\n")

# VALIDATION 5: Missing data check
cat("\n5. MISSING DATA VALIDATION\n")
cat("──────────────────────────\n")

# Check for missing values in critical columns
missing_amounts <- sum(is.na(validation_data$Amount))
missing_costs <- sum(is.na(validation_data$Cost))
missing_profits <- sum(is.na(validation_data$Profit))

cat("Missing amounts:", missing_amounts, "\n")
cat("Missing costs:", missing_costs, "\n") 
cat("Missing profits:", missing_profits, "\n")

cat("\n📋 VALIDATION SUMMARY\n")
cat("──────────────────────\n")
cat("Data validation is complete. Review any issues above before proceeding with analysis.\n")
cat("Always validate your data before presenting results to stakeholders!\n")

## Key Takeaways and Best Practices

### 🎯 **Key Functions Learned:**

1. **`mutate()`** - Create new variables and transform existing ones
2. **`summarize()`** - Generate aggregate statistics and summaries
3. **`group_by()`** - Perform operations on grouped data
4. **`count()`** - Quick frequency counting

### 💡 **Best Practices:**

- Always use `.groups = 'drop'` after `summarize()` to avoid unexpected grouping
- Use `round()` for cleaner numeric outputs
- Combine functions with the pipe operator (`%>%`) for readable code
- Use meaningful variable names in your mutations and summaries
- Consider using `case_when()` for complex conditional logic
- **Always validate your calculations** before presenting results
- **Check for outliers and impossible values** in your derived metrics
- **Verify business logic** - ensure categorical assignments match their criteria
- **Document your validation process** for stakeholder confidence

### 🔒 **Data Validation Checklist:**

- ✅ **Mathematical Accuracy**: Use `all.equal()` to verify calculations
- ✅ **Range Validation**: Check that values fall within business-logical ranges
- ✅ **Missing Data**: Identify and address gaps in critical columns
- ✅ **Outlier Detection**: Use `summary()` to spot extreme or impossible values
- ✅ **Business Rules**: Ensure categorical variables match their criteria
- ✅ **Consistency**: Verify that related metrics align (e.g., percentages sum to 100%)
- ✅ **Stakeholder Review**: Present validation results alongside your analysis

### 🔍 **Common Use Cases:**

- **Financial Analysis**: Revenue calculations, profit margins, cost analysis
- **Customer Analytics**: Segmentation, lifetime value, behavior analysis
- **Product Performance**: Sales by category, inventory analysis
- **Regional Analysis**: Geographic performance comparisons
- **Time Series**: Daily, weekly, monthly trend analysis

### 🚀 **Next Steps:**

In future lessons, you'll learn about:
- Joining data from multiple sources
- Advanced data visualization with ggplot2
- Working with dates and times
- Advanced statistical analysis techniques