# Lesson 8: Advanced Data Wrangling Techniques & Best Practices

**Topic:** Complex Workflows, Chaining, and Professional Data Analysis

**Time:** 60 minutes

---

## Background: The Capstone of Data Wrangling

### Why This Lesson Is Different

This is your **capstone lesson** - where everything comes together. Unlike previous lessons that focused on specific skills, this lesson teaches you to:
- **Integrate multiple techniques** into sophisticated workflows
- **Think like a professional analyst** about data quality and reproducibility
- **Build production-ready code** that others can use and maintain
- **Communicate insights** effectively to business stakeholders

### Real-World Professional Context

**What Professional Data Analysis Actually Looks Like:**

In real business environments, you'll rarely use just one technique. Instead, you'll:
1. **Import** data from multiple sources (CSV, Excel, databases)
2. **Validate** data quality (missing values, outliers, business rules)
3. **Clean** and standardize (text, dates, categories)
4. **Transform** with complex logic (segmentation, scoring, classification)
5. **Aggregate** and summarize (group by multiple dimensions)
6. **Analyze** patterns and trends
7. **Communicate** findings to stakeholders
8. **Document** for reproducibility

### Business Impact of Advanced Skills

**Revenue Growth:**
- Identify high-value customer segments for targeted marketing
- Optimize product mix based on profitability analysis
- Predict churn and implement retention strategies
- Discover cross-sell and upsell opportunities

**Cost Reduction:**
- Automate manual data processing (saving hours/week)
- Improve operational efficiency through data-driven insights
- Reduce errors from manual data handling
- Optimize inventory to reduce carrying costs

**Risk Mitigation:**
- Detect data quality issues before they impact decisions
- Validate business rules automatically
- Create audit trails for compliance
- Identify anomalies and outliers

**Strategic Planning:**
- Provide evidence-based recommendations
- Identify market opportunities and threats
- Support long-term growth strategies
- Enable data-driven decision making

### Professional Skills You'll Master

**Technical Skills:**
- Complex pipeline construction (chaining 5+ operations)
- Advanced conditional logic with `case_when()`
- Data validation and quality checks
- Creating reusable functions
- Reproducible analysis workflows

**Business Skills:**
- Customer segmentation (RFM, value scoring)
- KPI calculation and tracking
- Executive reporting and visualization
- Translating technical findings to business language
- Making actionable recommendations

**Professional Practices:**
- Code organization and documentation
- Version control readiness
- Error handling and edge cases
- Performance optimization
- Collaboration and maintainability

### What Success Looks Like

By the end of this lesson, you'll be able to:
- Build complex analysis pipelines that combine filtering, transformation, grouping, and summarization
- Implement sophisticated business logic using `case_when()`
- Validate data quality automatically
- Create reusable analysis functions
- Generate executive-ready reports
- Write code that other analysts can understand and maintain

---

## Learning Objectives

By the end of this lesson, you will be able to:
1. Chain multiple dplyr operations for complex workflows
2. Use `case_when()` for sophisticated conditional logic
3. Implement data validation and quality checks
4. Create reproducible analysis workflows
5. Apply best practices for professional data analysis
6. Build reusable functions for common analyses
7. Generate business-ready reports and summaries

---

## Part 1: Setup and Complex Sample Data

### Understanding Complex Business Data

Real business analysis requires working with datasets that have:
- **Multiple dimensions** (product, region, customer type, time)
- **Various data types** (numeric, categorical, dates)
- **Relationships** between different attributes
- **Enough volume** to reveal patterns

### The Dataset We'll Analyze

We'll create a realistic sales dataset with 50 transactions that includes:
- **Products:** Laptop, Mouse, Keyboard, Monitor, Webcam, Headphones
- **Categories:** Electronics, Peripherals
- **Regions:** North, South, East, West
- **Customer Types:** New, Returning, VIP
- **Time Period:** Q1 2024 (January - March)
- **Metrics:** Sales amount, quantity, dates

### Business Questions We'll Answer

1. Which regions and categories drive the most revenue?
2. How do we segment customers by value?
3. What are the transaction patterns by day of week?
4. Which customers need follow-up?
5. What are the key performance indicators (KPIs)?

### Why Use set.seed()?

`set.seed(123)` ensures reproducibility:
- Same random data every time you run the code
- Others can verify your results
- Essential for professional analysis
- Required for debugging and testing

In [None]:
# Load necessary packages
library(tidyverse)
library(lubridate)

# Set seed for reproducibility
set.seed(123)

cat("Packages loaded successfully!\n")
cat("Ready for advanced data wrangling!\n")

In [None]:
## Part 2: Complex Chained Operations

### The Power of the Pipe (%>%)

**Why Chaining Matters:**
- **Readability:** Code reads like a story (do this, then this, then this)
- **Efficiency:** No intermediate variables cluttering your environment
- **Maintainability:** Easy to add/remove steps
- **Professional:** Industry standard for R data analysis

### Anatomy of a Complex Pipeline

A professional analysis pipeline typically follows this structure:

```r
result <- data %>%
  # 1. FILTER: Remove invalid/unwanted data
  filter(condition) %>%
  
  # 2. MUTATE: Create new calculated fields
  mutate(new_column = calculation) %>%
  
  # 3. GROUP: Define analysis dimensions
  group_by(dimension1, dimension2) %>%
  
  # 4. SUMMARIZE: Calculate metrics
  summarize(metric = aggregation) %>%
  
  # 5. MUTATE: Add derived metrics
  mutate(share = metric / sum(metric)) %>%
  
  # 6. ARRANGE: Sort for presentation
  arrange(desc(metric))
```

### Best Practices for Chaining

**Do:**
- Put each operation on its own line
- Add comments explaining business logic
- Use meaningful variable names
- Test incrementally (run partial pipeline)
- Keep pipelines focused (< 10 steps ideal)

**Don't:**
- Chain unrelated operations
- Create overly complex one-liners
- Forget to handle edge cases
- Skip validation steps

### Business Context: Regional Performance Analysis

**Business Question:** Which region-category combinations drive the most revenue?

**Why This Matters:**
- Allocate marketing budget effectively
- Optimize inventory by region
- Identify growth opportunities
- Set regional sales targets
- Plan expansion strategies

# BUSINESS USE CASE: Regional performance analysis for strategic planning
#
# Problem: Need to understand which region-category combinations perform best
# Impact: Can't allocate resources effectively or identify growth opportunities
# Solution: Build a complex pipeline that filters, calculates, groups, and analyzes

regional_analysis <- sales_data %>%
  # Step 1: FILTER - Remove low-value transactions
  # Why: Focus on significant sales, reduce noise
  # Business rule: Only analyze sales > $100
  filter(Sales > 100) %>%
  
  # Step 2: MUTATE - Calculate revenue and add time dimensions
  # Revenue = Sales * Quantity (total transaction value)
  # Time dimensions enable temporal analysis
  mutate(
    Revenue = Sales * Quantity,
    Month = month(OrderDate, label = TRUE),  # "Jan", "Feb", "Mar"
    Quarter = quarter(OrderDate)              # 1 for Q1
  ) %>%
  
  # Step 3: GROUP - Define analysis dimensions
  # Why: We want metrics BY region AND category
  # This creates groups for each unique combination
  group_by(Region, Category) %>%
  
  # Step 4: SUMMARIZE - Calculate key metrics for each group
  # These are the KPIs that matter to the business
  summarize(
    Total_Revenue = sum(Revenue),        # Total $ generated
    Avg_Sale = mean(Sales),              # Average transaction size
    Order_Count = n(),                   # Number of transactions
    Total_Units = sum(Quantity),         # Total items sold
    .groups = 'drop'                     # Remove grouping after summarize
  ) %>%
  
  # Step 5: MUTATE - Calculate revenue share (% of total)
  # Why: Executives want to see percentages, not just absolute numbers
  # sum(Total_Revenue) calculates across all rows
  mutate(
    Revenue_Share = (Total_Revenue / sum(Total_Revenue)) * 100
  ) %>%
  
  # Step 6: ARRANGE - Sort by revenue (highest first)
  # Why: Put most important results at the top
  # desc() means descending order
  arrange(desc(Total_Revenue))

cat("📊 REGIONAL PERFORMANCE ANALYSIS\n")
cat("(Complex Pipeline: Filter → Mutate → Group → Summarize → Mutate → Arrange)\n\n")

print(regional_analysis)

cat("\n💡 Business Insights:\n")
cat("  • Top region-category combinations drive", 
    round(sum(head(regional_analysis$Revenue_Share, 3)), 1), "% of revenue\n")
cat("  • Use this to allocate marketing budget\n")
cat("  • Identify underperforming combinations for improvement\n")
cat("  • Set realistic targets based on historical performance\n")

cat("\n🎯 Action Items:\n")
cat("  1. Increase inventory in top-performing region-categories\n")
cat("  2. Investigate why some combinations underperform\n")
cat("  3. Replicate success factors from top performers\n")

In [None]:
## Part 3: Advanced Conditional Logic with case_when()

### Why case_when() Is a Game-Changer

**The Problem with if-else:**
```r
# Messy, hard to read:
if (x < 100) {
  "Low"
} else if (x < 500) {
  "Medium"
} else {
  "High"
}
```

**The case_when() Solution:**
```r
# Clean, readable, vectorized:
case_when(
  x < 100 ~ "Low",
  x < 500 ~ "Medium",
  TRUE ~ "High"
)
```

### How case_when() Works

**Syntax:**
```r
case_when(
  condition1 ~ result1,
  condition2 ~ result2,
  condition3 ~ result3,
  TRUE ~ default_result  # Always include a default!
)
```

**Key Points:**
- Evaluates conditions **in order** (first match wins)
- Use `~` (tilde) to separate condition from result
- `TRUE ~` creates a default case (like "else")
- Works with vectors (entire columns at once)
- Can combine multiple conditions with `&` (AND) and `|` (OR)

### Business Applications

**Customer Segmentation:**
```r
case_when(
  customer_type == "VIP" & revenue > 1000 ~ "Platinum",
  customer_type == "VIP" | revenue > 1000 ~ "Gold",
  revenue > 500 ~ "Silver",
  TRUE ~ "Bronze"
)
```

**Product Classification:**
```r
case_when(
  price < 50 ~ "Budget",
  price < 200 ~ "Mid-Range",
  price < 1000 ~ "Premium",
  TRUE ~ "Luxury"
)
```

**Priority Assignment:**
```r
case_when(
  days_since_contact > 30 & value_score == "High" ~ "Urgent",
  days_since_contact > 14 ~ "High",
  days_since_contact > 7 ~ "Medium",
  TRUE ~ "Low"
)
```

### Best Practices

**Do:**
- Always include a `TRUE ~` default case
- Order conditions from most specific to most general
- Use clear, descriptive result values
- Test edge cases
- Comment complex logic

**Don't:**
- Forget the default case (leads to NA values)
- Make conditions too complex (break into multiple case_when if needed)
- Overlap conditions unintentionally
- Use inconsistent result types (all text or all numbers)

### Real-World Example: Customer Value Scoring

**Business Need:** Segment customers for targeted marketing

**Factors to Consider:**
- Customer type (New, Returning, VIP)
- Transaction value (Revenue)
- Purchase frequency
- Recency

**Outcome:** Assign each customer to a tier (Platinum, Gold, Silver, Bronze)

**Business Impact:**
- Platinum: White-glove service, exclusive offers
- Gold: Priority support, early access
- Silver: Standard benefits, occasional perks
- Bronze: Basic service, growth potential

# BUSINESS USE CASE: Multi-dimensional customer value scoring
#
# Problem: Need to segment customers for targeted marketing and service levels
# Impact: Can't personalize experience, missing revenue from high-value customers
# Solution: Use case_when() to create sophisticated segmentation logic

sales_classified <- sales_data %>%
  mutate(
    # Calculate total revenue (will use for classification)
    Revenue = Sales * Quantity,
    
    # CLASSIFICATION 1: Sales Tier (simple, single condition)
    # Purpose: Quick categorization of transaction size
    Sales_Tier = case_when(
      Sales < 200 ~ "Low",           # Small transactions
      Sales >= 200 & Sales < 800 ~ "Medium",  # Mid-size transactions
      Sales >= 800 ~ "High",         # Large transactions
      TRUE ~ "Unknown"               # Safety net (should never happen)
    ),
    
    # CLASSIFICATION 2: Customer Value Score (complex, multiple factors)
    # Purpose: Identify most valuable customers for VIP treatment
    # Logic: Combines customer type AND revenue
    Value_Score = case_when(
      # Platinum: VIP customers with high revenue
      # & means AND (both conditions must be true)
      CustomerType == "VIP" & Revenue > 1000 ~ "Platinum",
      
      # Gold: Either VIP OR high revenue (but not both)
      # | means OR (at least one condition must be true)
      CustomerType == "VIP" | Revenue > 1000 ~ "Gold",
      
      # Silver: Returning customers with decent revenue
      CustomerType == "Returning" & Revenue > 500 ~ "Silver",
      
      # Bronze: Everyone else (new customers, low revenue)
      TRUE ~ "Bronze"
    ),
    
    # CLASSIFICATION 3: Follow-up Priority (derived from Value_Score)
    # Purpose: Prioritize sales team outreach
    # %in% checks if value is in a vector
    Follow_Up_Priority = case_when(
      Value_Score %in% c("Platinum", "Gold") ~ "High",    # Top customers
      Value_Score == "Silver" ~ "Medium",                  # Good customers
      TRUE ~ "Low"                                         # Standard customers
    )
  )

cat("🎯 CUSTOMER SEGMENTATION WITH case_when()\n\n")

# Show sample of classifications
sales_classified %>%
  select(OrderID, Sales, Revenue, Sales_Tier, CustomerType, Value_Score, Follow_Up_Priority) %>%
  head(10) %>%
  print()

cat("\n💡 Business Logic Explained:\n")
cat("  Platinum: VIP + Revenue > $1000 (top 5-10% of customers)\n")
cat("  Gold: VIP OR Revenue > $1000 (high-value segment)\n")
cat("  Silver: Returning + Revenue > $500 (loyal customers)\n")
cat("  Bronze: All others (growth potential)\n")

cat("\n🎯 Action Plan by Segment:\n")
cat("  Platinum: Dedicated account manager, exclusive offers\n")
cat("  Gold: Priority support, early product access\n")
cat("  Silver: Loyalty rewards, upgrade incentives\n")
cat("  Bronze: Nurture campaigns, education content\n")

In [None]:
# Classify sales into tiers with multiple criteria
sales_classified <- sales_data %>%
  mutate(
    # Revenue calculation
    Revenue = Sales * Quantity,
    
    # Sales tier based on amount
    Sales_Tier = case_when(
      Sales < 200 ~ "Low",
      Sales >= 200 & Sales < 800 ~ "Medium",
      Sales >= 800 ~ "High",
      TRUE ~ "Unknown"  # Default case
    ),
    
    # Customer value score (combining multiple factors)
    Value_Score = case_when(
      CustomerType == "VIP" & Revenue > 1000 ~ "Platinum",
      CustomerType == "VIP" | Revenue > 1000 ~ "Gold",
      CustomerType == "Returning" & Revenue > 500 ~ "Silver",
      TRUE ~ "Bronze"
    ),
    
    # Priority flag for follow-up
    Follow_Up_Priority = case_when(
      Value_Score %in% c("Platinum", "Gold") ~ "High",
      Value_Score == "Silver" ~ "Medium",
      TRUE ~ "Low"
    )
  )

print("Sales with Classification (first 10):")
sales_classified %>%
  select(OrderID, Sales, Revenue, Sales_Tier, CustomerType, Value_Score, Follow_Up_Priority) %>%
  head(10) %>%
  print()

In [None]:
## Part 4: Data Validation and Quality Checks

### Why Data Validation Is Critical

**The Cost of Bad Data:**
- **Wrong decisions:** Garbage in, garbage out
- **Lost revenue:** Missed opportunities from inaccurate analysis
- **Wasted time:** Hours debugging issues caused by data problems
- **Damaged credibility:** Stakeholders lose trust in your analysis
- **Compliance risks:** Regulatory issues from data quality problems

**Industry Statistics:**
- Poor data quality costs organizations an average of $12.9 million annually (Gartner)
- 27% of business leaders are unsure of data accuracy (Harvard Business Review)
- Data scientists spend 60% of time cleaning and organizing data (Forbes)

### Types of Data Quality Issues

**1. Missing Values:**
- NULL, NA, blank cells
- Impact: Can't calculate metrics, skewed averages
- Check: `sum(is.na(column))`

**2. Invalid Values:**
- Negative sales, zero quantities
- Future dates, dates before business started
- Impact: Incorrect calculations, failed business rules
- Check: `sum(sales < 0)`, `sum(date > today())`

**3. Duplicates:**
- Same transaction recorded multiple times
- Impact: Inflated metrics, double-counting
- Check: `sum(duplicated(data))`

**4. Outliers:**
- Extreme values that may be errors
- Impact: Skewed averages, misleading trends
- Check: IQR method, z-scores

**5. Inconsistencies:**
- "USA" vs "US" vs "United States"
- Impact: Failed joins, incorrect grouping
- Check: `n_distinct()`, manual inspection

### Professional Validation Workflow

**Step 1: Completeness Checks**
```r
# Check for missing values
summarize(
  Missing_Sales = sum(is.na(Sales)),
  Missing_Dates = sum(is.na(OrderDate))
)
```

**Step 2: Business Rule Validation**
```r
# Check business logic
summarize(
  Negative_Sales = sum(Sales < 0),
  Zero_Quantity = sum(Quantity <= 0),
  Future_Dates = sum(OrderDate > today())
)
```

**Step 3: Statistical Validation**
```r
# Check for outliers
summarize(
  Mean = mean(Sales),
  SD = sd(Sales),
  Q1 = quantile(Sales, 0.25),
  Q3 = quantile(Sales, 0.75),
  IQR = Q3 - Q1
)
```

### Best Practices

**Do:**
- Validate data BEFORE analysis
- Document all validation checks
- Create automated validation scripts
- Set up alerts for data quality issues
- Track data quality metrics over time

**Don't:**
- Assume data is clean
- Skip validation to save time
- Ignore validation failures
- Delete outliers without investigation
- Forget to communicate data quality issues

### When to Flag vs Fix vs Remove

**Flag (Keep but mark):**
- Potential outliers that might be legitimate
- Records with minor quality issues
- Data that needs manual review

**Fix (Correct the data):**
- Known systematic errors
- Standardization issues (case, format)
- Calculable missing values

**Remove (Exclude from analysis):**
- Clearly invalid data
- Duplicates
- Test records
- Data outside analysis scope

# BUSINESS USE CASE: Automated data quality checks
#
# Problem: Need to ensure data quality before analysis
# Impact: Bad data leads to wrong decisions and wasted time
# Solution: Implement systematic validation checks

# CHECK 1: Missing Values
# Why: Missing data can skew calculations and cause errors
missing_check <- sales_data %>%
  summarize(
    Missing_Sales = sum(is.na(Sales)),
    Missing_Quantity = sum(is.na(Quantity)),
    Missing_OrderDate = sum(is.na(OrderDate)),
    Total_Rows = n(),
    # Calculate percentage missing
    Pct_Missing_Sales = (Missing_Sales / Total_Rows) * 100
  )

cat("✅ MISSING VALUE CHECK:\n\n")
print(missing_check)

if (sum(missing_check[1:3]) == 0) {
  cat("\n✅ No missing values detected!\n")
} else {
  cat("\n⚠️  Missing values found - investigate before analysis!\n")
}

cat("\n💡 What to do if missing values found:\n")
cat("  • < 5% missing: Consider imputation or removal\n")
cat("  • 5-20% missing: Investigate pattern, may indicate systematic issue\n")
cat("  • > 20% missing: Serious data quality problem, contact data source\n")

In [None]:
# CHECK 2: Business Rule Validation
# Why: Ensure data follows business logic and constraints

validation_results <- sales_data %>%
  summarize(
    # Check for negative sales (should NEVER happen)
    Negative_Sales = sum(Sales < 0),
    
    # Check for zero or negative quantity (invalid)
    Zero_Quantity = sum(Quantity <= 0),
    
    # Check for future dates (impossible)
    Future_Dates = sum(OrderDate > today()),
    
    # Check for extreme outliers (sales > $10,000)
    # May be legitimate but worth investigating
    Extreme_Sales = sum(Sales > 10000),
    
    # Check for very old dates (before business started)
    # Assuming business started in 2020
    Old_Dates = sum(OrderDate < as.Date("2020-01-01"))
  )

cat("✅ BUSINESS RULE VALIDATION:\n\n")
print(validation_results)

# Calculate total issues
total_issues <- sum(validation_results)

if (total_issues == 0) {
  cat("\n✅ All validation checks passed! Data is clean.\n")
} else {
  cat("\n⚠️ ", total_issues, " validation issue(s) found!\n")
  cat("\n🔍 Next steps:\n")
  cat("  1. Investigate root cause of issues\n")
  cat("  2. Contact data source if systematic problem\n")
  cat("  3. Document any data cleaning decisions\n")
  cat("  4. Consider flagging vs removing problematic records\n")
}

cat("\n💡 Validation Rules Explained:\n")
cat("  Negative Sales: Physically impossible, indicates data error\n")
cat("  Zero Quantity: Business rule violation, can't sell 0 items\n")
cat("  Future Dates: Logically impossible, system error\n")
cat("  Extreme Sales: May be legitimate but worth manual review\n")
cat("  Old Dates: May indicate test data or migration issues\n")

In [None]:
# Validate business rules
validation_results <- sales_data %>%
  summarize(
    # Check for negative sales (should be 0)
    Negative_Sales = sum(Sales < 0),
    # Check for zero quantity (should be 0)
    Zero_Quantity = sum(Quantity <= 0),
    # Check for future dates (should be 0)
    Future_Dates = sum(OrderDate > today()),
    # Check for extreme outliers (sales > $10,000)
    Extreme_Sales = sum(Sales > 10000)
  )

print("Business Rule Validation:")
print(validation_results)

# Assert that all validations pass
if (sum(validation_results) == 0) {
  cat("\n✅ All validation checks passed!\n")
} else {
  cat("\n⚠️ Some validation checks failed!\n")
}

In [None]:
# Statistical validation: Check for outliers
outlier_analysis <- sales_data %>%
  summarize(
    Mean_Sales = mean(Sales),
    SD_Sales = sd(Sales),
    Min_Sales = min(Sales),
    Max_Sales = max(Sales),
    Q1 = quantile(Sales, 0.25),
    Median = median(Sales),
    Q3 = quantile(Sales, 0.75),
    IQR = Q3 - Q1
  ) %>%
  mutate(
    Lower_Fence = Q1 - 1.5 * IQR,
    Upper_Fence = Q3 + 1.5 * IQR
  )

print("Sales Distribution Analysis:")
print(outlier_analysis)

## Part 5: Reproducible Workflows

Creating analysis that others can understand, verify, and update.

In [None]:
# Create a reusable analysis function
analyze_sales_by_dimension <- function(data, group_var) {
  """
  Analyze sales data by any grouping variable
  
  Args:
    data: Sales data frame
    group_var: Column name to group by (as string)
  
  Returns:
    Summary data frame with key metrics
  """
  
  data %>%
    mutate(Revenue = Sales * Quantity) %>%
    group_by(across(all_of(group_var))) %>%
    summarize(
      Total_Revenue = sum(Revenue),
      Avg_Sale = mean(Sales),
      Order_Count = n(),
      Total_Units = sum(Quantity),
      .groups = 'drop'
    ) %>%
    mutate(
      Revenue_Share = (Total_Revenue / sum(Total_Revenue)) * 100
    ) %>%
    arrange(desc(Total_Revenue))
}

# Use the function for different analyses
print("Analysis by Region:")
analyze_sales_by_dimension(sales_data, "Region") %>% print()

print("\nAnalysis by Product:")
analyze_sales_by_dimension(sales_data, "Product") %>% head(5) %>% print()

print("\nAnalysis by Customer Type:")
analyze_sales_by_dimension(sales_data, "CustomerType") %>% print()

## Part 6: Comprehensive Business Analysis

Putting it all together: A complete analysis workflow.

In [None]:
# Complete analysis pipeline
comprehensive_analysis <- sales_data %>%
  # Step 1: Data validation and cleaning
  filter(
    Sales > 0,
    Quantity > 0,
    !is.na(OrderDate)
  ) %>%
  # Step 2: Feature engineering
  mutate(
    Revenue = Sales * Quantity,
    Month = month(OrderDate, label = TRUE),
    Quarter = paste0("Q", quarter(OrderDate)),
    Weekday = wday(OrderDate, label = TRUE),
    Is_Weekend = wday(OrderDate) %in% c(1, 7),
    # Classification
    Revenue_Tier = case_when(
      Revenue < 500 ~ "Low",
      Revenue < 2000 ~ "Medium",
      TRUE ~ "High"
    )
  ) %>%
  # Step 3: Group and summarize
  group_by(Quarter, Region, Revenue_Tier) %>%
  summarize(
    Total_Revenue = sum(Revenue),
    Order_Count = n(),
    Avg_Revenue = mean(Revenue),
    .groups = 'drop'
  ) %>%
  # Step 4: Calculate shares
  group_by(Quarter) %>%
  mutate(
    Quarter_Share = (Total_Revenue / sum(Total_Revenue)) * 100
  ) %>%
  ungroup() %>%
  # Step 5: Final sorting
  arrange(Quarter, desc(Total_Revenue))

print("Comprehensive Quarterly Analysis:")
print(comprehensive_analysis)

In [None]:
# Executive summary: Key metrics
executive_summary <- sales_data %>%
  mutate(Revenue = Sales * Quantity) %>%
  summarize(
    Total_Orders = n(),
    Total_Revenue = sum(Revenue),
    Avg_Order_Value = mean(Revenue),
    Total_Units_Sold = sum(Quantity),
    Unique_Products = n_distinct(Product),
    Date_Range = paste(min(OrderDate), "to", max(OrderDate)),
    VIP_Customers = sum(CustomerType == "VIP"),
    VIP_Percentage = (VIP_Customers / n()) * 100
  )

cat("\n=== EXECUTIVE SUMMARY ===\n")
cat("Total Orders:", executive_summary$Total_Orders, "\n")
cat("Total Revenue: $", format(executive_summary$Total_Revenue, big.mark = ","), "\n")
cat("Average Order Value: $", round(executive_summary$Avg_Order_Value, 2), "\n")
cat("Total Units Sold:", executive_summary$Total_Units_Sold, "\n")
cat("Unique Products:", executive_summary$Unique_Products, "\n")
cat("Date Range:", executive_summary$Date_Range, "\n")
cat("VIP Customers:", executive_summary$VIP_Customers, 
    "(", round(executive_summary$VIP_Percentage, 1), "%)\n")

## Summary: Best Practices for Professional Data Analysis

### 1. Complex Workflows:
- Chain operations logically (filter → mutate → group → summarize → arrange)
- Use meaningful intermediate variable names
- Comment each major step

### 2. Conditional Logic:
- Use `case_when()` for multiple conditions
- Always include a default case (`TRUE ~ ...`)
- Test edge cases

### 3. Data Validation:
- Check for missing values
- Validate business rules
- Identify outliers
- Document assumptions

### 4. Reproducibility:
- Set random seeds for consistency
- Create reusable functions
- Document your code
- Use version control

### 5. Professional Reporting:
- Create executive summaries
- Format numbers appropriately
- Provide context for metrics
- Make insights actionable

### Key Functions Mastered:
- `case_when()` - Complex conditional logic
- `across()` - Apply functions to multiple columns
- `all_of()` - Select columns programmatically
- `n_distinct()` - Count unique values
- Chaining with `%>%` - Build complex pipelines

### You're Now Ready For:
- Real-world business analytics projects
- Complex data transformation challenges
- Professional data analysis workflows
- Advanced R programming

**Congratulations on completing the Data Wrangling course!** 🎉

---

**End of Lesson 8**