# Lesson 5: Data Reshaping with tidyr

**Topic:** Data Reshaping and Tidy Data Principles with `pivot_longer()` and `pivot_wider()`

**Learning Objectives:**
- Understand the principles of tidy data and when data reshaping is necessary
- Master `pivot_longer()` to convert wide data to long format for analysis
- Apply `pivot_wider()` to convert long data to wide format for reporting
- Recognize business scenarios where data reshaping improves analysis
- Practice reshaping real-world business datasets for different analytical purposes

---

## Overview

Data rarely comes in the exact format we need for analysis. One of the most common data preparation tasks is **reshaping** - transforming data between different structural formats to match the requirements of your analysis or visualization.

In this lesson, we'll explore the powerful **tidyr** package, which provides intuitive functions for data reshaping. You'll learn when and how to transform data between "wide" and "long" formats, and understand how proper data structure can dramatically improve your analytical capabilities.

**Why Data Reshaping Matters:**
- **Analysis Requirements**: Different statistical methods and visualizations require specific data structures
- **Database Integration**: Systems often store data differently than we need for analysis
- **Reporting Flexibility**: Stakeholders may need the same data in multiple formats
- **Visualization Optimization**: Chart types often require specific data arrangements

---

## Understanding Tidy Data Principles

Before diving into reshaping techniques, it's crucial to understand the concept of **tidy data** - a fundamental principle that guides how we structure data for effective analysis.

### What is Tidy Data?

**Tidy data** follows three fundamental rules that make analysis more straightforward and consistent:

1. **Each variable forms a column** - Every measurement type has its own column
2. **Each observation forms a row** - Every individual record occupies one row
3. **Each type of observational unit forms a table** - Related data lives together

### Why Tidy Data Matters in Business:

**🎯 Analytical Efficiency:**
- **Consistent Structure**: Once data is tidy, most analysis functions work seamlessly
- **Reduced Errors**: Clear structure minimizes data manipulation mistakes
- **Scalability**: Tidy datasets handle growth and complexity better

**📊 Business Intelligence Benefits:**
- **Faster Insights**: Less time spent reformatting data means more time for analysis
- **Easier Automation**: Tidy data structures support automated reporting pipelines
- **Better Collaboration**: Team members can easily understand and work with consistently structured data

### Common Business Data Structure Challenges:

**Wide Format Issues:**
- Survey data with each question as a separate column
- Financial data with months or years as column headers
- Performance metrics with different time periods as columns

**Long Format Issues:**
- Difficulty creating summary tables for executive reporting
- Challenges in comparing values side-by-side
- Complex formatting for certain types of business dashboards

The goal of data reshaping is to move between these formats strategically, choosing the structure that best serves your analytical or reporting needs.

## Setup and Package Loading

Before we begin reshaping data, let's prepare our environment with the necessary tools for data manipulation and restructuring.

**📦 Package Overview:**
- **tidyverse**: Our comprehensive data science toolkit that includes tidyr
- **tidyr**: Specialized package for data reshaping and tidying operations

**🔧 What tidyr Provides:**
- `pivot_longer()`: Transform wide data to long format (many columns → fewer columns, more rows)
- `pivot_wider()`: Transform long data to wide format (fewer columns → many columns, fewer rows)
- Additional functions for data separation, combination, and missing value handling

**💼 Business Context:**
In professional data analysis, you'll frequently encounter data in formats that aren't optimal for your needs. For example:
- **Excel exports** often come in wide format for human readability
- **Database extracts** might be in long format for efficient storage
- **Survey platforms** typically export wide format with one column per question
- **Time series systems** may store data in either format depending on their design

Learning to reshape data efficiently is essential for any business analyst or data scientist.

In [1]:
# Load the comprehensive tidyverse package collection
library(tidyverse)    # This automatically loads several packages including:
                      # - dplyr: for data manipulation (select, filter, mutate, etc.)
                      # - tidyr: for data reshaping (pivot_longer, pivot_wider)
                      # - ggplot2: for data visualization
                      # - readr: for reading data files
                      # - and several other essential data science tools

# Confirm successful loading with informative message
cat("✅ Tidyverse loaded successfully!\n")
cat("📦 Available reshaping functions: pivot_longer(), pivot_wider()\n")
cat("🎯 Ready for data reshaping operations!\n")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


✅ Tidyverse loaded successfully!
📦 Available reshaping functions: pivot_longer(), pivot_wider()
🎯 Ready for data reshaping operations!


## Part 1: Converting Wide to Long Format with `pivot_longer()`

The `pivot_longer()` function is one of the most frequently used data reshaping tools in business analytics. It transforms "wide" data (many columns, fewer rows) into "long" data (fewer columns, more rows), which is often the preferred format for analysis and visualization.

### What You'll Learn:
- **Wide vs Long Concepts**: Understanding when each format is appropriate
- **Business Applications**: Real scenarios where pivot_longer() solves problems
- **Function Syntax**: Mastering the key parameters for effective reshaping
- **Data Quality**: Maintaining data integrity during transformation

### When to Use pivot_longer():
**📈 Time Series Analysis:**
- Converting yearly/monthly columns into a single time variable for trend analysis
- Preparing data for time series forecasting and seasonal pattern detection

**📊 Data Visualization:**
- Creating datasets suitable for ggplot2 visualizations (especially line charts and grouped bar charts)
- Enabling easy color-coding and grouping in charts

**🔍 Statistical Analysis:**
- Preparing data for statistical tests that require long format
- Setting up data for regression analysis with categorical variables

**💻 Database Operations:**
- Converting spreadsheet-style data into normalized database format
- Preparing data for joining with other long-format datasets

### Key Parameters of pivot_longer():
- `cols`: Which columns to reshape (can use selection helpers like `starts_with()`)
- `names_to`: Name for the new column that will contain the old column names
- `values_to`: Name for the new column that will contain the values
- `names_prefix`: Text to remove from column names (e.g., "Sales_" from "Sales_2022")
- `names_suffix`: Text to remove from the end of column names

The examples below demonstrate these concepts with realistic business scenarios.

In [2]:
# Example 1: Regional Sales Data - Wide to Long Transformation
# Creating a realistic wide-format sales dataset typical of business reports
sales_wide <- data.frame(
  Region = c("North", "South", "East", "West"),           # Geographic regions for business analysis
  Sales_2022 = c(1000, 1200, 900, 1500),                # Sales figures for 2022 by region
  Sales_2023 = c(1100, 1300, 950, 1600),                # Sales figures for 2023 by region  
  Sales_2024 = c(1200, 1400, 1000, 1700)                # Sales figures for 2024 by region
)

print("Original Wide Sales Data (typical Excel/reporting format):")
print(sales_wide)

# Explain the business problem this wide format creates
cat("\n📊 Business Challenge with Wide Format:\n")
cat("- Difficult to create time series charts\n")
cat("- Hard to calculate year-over-year growth rates\n") 
cat("- Cannot easily group or filter by year\n")
cat("- Statistical analysis functions expect long format\n")

[1] "Original Wide Sales Data (typical Excel/reporting format):"
  Region Sales_2022 Sales_2023 Sales_2024
1  North       1000       1100       1200
2  South       1200       1300       1400
3   East        900        950       1000
4   West       1500       1600       1700

📊 Business Challenge with Wide Format:
- Difficult to create time series charts
- Hard to calculate year-over-year growth rates
- Cannot easily group or filter by year
- Statistical analysis functions expect long format


In [3]:
# Transform wide sales data to long format using pivot_longer()
sales_long <- sales_wide %>%                    # Start with the wide sales data
  pivot_longer(
    cols = starts_with("Sales_"),              # Select all columns beginning with "Sales_"
                                               # This targets Sales_2022, Sales_2023, Sales_2024
    names_to = "Year",                         # Create new column "Year" to hold former column names
    names_prefix = "Sales_",                   # Remove "Sales_" prefix from values in Year column
                                               # "Sales_2022" becomes "2022"
    values_to = "SalesAmount"                  # Create new column "SalesAmount" for the actual values
  )

print("Converted Long Sales Data (analysis-ready format):")
print(sales_long)

# Demonstrate the analytical advantages of long format
cat("\n✅ Benefits of Long Format:\n")
cat("- Easy to filter data by specific years: filter(Year == '2023')\n")
cat("- Simple to calculate growth rates: group_by(Region) %>% arrange(Year)\n")
cat("- Perfect for time series visualization: ggplot with Year on x-axis\n")
cat("- Enables statistical analysis: correlation between regions over time\n")

[1] "Converted Long Sales Data (analysis-ready format):"
[90m# A tibble: 12 × 3[39m
   Region Year  SalesAmount
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m
[90m 1[39m North  2022         [4m1[24m000
[90m 2[39m North  2023         [4m1[24m100
[90m 3[39m North  2024         [4m1[24m200
[90m 4[39m South  2022         [4m1[24m200
[90m 5[39m South  2023         [4m1[24m300
[90m 6[39m South  2024         [4m1[24m400
[90m 7[39m East   2022          900
[90m 8[39m East   2023          950
[90m 9[39m East   2024         [4m1[24m000
[90m10[39m West   2022         [4m1[24m500
[90m11[39m West   2023         [4m1[24m600
[90m12[39m West   2024         [4m1[24m700

✅ Benefits of Long Format:
- Easy to filter data by specific years: filter(Year == '2023')
- Simple to calculate growth rates: group_by(Region) %>% arrange(Year)
- Perfect for time series visualization: ggplot with Year on x-axis
- Enables statistical an

In [4]:
# Example 2: Survey Response Data - Multiple Questions to Long Format
# Creating survey data typical of customer satisfaction or employee engagement surveys
survey_wide <- data.frame(
  ParticipantID = 1:3,                        # Unique identifier for each survey respondent
  Q1_Score = c(5, 4, 3),                      # Responses to Question 1 (e.g., "Rate our service")
  Q2_Score = c(4, 5, 4),                      # Responses to Question 2 (e.g., "Rate our products")
  Q3_Score = c(3, 3, 5)                       # Responses to Question 3 (e.g., "Rate our support")
)

print("Original Wide Survey Data (one column per question):")
print(survey_wide)

# Transform survey data to long format for statistical analysis
survey_long <- survey_wide %>%                # Start with wide survey data
  pivot_longer(
    cols = starts_with("Q"),                  # Select all columns starting with "Q"
                                              # This captures Q1_Score, Q2_Score, Q3_Score
    names_to = "Question",                    # New column to hold question identifiers
    values_to = "Score"                       # New column to hold the actual scores
  )

print("Converted Long Survey Data (statistical analysis ready):")
print(survey_long)

# Explain business value of this transformation
cat("\n💼 Business Applications of Long Survey Data:\n")
cat("- Calculate average scores by question: group_by(Question) %>% summarise(avg = mean(Score))\n")
cat("- Identify response patterns: analyze score distributions across questions\n")
cat("- Create comprehensive visualizations: box plots, violin plots by question\n")
cat("- Perform statistical tests: ANOVA to compare question scores\n")

[1] "Original Wide Survey Data (one column per question):"
  ParticipantID Q1_Score Q2_Score Q3_Score
1             1        5        4        3
2             2        4        5        3
3             3        3        4        5
[1] "Converted Long Survey Data (statistical analysis ready):"
[90m# A tibble: 9 × 3[39m
  ParticipantID Question Score
          [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m             1 Q1_Score     5
[90m2[39m             1 Q2_Score     4
[90m3[39m             1 Q3_Score     3
[90m4[39m             2 Q1_Score     4
[90m5[39m             2 Q2_Score     5
[90m6[39m             2 Q3_Score     3
[90m7[39m             3 Q1_Score     3
[90m8[39m             3 Q2_Score     4
[90m9[39m             3 Q3_Score     5

💼 Business Applications of Long Survey Data:
- Calculate average scores by question: group_by(Question) %>% summarise(avg = mean(Score))
- Identify response patterns: analyze score distribu

In [5]:
# Example 3: Advanced pivot_longer() with Business Context
# Creating monthly product sales data typical of retail or e-commerce businesses
monthly_sales_wide <- data.frame(
  Month = c("Jan", "Feb", "Mar"),             # Time period for business tracking
  ProductA_Sales = c(100, 110, 105),         # Sales figures for Product A each month
  ProductB_Sales = c(200, 210, 205),         # Sales figures for Product B each month
  ProductC_Sales = c(150, 155, 160)          # Sales figures for Product C each month
)

print("Monthly Product Sales (Wide Format - typical business report):")
print(monthly_sales_wide)

# Transform to long format suitable for comprehensive business analysis
monthly_sales_long <- monthly_sales_wide %>%  # Start with wide monthly sales data
  pivot_longer(
    cols = starts_with("Product"),            # Select all product sales columns
    names_to = "Product",                     # Create column for product names
    names_pattern = "(.*)_Sales",             # Extract product name, removing "_Sales" suffix
                                              # This regex captures everything before "_Sales"
    values_to = "Sales"                       # Create column for sales values
  )

print("Monthly Product Sales (Long Format - analysis optimized):")
print(monthly_sales_long)

# Demonstrate analytical power of long format
cat("\n🎯 Strategic Business Analysis Enabled:\n")
cat("- Product performance comparison: group_by(Product) %>% summarise(total = sum(Sales))\n")
cat("- Trend analysis by product: filter(Product == 'ProductA') %>% arrange(Month)\n")
cat("- Market share calculations: mutate(share = Sales / sum(Sales) * 100)\n")
cat("- Visualization readiness: ggplot(aes(x = Month, y = Sales, color = Product))\n")

[1] "Monthly Product Sales (Wide Format - typical business report):"
  Month ProductA_Sales ProductB_Sales ProductC_Sales
1   Jan            100            200            150
2   Feb            110            210            155
3   Mar            105            205            160
[1] "Monthly Product Sales (Long Format - analysis optimized):"
[90m# A tibble: 9 × 3[39m
  Month Product  Sales
  [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m Jan   ProductA   100
[90m2[39m Jan   ProductB   200
[90m3[39m Jan   ProductC   150
[90m4[39m Feb   ProductA   110
[90m5[39m Feb   ProductB   210
[90m6[39m Feb   ProductC   155
[90m7[39m Mar   ProductA   105
[90m8[39m Mar   ProductB   205
[90m9[39m Mar   ProductC   160

🎯 Strategic Business Analysis Enabled:
- Product performance comparison: group_by(Product) %>% summarise(total = sum(Sales))
- Trend analysis by product: filter(Product == 'ProductA') %>% arrange(Month)
- Market share calculat

## Part 2: Converting Long to Wide Format with `pivot_wider()`

The `pivot_wider()` function performs the reverse operation of `pivot_longer()`, transforming "long" data (fewer columns, more rows) into "wide" data (many columns, fewer rows). This transformation is particularly valuable for creating summary reports, comparison tables, and formats that stakeholders expect.

### What You'll Learn:
- **Strategic Widening**: When wide format serves business needs better
- **Executive Reporting**: Creating stakeholder-friendly data presentations  
- **Comparison Analysis**: Side-by-side metric comparisons
- **Data Export**: Preparing data for Excel reports and dashboards

### When to Use pivot_wider():
**📋 Executive Reporting:**
- Creating summary tables where executives can easily compare metrics side-by-side
- Building dashboard-ready datasets with clear column headers

**📊 Comparison Analysis:**
- Facilitating direct comparisons between categories, time periods, or business units
- Creating matrices for correlation analysis or heat map visualizations

**📤 Data Export:**
- Preparing data for Excel reports where wide format is more readable
- Creating datasets for business users who prefer spreadsheet-style layouts

**🔄 Data Integration:**
- Matching the format expected by other systems or analysis tools
- Creating lookup tables or reference datasets

### Key Parameters of pivot_wider():
- `names_from`: Column whose values will become the new column names
- `values_from`: Column whose values will fill the new columns
- `names_prefix`: Text to add before new column names (e.g., "Sales_")
- `values_fill`: Value to use for missing combinations (e.g., 0 for missing sales)
- `names_sep`: Separator when creating names from multiple columns

The examples below show how pivot_wider() solves real business reporting challenges.

In [6]:
# Example 1: Converting Long Sales Data Back to Wide Format for Executive Reporting
# Using the sales_long data we created earlier for demonstration
print("Starting with Long Sales Data (from previous example):")
print(sales_long)

# Convert back to wide format for executive dashboard or report
sales_re_wide <- sales_long %>%               # Start with long-format sales data
  pivot_wider(
    names_from = Year,                        # Use Year column values as new column names
                                              # "2022", "2023", "2024" become column headers
    values_from = SalesAmount,                # Use SalesAmount values to fill new columns
    names_prefix = "Sales_"                   # Add "Sales_" prefix to new column names
                                              # Results in "Sales_2022", "Sales_2023", etc.
  )

print("Converted Back to Wide Sales Data (executive report format):")
print(sales_re_wide)

# Explain the business value of this wide format
cat("\n💼 Executive Reporting Benefits:\n")
cat("- Easy year-over-year comparison at a glance\n")
cat("- Familiar spreadsheet-like format for stakeholders\n")
cat("- Simple to add calculated columns (e.g., growth rates)\n")
cat("- Perfect for inclusion in PowerPoint presentations\n")

[1] "Starting with Long Sales Data (from previous example):"
[90m# A tibble: 12 × 3[39m
   Region Year  SalesAmount
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m
[90m 1[39m North  2022         [4m1[24m000
[90m 2[39m North  2023         [4m1[24m100
[90m 3[39m North  2024         [4m1[24m200
[90m 4[39m South  2022         [4m1[24m200
[90m 5[39m South  2023         [4m1[24m300
[90m 6[39m South  2024         [4m1[24m400
[90m 7[39m East   2022          900
[90m 8[39m East   2023          950
[90m 9[39m East   2024         [4m1[24m000
[90m10[39m West   2022         [4m1[24m500
[90m11[39m West   2023         [4m1[24m600
[90m12[39m West   2024         [4m1[24m700
[1] "Converted Back to Wide Sales Data (executive report format):"
[90m# A tibble: 4 × 4[39m
  Region Sales_2022 Sales_2023 Sales_2024
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[

In [7]:
# Example 2: Product Features Database to Comparison Table
# Creating a product features dataset typical of e-commerce or catalog systems
product_features_long <- data.frame(
  ProductID = c(1, 1, 1, 2, 2, 2),           # Product identifiers (each product has multiple features)
  Feature = c("Color", "Size", "Material", "Color", "Size", "Material"),  # Feature types
  Value = c("Red", "M", "Cotton", "Blue", "L", "Polyester")              # Feature values
)

print("Original Long Product Features Data (database format):")
print(product_features_long)

# Transform to wide format for product comparison table
product_features_wide <- product_features_long %>%  # Start with long features data
  pivot_wider(
    names_from = Feature,                     # Use Feature column values as new column names
                                              # "Color", "Size", "Material" become columns
    values_from = Value                       # Use Value column to fill the new columns
  )

print("Converted Wide Product Features Data (comparison table):")
print(product_features_wide)

# Demonstrate business applications
cat("\n🛒 E-commerce Business Applications:\n")
cat("- Product comparison tables on website\n")
cat("- Inventory management with clear specifications\n")
cat("- Easy filtering: filter(Color == 'Red' & Size == 'M')\n")
cat("- Supplier communication with structured product specs\n")

[1] "Original Long Product Features Data (database format):"
  ProductID  Feature     Value
1         1    Color       Red
2         1     Size         M
3         1 Material    Cotton
4         2    Color      Blue
5         2     Size         L
6         2 Material Polyester
[1] "Converted Wide Product Features Data (comparison table):"
[90m# A tibble: 2 × 4[39m
  ProductID Color Size  Material 
      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m    
[90m1[39m         1 Red   M     Cotton   
[90m2[39m         2 Blue  L     Polyester

🛒 E-commerce Business Applications:
- Product comparison tables on website
- Inventory management with clear specifications
- Easy filtering: filter(Color == 'Red' & Size == 'M')
- Supplier communication with structured product specs


In [8]:
# Example 3: Customer Survey Responses - Long to Wide for Analysis
# Creating customer feedback data typical of satisfaction surveys
customer_feedback_long <- data.frame(
  CustomerID = c(101, 101, 101, 102, 102, 102, 103, 103, 103),  # Customer identifiers
  Metric = c("Satisfaction", "Likelihood_to_Recommend", "Value_Rating",  # Survey metrics
             "Satisfaction", "Likelihood_to_Recommend", "Value_Rating",
             "Satisfaction", "Likelihood_to_Recommend", "Value_Rating"),
  Score = c(8, 7, 6, 9, 8, 7, 7, 6, 8)      # Customer scores for each metric
)

print("Customer Feedback Data (Long Format - database storage):")
print(customer_feedback_long)

# Convert to wide format for customer profile analysis
customer_profiles_wide <- customer_feedback_long %>%  # Start with long feedback data
  pivot_wider(
    names_from = Metric,                      # Use survey metrics as column names
    values_from = Score,                      # Use scores to fill the new columns
    names_prefix = "Score_"                   # Add prefix to distinguish score columns
  )

print("Customer Profiles (Wide Format - analysis ready):")
print(customer_profiles_wide)

# Show analytical advantages of wide format for this use case
cat("\n📊 Customer Analytics Benefits:\n")
cat("- Direct correlation analysis: cor(Score_Satisfaction, Score_Likelihood_to_Recommend)\n")
cat("- Easy customer segmentation based on multiple scores\n")
cat("- Simplified statistical modeling with predictor variables\n")
cat("- Clear overview of each customer's complete feedback profile\n")

[1] "Customer Feedback Data (Long Format - database storage):"
  CustomerID                  Metric Score
1        101            Satisfaction     8
2        101 Likelihood_to_Recommend     7
3        101            Value_Rating     6
4        102            Satisfaction     9
5        102 Likelihood_to_Recommend     8
6        102            Value_Rating     7
7        103            Satisfaction     7
8        103 Likelihood_to_Recommend     6
9        103            Value_Rating     8
[1] "Customer Profiles (Wide Format - analysis ready):"
[90m# A tibble: 3 × 4[39m
  CustomerID Score_Satisfaction Score_Likelihood_to_Recommend Score_Value_Rating
       [3m[90m<dbl>[39m[23m              [3m[90m<dbl>[39m[23m                         [3m[90m<dbl>[39m[23m              [3m[90m<dbl>[39m[23m
[90m1[39m        101                  8                             7                  6
[90m2[39m        102                  9                             8                  7
[9

## Part 3: Business Applications and Use Cases

Understanding when to reshape data is crucial for effective business analytics. Different analytical goals and stakeholder needs require different data structures. This section explores real-world scenarios where data reshaping provides significant business value.

### Strategic Data Reshaping Decisions

**🎯 Analysis-Driven Reshaping:**
Choose your data structure based on your analytical objectives, not just the format you received the data in.

**📊 Stakeholder-Focused Reshaping:**
Consider who will consume your analysis and what format best serves their decision-making needs.

**🔄 Workflow-Optimized Reshaping:**
Sometimes you'll reshape data multiple times within a single analysis - wide for reporting, long for analysis, then back to wide for presentation.

### Common Business Scenarios

**1. Time Series Analysis (Long Format Preferred):**
- Financial forecasting with historical revenue data
- Sales trend analysis across multiple product lines
- Performance tracking over quarters or years

**2. Executive Dashboards (Wide Format Preferred):**
- KPI summary tables for board presentations
- Budget vs. actual comparison reports
- Regional performance scorecards

**3. Statistical Modeling (Format Depends on Analysis):**
- Long format for regression analysis with categorical predictors
- Wide format for correlation analysis between variables
- Mixed approaches for complex multi-level modeling

**4. Data Visualization (Format Depends on Chart Type):**
- Long format for line charts, grouped bar charts, box plots
- Wide format for correlation heat maps, parallel coordinate plots
- Strategic reshaping to enable specific visualization requirements

The examples below demonstrate these principles with practical business scenarios.

In [9]:
# Business Use Case 1: Quarterly Performance Analysis
# Creating quarterly business metrics typical of financial reporting
quarterly_metrics_wide <- data.frame(
  Department = c("Sales", "Marketing", "Operations", "Customer_Service"),  # Business departments
  Q1_Revenue = c(250000, 75000, 50000, 25000),     # Q1 revenue contribution by department
  Q2_Revenue = c(275000, 80000, 55000, 27000),     # Q2 revenue contribution by department
  Q3_Revenue = c(300000, 85000, 60000, 30000),     # Q3 revenue contribution by department
  Q4_Revenue = c(325000, 90000, 65000, 32000)      # Q4 revenue contribution by department
)

print("Quarterly Department Performance (Wide Format - CFO Report):")
print(quarterly_metrics_wide)

# Transform to long format for trend analysis and forecasting
quarterly_metrics_long <- quarterly_metrics_wide %>%  # Start with wide quarterly data
  pivot_longer(
    cols = starts_with("Q"),                  # Select all quarter columns
    names_to = "Quarter",                     # Create Quarter column for time dimension
    names_pattern = "(.*)_Revenue",           # Extract quarter identifier, remove "_Revenue"
    values_to = "Revenue"                     # Create Revenue column for values
  )

print("Quarterly Department Performance (Long Format - Trend Analysis Ready):")
print(quarterly_metrics_long)

# Demonstrate business analytics enabled by long format
cat("\n💹 Financial Analysis Capabilities:\n")
cat("- Quarter-over-quarter growth: group_by(Department) %>% mutate(growth = Revenue/lag(Revenue) - 1)\n")
cat("- Seasonal pattern detection: group_by(Quarter) %>% summarise(total = sum(Revenue))\n")
cat("- Department performance ranking: group_by(Quarter) %>% mutate(rank = rank(-Revenue))\n")
cat("- Time series forecasting: filter(Department == 'Sales') for predictive modeling\n")

[1] "Quarterly Department Performance (Wide Format - CFO Report):"
        Department Q1_Revenue Q2_Revenue Q3_Revenue Q4_Revenue
1            Sales     250000     275000     300000     325000
2        Marketing      75000      80000      85000      90000
3       Operations      50000      55000      60000      65000
4 Customer_Service      25000      27000      30000      32000
[1] "Quarterly Department Performance (Long Format - Trend Analysis Ready):"
[90m# A tibble: 16 × 3[39m
   Department       Quarter Revenue
   [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m Sales            Q1       [4m2[24m[4m5[24m[4m0[24m000
[90m 2[39m Sales            Q2       [4m2[24m[4m7[24m[4m5[24m000
[90m 3[39m Sales            Q3       [4m3[24m[4m0[24m[4m0[24m000
[90m 4[39m Sales            Q4       [4m3[24m[4m2[24m[4m5[24m000
[90m 5[39m Marketing        Q1        [4m7[24m[4m5[24m000
[90m 6[39m Marketing     

In [10]:
# Business Use Case 2: Customer Satisfaction Matrix
# Creating customer satisfaction data across different service touchpoints
satisfaction_long <- data.frame(
  CustomerSegment = rep(c("Enterprise", "SMB", "Consumer"), each = 4),  # Customer types
  TouchPoint = rep(c("Phone_Support", "Online_Chat", "Email_Support", "Self_Service"), 3),  # Service channels
  SatisfactionScore = c(8.5, 7.2, 6.8, 9.1,        # Enterprise satisfaction scores
                       7.8, 8.1, 7.5, 8.7,         # SMB satisfaction scores  
                       6.9, 8.9, 6.2, 9.3)         # Consumer satisfaction scores
)

print("Customer Satisfaction Data (Long Format - Survey Results):")
print(satisfaction_long)

# Transform to wide format for executive satisfaction matrix
satisfaction_matrix <- satisfaction_long %>%    # Start with long satisfaction data
  pivot_wider(
    names_from = TouchPoint,                  # Use touchpoints as column headers
    values_from = SatisfactionScore,          # Use satisfaction scores as values
    names_prefix = "Satisfaction_"            # Add prefix for clarity in matrix
  )

print("Customer Satisfaction Matrix (Wide Format - Executive Dashboard):")
print(satisfaction_matrix)

# Explain strategic business value
cat("\n🎯 Strategic Customer Experience Insights:\n")
cat("- Easy identification of satisfaction gaps across touchpoints\n")
cat("- Quick comparison of service quality by customer segment\n")
cat("- Perfect format for heat map visualization in dashboards\n")
cat("- Enables correlation analysis: which touchpoints align with overall satisfaction?\n")

[1] "Customer Satisfaction Data (Long Format - Survey Results):"
   CustomerSegment    TouchPoint SatisfactionScore
1       Enterprise Phone_Support               8.5
2       Enterprise   Online_Chat               7.2
3       Enterprise Email_Support               6.8
4       Enterprise  Self_Service               9.1
5              SMB Phone_Support               7.8
6              SMB   Online_Chat               8.1
7              SMB Email_Support               7.5
8              SMB  Self_Service               8.7
9         Consumer Phone_Support               6.9
10        Consumer   Online_Chat               8.9
11        Consumer Email_Support               6.2
12        Consumer  Self_Service               9.3
[1] "Customer Satisfaction Matrix (Wide Format - Executive Dashboard):"
[90m# A tibble: 3 × 5[39m
  CustomerSegment Satisfaction_Phone_Support Satisfaction_Online_Chat
  [3m[90m<chr>[39m[23m                                [3m[90m<dbl>[39m[23m                    

In [11]:
# Business Use Case 3: Sales Channel Performance Analysis
# Creating sales performance data across different channels and time periods
channel_performance <- data.frame(
  Date = rep(c("2024-01", "2024-02", "2024-03"), each = 3),  # Monthly tracking periods
  Channel = rep(c("Online", "Retail", "Partner"), 3),        # Sales channels
  Revenue = c(150000, 200000, 100000,     # January revenue by channel
              160000, 210000, 105000,     # February revenue by channel
              170000, 220000, 110000),    # March revenue by channel
  Units_Sold = c(1500, 800, 500,         # January units sold by channel
                 1600, 850, 525,         # February units sold by channel
                 1700, 900, 550)         # March units sold by channel
)

print("Sales Channel Performance (Long Format - Operational Data):")
print(channel_performance)

# Create wide format for channel comparison analysis
revenue_by_channel <- channel_performance %>%  # Start with long performance data
  select(Date, Channel, Revenue) %>%           # Focus on revenue metrics
  pivot_wider(
    names_from = Channel,                      # Use sales channels as column names
    values_from = Revenue,                     # Use revenue values to fill columns
    names_prefix = "Revenue_"                  # Add prefix to identify revenue columns
  )

print("Revenue by Channel (Wide Format - Channel Comparison Table):")
print(revenue_by_channel)

# Demonstrate cross-channel business analysis
cat("\n🏪 Multi-Channel Business Insights:\n")
cat("- Channel contribution analysis: mutate(Total = Revenue_Online + Revenue_Retail + Revenue_Partner)\n")
cat("- Market share by channel: mutate(Online_Share = Revenue_Online / Total * 100)\n")
cat("- Channel performance trends: easy to spot which channels are growing/declining\n")
cat("- Resource allocation decisions: identify which channels need more investment\n")

[1] "Sales Channel Performance (Long Format - Operational Data):"
     Date Channel Revenue Units_Sold
1 2024-01  Online  150000       1500
2 2024-01  Retail  200000        800
3 2024-01 Partner  100000        500
4 2024-02  Online  160000       1600
5 2024-02  Retail  210000        850
6 2024-02 Partner  105000        525
7 2024-03  Online  170000       1700
8 2024-03  Retail  220000        900
9 2024-03 Partner  110000        550
[1] "Revenue by Channel (Wide Format - Channel Comparison Table):"
[90m# A tibble: 3 × 4[39m
  Date    Revenue_Online Revenue_Retail Revenue_Partner
  [3m[90m<chr>[39m[23m            [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m
[90m1[39m 2024-01         [4m1[24m[4m5[24m[4m0[24m000         [4m2[24m[4m0[24m[4m0[24m000          [4m1[24m[4m0[24m[4m0[24m000
[90m2[39m 2024-02         [4m1[24m[4m6[24m[4m0[24m000         [4m2[24m[4m1[24m[4m0[24m000          [4m1[24m[4m0[24m

## Part 4: Advanced Reshaping Techniques and Best Practices

As your data reshaping skills develop, you'll encounter more complex scenarios that require advanced techniques and careful consideration of data quality. This section covers sophisticated reshaping strategies and professional best practices.

### Advanced Reshaping Scenarios

**🔧 Complex Data Structures:**
- Multiple value columns that need different treatment
- Hierarchical data with nested categories
- Mixed data types requiring careful handling

**🎯 Performance Optimization:**
- Efficient reshaping of large datasets
- Memory management during transformation
- Choosing the right approach for your data size

**✅ Data Quality Assurance:**
- Maintaining data integrity during reshaping
- Handling missing values appropriately
- Validating transformation results

### Professional Best Practices

**1. Plan Your Reshaping Strategy:**
- Understand your end goal before starting
- Consider the full analytical workflow
- Plan for data validation at each step

**2. Handle Missing Values Thoughtfully:**
- Decide whether missing values should become NA or 0
- Use `values_fill` parameter when appropriate
- Document your missing value decisions

**3. Maintain Data Lineage:**
- Keep track of transformation steps
- Document business logic behind reshaping decisions
- Create reproducible transformation scripts

**4. Validate Your Results:**
- Check that row/column counts make sense
- Verify that totals are preserved where expected
- Test edge cases and boundary conditions

The examples below demonstrate these advanced concepts in action.

In [12]:
# Advanced Example 1: Handling Multiple Value Columns
# Creating complex business data with multiple metrics per observation
multi_metric_long <- data.frame(
  Region = rep(c("North", "South", "East", "West"), each = 6),  # Geographic regions
  Month = rep(rep(c("Jan", "Feb", "Mar"), each = 2), 4),        # Time periods
  Metric = rep(c("Revenue", "Units_Sold"), 12),                # Multiple business metrics
  Value = c(100000, 1000, 110000, 1100, 120000, 1200,         # North region data
            90000, 900, 95000, 950, 100000, 1000,             # South region data  
            80000, 800, 85000, 850, 90000, 900,               # East region data
            120000, 1200, 130000, 1300, 140000, 1400)         # West region data
)

print("Multi-Metric Business Data (Long Format - Database Extract):")
print(head(multi_metric_long, 12))  # Show first 12 rows for clarity

# Transform to wide format with multiple metrics as separate columns
multi_metric_wide <- multi_metric_long %>%  # Start with long multi-metric data
  pivot_wider(
    names_from = Metric,                     # Use metric names as column headers
    values_from = Value,                     # Use values to fill new columns
    names_prefix = "Monthly_"                # Add prefix to identify monthly metrics
  )

print("Multi-Metric Business Data (Wide Format - Analysis Ready):")
print(multi_metric_wide)

# Demonstrate advanced analytical capabilities
cat("\n📊 Advanced Business Analytics Enabled:\n")
cat("- Revenue per unit calculation: mutate(RPU = Monthly_Revenue / Monthly_Units_Sold)\n")
cat("- Efficiency metrics: compare revenue and units across regions\n")
cat("- Correlation analysis: cor(Monthly_Revenue, Monthly_Units_Sold)\n")
cat("- Multi-dimensional performance tracking: both financial and operational metrics\n")

[1] "Multi-Metric Business Data (Long Format - Database Extract):"
   Region Month     Metric  Value
1   North   Jan    Revenue 100000
2   North   Jan Units_Sold   1000
3   North   Feb    Revenue 110000
4   North   Feb Units_Sold   1100
5   North   Mar    Revenue 120000
6   North   Mar Units_Sold   1200
7   South   Jan    Revenue  90000
8   South   Jan Units_Sold    900
9   South   Feb    Revenue  95000
10  South   Feb Units_Sold    950
11  South   Mar    Revenue 100000
12  South   Mar Units_Sold   1000
[1] "Multi-Metric Business Data (Wide Format - Analysis Ready):"
[90m# A tibble: 12 × 4[39m
   Region Month Monthly_Revenue Monthly_Units_Sold
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m           [3m[90m<dbl>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m North  Jan            [4m1[24m[4m0[24m[4m0[24m000               [4m1[24m000
[90m 2[39m North  Feb            [4m1[24m[4m1[24m[4m0[24m000               [4m1[24m100
[90m 3[39m North  Mar  

In [13]:
# Advanced Example 2: Handling Missing Values and Data Quality
# Creating realistic business data with some missing observations
incomplete_sales_data <- data.frame(
  Product = c("A", "A", "A", "B", "B", "C", "C"),              # Product identifiers
  Quarter = c("Q1", "Q2", "Q4", "Q1", "Q3", "Q2", "Q4"),      # Note: Missing Q3 for A, Q2&Q4 for B
  Sales = c(1000, 1200, 1100, 800, 900, 600, 650)             # Sales figures
)

print("Incomplete Sales Data (Long Format - Real-world with gaps):")
print(incomplete_sales_data)

# Transform to wide format, explicitly handling missing values
sales_matrix_filled <- incomplete_sales_data %>%  # Start with incomplete data
  pivot_wider(
    names_from = Quarter,                     # Use quarters as column names
    values_from = Sales,                      # Use sales values to fill columns
    values_fill = 0                           # Fill missing combinations with 0
                                              # This assumes missing = no sales (business decision)
  )

print("Sales Matrix with Missing Values Filled (Wide Format - Complete Analysis Grid):")
print(sales_matrix_filled)

# Alternative approach: Keep missing values as NA for different analytical needs
sales_matrix_na <- incomplete_sales_data %>%    # Start with incomplete data again
  pivot_wider(
    names_from = Quarter,                       # Use quarters as column names  
    values_from = Sales                         # Use sales values, leave missing as NA
    # No values_fill specified - missing combinations remain NA
  )

print("Sales Matrix with NA Values (Wide Format - Preserves Missing Data Context):")
print(sales_matrix_na)

# Explain business implications of missing value decisions
cat("\n⚠️  Missing Value Strategy - Business Considerations:\n")
cat("- values_fill = 0: Assumes missing means no sales occurred (common for new products)\n")
cat("- values_fill = NA: Preserves missing data context (useful when data was not collected)\n")
cat("- Business rule: Choose based on what missing data actually means in your context\n")
cat("- Documentation: Always document your missing value assumptions for stakeholders\n")

[1] "Incomplete Sales Data (Long Format - Real-world with gaps):"
  Product Quarter Sales
1       A      Q1  1000
2       A      Q2  1200
3       A      Q4  1100
4       B      Q1   800
5       B      Q3   900
6       C      Q2   600
7       C      Q4   650
[1] "Sales Matrix with Missing Values Filled (Wide Format - Complete Analysis Grid):"
[90m# A tibble: 3 × 5[39m
  Product    Q1    Q2    Q4    Q3
  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m A        [4m1[24m000  [4m1[24m200  [4m1[24m100     0
[90m2[39m B         800     0     0   900
[90m3[39m C           0   600   650     0
[1] "Sales Matrix with NA Values (Wide Format - Preserves Missing Data Context):"
[90m# A tibble: 3 × 5[39m
  Product    Q1    Q2    Q4    Q3
  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m A        [4m1[24m00

In [14]:
# Advanced Example 3: Data Validation After Reshaping
# Demonstrating how to verify that reshaping preserved data integrity
validation_data_long <- data.frame(
  Category = rep(c("A", "B", "C"), each = 4),                  # Product categories
  Period = rep(c("P1", "P2", "P3", "P4"), 3),                 # Time periods
  Amount = c(100, 200, 150, 250,    # Category A amounts
             300, 400, 350, 450,    # Category B amounts  
             200, 300, 250, 350)    # Category C amounts
)

print("Original Long Data for Validation Demo:")
print(validation_data_long)

# Calculate total before reshaping for validation
original_total <- sum(validation_data_long$Amount)
original_count <- nrow(validation_data_long)

cat("📊 Original Data Metrics:\n")
cat("Total Amount:", original_total, "\n")
cat("Number of Records:", original_count, "\n")

# Reshape to wide format
validation_data_wide <- validation_data_long %>%  # Start with long validation data
  pivot_wider(
    names_from = Period,                      # Use periods as column names
    values_from = Amount                      # Use amounts to fill columns
  )

print("Reshaped Wide Data:")
print(validation_data_wide)

# Validate the reshaping preserved data integrity
wide_total <- sum(validation_data_wide[, -1], na.rm = TRUE)  # Sum all numeric columns (excluding Category)
wide_records <- nrow(validation_data_wide) * (ncol(validation_data_wide) - 1)  # Categories × Periods

cat("\n✅ Data Integrity Validation:\n")
cat("Original Total Amount:", original_total, "\n")
cat("Reshaped Total Amount:", wide_total, "\n")
cat("Totals Match:", original_total == wide_total, "\n")
cat("Expected Records after Reshaping:", nrow(validation_data_wide), "categories ×", 
    (ncol(validation_data_wide) - 1), "periods\n")

# Demonstrate reverse validation by converting back to long
validation_back_to_long <- validation_data_wide %>%  # Start with wide data
  pivot_longer(
    cols = -Category,                         # Reshape all columns except Category
    names_to = "Period",                      # Create Period column
    values_to = "Amount"                      # Create Amount column
  ) %>%
  arrange(Category, Period)                   # Sort to match original order

print("Converted Back to Long (Should Match Original):")
print(validation_back_to_long)

# Final validation check
back_total <- sum(validation_back_to_long$Amount, na.rm = TRUE)
data_integrity_check <- all.equal(validation_data_long %>% arrange(Category, Period), 
                                  validation_back_to_long)

cat("\n🔍 Round-trip Validation:\n")
cat("Back-to-Long Total:", back_total, "\n")
cat("Perfect Round-trip:", data_integrity_check == TRUE, "\n")
cat("✅ Data reshaping preserved all information correctly!\n")

[1] "Original Long Data for Validation Demo:"
   Category Period Amount
1         A     P1    100
2         A     P2    200
3         A     P3    150
4         A     P4    250
5         B     P1    300
6         B     P2    400
7         B     P3    350
8         B     P4    450
9         C     P1    200
10        C     P2    300
11        C     P3    250
12        C     P4    350
📊 Original Data Metrics:
Total Amount: 3300 
Number of Records: 12 
[1] "Reshaped Wide Data:"
[90m# A tibble: 3 × 5[39m
  Category    P1    P2    P3    P4
  [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m A          100   200   150   250
[90m2[39m B          300   400   350   450
[90m3[39m C          200   300   250   350

✅ Data Integrity Validation:
Original Total Amount: 3300 
Reshaped Total Amount: 3300 
Totals Match: TRUE 
Expected Records after Reshaping: 3 categories × 4 periods
[1] "Converted Back to Long

## Practical Exercise: Monthly Sales Analysis

**Business Scenario:** You're a business analyst tasked with preparing monthly sales data for two different purposes:
1. **Trend Analysis**: Create visualizations showing how each product performs over time
2. **Executive Summary**: Create a comparison table for the monthly business review meeting

**Your Task:** Transform the provided monthly sales data to support both analytical needs, demonstrating when and why to use each format.

**Dataset Context:** The data represents three months of sales for different product lines in a retail business. You'll practice both reshaping directions and explain the business value of each transformation.

**Learning Objectives:**
- Apply pivot_longer() for time series analysis preparation
- Apply pivot_wider() for executive reporting
- Understand strategic data formatting decisions
- Practice data validation techniques

**Business Impact:** Your analysis will inform inventory planning, marketing budget allocation, and product strategy decisions.

In [15]:
# Exercise Setup: Monthly Product Sales Data
# Creating realistic monthly sales data typical of retail business reports
monthly_sales_exercise <- data.frame(
  Month = c("January", "February", "March"),          # Business tracking periods
  ProductA_Sales = c(15000, 18000, 16500),          # Product A monthly sales figures
  ProductB_Sales = c(22000, 25000, 23500),          # Product B monthly sales figures  
  ProductC_Sales = c(12000, 14500, 13800),          # Product C monthly sales figures
  ProductD_Sales = c(8000, 9200, 8800)              # Product D monthly sales figures
)

print("📊 EXERCISE DATA: Monthly Product Sales (Wide Format)")
print("Source: Retail Business Monthly Report")
print(monthly_sales_exercise)

# Exercise Part 1: Transform for Trend Analysis
cat("\n🎯 PART 1: PREPARE DATA FOR TREND ANALYSIS\n")
cat("Business Need: Create time series charts showing product performance trends\n")
cat("Required Format: Long format for ggplot2 visualization\n\n")

# TODO for students: Transform to long format
# Hint: Use pivot_longer() with cols = starts_with("Product")
# Create columns: Month, Product, Sales

# SOLUTION (for instructor reference):
trend_analysis_data <- monthly_sales_exercise %>%  # Start with wide sales data
  pivot_longer(
    cols = starts_with("Product"),            # Select all product sales columns
    names_to = "Product",                     # Create Product column for product names
    names_pattern = "(.*)_Sales",             # Extract product name, remove "_Sales" suffix  
    values_to = "Sales"                       # Create Sales column for values
  )

print("✅ SOLUTION - Trend Analysis Data (Long Format):")
print(trend_analysis_data)

# Demonstrate analytical capabilities enabled
cat("\n📈 Analysis Capabilities Enabled:\n")
cat("- Time series visualization: ggplot(aes(x = Month, y = Sales, color = Product))\n")
cat("- Growth rate calculation: group_by(Product) %>% mutate(growth = Sales/lag(Sales) - 1)\n")
cat("- Product ranking by month: group_by(Month) %>% mutate(rank = rank(-Sales))\n")
cat("- Statistical analysis: correlation between products over time\n")

[1] "📊 EXERCISE DATA: Monthly Product Sales (Wide Format)"
[1] "Source: Retail Business Monthly Report"
     Month ProductA_Sales ProductB_Sales ProductC_Sales ProductD_Sales
1  January          15000          22000          12000           8000
2 February          18000          25000          14500           9200
3    March          16500          23500          13800           8800

🎯 PART 1: PREPARE DATA FOR TREND ANALYSIS
Business Need: Create time series charts showing product performance trends
Required Format: Long format for ggplot2 visualization

[1] "✅ SOLUTION - Trend Analysis Data (Long Format):"
[90m# A tibble: 12 × 3[39m
   Month    Product  Sales
   [3m[90m<chr>[39m[23m    [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m
[90m 1[39m January  ProductA [4m1[24m[4m5[24m000
[90m 2[39m January  ProductB [4m2[24m[4m2[24m000
[90m 3[39m January  ProductC [4m1[24m[4m2[24m000
[90m 4[39m January  ProductD  [4m8[24m000
[90m 5[39m February ProductA 

In [16]:
# Exercise Part 2: Transform for Executive Summary
cat("\n🎯 PART 2: PREPARE DATA FOR EXECUTIVE SUMMARY\n")
cat("Business Need: Create comparison table for monthly business review meeting\n")
cat("Required Format: Wide format for easy side-by-side comparison\n")
cat("Additional Requirement: Add total and percentage columns\n\n")

# Start with our long format data from Part 1
print("Starting with Long Format Data:")
print(trend_analysis_data)

# TODO for students: Transform back to wide format with enhancements
# Hint: Use pivot_wider() with names_from = Product, values_from = Sales
# Then add Total and percentage calculations

# SOLUTION (for instructor reference):
executive_summary <- trend_analysis_data %>%     # Start with long trend data
  pivot_wider(
    names_from = Product,                        # Use product names as column headers
    values_from = Sales,                         # Use sales values to fill columns
    names_prefix = "Sales_"                      # Add prefix for clarity
  ) %>%
  mutate(
    # Add business intelligence calculations
    Total_Sales = Sales_ProductA + Sales_ProductB + Sales_ProductC + Sales_ProductD,  # Monthly totals
    ProductA_Share = round((Sales_ProductA / Total_Sales) * 100, 1),    # Market share percentages
    ProductB_Share = round((Sales_ProductB / Total_Sales) * 100, 1),
    ProductC_Share = round((Sales_ProductC / Total_Sales) * 100, 1),
    ProductD_Share = round((Sales_ProductD / Total_Sales) * 100, 1)
  )

print("✅ SOLUTION - Executive Summary Table (Enhanced Wide Format):")
print(executive_summary)

# Explain executive value
cat("\n💼 Executive Dashboard Benefits:\n")
cat("- Quick visual comparison of product performance across months\n")
cat("- Clear total sales tracking for overall business health\n")
cat("- Market share percentages for product portfolio analysis\n")
cat("- Ready for PowerPoint inclusion in board presentations\n")


🎯 PART 2: PREPARE DATA FOR EXECUTIVE SUMMARY
Business Need: Create comparison table for monthly business review meeting
Required Format: Wide format for easy side-by-side comparison
Additional Requirement: Add total and percentage columns

[1] "Starting with Long Format Data:"
[90m# A tibble: 12 × 3[39m
   Month    Product  Sales
   [3m[90m<chr>[39m[23m    [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m
[90m 1[39m January  ProductA [4m1[24m[4m5[24m000
[90m 2[39m January  ProductB [4m2[24m[4m2[24m000
[90m 3[39m January  ProductC [4m1[24m[4m2[24m000
[90m 4[39m January  ProductD  [4m8[24m000
[90m 5[39m February ProductA [4m1[24m[4m8[24m000
[90m 6[39m February ProductB [4m2[24m[4m5[24m000
[90m 7[39m February ProductC [4m1[24m[4m4[24m500
[90m 8[39m February ProductD  [4m9[24m200
[90m 9[39m March    ProductA [4m1[24m[4m6[24m500
[90m10[39m March    ProductB [4m2[24m[4m3[24m500
[90m11[39m March    ProductC [4m1[24m[4m3[2

In [17]:
# Exercise Part 3: Data Validation and Business Insights
cat("\n🎯 PART 3: VALIDATION AND BUSINESS INSIGHTS\n")
cat("Professional Practice: Always validate your transformations\n\n")

# Validation 1: Check that totals are preserved
original_grand_total <- sum(monthly_sales_exercise[, -1])  # Sum all sales columns except Month
long_format_total <- sum(trend_analysis_data$Sales)
wide_format_total <- sum(executive_summary$Total_Sales)

cat("📊 Data Integrity Validation:\n")
cat("Original Data Total:", original_grand_total, "\n")
cat("Long Format Total:", long_format_total, "\n") 
cat("Wide Format Total:", wide_format_total, "\n")
cat("All Totals Match:", original_grand_total == long_format_total & long_format_total == wide_format_total, "\n\n")

# Validation 2: Check percentage calculations
march_percentages <- executive_summary$ProductA_Share[3] + executive_summary$ProductB_Share[3] + 
                    executive_summary$ProductC_Share[3] + executive_summary$ProductD_Share[3]

cat("🔍 Business Logic Validation:\n")
cat("March Percentages Sum to:", march_percentages, "% (should be ~100%)\n")
cat("Percentage Calculation Valid:", abs(march_percentages - 100) < 0.1, "\n\n")

# Business Insights Generation
cat("💡 KEY BUSINESS INSIGHTS FROM EXERCISE:\n")
cat("════════════════════════════════════════\n")

# Find best performing product
best_product_march <- names(which.max(executive_summary[3, c("Sales_ProductA", "Sales_ProductB", "Sales_ProductC", "Sales_ProductD")]))
best_product_march <- gsub("Sales_", "", best_product_march)

# Calculate growth rates
feb_to_march_growth <- ((executive_summary$Total_Sales[3] - executive_summary$Total_Sales[2]) / 
                       executive_summary$Total_Sales[2]) * 100

cat("1. Top Performer in March:", best_product_march, "\n")
cat("2. February to March Growth:", round(feb_to_march_growth, 1), "%\n")
cat("3. Strongest Product (March):", best_product_march, "with", 
    max(executive_summary[3, c("ProductA_Share", "ProductB_Share", "ProductC_Share", "ProductD_Share")]), 
    "% market share\n")

cat("\n🎯 Strategic Recommendations:\n")
cat("- Focus marketing budget on top-performing products\n")
cat("- Investigate growth drivers for February-March improvement\n")  
cat("- Consider product portfolio optimization based on market share analysis\n")
cat("- Use trend data for inventory planning and demand forecasting\n")


🎯 PART 3: VALIDATION AND BUSINESS INSIGHTS
Professional Practice: Always validate your transformations

📊 Data Integrity Validation:
Original Data Total: 186300 
Long Format Total: 186300 
Wide Format Total: 186300 
All Totals Match: TRUE 

🔍 Business Logic Validation:
March Percentages Sum to: 100 % (should be ~100%)
Percentage Calculation Valid: TRUE 

💡 KEY BUSINESS INSIGHTS FROM EXERCISE:
════════════════════════════════════════
1. Top Performer in March: ProductB 
2. February to March Growth: -6.1 %
3. Strongest Product (March): ProductB with 37.5 % market share

🎯 Strategic Recommendations:
- Focus marketing budget on top-performing products
- Investigate growth drivers for February-March improvement
- Consider product portfolio optimization based on market share analysis
- Use trend data for inventory planning and demand forecasting


## Key Takeaways and Best Practices

### 🎯 **Core Functions Mastered:**

1. **`pivot_longer()`** - Transform wide data to long format for analysis
2. **`pivot_wider()`** - Transform long data to wide format for reporting
3. **Strategic reshaping** - Choose format based on analytical goals

### 💡 **Professional Best Practices:**

- **Plan before reshaping**: Understand your end goal and choose the appropriate format
- **Validate transformations**: Always check that data integrity is preserved
- **Handle missing values thoughtfully**: Use `values_fill` parameter when appropriate  
- **Document your decisions**: Explain business logic behind reshaping choices
- **Consider your audience**: Stakeholders often prefer wide format for readability
- **Test with small datasets**: Validate your reshaping logic before applying to large data

### 🔍 **Strategic Reshaping Guidelines:**

**Use Long Format For:**
- Time series analysis and forecasting
- Statistical modeling with categorical variables
- Most ggplot2 visualizations (line charts, grouped bars)
- Data that needs to be filtered or grouped by the reshaped variable

**Use Wide Format For:**
- Executive reports and summary tables
- Side-by-side comparisons
- Correlation analysis between variables
- Data export to Excel or other business tools
- Heat map visualizations

### 🚀 **Next Steps:**

In future lessons, you'll learn about:
- Joining datasets from multiple sources
- Advanced data cleaning techniques  
- Handling complex nested data structures
- Automated data processing pipelines
- Advanced statistical analysis methods

### 📋 **Reshaping Checklist:**

- ✅ **Understand the business need** - Why are you reshaping?
- ✅ **Plan the transformation** - What should the end result look like?
- ✅ **Handle missing values** - What do gaps in your data mean?
- ✅ **Validate the results** - Are totals preserved? Do percentages add up?
- ✅ **Document your process** - Can others understand and reproduce your work?
- ✅ **Consider the audience** - What format best serves your stakeholders?

Remember: Data reshaping is not just a technical skill—it's a strategic tool that enables better analysis and more effective communication of business insights. Master these techniques, and you'll be able to tackle any data structure challenge in your professional work.