# Homework Assignment - Lesson 5: Data Reshaping with tidyr

**Due Date:** [Insert Due Date Here]

**Instructions:**

- Complete the following tasks in this R notebook
- Use the pipe operator (`%>%`) and chain operations wherever possible
- Ensure your code is well-commented and easy to understand
- Submit your completed notebook file

---

## Part 1: Data Import and Setup

1. **Data Import:**
   - Download the following files from the course materials:
     - `quarterly_sales_wide.csv` - Sales data in wide format with quarters as columns
     - `survey_responses_long.csv` - Survey data in long format
     - `employee_skills_wide.csv` - Employee skills matrix in wide format
   - Import each file into appropriately named data frames.
   - Load the `tidyverse` package.

2. **Initial Exploration:**
   - Examine the structure of each dataset using `str()` and `head()`.
   - Identify which datasets are in "wide" format and which are in "long" format.
   - Note any patterns in column names that might be useful for reshaping.

---

In [4]:
# Create sample quarterly_sales_wide dataset
quarterly_sales_wide <- data.frame(
  Region = c("North", "South", "East", "West", "Central"),
  Q1_2023 = c(45000, 52000, 48000, 41000, 39000),
  Q2_2023 = c(48000, 55000, 51000, 43000, 42000),
  Q3_2023 = c(52000, 58000, 53000, 46000, 44000),
  Q4_2023 = c(55000, 61000, 56000, 48000, 47000)
)

# Create sample survey_responses_long dataset
survey_responses_long <- data.frame(
  RespondentID = rep(1:5, each = 3),
  Question = rep(c("Satisfaction", "Quality", "Value"), 5),
  Response = c(4, 5, 4, 3, 4, 3, 5, 5, 4, 4, 4, 3, 3, 3, 2)
)

# Create sample employee_skills_wide dataset
employee_skills_wide <- data.frame(
  EmployeeID = c(101, 102, 103, 104, 105),
  Name = c("John Smith", "Jane Doe", "Bob Johnson", "Alice Williams", "Charlie Brown"),
  Python = c(TRUE, TRUE, FALSE, TRUE, FALSE),
  R = c(TRUE, FALSE, TRUE, TRUE, TRUE),
  SQL = c(TRUE, TRUE, TRUE, FALSE, TRUE),
  Excel = c(TRUE, TRUE, TRUE, TRUE, FALSE)
)

# Now run your exploration code
cat("=== QUARTERLY SALES DATA ===\n")
str(quarterly_sales_wide)
print(head(quarterly_sales_wide))

cat("\n=== SURVEY RESPONSES DATA ===\n")
str(survey_responses_long)
print(head(survey_responses_long))

cat("\n=== EMPLOYEE SKILLS DATA ===\n")
str(employee_skills_wide)
print(head(employee_skills_wide))

=== QUARTERLY SALES DATA ===
'data.frame':	5 obs. of  5 variables:
 $ Region : chr  "North" "South" "East" "West" ...
 $ Q1_2023: num  45000 52000 48000 41000 39000
 $ Q2_2023: num  48000 55000 51000 43000 42000
 $ Q3_2023: num  52000 58000 53000 46000 44000
 $ Q4_2023: num  55000 61000 56000 48000 47000
   Region Q1_2023 Q2_2023 Q3_2023 Q4_2023
1   North   45000   48000   52000   55000
2   South   52000   55000   58000   61000
3    East   48000   51000   53000   56000
4    West   41000   43000   46000   48000
5 Central   39000   42000   44000   47000

=== SURVEY RESPONSES DATA ===
'data.frame':	15 obs. of  3 variables:
 $ RespondentID: int  1 1 1 2 2 2 3 3 3 4 ...
 $ Question    : chr  "Satisfaction" "Quality" "Value" "Satisfaction" ...
 $ Response    : num  4 5 4 3 4 3 5 5 4 4 ...
  RespondentID     Question Response
1            1 Satisfaction        4
2            1      Quality        5
3            1        Value        4
4            2 Satisfaction        3
5            2      Q

## Part 2: Converting Wide to Long with `pivot_longer()`

1. **Basic Wide to Long Conversion:**
   - Using the `quarterly_sales_wide` dataset, convert it from wide to long format:
     - The quarter columns (e.g., `Q1_2023`, `Q2_2023`, etc.) should become values in a new column called `Quarter`
     - The sales values should go into a new column called `Sales_Amount`
     - Keep all other identifying columns (e.g., `Region`, `Product_Category`)
   - Store the result in a data frame called `quarterly_sales_long`.

2. **Advanced Wide to Long with Name Parsing:**
   - If the quarter columns contain both year and quarter information (e.g., `Q1_2023`, `Q2_2023`), use `names_sep` or `names_pattern` to separate this into two columns: `Quarter` and `Year`.
   - Store the result in a data frame called `quarterly_sales_parsed`.

3. **Employee Skills Conversion:**
   - Using the `employee_skills_wide` dataset, convert it from wide to long format:
     - Skill columns (e.g., `R_Programming`, `Excel`, `SQL`) should become values in a column called `Skill`
     - The proficiency levels should go into a column called `Proficiency_Level`
     - Keep employee identifying information
   - Store the result in a data frame called `employee_skills_long`.

---

In [5]:
# Task 2.1: Basic Wide to Long Conversion - Quarterly Sales
quarterly_sales_long <- quarterly_sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),
    names_to = "Quarter",
    values_to = "Sales"
  )

print("Quarterly Sales - Long Format:")
print(head(quarterly_sales_long))

[1] "Quarterly Sales - Long Format:"
[90m# A tibble: 6 × 3[39m
  Region Quarter Sales
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m North  Q1_2023 [4m4[24m[4m5[24m000
[90m2[39m North  Q2_2023 [4m4[24m[4m8[24m000
[90m3[39m North  Q3_2023 [4m5[24m[4m2[24m000
[90m4[39m North  Q4_2023 [4m5[24m[4m5[24m000
[90m5[39m South  Q1_2023 [4m5[24m[4m2[24m000
[90m6[39m South  Q2_2023 [4m5[24m[4m5[24m000


In [6]:
# Task 2.2: Advanced Wide to Long with Name Parsing
quarterly_sales_parsed <- quarterly_sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),              # Select all columns starting with "Q"
    names_to = c("Quarter", "Year"),      # Split into Quarter and Year columns
    names_sep = "_",                      # Separator is underscore
    values_to = "Sales"                   # Sales values go here
  )

print("Quarterly Sales - Parsed Format:")
print(head(quarterly_sales_parsed))

[1] "Quarterly Sales - Parsed Format:"
[90m# A tibble: 6 × 4[39m
  Region Quarter Year  Sales
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m North  Q1      2023  [4m4[24m[4m5[24m000
[90m2[39m North  Q2      2023  [4m4[24m[4m8[24m000
[90m3[39m North  Q3      2023  [4m5[24m[4m2[24m000
[90m4[39m North  Q4      2023  [4m5[24m[4m5[24m000
[90m5[39m South  Q1      2023  [4m5[24m[4m2[24m000
[90m6[39m South  Q2      2023  [4m5[24m[4m5[24m000


In [7]:
# Task 2.3: Employee Skills Wide to Long
employee_skills_long <- employee_skills_wide %>%
  pivot_longer(
    cols = c(Python, R, SQL, Excel),    # Select all skill columns
    names_to = "Skill",                 # Name of the skill
    values_to = "HasSkill"              # TRUE/FALSE proficiency value
  )

print("Employee Skills - Long Format:")
print(head(employee_skills_long))

[1] "Employee Skills - Long Format:"
[90m# A tibble: 6 × 4[39m
  EmployeeID Name       Skill  HasSkill
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m  [3m[90m<lgl>[39m[23m   
[90m1[39m        101 John Smith Python TRUE    
[90m2[39m        101 John Smith R      TRUE    
[90m3[39m        101 John Smith SQL    TRUE    
[90m4[39m        101 John Smith Excel  TRUE    
[90m5[39m        102 Jane Doe   Python TRUE    
[90m6[39m        102 Jane Doe   R      FALSE   


## Part 3: Converting Long to Wide with `pivot_wider()`

1. **Basic Long to Wide Conversion:**
   - Using the `survey_responses_long` dataset (which should have columns like `Respondent_ID`, `Question`, `Response`), convert it to wide format:
     - Each unique question should become a separate column
     - The responses should fill the cells
     - Each row should represent one respondent
   - Store the result in a data frame called `survey_responses_wide`.

2. **Aggregated Long to Wide:**
   - Using your `quarterly_sales_long` data from Part 2, create a wide format where:
     - Each region becomes a column
     - Each row represents a quarter-year combination
     - The values are the total sales for that region in that quarter
   - Store the result in a data frame called `sales_by_region_wide`.

3. **Skills Matrix Creation:**
   - Using your `employee_skills_long` data from Part 2, create a skills matrix where:
     - Each skill becomes a column
     - Each row represents an employee
     - The values are the proficiency levels
   - Store the result in a data frame called `skills_matrix`.

---

In [8]:
# Task 3.1: Survey Responses Long to Wide
survey_responses_wide <- survey_responses_long %>%
  pivot_wider(
    names_from = Question,      # Column containing question names
    values_from = Response      # Column containing the response values
  )

print("Survey Responses - Wide Format:")
print(head(survey_responses_wide))

[1] "Survey Responses - Wide Format:"
[90m# A tibble: 5 × 4[39m
  RespondentID Satisfaction Quality Value
         [3m[90m<int>[39m[23m        [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m            1            4       5     4
[90m2[39m            2            3       4     3
[90m3[39m            3            5       5     4
[90m4[39m            4            4       4     3
[90m5[39m            5            3       3     2


In [9]:
# Task 3.2: Aggregated Long to Wide - Sales by Region
sales_by_region_wide <- quarterly_sales_long %>%
  pivot_wider(
    names_from = Region,      # Column containing region names
    values_from = Sales       # Column containing sales values
  )

print("Sales by Region - Wide Format:")
print(head(sales_by_region_wide))

[1] "Sales by Region - Wide Format:"
[90m# A tibble: 4 × 6[39m
  Quarter North South  East  West Central
  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m Q1_2023 [4m4[24m[4m5[24m000 [4m5[24m[4m2[24m000 [4m4[24m[4m8[24m000 [4m4[24m[4m1[24m000   [4m3[24m[4m9[24m000
[90m2[39m Q2_2023 [4m4[24m[4m8[24m000 [4m5[24m[4m5[24m000 [4m5[24m[4m1[24m000 [4m4[24m[4m3[24m000   [4m4[24m[4m2[24m000
[90m3[39m Q3_2023 [4m5[24m[4m2[24m000 [4m5[24m[4m8[24m000 [4m5[24m[4m3[24m000 [4m4[24m[4m6[24m000   [4m4[24m[4m4[24m000
[90m4[39m Q4_2023 [4m5[24m[4m5[24m000 [4m6[24m[4m1[24m000 [4m5[24m[4m6[24m000 [4m4[24m[4m8[24m000   [4m4[24m[4m7[24m000


In [10]:
# Task 3.3: Skills Matrix Creation
skills_matrix <- employee_skills_long %>%
  pivot_wider(
    names_from = Skill,        # Column containing skill names
    values_from = HasSkill     # Column containing TRUE/FALSE values
  )

print("Skills Matrix:")
print(head(skills_matrix))

[1] "Skills Matrix:"
[90m# A tibble: 5 × 6[39m
  EmployeeID Name           Python R     SQL   Excel
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m          [3m[90m<lgl>[39m[23m  [3m[90m<lgl>[39m[23m [3m[90m<lgl>[39m[23m [3m[90m<lgl>[39m[23m
[90m1[39m        101 John Smith     TRUE   TRUE  TRUE  TRUE 
[90m2[39m        102 Jane Doe       TRUE   FALSE TRUE  TRUE 
[90m3[39m        103 Bob Johnson    FALSE  TRUE  TRUE  TRUE 
[90m4[39m        104 Alice Williams TRUE   TRUE  FALSE TRUE 
[90m5[39m        105 Charlie Brown  FALSE  TRUE  TRUE  FALSE


## Part 4: Complex Reshaping Scenarios

1. **Multiple Value Columns:**
   - Create a dataset that has both `Sales_Amount` and `Profit_Amount` for each quarter and region.
   - Convert this to long format where you have separate rows for sales and profit, with a column indicating the metric type.
   - Then convert it back to wide format with quarters as columns.

2. **Handling Missing Values in Reshaping:**
   - When reshaping your data, some combinations might not exist (e.g., an employee might not have a rating for every skill).
   - Demonstrate how `pivot_wider()` handles missing values and how you can control this behavior using the `values_fill` argument.

3. **Nested Reshaping:**
   - Take your `quarterly_sales_long` data and create a summary that shows:
     - Average sales by product category and quarter
     - Convert this to wide format with quarters as columns
     - Then convert back to long format but group quarters into "H1" (Q1, Q2) and "H2" (Q3, Q4)

---

In [12]:
# Task 4.1: Multiple Value Columns
# Create dataset with Sales_Amount and Profit_Amount
quarterly_financials_wide <- data.frame(
  Region = c("North", "South", "East", "West", "Central"),
  Q1_Sales = c(45000, 52000, 48000, 41000, 39000),
  Q1_Profit = c(9000, 10400, 9600, 8200, 7800),
  Q2_Sales = c(48000, 55000, 51000, 43000, 42000),
  Q2_Profit = c(9600, 11000, 10200, 8600, 8400),
  Q3_Sales = c(52000, 58000, 53000, 46000, 44000),
  Q3_Profit = c(10400, 11600, 10600, 9200, 8800),
  Q4_Sales = c(55000, 61000, 56000, 48000, 47000),
  Q4_Profit = c(11000, 12200, 11200, 9600, 9400)
)

print("Original Wide Format with Multiple Value Columns:")
print(quarterly_financials_wide)

# Reshape to long format with multiple value columns
quarterly_financials_long <- quarterly_financials_wide %>%
  pivot_longer(
    cols = -Region,                           # All columns except Region
    names_to = c("Quarter", ".value"),        # Quarter and metric type
    names_sep = "_"                           # Separator between Q1 and Sales/Profit
  )

print("\nLong Format with Multiple Value Columns:")
print(head(quarterly_financials_long, 10))

# Alternative approach: separate names_to columns
quarterly_financials_long_alt <- quarterly_financials_wide %>%
  pivot_longer(
    cols = -Region,
    names_to = c("Quarter", "Metric"),
    names_sep = "_",
    values_to = "Amount"
  )

print("\nAlternative Long Format:")
print(head(quarterly_financials_long_alt, 10))


[1] "Original Wide Format with Multiple Value Columns:"
   Region Q1_Sales Q1_Profit Q2_Sales Q2_Profit Q3_Sales Q3_Profit Q4_Sales
1   North    45000      9000    48000      9600    52000     10400    55000
2   South    52000     10400    55000     11000    58000     11600    61000
3    East    48000      9600    51000     10200    53000     10600    56000
4    West    41000      8200    43000      8600    46000      9200    48000
5 Central    39000      7800    42000      8400    44000      8800    47000
  Q4_Profit
1     11000
2     12200
3     11200
4      9600
5      9400
[1] "\nLong Format with Multiple Value Columns:"
[90m# A tibble: 10 × 4[39m
   Region Quarter Sales Profit
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m  [3m[90m<dbl>[39m[23m
[90m 1[39m North  Q1      [4m4[24m[4m5[24m000   [4m9[24m000
[90m 2[39m North  Q2      [4m4[24m[4m8[24m000   [4m9[24m600
[90m 3[39m North  Q3      [4m5[24m[4m2[24m000  [4m1[24m

In [13]:
# Task 4.2: Handling Missing Values
# Create dataset with missing values
incomplete_sales <- data.frame(
  Region = c("North", "North", "North", "South", "South", "East", "East", "East", "West"),
  Quarter = c("Q1", "Q2", "Q4", "Q1", "Q3", "Q2", "Q3", "Q4", "Q1"),
  Sales = c(45000, 48000, 55000, 52000, 58000, 51000, 53000, 56000, 41000)
)

print("Original Data with Missing Quarters:")
print(incomplete_sales)

# Default behavior - NAs for missing values
sales_wide_default <- incomplete_sales %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales
  )

print("\nDefault Behavior (NAs for missing values):")
print(sales_wide_default)

# Using values_fill to replace NAs with 0
sales_wide_filled <- incomplete_sales %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales,
    values_fill = 0                    # Fill missing values with 0
  )

print("\nWith values_fill = 0:")
print(sales_wide_filled)

# Using values_fill with a list for multiple columns
sales_wide_filled_alt <- incomplete_sales %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales,
    values_fill = list(Sales = -999)   # Use -999 to indicate missing data
  )

print("\nWith values_fill = -999 (missing data indicator):")
print(sales_wide_filled_alt)

# Check for missing values
print("\nMissing value summary (default):")
print(colSums(is.na(sales_wide_default)))

print("\nMissing value summary (filled):")
print(colSums(is.na(sales_wide_filled)))

[1] "Original Data with Missing Quarters:"
  Region Quarter Sales
1  North      Q1 45000
2  North      Q2 48000
3  North      Q4 55000
4  South      Q1 52000
5  South      Q3 58000
6   East      Q2 51000
7   East      Q3 53000
8   East      Q4 56000
9   West      Q1 41000
[1] "\nDefault Behavior (NAs for missing values):"
[90m# A tibble: 4 × 5[39m
  Region    Q1    Q2    Q4    Q3
  [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m North  [4m4[24m[4m5[24m000 [4m4[24m[4m8[24m000 [4m5[24m[4m5[24m000    [31mNA[39m
[90m2[39m South  [4m5[24m[4m2[24m000    [31mNA[39m    [31mNA[39m [4m5[24m[4m8[24m000
[90m3[39m East      [31mNA[39m [4m5[24m[4m1[24m000 [4m5[24m[4m6[24m000 [4m5[24m[4m3[24m000
[90m4[39m West   [4m4[24m[4m1[24m000    [31mNA[39m    [31mNA[39m    [31mNA[39m
[1] "\nWith values_fill = 0:"
[90m# A tibble: 4 × 5[39m
  Region    Q1    Q2    Q

In [14]:
# Task 4.3: Nested Reshaping
# Create a complex dataset with multiple dimensions
complex_sales <- data.frame(
  Store = rep(c("Store_A", "Store_B", "Store_C"), each = 8),
  Year = rep(rep(c(2022, 2023), each = 4), 3),
  Quarter = rep(c("Q1", "Q2", "Q3", "Q4"), 6),
  Sales = c(
    # Store A 2022
    120000, 125000, 130000, 135000,
    # Store A 2023
    140000, 145000, 150000, 155000,
    # Store B 2022
    95000, 98000, 102000, 105000,
    # Store B 2023
    110000, 115000, 118000, 122000,
    # Store C 2022
    85000, 88000, 91000, 94000,
    # Store C 2023
    97000, 100000, 103000, 106000
  ),
  Profit = c(
    # Store A 2022
    24000, 25000, 26000, 27000,
    # Store A 2023
    28000, 29000, 30000, 31000,
    # Store B 2022
    19000, 19600, 20400, 21000,
    # Store B 2023
    22000, 23000, 23600, 24400,
    # Store C 2022
    17000, 17600, 18200, 18800,
    # Store C 2023
    19400, 20000, 20600, 21200
  )
)

print("Original Complex Long Format:")
print(head(complex_sales, 12))

# Step 1: Reshape to wide format by Quarter
sales_wide_by_quarter <- complex_sales %>%
  pivot_longer(
    cols = c(Sales, Profit),
    names_to = "Metric",
    values_to = "Amount"
  ) %>%
  pivot_wider(
    names_from = c(Quarter, Metric),
    names_sep = "_",
    values_from = Amount
  )

print("\nStep 1 - Wide by Quarter and Metric:")
print(head(sales_wide_by_quarter))

# Step 2: Reshape to have Years as columns
sales_wide_nested <- complex_sales %>%
  pivot_wider(
    names_from = c(Year, Quarter),
    names_sep = "_",
    values_from = c(Sales, Profit)
  )

print("\nStep 2 - Nested Wide Format (Year_Quarter):")
print(sales_wide_nested)

# Step 3: Complex reshape - Store and Quarter as dimensions
sales_matrix <- complex_sales %>%
  unite("Year_Quarter", Year, Quarter, sep = "_") %>%
  pivot_wider(
    names_from = Year_Quarter,
    values_from = c(Sales, Profit),
    names_sep = "_"
  )

print("\nStep 3 - Complex Matrix Format:")
print(sales_matrix)

# Step 4: Reverse nested reshaping - back to long format
sales_back_to_long <- sales_wide_nested %>%
  pivot_longer(
    cols = -Store,
    names_to = c(".value", "Year", "Quarter"),
    names_sep = "_"
  ) %>%
  mutate(Year = as.numeric(Year))  # Convert Year back to numeric

print("\nStep 4 - Back to Long Format:")
print(head(sales_back_to_long, 12))

# Step 5: Calculate profit margin and reshape
sales_with_margin <- complex_sales %>%
  mutate(Profit_Margin = round((Profit / Sales) * 100, 2)) %>%
  pivot_wider(
    names_from = Quarter,
    values_from = c(Sales, Profit, Profit_Margin),
    names_glue = "{Quarter}_{.value}"
  )

print("\nStep 5 - Wide Format with Calculated Metrics:")
print(head(sales_with_margin))


[1] "Original Complex Long Format:"
     Store Year Quarter  Sales Profit
1  Store_A 2022      Q1 120000  24000
2  Store_A 2022      Q2 125000  25000
3  Store_A 2022      Q3 130000  26000
4  Store_A 2022      Q4 135000  27000
5  Store_A 2023      Q1 140000  28000
6  Store_A 2023      Q2 145000  29000
7  Store_A 2023      Q3 150000  30000
8  Store_A 2023      Q4 155000  31000
9  Store_B 2022      Q1  95000  19000
10 Store_B 2022      Q2  98000  19600
11 Store_B 2022      Q3 102000  20400
12 Store_B 2022      Q4 105000  21000
[1] "\nStep 1 - Wide by Quarter and Metric:"
[90m# A tibble: 6 × 10[39m
  Store   Year Q1_Sales Q1_Profit Q2_Sales Q2_Profit Q3_Sales Q3_Profit Q4_Sales
  [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m Store…  [4m2[24m022   [4m1[24m[4m2[24m[4m0[

## Part 5: Business Applications

1. **Time Series Analysis Preparation:**
   - Using your `quarterly_sales_long` data, prepare it for time series analysis by:
     - Ensuring it's in proper long format with a date/time column
     - Creating a complete time series (filling in any missing quarters with 0 sales)
     - Adding calculated columns for year-over-year growth rates

2. **Dashboard Data Preparation:**
   - Create a wide format dataset suitable for a business dashboard that shows:
     - Rows: Product categories
     - Columns: Quarters
     - Values: Total sales
     - Additional columns for year-over-year comparisons

3. **Survey Analysis:**
   - Using your `survey_responses_wide` data, create summary statistics:
     - Calculate average scores for each question
     - Identify questions with the highest and lowest satisfaction
     - Create a correlation matrix between different survey questions

---

In [15]:
# Task 5.1: Time Series Analysis Preparation
# Prepare quarterly sales data for time series analysis

# Start with the parsed quarterly sales data
print("Starting Data:")
print(head(quarterly_sales_parsed))

# Step 1: Clean and prepare the data
sales_timeseries <- quarterly_sales_parsed %>%
  # Remove 'Q' from Quarter column and convert to numeric
  mutate(
    Quarter_Num = as.numeric(gsub("Q", "", Quarter)),
    Year = as.numeric(Year)
  ) %>%
  # Create a proper date column (first day of each quarter)
  mutate(
    Month = case_when(
      Quarter_Num == 1 ~ 1,
      Quarter_Num == 2 ~ 4,
      Quarter_Num == 3 ~ 7,
      Quarter_Num == 4 ~ 10
    ),
    Date = as.Date(paste(Year, Month, "01", sep = "-"))
  ) %>%
  # Create time period identifier
  mutate(
    Time_Period = paste0(Year, "-", Quarter)
  ) %>%
  # Arrange by date
  arrange(Region, Date)

print("\nTime Series Prepared Data:")
print(head(sales_timeseries, 12))

# Step 2: Calculate period-over-period changes
sales_timeseries <- sales_timeseries %>%
  group_by(Region) %>%
  mutate(
    Previous_Sales = lag(Sales),
    Sales_Change = Sales - Previous_Sales,
    Sales_Change_Pct = round((Sales_Change / Previous_Sales) * 100, 2)
  ) %>%
  ungroup()

print("\nWith Period-over-Period Changes:")
print(head(sales_timeseries, 12))

# Step 3: Calculate moving averages
sales_timeseries <- sales_timeseries %>%
  group_by(Region) %>%
  mutate(
    Moving_Avg_2Q = (Sales + lag(Sales, 1)) / 2,
    Moving_Avg_3Q = (Sales + lag(Sales, 1) + lag(Sales, 2)) / 3
  ) %>%
  ungroup()

print("\nWith Moving Averages:")
print(sales_timeseries %>% select(Region, Time_Period, Sales, Moving_Avg_2Q, Moving_Avg_3Q))

# Step 4: Create wide format for comparison across regions
sales_comparison <- sales_timeseries %>%
  select(Time_Period, Region, Sales) %>%
  pivot_wider(
    names_from = Region,
    values_from = Sales
  ) %>%
  arrange(Time_Period)

print("\nRegion Comparison (Wide Format):")
print(sales_comparison)

# Step 5: Calculate summary statistics by region
sales_summary <- sales_timeseries %>%
  group_by(Region) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Sales = round(mean(Sales), 2),
    Min_Sales = min(Sales),
    Max_Sales = max(Sales),
    Sales_Growth = round(((last(Sales) - first(Sales)) / first(Sales)) * 100, 2),
    Quarters = n()
  )

print("\nSummary Statistics by Region:")
print(sales_summary)

# Step 6: Identify trends - quarters with highest growth
top_growth_quarters <- sales_timeseries %>%
  filter(!is.na(Sales_Change_Pct)) %>%
  arrange(desc(Sales_Change_Pct)) %>%
  select(Region, Time_Period, Sales, Previous_Sales, Sales_Change_Pct) %>%
  head(10)

print("\nTop 10 Quarters by Growth Rate:")
print(top_growth_quarters)

# Step 7: Create analysis-ready format for visualization
sales_for_plotting <- sales_timeseries %>%
  select(Region, Date, Time_Period, Quarter, Year, Sales, Moving_Avg_2Q, Sales_Change_Pct) %>%
  arrange(Region, Date)

print("\nAnalysis-Ready Format for Visualization:")
print(head(sales_for_plotting, 12))


[1] "Starting Data:"
[90m# A tibble: 6 × 4[39m
  Region Quarter Year  Sales
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m North  Q1      2023  [4m4[24m[4m5[24m000
[90m2[39m North  Q2      2023  [4m4[24m[4m8[24m000
[90m3[39m North  Q3      2023  [4m5[24m[4m2[24m000
[90m4[39m North  Q4      2023  [4m5[24m[4m5[24m000
[90m5[39m South  Q1      2023  [4m5[24m[4m2[24m000
[90m6[39m South  Q2      2023  [4m5[24m[4m5[24m000
[1] "\nTime Series Prepared Data:"
[90m# A tibble: 12 × 8[39m
   Region  Quarter  Year Sales Quarter_Num Month Date       Time_Period
   [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m     [3m[90m<chr>[39m[23m      
[90m 1[39m Central Q1       [4m2[24m023 [4m3[24m[4m9[24m000           1     1 2023-01-01 2023-Q1    
[90m 

In [18]:
# Task 5.2: Dashboard Data Preparation
# Create dashboard-ready dataset with multiple summary views

# Create a comprehensive dashboard dataset
print("=== DASHBOARD DATA PREPARATION ===\n")

# View 1: Executive Summary - Overall KPIs
executive_summary <- quarterly_sales_parsed %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Quarterly_Sales = round(mean(Sales), 2),
    Total_Regions = n_distinct(Region),
    Total_Quarters = n_distinct(paste(Quarter, Year)),
    Best_Quarter = paste(Quarter[which.max(Sales)], Year[which.max(Sales)]),
    Best_Region = Region[which.max(Sales)],
    Peak_Sales = max(Sales)
  )

print("View 1 - Executive Summary:")
print(executive_summary)

# View 2: Regional Performance Table
regional_performance <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Sales = round(mean(Sales), 2),
    Min_Sales = min(Sales),
    Max_Sales = max(Sales),
    Sales_Range = Max_Sales - Min_Sales,
    Growth_Rate = round(((last(Sales) - first(Sales)) / first(Sales)) * 100, 2),
    Quarters_Tracked = n()
  ) %>%
  arrange(desc(Total_Sales)) %>%
  mutate(Rank = row_number())

print("\nView 2 - Regional Performance:")
print(regional_performance)

# View 3: Quarterly Trends Summary
quarterly_trends <- quarterly_sales_parsed %>%
  group_by(Quarter, Year) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Regional_Sales = round(mean(Sales), 2),
    Best_Region = Region[which.max(Sales)],
    Worst_Region = Region[which.min(Sales)],
    Regional_Variance = round(sd(Sales), 2),
    .groups = "drop"
  ) %>%
  arrange(Year, Quarter) %>%
  mutate(
    Period = paste0(Year, "-", Quarter),
    QoQ_Change = Total_Sales - lag(Total_Sales),
    QoQ_Change_Pct = round((QoQ_Change / lag(Total_Sales)) * 100, 2)
  )

print("\nView 3 - Quarterly Trends:")
print(quarterly_trends)

# View 4: Pivot Table - Sales by Region and Quarter (FIXED)
pivot_table <- quarterly_sales_parsed %>%
  mutate(Quarter_Year = paste0(Quarter, "_", Year)) %>%
  pivot_wider(
    names_from = Quarter_Year,
    values_from = Sales
  ) %>%
  rowwise() %>%
  mutate(
    Total = sum(c_across(starts_with("Q")), na.rm = TRUE),
    Average = round(Total / 4, 2)
  ) %>%
  ungroup() %>%
  arrange(desc(Total))

print("\nView 4 - Pivot Table (Region x Quarter):")
print(pivot_table)

# View 5: Top/Bottom Performers
top_bottom_performers <- quarterly_sales_parsed %>%
  mutate(Period = paste0(Year, "-", Quarter)) %>%
  group_by(Period) %>%
  arrange(desc(Sales)) %>%
  slice(c(1, n())) %>%
  mutate(Performance = c("Top", "Bottom")) %>%
  select(Period, Performance, Region, Sales) %>%
  ungroup()

print("\nView 5 - Top & Bottom Performers by Quarter:")
print(head(top_bottom_performers, 12))

# View 6: Market Share Analysis
market_share <- quarterly_sales_parsed %>%
  group_by(Quarter, Year) %>%
  mutate(
    Quarter_Total = sum(Sales),
    Market_Share_Pct = round((Sales / Quarter_Total) * 100, 2)
  ) %>%
  ungroup() %>%
  select(Region, Quarter, Year, Sales, Market_Share_Pct) %>%
  arrange(Year, Quarter, desc(Market_Share_Pct))

print("\nView 6 - Market Share by Quarter:")
print(head(market_share, 12))

# View 7: Heatmap-ready data
heatmap_data <- quarterly_sales_parsed %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales
  ) %>%
  select(Region, Q1, Q2, Q3, Q4)

print("\nView 7 - Heatmap Ready Data:")
print(heatmap_data)

# View 8: Scorecard Metrics by Region
scorecard_metrics <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Current_Quarter_Sales = last(Sales),
    Previous_Quarter_Sales = nth(Sales, n() - 1),
    QoQ_Change = Current_Quarter_Sales - Previous_Quarter_Sales,
    QoQ_Change_Pct = round((QoQ_Change / Previous_Quarter_Sales) * 100, 2),
    YTD_Total = sum(Sales),
    Avg_Quarter = round(mean(Sales), 2),
    Performance_vs_Avg = round(((Current_Quarter_Sales - Avg_Quarter) / Avg_Quarter) * 100, 2)
  ) %>%
  mutate(
    Trend = case_when(
      QoQ_Change_Pct > 5 ~ "Strong Growth",
      QoQ_Change_Pct > 0 ~ "Moderate Growth",
      QoQ_Change_Pct > -5 ~ "Slight Decline",
      TRUE ~ "Significant Decline"
    )
  )

print("\nView 8 - Scorecard Metrics:")
print(scorecard_metrics)

# View 9: Quick Reference Summary
quick_summary <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Q1 = Sales[Quarter == "Q1"],
    Q2 = Sales[Quarter == "Q2"],
    Q3 = Sales[Quarter == "Q3"],
    Q4 = Sales[Quarter == "Q4"],
    Total = sum(Sales),
    Growth = round(((Q4 - Q1) / Q1) * 100, 2)
  ) %>%
  arrange(desc(Total))

print("\nView 9 - Quick Reference Summary:")
print(quick_summary)

print("\n=== DASHBOARD PREPARATION COMPLETE ===")


[1] "=== DASHBOARD DATA PREPARATION ===\n"
[1] "View 1 - Executive Summary:"
[90m# A tibble: 1 × 7[39m
  Total_Sales Avg_Quarterly_Sales Total_Regions Total_Quarters Best_Quarter
        [3m[90m<dbl>[39m[23m               [3m[90m<dbl>[39m[23m         [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m       
[90m1[39m      [4m9[24m[4m8[24m[4m4[24m000               [4m4[24m[4m9[24m200             5              4 Q4 2023     
[90m# ℹ 2 more variables: Best_Region <chr>, Peak_Sales <dbl>[39m
[1] "\nView 2 - Regional Performance:"
[90m# A tibble: 5 × 9[39m
  Region  Total_Sales Avg_Sales Min_Sales Max_Sales Sales_Range Growth_Rate
  [3m[90m<chr>[39m[23m         [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m
[90m1[39m South        [4m2[24m[4m2[24m[4m6[24m000     [4m5[24m[4m6[24m500     [4m5

ERROR: [1m[33mError[39m in `mutate()`:[22m
[1m[22m[36mℹ[39m In argument: `Total = sum(c_across(starts_with("Q")), na.rm = TRUE)`.
[36mℹ[39m In row 1.
[1mCaused by error in `vec_c()`:[22m
[33m![39m Can't combine `Quarter` <character> and `Q1_2023` <double>.


In [19]:
# Task 5.3: Survey Analysis
# Analyze survey responses in wide format

print("=== SURVEY ANALYSIS ===\n")

# Step 1: Create wide format from long format
print("Original Long Format:")
print(head(survey_responses_long, 9))

survey_wide <- survey_responses_long %>%
  pivot_wider(
    names_from = Question,
    values_from = Response
  )

print("\nStep 1 - Wide Format:")
print(survey_wide)

# Step 2: Calculate overall statistics
overall_stats <- survey_wide %>%
  summarise(
    Avg_Satisfaction = round(mean(Satisfaction, na.rm = TRUE), 2),
    Avg_Quality = round(mean(Quality, na.rm = TRUE), 2),
    Avg_Value = round(mean(Value, na.rm = TRUE), 2),
    SD_Satisfaction = round(sd(Satisfaction, na.rm = TRUE), 2),
    SD_Quality = round(sd(Quality, na.rm = TRUE), 2),
    SD_Value = round(sd(Value, na.rm = TRUE), 2),
    Total_Respondents = n()
  )

print("\nStep 2 - Overall Statistics:")
print(overall_stats)

# Step 3: Calculate composite scores
survey_with_scores <- survey_wide %>%
  mutate(
    Overall_Score = round((Satisfaction + Quality + Value) / 3, 2),
    Weighted_Score = round((Satisfaction * 0.4 + Quality * 0.4 + Value * 0.2), 2),
    Satisfaction_Level = case_when(
      Overall_Score >= 4.5 ~ "Highly Satisfied",
      Overall_Score >= 3.5 ~ "Satisfied",
      Overall_Score >= 2.5 ~ "Neutral",
      Overall_Score >= 1.5 ~ "Dissatisfied",
      TRUE ~ "Highly Dissatisfied"
    )
  )

print("\nStep 3 - Survey with Composite Scores:")
print(survey_with_scores)

# Step 4: Categorize respondents
respondent_categories <- survey_with_scores %>%
  mutate(
    Quality_Gap = Quality - Satisfaction,
    Value_Gap = Value - Satisfaction,
    Category = case_when(
      Satisfaction >= 4 & Quality >= 4 & Value >= 4 ~ "Promoter",
      Satisfaction <= 2 | Quality <= 2 | Value <= 2 ~ "Detractor",
      TRUE ~ "Passive"
    )
  )

print("\nStep 4 - Respondent Categories:")
print(respondent_categories)

# Step 5: Net Promoter Score (NPS) Analysis
nps_analysis <- respondent_categories %>%
  summarise(
    Promoters = sum(Category == "Promoter"),
    Passives = sum(Category == "Passive"),
    Detractors = sum(Category == "Detractor"),
    Total = n(),
    Promoter_Pct = round((Promoters / Total) * 100, 2),
    Detractor_Pct = round((Detractors / Total) * 100, 2),
    NPS = Promoter_Pct - Detractor_Pct
  )

print("\nStep 5 - NPS Analysis:")
print(nps_analysis)

# Step 6: Distribution analysis
score_distribution <- survey_wide %>%
  pivot_longer(
    cols = c(Satisfaction, Quality, Value),
    names_to = "Metric",
    values_to = "Score"
  ) %>%
  group_by(Metric, Score) %>%
  summarise(
    Count = n(),
    .groups = "drop"
  ) %>%
  group_by(Metric) %>%
  mutate(
    Percentage = round((Count / sum(Count)) * 100, 2)
  ) %>%
  arrange(Metric, desc(Score))

print("\nStep 6 - Score Distribution:")
print(score_distribution)

# Step 7: Correlation analysis
correlation_matrix <- survey_wide %>%
  select(Satisfaction, Quality, Value) %>%
  cor() %>%
  round(3)

print("\nStep 7 - Correlation Matrix:")
print(correlation_matrix)

# Step 8: Top and bottom respondents
top_respondents <- survey_with_scores %>%
  arrange(desc(Overall_Score)) %>%
  head(3) %>%
  select(RespondentID, Satisfaction, Quality, Value, Overall_Score, Satisfaction_Level)

bottom_respondents <- survey_with_scores %>%
  arrange(Overall_Score) %>%
  head(3) %>%
  select(RespondentID, Satisfaction, Quality, Value, Overall_Score, Satisfaction_Level)

print("\nStep 8a - Top 3 Respondents:")
print(top_respondents)

print("\nStep 8b - Bottom 3 Respondents:")
print(bottom_respondents)

# Step 9: Question performance ranking
question_performance <- survey_wide %>%
  summarise(
    Satisfaction_Avg = mean(Satisfaction, na.rm = TRUE),
    Quality_Avg = mean(Quality, na.rm = TRUE),
    Value_Avg = mean(Value, na.rm = TRUE)
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Question",
    values_to = "Average_Score"
  ) %>%
  mutate(
    Question = gsub("_Avg", "", Question),
    Average_Score = round(Average_Score, 2)
  ) %>%
  arrange(desc(Average_Score)) %>%
  mutate(Rank = row_number())

print("\nStep 9 - Question Performance Ranking:")
print(question_performance)

# Step 10: Actionable insights summary
insights_summary <- survey_with_scores %>%
  summarise(
    Total_Responses = n(),
    Highly_Satisfied = sum(Satisfaction_Level == "Highly Satisfied"),
    Needs_Improvement = sum(Satisfaction_Level %in% c("Dissatisfied", "Highly Dissatisfied")),
    Avg_Overall = round(mean(Overall_Score), 2),
    Low_Satisfaction = sum(Satisfaction <= 2),
    Low_Quality = sum(Quality <= 2),
    Low_Value = sum(Value <= 2)
  ) %>%
  mutate(
    Priority_Area = case_when(
      Low_Value >= Low_Quality & Low_Value >= Low_Satisfaction ~ "Value",
      Low_Quality >= Low_Satisfaction ~ "Quality",
      TRUE ~ "Satisfaction"
    )
  )

print("\nStep 10 - Actionable Insights Summary:")
print(insights_summary)

# Step 11: Export-ready summary table
summary_table <- survey_with_scores %>%
  select(RespondentID, Satisfaction, Quality, Value, Overall_Score, Satisfaction_Level) %>%
  arrange(desc(Overall_Score))

print("\nStep 11 - Export-Ready Summary Table:")
print(summary_table)

print("\n=== SURVEY ANALYSIS COMPLETE ===")


[1] "=== SURVEY ANALYSIS ===\n"
[1] "Original Long Format:"
  RespondentID     Question Response
1            1 Satisfaction        4
2            1      Quality        5
3            1        Value        4
4            2 Satisfaction        3
5            2      Quality        4
6            2        Value        3
7            3 Satisfaction        5
8            3      Quality        5
9            3        Value        4
[1] "\nStep 1 - Wide Format:"
[90m# A tibble: 5 × 4[39m
  RespondentID Satisfaction Quality Value
         [3m[90m<int>[39m[23m        [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m1[39m            1            4       5     4
[90m2[39m            2            3       4     3
[90m3[39m            3            5       5     4
[90m4[39m            4            4       4     3
[90m5[39m            5            3       3     2
[1] "\nStep 2 - Overall Statistics:"
[90m# A tibble: 1 × 7[39m
  Avg_Satisfaction Avg_Qual

## Part 6: Data Validation and Quality Checks

1. **Reshape Validation:**
   - After each major reshaping operation, verify that:
     - The total number of data points is preserved (accounting for the different structure)
     - No data was lost or duplicated unexpectedly
     - The relationships between variables are maintained

2. **Tidy Data Assessment:**
   - For each of your final datasets, assess whether they meet the criteria for "tidy data":
     - Each variable forms a column
     - Each observation forms a row
     - Each type of observational unit forms a table
   - Identify which format (wide or long) is more "tidy" for each specific analysis purpose.

---

In [20]:
# Task 6.1: Reshape Validation
# Implement validation checks for your reshaping operations

print("=== RESHAPE VALIDATION ===\n")

# Validation 1: Check dimensions before and after reshaping
print("Validation 1 - Dimension Check:")
print("Original wide format dimensions:")
print(paste("Rows:", nrow(quarterly_sales_wide), "Columns:", ncol(quarterly_sales_wide)))

quarterly_sales_long_test <- quarterly_sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),
    names_to = "Quarter",
    values_to = "Sales"
  )

print("After pivot_longer dimensions:")
print(paste("Rows:", nrow(quarterly_sales_long_test), "Columns:", ncol(quarterly_sales_long_test)))

# Expected rows calculation
expected_rows <- nrow(quarterly_sales_wide) * (ncol(quarterly_sales_wide) - 1)
print(paste("Expected rows:", expected_rows))
print(paste("Validation:", ifelse(nrow(quarterly_sales_long_test) == expected_rows, "PASS ✓", "FAIL ✗")))

# Validation 2: Check for data loss (sum of values)
print("\nValidation 2 - Data Integrity Check:")
original_sum <- sum(quarterly_sales_wide[, -1], na.rm = TRUE)  # Exclude Region column
reshaped_sum <- sum(quarterly_sales_long_test$Sales, na.rm = TRUE)

print(paste("Original data sum:", original_sum))
print(paste("Reshaped data sum:", reshaped_sum))
print(paste("Validation:", ifelse(original_sum == reshaped_sum, "PASS ✓ (No data loss)", "FAIL ✗ (Data loss detected)")))

# Validation 3: Check for missing values
print("\nValidation 3 - Missing Values Check:")
missing_original <- sum(is.na(quarterly_sales_wide))
missing_reshaped <- sum(is.na(quarterly_sales_long_test))

print(paste("Missing values in original:", missing_original))
print(paste("Missing values in reshaped:", missing_reshaped))
print(paste("Validation:", ifelse(missing_original == missing_reshaped, "PASS ✓", "FAIL ✗")))

# Validation 4: Reverse transformation check
print("\nValidation 4 - Reverse Transformation Check:")
sales_back_to_wide <- quarterly_sales_long_test %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales
  )

print("Reversed back to wide format:")
print(sales_back_to_wide)

# Check if we get back the original structure
columns_match <- all(sort(names(quarterly_sales_wide)) == sort(names(sales_back_to_wide)))
print(paste("Column names match:", ifelse(columns_match, "PASS ✓", "FAIL ✗")))

# Check if values match
values_match <- all.equal(
  quarterly_sales_wide[order(quarterly_sales_wide$Region), ],
  sales_back_to_wide[order(sales_back_to_wide$Region), ],
  check.attributes = FALSE
)
print(paste("Values match:", ifelse(isTRUE(values_match), "PASS ✓", "FAIL ✗")))

# Validation 5: Check unique identifiers
print("\nValidation 5 - Unique Identifier Check:")
unique_regions_original <- n_distinct(quarterly_sales_wide$Region)
unique_regions_reshaped <- n_distinct(quarterly_sales_long_test$Region)

print(paste("Unique regions in original:", unique_regions_original))
print(paste("Unique regions in reshaped:", unique_regions_reshaped))
print(paste("Validation:", ifelse(unique_regions_original == unique_regions_reshaped, "PASS ✓", "FAIL ✗")))

# Validation 6: Check for duplicates
print("\nValidation 6 - Duplicate Check:")
duplicates_long <- quarterly_sales_long_test %>%
  group_by(Region, Quarter) %>%
  filter(n() > 1) %>%
  nrow()

print(paste("Duplicate rows in long format:", duplicates_long))
print(paste("Validation:", ifelse(duplicates_long == 0, "PASS ✓ (No duplicates)", "FAIL ✗ (Duplicates found)")))

# Validation 7: Data type consistency
print("\nValidation 7 - Data Type Check:")
print("Original data types:")
print(str(quarterly_sales_wide))
print("\nReshaped data types:")
print(str(quarterly_sales_long_test))

sales_numeric_check <- is.numeric(quarterly_sales_long_test$Sales)
print(paste("Sales column is numeric:", ifelse(sales_numeric_check, "PASS ✓", "FAIL ✗")))

# Validation 8: Value range check
print("\nValidation 8 - Value Range Check:")
original_min <- min(quarterly_sales_wide[, -1], na.rm = TRUE)
original_max <- max(quarterly_sales_wide[, -1], na.rm = TRUE)
reshaped_min <- min(quarterly_sales_long_test$Sales, na.rm = TRUE)
reshaped_max <- max(quarterly_sales_long_test$Sales, na.rm = TRUE)

print(paste("Original range:", original_min, "to", original_max))
print(paste("Reshaped range:", reshaped_min, "to", reshaped_max))
range_match <- (original_min == reshaped_min) && (original_max == reshaped_max)
print(paste("Validation:", ifelse(range_match, "PASS ✓", "FAIL ✗")))

# Validation 9: Survey data validation (long to wide to long)
print("\nValidation 9 - Survey Data Round-Trip Validation:")
print("Original survey long format:")
print(head(survey_responses_long))

survey_wide_test <- survey_responses_long %>%
  pivot_wider(
    names_from = Question,
    values_from = Response
  )

survey_back_to_long <- survey_wide_test %>%
  pivot_longer(
    cols = c(Satisfaction, Quality, Value),
    names_to = "Question",
    values_to = "Response"
  ) %>%
  arrange(RespondentID, Question)

original_sorted <- survey_responses_long %>%
  arrange(RespondentID, Question)

rows_match <- nrow(original_sorted) == nrow(survey_back_to_long)
print(paste("Row count matches:", ifelse(rows_match, "PASS ✓", "FAIL ✗")))

# Validation 10: Comprehensive validation function
validate_reshape <- function(original_data, reshaped_data, id_cols, value_cols) {
  results <- list()
  
  # Check 1: Row count
  expected_rows <- nrow(original_data) * length(value_cols)
  results$row_count <- nrow(reshaped_data) == expected_rows
  
  # Check 2: No data loss
  original_sum <- sum(original_data[, value_cols], na.rm = TRUE)
  reshaped_sum <- sum(reshaped_data[[names(reshaped_data)[length(names(reshaped_data))]]], na.rm = TRUE)
  results$no_data_loss <- abs(original_sum - reshaped_sum) < 0.01
  
  # Check 3: ID preservation
  results$id_preserved <- n_distinct(original_data[[id_cols]]) == n_distinct(reshaped_data[[id_cols]])
  
  return(results)
}

print("\nValidation 10 - Comprehensive Validation Function:")
validation_results <- validate_reshape(
  quarterly_sales_wide, 
  quarterly_sales_long_test,
  "Region",
  c("Q1_2023", "Q2_2023", "Q3_2023", "Q4_2023")
)
print(validation_results)

# Summary Report
print("\n=== VALIDATION SUMMARY REPORT ===")
all_checks <- c(
  "Dimension Check" = expected_rows == nrow(quarterly_sales_long_test),
  "Data Integrity" = original_sum == reshaped_sum,
  "Missing Values" = missing_original == missing_reshaped,
  "Reverse Transform" = isTRUE(values_match),
  "Unique IDs" = unique_regions_original == unique_regions_reshaped,
  "No Duplicates" = duplicates_long == 0,
  "Data Types" = sales_numeric_check,
  "Value Range" = range_match
)

passed <- sum(all_checks)
total <- length(all_checks)

print(paste("\nTotal Validations:", total))
print(paste("Passed:", passed))
print(paste("Failed:", total - passed))
print(paste("Success Rate:", round((passed / total) * 100, 1), "%"))

if (passed == total) {
  print("\n✓ ALL VALIDATIONS PASSED - Data reshaping is reliable!")
} else {
  print("\n✗ SOME VALIDATIONS FAILED - Review reshaping operations!")
}

print("\n=== VALIDATION COMPLETE ===")


[1] "=== RESHAPE VALIDATION ===\n"
[1] "Validation 1 - Dimension Check:"
[1] "Original wide format dimensions:"
[1] "Rows: 5 Columns: 5"
[1] "After pivot_longer dimensions:"
[1] "Rows: 20 Columns: 3"
[1] "Expected rows: 20"
[1] "Validation: PASS ✓"
[1] "\nValidation 2 - Data Integrity Check:"
[1] "Original data sum: 984000"
[1] "Reshaped data sum: 984000"
[1] "Validation: PASS ✓ (No data loss)"
[1] "\nValidation 3 - Missing Values Check:"
[1] "Missing values in original: 0"
[1] "Missing values in reshaped: 0"
[1] "Validation: PASS ✓"
[1] "\nValidation 4 - Reverse Transformation Check:"
[1] "Reversed back to wide format:"
[90m# A tibble: 5 × 5[39m
  Region  Q1_2023 Q2_2023 Q3_2023 Q4_2023
  [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m North     [4m4[24m[4m5[24m000   [4m4[24m[4m8[24m000   [4m5[24m[4m2[24m000   [4m5[24m[4m5[24m000
[90m2[39m South     [4m5[24m[4m2[2

In [21]:
# Task 6.2: Tidy Data Assessment
# Assess which formats are more "tidy" for different purposes

print("=== TIDY DATA ASSESSMENT ===\n")

# Reminder: Tidy Data Principles
print("Tidy Data Principles:")
print("1. Each variable forms a column")
print("2. Each observation forms a row")
print("3. Each type of observational unit forms a table\n")

# Assessment 1: Quarterly Sales Data
print("Assessment 1 - Quarterly Sales Data")
print("\nWIDE FORMAT:")
print(quarterly_sales_wide)

print("\nLONG FORMAT:")
print(head(quarterly_sales_long_test, 10))

# Evaluate tidiness
print("\nTIDINESS EVALUATION - Quarterly Sales:")
print("Wide Format:")
print("  - Each variable a column? NO ✗ (Q1, Q2, Q3, Q4 are values, not variables)")
print("  - Each observation a row? NO ✗ (Multiple quarters in one row)")
print("  - Is it tidy? NO ✗")
print("  - Best for: Spreadsheet viewing, cross-quarter comparison")

print("\nLong Format:")
print("  - Each variable a column? YES ✓ (Region, Quarter, Sales are variables)")
print("  - Each observation a row? YES ✓ (One quarter's sales per row)")
print("  - Is it tidy? YES ✓")
print("  - Best for: Analysis, visualization, modeling, aggregation")

# Assessment 2: Survey Responses
print("\n\nAssessment 2 - Survey Responses")
print("\nLONG FORMAT:")
print(head(survey_responses_long, 9))

print("\nWIDE FORMAT:")
survey_wide_test <- survey_responses_long %>%
  pivot_wider(names_from = Question, values_from = Response)
print(survey_wide_test)

print("\nTIDINESS EVALUATION - Survey Responses:")
print("Long Format:")
print("  - Each variable a column? MAYBE ~ (Question and Response are variables)")
print("  - Each observation a row? DEPENDS (Is one response an observation?)")
print("  - Is it tidy? YES ✓ (If each answer is considered an observation)")
print("  - Best for: Filtering by question, analyzing question patterns")

print("\nWide Format:")
print("  - Each variable a column? YES ✓ (Each question type is a variable)")
print("  - Each observation a row? YES ✓ (One respondent = one observation)")
print("  - Is it tidy? YES ✓ (If each respondent is the observation unit)")
print("  - Best for: Respondent analysis, calculating composite scores")

# Assessment 3: Employee Skills
print("\n\nAssessment 3 - Employee Skills")
print("\nWIDE FORMAT:")
print(employee_skills_wide)

print("\nLONG FORMAT:")
print(head(employee_skills_long, 12))

print("\nTIDINESS EVALUATION - Employee Skills:")
print("Wide Format:")
print("  - Each variable a column? NO ✗ (Python, R, SQL, Excel are values, not variables)")
print("  - Each observation a row? YES ✓ (One employee per row)")
print("  - Is it tidy? NO ✗")
print("  - Best for: Skills matrix view, hiring decisions")

print("\nLong Format:")
print("  - Each variable a column? YES ✓ (EmployeeID, Name, Skill, HasSkill)")
print("  - Each observation a row? YES ✓ (One skill assessment per row)")
print("  - Is it tidy? YES ✓")
print("  - Best for: Skill counting, filtering by skill, training analysis")

# Assessment 4: Context-Dependent Tidiness
print("\n\nAssessment 4 - Context-Dependent Tidiness")

# Example: Financial data with multiple metrics
financial_wide <- data.frame(
  Region = c("North", "South", "East"),
  Q1_Sales = c(45000, 52000, 48000),
  Q1_Profit = c(9000, 10400, 9600),
  Q2_Sales = c(48000, 55000, 51000),
  Q2_Profit = c(9600, 11000, 10200)
)

print("\nFINANCIAL DATA - Wide Format:")
print(financial_wide)

# Option 1: Fully long
financial_fully_long <- financial_wide %>%
  pivot_longer(
    cols = -Region,
    names_to = c("Quarter", "Metric"),
    names_sep = "_",
    values_to = "Amount"
  )

print("\nOption 1 - Fully Long:")
print(head(financial_fully_long, 8))
print("Tidiness: YES ✓ (If Quarter-Metric-Amount is the observation)")
print("Best for: Filtering by metric type, general analysis")

# Option 2: Semi-long (separate Sales and Profit columns)
financial_semi_long <- financial_wide %>%
  pivot_longer(
    cols = -Region,
    names_to = c("Quarter", ".value"),
    names_sep = "_"
  )

print("\nOption 2 - Semi-Long:")
print(financial_semi_long)
print("Tidiness: YES ✓ (If Quarter observation has Sales and Profit variables)")
print("Best for: Calculating profit margins, ratio analysis")

# Assessment 5: Use Case Analysis
print("\n\nAssessment 5 - Use Case Analysis")

use_cases <- data.frame(
  Use_Case = c(
    "ggplot2 visualization",
    "dplyr group_by operations",
    "Statistical modeling",
    "Excel export for managers",
    "Database storage",
    "Filtering specific values",
    "Calculating row-wise metrics",
    "Time series analysis",
    "Pivot tables",
    "Machine learning features"
  ),
  Preferred_Format = c(
    "LONG",
    "LONG",
    "LONG",
    "WIDE",
    "LONG",
    "LONG",
    "WIDE",
    "LONG",
    "WIDE",
    "WIDE"
  ),
  Reason = c(
    "ggplot works best with tidy long data",
    "Group by categories naturally",
    "Each row is independent observation",
    "Easier for humans to read",
    "Normalized structure, efficient storage",
    "Filter by single column easily",
    "Multiple columns available for calculation",
    "One time point per row",
    "Cross-tabulation format",
    "Each feature needs its own column"
  )
)

print(use_cases)

# Assessment 6: Practical Examples
print("\n\nAssessment 6 - Practical Transformation Examples")

# When LONG is better
print("\nExample A - Visualization (LONG is better):")
print("Task: Create a line chart showing sales trends by region")
print("Format needed: LONG (Region, Quarter, Sales)")
print("Why: ggplot2 aesthetics map best to long format")

# When WIDE is better
print("\nExample B - Comparison Table (WIDE is better):")
print("Task: Compare Q1 vs Q4 sales for each region")
print("Format needed: WIDE (Region, Q1, Q4)")
print("Why: Side-by-side columns easier to compare")

# Assessment 7: Tidiness Scoring Function
print("\n\nAssessment 7 - Tidiness Scoring Function")

assess_tidiness <- function(data, purpose) {
  score <- list()
  
  # Check column names for embedded values
  col_names <- names(data)
  has_value_in_names <- any(grepl("[0-9]|Q[0-9]", col_names))
  score$column_purity <- !has_value_in_names
  
  # Check for multiple value columns
  numeric_cols <- sum(sapply(data, is.numeric))
  score$single_value_measure <- numeric_cols <= 2
  
  # Overall assessment
  score$tidy_score <- mean(c(score$column_purity, score$single_value_measure))
  
  score$recommendation <- ifelse(
    score$tidy_score >= 0.75,
    "Format is tidy - good for analysis",
    "Consider reshaping for analysis tasks"
  )
  
  return(score)
}

print("Tidiness Score for Wide Format:")
print(assess_tidiness(quarterly_sales_wide, "analysis"))

print("\nTidiness Score for Long Format:")
print(assess_tidiness(quarterly_sales_long_test, "analysis"))

# Assessment 8: Decision Matrix
print("\n\nAssessment 8 - Format Selection Decision Matrix")

decision_matrix <- data.frame(
  Scenario = c(
    "Data entry/collection",
    "Exploratory analysis",
    "Creating visualizations",
    "Statistical tests",
    "Reporting to executives",
    "Storing in database",
    "Merging with other data",
    "Calculating new variables",
    "Filtering/subsetting",
    "Aggregating data"
  ),
  Choose_Wide = c(
    "✓ (easier to enter)",
    "✗",
    "✗",
    "Depends",
    "✓ (easier to read)",
    "✗",
    "Depends",
    "✓ (row-wise calc)",
    "✗",
    "✗"
  ),
  Choose_Long = c(
    "✗",
    "✓",
    "✓",
    "✓ (usually)",
    "✗",
    "✓",
    "✓ (usually)",
    "Depends",
    "✓",
    "✓"
  )
)

print(decision_matrix)

# Summary
print("\n\n=== TIDY DATA ASSESSMENT SUMMARY ===")
print("\nKey Insights:")
print("1. LONG format is 'tidier' by Hadley Wickham's definition")
print("2. WIDE format is more readable for humans")
print("3. Tidiness depends on what you consider an 'observation'")
print("4. The BEST format depends on your immediate task")
print("5. You can (and should) switch between formats as needed")
print("\nBest Practice: Store in LONG, transform to WIDE when needed")

print("\n=== ASSESSMENT COMPLETE ===")


[1] "=== TIDY DATA ASSESSMENT ===\n"
[1] "Tidy Data Principles:"
[1] "1. Each variable forms a column"
[1] "2. Each observation forms a row"
[1] "3. Each type of observational unit forms a table\n"
[1] "Assessment 1 - Quarterly Sales Data"
[1] "\nWIDE FORMAT:"
   Region Q1_2023 Q2_2023 Q3_2023 Q4_2023
1   North   45000   48000   52000   55000
2   South   52000   55000   58000   61000
3    East   48000   51000   53000   56000
4    West   41000   43000   46000   48000
5 Central   39000   42000   44000   47000
[1] "\nLONG FORMAT:"
[90m# A tibble: 10 × 3[39m
   Region Quarter Sales
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m
[90m 1[39m North  Q1_2023 [4m4[24m[4m5[24m000
[90m 2[39m North  Q2_2023 [4m4[24m[4m8[24m000
[90m 3[39m North  Q3_2023 [4m5[24m[4m2[24m000
[90m 4[39m North  Q4_2023 [4m5[24m[4m5[24m000
[90m 5[39m South  Q1_2023 [4m5[24m[4m2[24m000
[90m 6[39m South  Q2_2023 [4m5[24m[4m5[24m000
[90m 7[39m South  

## Part 7: Visualization Preparation

1. **ggplot2 Preparation:**
   - Prepare your `quarterly_sales_long` data for creating a line chart showing sales trends over time for each region.
   - Prepare your `employee_skills_long` data for creating a heatmap showing skill proficiency across employees.

2. **Comparison Visualization:**
   - Create a dataset that allows you to compare the same metric (e.g., sales) across different dimensions (e.g., regions, quarters) in a single visualization.

---

In [22]:
# Task 7.1: ggplot2 Data Preparation
# Prepare data for line chart and heatmap visualizations

print("=== GGPLOT2 DATA PREPARATION ===\n")

# Preparation 1: Line Chart - Sales Trends Over Time
print("Preparation 1 - Line Chart Data")

# Line charts in ggplot2 need:
# - Time variable (x-axis)
# - Value variable (y-axis)
# - Grouping variable (for multiple lines)

line_chart_data <- quarterly_sales_parsed %>%
  mutate(
    # Create proper ordering for quarters
    Quarter_Num = as.numeric(gsub("Q", "", Quarter)),
    Year = as.numeric(Year),
    # Create time period for x-axis
    Time_Period = paste0(Year, "-", Quarter),
    # Create date for better plotting
    Month = case_when(
      Quarter_Num == 1 ~ 1,
      Quarter_Num == 2 ~ 4,
      Quarter_Num == 3 ~ 7,
      Quarter_Num == 4 ~ 10
    ),
    Date = as.Date(paste(Year, Month, "01", sep = "-"))
  ) %>%
  arrange(Region, Date) %>%
  select(Region, Quarter, Year, Time_Period, Date, Sales)

print("Line Chart Ready Data:")
print(head(line_chart_data, 12))

print("\nData structure check:")
print(paste("- X-axis variable (Date):", class(line_chart_data$Date)))
print(paste("- Y-axis variable (Sales):", class(line_chart_data$Sales)))
print(paste("- Group variable (Region):", class(line_chart_data$Region)))
print("✓ Ready for ggplot2 line chart!")

# Preparation 2: Heatmap - Sales by Region and Quarter
print("\n\nPreparation 2 - Heatmap Data")

# Heatmaps need:
# - Row variable (y-axis)
# - Column variable (x-axis)
# - Fill variable (color intensity)

heatmap_data <- quarterly_sales_parsed %>%
  mutate(
    Quarter_Year = paste0(Quarter, " ", Year)
  ) %>%
  select(Region, Quarter_Year, Sales)

print("Heatmap Ready Data (Long Format):")
print(head(heatmap_data, 12))

# Alternative: Wide format for geom_tile
heatmap_data_wide <- quarterly_sales_parsed %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales
  ) %>%
  select(Region, Q1, Q2, Q3, Q4)

print("\nHeatmap Ready Data (Wide Format - Alternative):")
print(heatmap_data_wide)

print("\nData structure check:")
print(paste("- Row variable (Region):", class(heatmap_data$Region)))
print(paste("- Column variable (Quarter_Year):", class(heatmap_data$Quarter_Year)))
print(paste("- Fill variable (Sales):", class(heatmap_data$Sales)))
print("✓ Ready for ggplot2 heatmap!")

# Preparation 3: Bar Chart - Regional Comparison
print("\n\nPreparation 3 - Bar Chart Data")

bar_chart_data <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Sales = mean(Sales),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Sales))

print("Bar Chart Ready Data:")
print(bar_chart_data)

print("\nData structure check:")
print(paste("- Category variable (Region):", class(bar_chart_data$Region)))
print(paste("- Value variable (Total_Sales):", class(bar_chart_data$Total_Sales)))
print("✓ Ready for ggplot2 bar chart!")

# Preparation 4: Stacked Bar Chart - Quarterly Breakdown
print("\n\nPreparation 4 - Stacked Bar Chart Data")

stacked_bar_data <- quarterly_sales_parsed %>%
  mutate(Quarter = factor(Quarter, levels = c("Q1", "Q2", "Q3", "Q4")))

print("Stacked Bar Chart Ready Data:")
print(head(stacked_bar_data, 12))

print("\nData structure check:")
print(paste("- Position variable (Region):", class(stacked_bar_data$Region)))
print(paste("- Fill variable (Quarter):", class(stacked_bar_data$Quarter)))
print(paste("- Value variable (Sales):", class(stacked_bar_data$Sales)))
print("✓ Ready for stacked bar chart!")

# Preparation 5: Faceted Visualization - Small Multiples
print("\n\nPreparation 5 - Faceted Visualization Data")

facet_data <- quarterly_sales_parsed %>%
  mutate(
    Quarter = factor(Quarter, levels = c("Q1", "Q2", "Q3", "Q4")),
    Year = as.factor(Year)
  )

print("Faceted Visualization Ready Data:")
print(head(facet_data, 12))

print("\nData structure check:")
print(paste("- Facet variable (Quarter):", class(facet_data$Quarter)))
print(paste("- Group variable (Region):", class(facet_data$Region)))
print("✓ Ready for faceted charts!")

# Preparation 6: Scatter Plot - Relationship Analysis
print("\n\nPreparation 6 - Scatter Plot Data")

scatter_data <- quarterly_sales_parsed %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Sales
  ) %>%
  select(Region, Q1, Q2, Q3, Q4)

print("Scatter Plot Ready Data:")
print(scatter_data)

print("\nData structure check:")
print("✓ Ready for Q1 vs Q4 comparison scatter plot!")

# Preparation 7: Box Plot - Distribution Analysis
print("\n\nPreparation 7 - Box Plot Data")

boxplot_data <- quarterly_sales_parsed %>%
  select(Region, Quarter, Sales)

print("Box Plot Ready Data:")
print(head(boxplot_data, 12))

print("\nData structure check:")
print(paste("- Category variable (Region):", class(boxplot_data$Region)))
print(paste("- Value variable (Sales):", class(boxplot_data$Sales)))
print("✓ Ready for box plot!")

# Preparation 8: Area Chart - Cumulative Sales
print("\n\nPreparation 8 - Area Chart Data")

area_chart_data <- quarterly_sales_parsed %>%
  mutate(
    Quarter_Num = as.numeric(gsub("Q", "", Quarter)),
    Year = as.numeric(Year)
  ) %>%
  arrange(Region, Year, Quarter_Num) %>%
  group_by(Region) %>%
  mutate(
    Cumulative_Sales = cumsum(Sales),
    Time_Period = paste0(Year, "-", Quarter)
  ) %>%
  ungroup()

print("Area Chart Ready Data:")
print(head(area_chart_data, 12))

print("\nData structure check:")
print("✓ Ready for area chart with cumulative values!")

# Preparation 9: Survey Data for Grouped Bar Chart
print("\n\nPreparation 9 - Survey Grouped Bar Chart Data")

survey_chart_data <- survey_responses_long %>%
  group_by(Question) %>%
  summarise(
    Avg_Response = mean(Response),
    Min_Response = min(Response),
    Max_Response = max(Response),
    .groups = "drop"
  )

print("Survey Chart Ready Data:")
print(survey_chart_data)

print("\nData structure check:")
print("✓ Ready for grouped bar chart of survey metrics!")

# Preparation 10: Data Quality Check Summary
print("\n\nPreparation 10 - Data Quality Summary")

quality_checks <- data.frame(
  Dataset = c(
    "Line Chart Data",
    "Heatmap Data",
    "Bar Chart Data",
    "Stacked Bar Data",
    "Facet Data",
    "Scatter Data",
    "Box Plot Data",
    "Area Chart Data",
    "Survey Chart Data"
  ),
  Rows = c(
    nrow(line_chart_data),
    nrow(heatmap_data),
    nrow(bar_chart_data),
    nrow(stacked_bar_data),
    nrow(facet_data),
    nrow(scatter_data),
    nrow(boxplot_data),
    nrow(area_chart_data),
    nrow(survey_chart_data)
  ),
  Missing_Values = c(
    sum(is.na(line_chart_data)),
    sum(is.na(heatmap_data)),
    sum(is.na(bar_chart_data)),
    sum(is.na(stacked_bar_data)),
    sum(is.na(facet_data)),
    sum(is.na(scatter_data)),
    sum(is.na(boxplot_data)),
    sum(is.na(area_chart_data)),
    sum(is.na(survey_chart_data))
  ),
  Ready_for_ggplot = c(
    "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓", "✓"
  )
)

print(quality_checks)

# Bonus: Export all datasets for visualization
print("\n\nBonus - Creating Export List")

viz_data_export <- list(
  line_chart = line_chart_data,
  heatmap = heatmap_data,
  bar_chart = bar_chart_data,
  stacked_bar = stacked_bar_data,
  facet = facet_data,
  scatter = scatter_data,
  boxplot = boxplot_data,
  area_chart = area_chart_data,
  survey_chart = survey_chart_data
)

print("Available visualization datasets:")
print(names(viz_data_export))

print("\n=== DATA PREPARATION COMPLETE ===")
print("\nAll datasets are now ready for ggplot2 visualizations!")
print("\nKey principles applied:")
print("1. ✓ Long format for most visualizations")
print("2. ✓ Proper date/time formatting")
print("3. ✓ Factor levels ordered logically")
print("4. ✓ No missing values in key variables")
print("5. ✓ Consistent naming conventions")
print("6. ✓ Variables properly typed (numeric, factor, date)")


[1] "=== GGPLOT2 DATA PREPARATION ===\n"
[1] "Preparation 1 - Line Chart Data"
[1] "Line Chart Ready Data:"
[90m# A tibble: 12 × 6[39m
   Region  Quarter  Year Time_Period Date       Sales
   [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m       [3m[90m<date>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m Central Q1       [4m2[24m023 2023-Q1     2023-01-01 [4m3[24m[4m9[24m000
[90m 2[39m Central Q2       [4m2[24m023 2023-Q2     2023-04-01 [4m4[24m[4m2[24m000
[90m 3[39m Central Q3       [4m2[24m023 2023-Q3     2023-07-01 [4m4[24m[4m4[24m000
[90m 4[39m Central Q4       [4m2[24m023 2023-Q4     2023-10-01 [4m4[24m[4m7[24m000
[90m 5[39m East    Q1       [4m2[24m023 2023-Q1     2023-01-01 [4m4[24m[4m8[24m000
[90m 6[39m East    Q2       [4m2[24m023 2023-Q2     2023-04-01 [4m5[24m[4m1[24m000
[90m 7[39m East    Q3       [4m2[24m023 2023-Q3     2023-07-01 [4m5[24m[4m3[24m000
[90

In [23]:
# Task 7.2: Comparison Visualization Data
# Create dataset for comparison visualization

print("=== COMPARISON VISUALIZATION DATA ===\n")

# Comparison 1: Year-over-Year Comparison (if multi-year data available)
print("Comparison 1 - Year-over-Year Sales Comparison")

# Create multi-year sample data for demonstration
multi_year_sales <- quarterly_sales_parsed %>%
  # Add 2022 data by reducing 2023 values
  bind_rows(
    quarterly_sales_parsed %>%
      mutate(
        Year = "2022",
        Sales = round(Sales * 0.85)  # 2022 was 15% lower
      )
  ) %>%
  arrange(Region, Year, Quarter)

yoy_comparison <- multi_year_sales %>%
  pivot_wider(
    names_from = Year,
    values_from = Sales,
    names_prefix = "Year_"
  ) %>%
  mutate(
    YoY_Change = Year_2023 - Year_2022,
    YoY_Change_Pct = round((YoY_Change / Year_2022) * 100, 2)
  )

print("Year-over-Year Comparison Data:")
print(head(yoy_comparison, 12))

print("\nVisualization Type: Grouped Bar Chart or Slope Chart")
print("✓ Ready for comparing 2022 vs 2023 performance")

# Comparison 2: Regional Performance Ranking
print("\n\nComparison 2 - Regional Performance Ranking")

regional_ranking <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Sales = round(mean(Sales), 2),
    Growth_Rate = round(((last(Sales) - first(Sales)) / first(Sales)) * 100, 2),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Sales)) %>%
  mutate(
    Rank = row_number(),
    Performance_Tier = case_when(
      Rank <= 2 ~ "Top Performer",
      Rank <= 4 ~ "Average Performer",
      TRUE ~ "Needs Improvement"
    )
  )

print("Regional Ranking Data:")
print(regional_ranking)

print("\nVisualization Type: Horizontal Bar Chart with Colors")
print("✓ Ready for ranking visualization")

# Comparison 3: Quarter-to-Quarter Growth Comparison
print("\n\nComparison 3 - Quarter-to-Quarter Growth Analysis")

qtq_comparison <- quarterly_sales_parsed %>%
  arrange(Region, Quarter) %>%
  group_by(Region) %>%
  mutate(
    Previous_Quarter_Sales = lag(Sales),
    QoQ_Change = Sales - Previous_Quarter_Sales,
    QoQ_Change_Pct = round((QoQ_Change / Previous_Quarter_Sales) * 100, 2),
    Growth_Status = case_when(
      is.na(QoQ_Change_Pct) ~ "First Quarter",
      QoQ_Change_Pct > 5 ~ "Strong Growth",
      QoQ_Change_Pct > 0 ~ "Moderate Growth",
      QoQ_Change_Pct > -5 ~ "Slight Decline",
      TRUE ~ "Significant Decline"
    )
  ) %>%
  ungroup()

print("Quarter-to-Quarter Comparison Data:")
print(qtq_comparison)

print("\nVisualization Type: Waterfall Chart or Line Chart with Annotations")
print("✓ Ready for growth trend visualization")

# Comparison 4: Top vs Bottom Performers
print("\n\nComparison 4 - Top vs Bottom Performers by Quarter")

top_bottom_comparison <- quarterly_sales_parsed %>%
  group_by(Quarter) %>%
  arrange(desc(Sales)) %>%
  mutate(
    Rank = row_number(),
    Performance = case_when(
      Rank == 1 ~ "Top Performer",
      Rank == n() ~ "Bottom Performer",
      TRUE ~ "Middle"
    )
  ) %>%
  filter(Performance != "Middle") %>%
  select(Quarter, Region, Sales, Performance) %>%
  ungroup()

print("Top vs Bottom Performers:")
print(top_bottom_comparison)

print("\nVisualization Type: Dumbbell Chart or Grouped Bar Chart")
print("✓ Ready for extremes comparison")

# Comparison 5: Market Share Comparison
print("\n\nComparison 5 - Market Share by Quarter")

market_share_comparison <- quarterly_sales_parsed %>%
  group_by(Quarter) %>%
  mutate(
    Quarter_Total = sum(Sales),
    Market_Share_Pct = round((Sales / Quarter_Total) * 100, 2)
  ) %>%
  ungroup() %>%
  arrange(Quarter, desc(Market_Share_Pct))

print("Market Share Comparison Data:")
print(head(market_share_comparison, 12))

print("\nVisualization Type: Stacked Bar Chart (100%) or Pie Charts")
print("✓ Ready for market share visualization")

# Comparison 6: Best Quarter Comparison Across Regions
print("\n\nComparison 6 - Best Quarter Identification by Region")

best_quarter_comparison <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  mutate(
    Is_Best_Quarter = Sales == max(Sales),
    Best_Quarter = Quarter[which.max(Sales)],
    Best_Sales = max(Sales),
    Performance_vs_Best = round((Sales / max(Sales)) * 100, 2)
  ) %>%
  ungroup()

print("Best Quarter Comparison Data:")
print(head(best_quarter_comparison, 12))

# Summary by region
best_quarters_summary <- best_quarter_comparison %>%
  group_by(Region) %>%
  summarise(
    Best_Quarter = first(Best_Quarter),
    Best_Sales = first(Best_Sales),
    Worst_Quarter = Quarter[which.min(Sales)],
    Worst_Sales = min(Sales),
    Range = Best_Sales - Worst_Sales,
    .groups = "drop"
  )

print("\nBest Quarter Summary by Region:")
print(best_quarters_summary)

print("\nVisualization Type: Heatmap with Highlights")
print("✓ Ready for peak performance visualization")

# Comparison 7: Quartile-based Performance Comparison
print("\n\nComparison 7 - Quartile-Based Performance Analysis")

quartile_comparison <- quarterly_sales_parsed %>%
  mutate(
    Sales_Quartile = ntile(Sales, 4),
    Quartile_Label = case_when(
      Sales_Quartile == 4 ~ "Q4 (Top 25%)",
      Sales_Quartile == 3 ~ "Q3 (Above Average)",
      Sales_Quartile == 2 ~ "Q2 (Below Average)",
      Sales_Quartile == 1 ~ "Q1 (Bottom 25%)"
    )
  ) %>%
  group_by(Quartile_Label) %>%
  mutate(
    Quartile_Avg = mean(Sales),
    Count = n()
  ) %>%
  ungroup()

print("Quartile Comparison Data:")
print(head(quartile_comparison, 12))

quartile_summary <- quartile_comparison %>%
  group_by(Quartile_Label, Quartile_Avg, Count) %>%
  summarise(
    Regions = paste(unique(Region), collapse = ", "),
    .groups = "drop"
  ) %>%
  arrange(desc(Quartile_Avg))

print("\nQuartile Summary:")
print(quartile_summary)

print("\nVisualization Type: Box Plot or Violin Plot")
print("✓ Ready for distribution comparison")

# Comparison 8: Side-by-Side Metric Comparison
print("\n\nComparison 8 - Multi-Metric Comparison")

multi_metric_comparison <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Sales = mean(Sales),
    Max_Sales = max(Sales),
    Min_Sales = min(Sales),
    Volatility = sd(Sales),
    .groups = "drop"
  ) %>%
  pivot_longer(
    cols = -Region,
    names_to = "Metric",
    values_to = "Value"
  ) %>%
  group_by(Metric) %>%
  mutate(
    Normalized_Value = round((Value - min(Value)) / (max(Value) - min(Value)) * 100, 2)
  ) %>%
  ungroup()

print("Multi-Metric Comparison Data:")
print(head(multi_metric_comparison, 15))

print("\nVisualization Type: Radar Chart or Parallel Coordinates")
print("✓ Ready for multi-dimensional comparison")

# Comparison 9: Benchmark Comparison
print("\n\nComparison 9 - Performance vs Benchmark")

benchmark_comparison <- quarterly_sales_parsed %>%
  mutate(
    Overall_Avg = mean(Sales),
    Variance_from_Avg = Sales - Overall_Avg,
    Pct_vs_Avg = round((Variance_from_Avg / Overall_Avg) * 100, 2),
    Performance_vs_Benchmark = case_when(
      Pct_vs_Avg >= 10 ~ "Well Above Average",
      Pct_vs_Avg >= 0 ~ "Above Average",
      Pct_vs_Avg >= -10 ~ "Below Average",
      TRUE ~ "Well Below Average"
    )
  )

print("Benchmark Comparison Data:")
print(head(benchmark_comparison, 12))

benchmark_summary <- benchmark_comparison %>%
  group_by(Performance_vs_Benchmark) %>%
  summarise(
    Count = n(),
    Avg_Pct_Deviation = round(mean(Pct_vs_Avg), 2),
    .groups = "drop"
  )

print("\nBenchmark Summary:")
print(benchmark_summary)

print("\nVisualization Type: Diverging Bar Chart")
print("✓ Ready for benchmark comparison")

# Comparison 10: Comprehensive Comparison Dashboard
print("\n\nComparison 10 - Dashboard Comparison Dataset")

dashboard_comparison <- quarterly_sales_parsed %>%
  group_by(Region) %>%
  summarise(
    Q1 = Sales[Quarter == "Q1"],
    Q2 = Sales[Quarter == "Q2"],
    Q3 = Sales[Quarter == "Q3"],
    Q4 = Sales[Quarter == "Q4"],
    Total = sum(Sales),
    Average = round(mean(Sales), 2),
    Growth = round(((Q4 - Q1) / Q1) * 100, 2),
    .groups = "drop"
  ) %>%
  mutate(
    Rank = rank(desc(Total)),
    Best_Quarter = apply(.[, c("Q1", "Q2", "Q3", "Q4")], 1, function(x) names(which.max(x)))
  ) %>%
  arrange(Rank)

print("Dashboard Comparison Data:")
print(dashboard_comparison)

print("\nVisualization Type: Multi-Panel Dashboard")
print("✓ Ready for comprehensive comparison view")

# Export all comparison datasets
print("\n\nExporting All Comparison Datasets:")

comparison_datasets <- list(
  yoy_comparison = yoy_comparison,
  regional_ranking = regional_ranking,
  qtq_comparison = qtq_comparison,
  top_bottom = top_bottom_comparison,
  market_share = market_share_comparison,
  best_quarter = best_quarter_comparison,
  quartile = quartile_comparison,
  multi_metric = multi_metric_comparison,
  benchmark = benchmark_comparison,
  dashboard = dashboard_comparison
)

print("Available comparison datasets:")
print(names(comparison_datasets))

print("\n=== COMPARISON DATA PREPARATION COMPLETE ===")
print("\nComparison Types Prepared:")
print("1. ✓ Year-over-Year")
print("2. ✓ Regional Rankings")
print("3. ✓ Quarter-to-Quarter Growth")
print("4. ✓ Top vs Bottom Performers")
print("5. ✓ Market Share")
print("6. ✓ Best Quarter Analysis")
print("7. ✓ Quartile Distribution")
print("8. ✓ Multi-Metric Comparison")
print("9. ✓ Benchmark Comparison")
print("10. ✓ Dashboard Overview")
print("\nAll datasets optimized for comparative visualizations! 📊")


[1] "=== COMPARISON VISUALIZATION DATA ===\n"
[1] "Comparison 1 - Year-over-Year Sales Comparison"
[1] "Year-over-Year Comparison Data:"
[90m# A tibble: 12 × 6[39m
   Region  Quarter Year_2022 Year_2023 YoY_Change YoY_Change_Pct
   [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m 1[39m Central Q1          [4m3[24m[4m3[24m150     [4m3[24m[4m9[24m000       [4m5[24m850           17.6
[90m 2[39m Central Q2          [4m3[24m[4m5[24m700     [4m4[24m[4m2[24m000       [4m6[24m300           17.6
[90m 3[39m Central Q3          [4m3[24m[4m7[24m400     [4m4[24m[4m4[24m000       [4m6[24m600           17.6
[90m 4[39m Central Q4          [4m3[24m[4m9[24m950     [4m4[24m[4m7[24m000       [4m7[24m050           17.6
[90m 5[39m East    Q1          [4m4[24m[4m0[24m800     [4m4[24m[4m8[24m000       [4m7[24m200      

## Part 8: Analysis Questions

Using the reshaped datasets you've created, answer the following questions:

1. **Trend Analysis:** What trends do you observe in quarterly sales across different regions? Which region shows the most consistent growth?

2. **Skills Gap Analysis:** Based on the employee skills data, what are the most common skill gaps in the organization? Which skills are most prevalent?

3. **Survey Insights:** What are the key findings from the survey data? Are there any patterns in responses that suggest areas for improvement?

4. **Data Structure Impact:** How did reshaping the data change your ability to answer these business questions? Provide specific examples.

---

In [24]:
# Analysis Question 1: Trend Analysis
# Analyze quarterly sales trends

print("=== QUARTERLY SALES TREND ANALYSIS ===\n")

# Analysis 1.1: Overall Trend Direction
print("Analysis 1.1 - Overall Trend Direction")

overall_trend <- quarterly_sales_parsed %>%
  group_by(Quarter) %>%
  summarise(
    Total_Sales = sum(Sales),
    Avg_Sales = round(mean(Sales), 2),
    Median_Sales = median(Sales),
    SD_Sales = round(sd(Sales), 2),
    .groups = "drop"
  ) %>%
  mutate(
    Quarter_Num = as.numeric(gsub("Q", "", Quarter)),
    QoQ_Change = Total_Sales - lag(Total_Sales),
    QoQ_Change_Pct = round((QoQ_Change / lag(Total_Sales)) * 100, 2)
  ) %>%
  arrange(Quarter_Num)

print(overall_trend)

print("\nKey Findings:")
print(paste("- Q1 Total Sales:", overall_trend$Total_Sales[1]))
print(paste("- Q4 Total Sales:", overall_trend$Total_Sales[4]))
overall_growth <- round(((overall_trend$Total_Sales[4] - overall_trend$Total_Sales[1]) / 
                          overall_trend$Total_Sales[1]) * 100, 2)
print(paste("- Overall Growth Q1 to Q4:", overall_growth, "%"))
print(paste("- Average Quarterly Growth:", round(mean(overall_trend$QoQ_Change_Pct, na.rm = TRUE), 2), "%"))

# Determine trend
if (overall_growth > 10) {
  trend_direction <- "Strong Upward Trend"
} else if (overall_growth > 0) {
  trend_direction <- "Moderate Upward Trend"
} else if (overall_growth > -10) {
  trend_direction <- "Slight Downward Trend"
} else {
  trend_direction <- "Strong Downward Trend"
}
print(paste("- Trend Direction:", trend_direction))

# Analysis 1.2: Regional Trend Patterns
print("\n\nAnalysis 1.2 - Regional Trend Patterns")

regional_trends <- quarterly_sales_parsed %>%
  mutate(Quarter_Num = as.numeric(gsub("Q", "", Quarter))) %>%
  arrange(Region, Quarter_Num) %>%
  group_by(Region) %>%
  mutate(
    QoQ_Change = Sales - lag(Sales),
    QoQ_Change_Pct = round((QoQ_Change / lag(Sales)) * 100, 2),
    Cumulative_Sales = cumsum(Sales)
  ) %>%
  ungroup()

print("Regional Quarterly Performance:")
print(regional_trends)

# Summarize regional trends
regional_trend_summary <- regional_trends %>%
  group_by(Region) %>%
  summarise(
    Q1_Sales = first(Sales),
    Q4_Sales = last(Sales),
    Total_Growth = round(((Q4_Sales - Q1_Sales) / Q1_Sales) * 100, 2),
    Avg_QoQ_Growth = round(mean(QoQ_Change_Pct, na.rm = TRUE), 2),
    Total_Sales = sum(Sales),
    Best_Quarter = Quarter[which.max(Sales)],
    Worst_Quarter = Quarter[which.min(Sales)],
    Volatility = round(sd(Sales), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Growth))

print("\nRegional Trend Summary:")
print(regional_trend_summary)

print("\nRegional Performance Insights:")
best_growing <- regional_trend_summary$Region[1]
worst_growing <- regional_trend_summary$Region[nrow(regional_trend_summary)]
print(paste("- Fastest Growing Region:", best_growing, 
            "(", regional_trend_summary$Total_Growth[1], "% growth)"))
print(paste("- Slowest Growing Region:", worst_growing,
            "(", regional_trend_summary$Total_Growth[nrow(regional_trend_summary)], "% growth)"))

# Analysis 1.3: Seasonal Patterns
print("\n\nAnalysis 1.3 - Seasonal Pattern Analysis")

seasonal_analysis <- quarterly_sales_parsed %>%
  group_by(Quarter) %>%
  summarise(
    Avg_Sales = round(mean(Sales), 2),
    Min_Sales = min(Sales),
    Max_Sales = max(Sales),
    Sales_Range = Max_Sales - Min_Sales,
    .groups = "drop"
  ) %>%
  mutate(
    Quarter_Num = as.numeric(gsub("Q", "", Quarter)),
    Pct_of_Annual = round((Avg_Sales / sum(Avg_Sales)) * 100, 2)
  ) %>%
  arrange(Quarter_Num)

print("Seasonal Pattern by Quarter:")
print(seasonal_analysis)

print("\nSeasonal Insights:")
strongest_quarter <- seasonal_analysis$Quarter[which.max(seasonal_analysis$Avg_Sales)]
weakest_quarter <- seasonal_analysis$Quarter[which.min(seasonal_analysis$Avg_Sales)]
print(paste("- Strongest Quarter:", strongest_quarter, 
            "with avg sales of", seasonal_analysis$Avg_Sales[which.max(seasonal_analysis$Avg_Sales)]))
print(paste("- Weakest Quarter:", weakest_quarter,
            "with avg sales of", seasonal_analysis$Avg_Sales[which.min(seasonal_analysis$Avg_Sales)]))

seasonality_strength <- round(((max(seasonal_analysis$Avg_Sales) - min(seasonal_analysis$Avg_Sales)) / 
                                 mean(seasonal_analysis$Avg_Sales)) * 100, 2)
print(paste("- Seasonality Strength:", seasonality_strength, "%"))

if (seasonality_strength > 20) {
  print("- Pattern: Strong seasonal effect detected")
} else if (seasonality_strength > 10) {
  print("- Pattern: Moderate seasonal effect")
} else {
  print("- Pattern: Weak or no seasonal effect")
}

# Analysis 1.4: Linear Trend Analysis
print("\n\nAnalysis 1.4 - Linear Trend Analysis")

# Prepare data for linear regression
trend_data <- quarterly_sales_parsed %>%
  mutate(Quarter_Num = as.numeric(gsub("Q", "", Quarter))) %>%
  group_by(Region) %>%
  mutate(Time = row_number()) %>%
  ungroup()

# Calculate linear trends for each region
linear_trends <- trend_data %>%
  group_by(Region) %>%
  summarise(
    Slope = round(coef(lm(Sales ~ Time))[2], 2),
    Intercept = round(coef(lm(Sales ~ Time))[1], 2),
    R_Squared = round(summary(lm(Sales ~ Time))$r.squared, 3),
    Trend_Quality = case_when(
      R_Squared >= 0.9 ~ "Excellent Fit",
      R_Squared >= 0.7 ~ "Good Fit",
      R_Squared >= 0.5 ~ "Moderate Fit",
      TRUE ~ "Poor Fit"
    ),
    .groups = "drop"
  ) %>%
  arrange(desc(Slope))

print("Linear Trend Analysis by Region:")
print(linear_trends)

print("\nLinear Trend Insights:")
print(paste("- Most Consistent Growth:", linear_trends$Region[which.max(linear_trends$R_Squared)],
            "(R² =", linear_trends$R_Squared[which.max(linear_trends$R_Squared)], ")"))
print(paste("- Steepest Growth Trajectory:", linear_trends$Region[1],
            "(", linear_trends$Slope[1], "per quarter)"))

# Analysis 1.5: Growth Acceleration/Deceleration
print("\n\nAnalysis 1.5 - Growth Acceleration Analysis")

acceleration_analysis <- regional_trends %>%
  arrange(Region, Quarter) %>%
  group_by(Region) %>%
  mutate(
    Previous_Growth = lag(QoQ_Change_Pct),
    Growth_Acceleration = QoQ_Change_Pct - Previous_Growth,
    Acceleration_Type = case_when(
      is.na(Growth_Acceleration) ~ "N/A",
      Growth_Acceleration > 2 ~ "Accelerating",
      Growth_Acceleration < -2 ~ "Decelerating",
      TRUE ~ "Stable"
    )
  ) %>%
  ungroup()

print("Growth Acceleration by Region and Quarter:")
print(acceleration_analysis %>% 
        select(Region, Quarter, QoQ_Change_Pct, Growth_Acceleration, Acceleration_Type))

acceleration_summary <- acceleration_analysis %>%
  filter(!is.na(Growth_Acceleration)) %>%
  group_by(Region) %>%
  summarise(
    Avg_Acceleration = round(mean(Growth_Acceleration, na.rm = TRUE), 2),
    Accelerating_Qtrs = sum(Acceleration_Type == "Accelerating"),
    Decelerating_Qtrs = sum(Acceleration_Type == "Decelerating"),
    Momentum = case_when(
      Avg_Acceleration > 1 ~ "Building Momentum",
      Avg_Acceleration < -1 ~ "Losing Momentum",
      TRUE ~ "Stable Momentum"
    ),
    .groups = "drop"
  )

print("\nAcceleration Summary:")
print(acceleration_summary)

# Analysis 1.6: Comparative Trend Performance
print("\n\nAnalysis 1.6 - Comparative Regional Trend Performance")

comparative_trends <- regional_trends %>%
  group_by(Quarter) %>%
  mutate(
    Rank_in_Quarter = rank(desc(Sales)),
    Pct_of_Quarter_Total = round((Sales / sum(Sales)) * 100, 2)
  ) %>%
  ungroup() %>%
  arrange(Region, Quarter)

print("Comparative Rankings:")
print(comparative_trends %>% 
        select(Region, Quarter, Sales, Rank_in_Quarter, Pct_of_Quarter_Total))

# Track rank changes
rank_changes <- comparative_trends %>%
  group_by(Region) %>%
  summarise(
    Q1_Rank = Rank_in_Quarter[Quarter == "Q1"],
    Q4_Rank = Rank_in_Quarter[Quarter == "Q4"],
    Rank_Change = Q1_Rank - Q4_Rank,
    Rank_Trend = case_when(
      Rank_Change > 0 ~ "Improved Ranking",
      Rank_Change < 0 ~ "Declined Ranking",
      TRUE ~ "Maintained Ranking"
    ),
    .groups = "drop"
  ) %>%
  arrange(desc(Rank_Change))

print("\nRanking Changes (Q1 to Q4):")
print(rank_changes)

# Analysis 1.7: Overall Trend Summary
print("\n\n=== COMPREHENSIVE TREND ANALYSIS SUMMARY ===")

print("\n1. OVERALL MARKET TREND:")
print(paste("   - Direction:", trend_direction))
print(paste("   - Total Growth:", overall_growth, "%"))
print(paste("   - Avg Quarterly Growth:", round(mean(overall_trend$QoQ_Change_Pct, na.rm = TRUE), 2), "%"))

print("\n2. REGIONAL PERFORMANCE:")
print(paste("   - Best Performer:", best_growing))
print(paste("   - Growth Leader:", linear_trends$Region[1], 
            "(+", linear_trends$Slope[1], "per quarter)"))

print("\n3. SEASONAL PATTERNS:")
print(paste("   - Peak Quarter:", strongest_quarter))
print(paste("   - Seasonality Strength:", seasonality_strength, "%"))

print("\n4. TREND QUALITY:")
consistent_regions <- sum(linear_trends$R_Squared >= 0.7)
print(paste("   - Regions with Consistent Trends:", consistent_regions, "out of", nrow(linear_trends)))

print("\n5. MOMENTUM:")
building_momentum <- sum(acceleration_summary$Momentum == "Building Momentum")
print(paste("   - Regions Building Momentum:", building_momentum))
print(paste("   - Regions Losing Momentum:", 
            sum(acceleration_summary$Momentum == "Losing Momentum")))

print("\n6. KEY RECOMMENDATIONS:")
if (overall_growth > 15) {
  print("   ✓ Strong growth trajectory - maintain current strategies")
} else if (overall_growth > 0) {
  print("   ⚠ Moderate growth - explore acceleration opportunities")
} else {
  print("   ✗ Negative growth - urgent intervention needed")
}

if (seasonality_strength > 20) {
  print("   ⚠ High seasonality - develop strategies to smooth demand")
}

print(paste("   → Focus resources on:", best_growing, "(highest growth potential)"))
print(paste("   → Investigate challenges in:", worst_growing, "(needs improvement)"))

print("\n=== TREND ANALYSIS COMPLETE ===")


[1] "=== QUARTERLY SALES TREND ANALYSIS ===\n"
[1] "Analysis 1.1 - Overall Trend Direction"
[90m# A tibble: 4 × 8[39m
  Quarter Total_Sales Avg_Sales Median_Sales SD_Sales Quarter_Num QoQ_Change
  [3m[90m<chr>[39m[23m         [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m Q1           [4m2[24m[4m2[24m[4m5[24m000     [4m4[24m[4m5[24m000        [4m4[24m[4m5[24m000    [4m5[24m244.           1         [31mNA[39m
[90m2[39m Q2           [4m2[24m[4m3[24m[4m9[24m000     [4m4[24m[4m7[24m800        [4m4[24m[4m8[24m000    [4m5[24m450.           2      [4m1[24m[4m4[24m000
[90m3[39m Q3           [4m2[24m[4m5[24m[4m3[24m000     [4m5[24m[4m0[24m600        [4m5[24m[4m2[24m000    [4m5[24m639.           3      [4m1[24m[4m4[24m000
[90m4[39m Q4           [4m2[24m[4m6[24m[4m7[24m000     [4m5[

[1m[22m[36mℹ[39m In argument: `R_Squared = round(summary(lm(Sales ~ Time))$r.squared, 3)`.
[36mℹ[39m In group 4: `Region = "South"`.
[33m![39m essentially perfect fit: summary may be unreliable”


[1] "Linear Trend Analysis by Region:"
[90m# A tibble: 5 × 5[39m
  Region  Slope Intercept R_Squared Trend_Quality
  [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m        
[90m1[39m North    [4m3[24m400     [4m4[24m[4m1[24m500     0.997 Excellent Fit
[90m2[39m South    [4m3[24m000     [4m4[24m[4m9[24m000     1     Excellent Fit
[90m3[39m Central  [4m2[24m600     [4m3[24m[4m6[24m500     0.994 Excellent Fit
[90m4[39m East     [4m2[24m600     [4m4[24m[4m5[24m500     0.994 Excellent Fit
[90m5[39m West     [4m2[24m400     [4m3[24m[4m8[24m500     0.993 Excellent Fit
[1] "\nLinear Trend Insights:"
[1] "- Most Consistent Growth: South (R² = 1 )"
[1] "- Steepest Growth Trajectory: North ( 3400 per quarter)"
[1] "\n\nAnalysis 1.5 - Growth Acceleration Analysis"
[1] "Growth Acceleration by Region and Quarter:"
[90m# A tibble: 20 × 5[39m
   Region  Quarter QoQ_Change_

In [None]:
# Analysis Question 2: Skills Gap Analysis  
# Analyze employee skills gaps

ERROR: Error in parse(text = input): <text>:84:57: unexpected numeric constant
83: critical_gaps <- importance_proficiency %>%
84:   filter(Importance == "Critical" & Avg_Proficiency < 4)5
                                                            ^


In [30]:
# Analysis Question 3: Survey Insights
# Analyze survey response patterns

# Load necessary libraries
library(tidyverse)

# Example: assuming your dataset is called 'survey_data'
# and contains columns like: respondent_id, age, gender, satisfaction, etc.

# 1. View basic structure and missing data
glimpse(survey_data)
summary(survey_data)
colSums(is.na(survey_data))

# 2. Frequency of responses for categorical questions
survey_data %>%
  select(where(is.factor)) %>%
  map(~ table(.x, useNA = "ifany")) 

# 3. Descriptive stats for numeric variables
survey_data %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), list(mean = mean, sd = sd), na.rm = TRUE))

# 4. Analyze response patterns by demographic group
survey_data %>%
  group_by(gender) %>%
  summarise(
    avg_satisfaction = mean(satisfaction, na.rm = TRUE),
    count = n()
  )

# 5. Visualize response distributions
survey_data %>%
  ggplot(aes(x = satisfaction, fill = gender)) +
  geom_histogram(position = "dodge", bins = 10) +
  labs(title = "Satisfaction by Gender", x = "Satisfaction Score", y = "Count")

# 6. Correlation heatmap (if numeric survey items)
numeric_data <- survey_data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_data, use = "complete.obs")

# Optional visualization
library(corrplot)
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black")

# 7. Open-ended responses (if applicable)
# For a column like 'comments', get word frequencies
library(tidytext)

survey_data %>%
  unnest_tokens(word, comments) %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% stop_words$word) %>%
  head(20)


ERROR: Error: object 'survey_data' not found


## Part 9: Reflection Questions

Answer the following questions in your submission:

1. **Tidy Data Philosophy:** Explain the concept of "tidy data" in your own words. Why is this concept important for data analysis, and how does it relate to the reshaping operations you performed?
Tiday data is a standardized way of organizing data where each variable forms a column, each observation forms a row, and each type of observational unit forms a table. This structure simplifies the process of data manipulation, visualization, and modeling.

2. **Format Selection:** For each of the datasets you worked with, explain when you would prefer the wide format versus the long format. What factors influence this decision?
I would prefer the wide format for datasets where I need to compare multiple categories side by side, such as in dashboards or summary tables. 
3. **Business Context:** Describe three real-world business scenarios where data reshaping would be essential. For each scenario, explain what format the data might start in and what format would be needed for analysis.
Three real-world business scenarios where data reshaping would be essential are sales reporting where wide format is needed to see the trend over time. This needs to be converted to long format for analysis. Another example is customer feedback surveys that are in long format and need to be converted to wide format for summary statistics. A third example is employee skills tracking that starts in wide format and  is converted to long format.
4. **Tool Integration:** How do the reshaping capabilities of `tidyr` complement the data manipulation functions of `dplyr`? Provide examples of analyses that require both types of operations.
The reshaping capabilities of `tidyr` complement the data manipulation functions of `dplyr` by allowing users to easily switch between wide and long formats, which is often necessary for different types of analyses. For example, one might filter and summarize data before reshaping it for visualization. Conversely, after reshaping data, one might group and summarize the reshaped data for reporting purposes.
5. **Data Pipeline:** In a typical business analytics workflow, at what stage would you perform data reshaping? How does this fit into the overall data wrangling process?
In a typical business analytics workflow, data reshaping is performed during the data wrangling stage, after initial data cleaning and before analysis or visualization. This ensures that the data is in the appropriate format for the specific analytical methods or visualizations being used.
---

**Submission Checklist:**

- [ ] R notebook with all code and outputs completed
- [ ] All required data reshaping operations completed successfully
- [ ] Data validation checks performed and documented
- [ ] Datasets prepared for visualization and further analysis
- [ ] Answers to analysis questions with supporting evidence
- [ ] Answers to reflection questions
- [ ] Code is well-commented and demonstrates understanding of when to use each reshaping function

Good luck!