# Homework Assignment - Lesson 5: Data Reshaping with tidyr

**Due Date:** 9/28/2025 

**Instructions:**

- Complete the following tasks in this R notebook
- Use the pipe operator (`%>%`) and chain operations wherever possible
- Ensure your code is well-commented and easy to understand
- Submit your completed notebook file

---

## Part 1: Data Import and Setup

1. **Data Import:**
   - Download the following files from the course materials:
     - `quarterly_sales_wide.csv` - Sales data in wide format with quarters as columns
     - `survey_responses_long.csv` - Survey data in long format
     - `employee_skills_wide.csv` - Employee skills matrix in wide format
   - Import each file into appropriately named data frames.
   - Load the `tidyverse` package.

2. **Initial Exploration:**
   - Examine the structure of each dataset using `str()` and `head()`.
   - Identify which datasets are in "wide" format and which are in "long" format.
   - Note any patterns in column names that might be useful for reshaping.

---

In [68]:
# Load necessary packages
library(tidyverse) # includes tidyr

# Import the required datasets
quarterly_sales_wide <- read.csv("quarterly_sales_wide.csv", stringsAsFactors = FALSE)
survey_responses_long <- read.csv("survey_responses_long.csv", stringsAsFactors = FALSE)
employee_skills_wide <- read.csv("employee_skills_wide.csv", stringsAsFactors = FALSE)

# Initial exploration
cat("=== QUARTERLY SALES DATA ===\n")
str(quarterly_sales_wide)
print(head(quarterly_sales_wide))

cat("\n=== SURVEY RESPONSES DATA ===\n")
str(survey_responses_long)
print(head(survey_responses_long))

cat("\n=== EMPLOYEE SKILLS DATA ===\n")
str(employee_skills_wide)
print(head(employee_skills_wide))

=== QUARTERLY SALES DATA ===
'data.frame':	4 obs. of  8 variables:
 $ Region          : chr  "North" "South" "East" "West"
 $ Product_Category: chr  "Electronics" "Clothing" "Electronics" "Clothing"
 $ Q1_2023         : int  45000 32000 38000 28000
 $ Q2_2023         : int  48000 35000 41000 31000
 $ Q3_2023         : int  46000 33000 39000 29000
 $ Q4_2023         : int  52000 38000 44000 34000
 $ Q1_2024         : int  50000 36000 42000 32000
 $ Q2_2024         : int  54000 40000 46000 36000
  Region Product_Category Q1_2023 Q2_2023 Q3_2023 Q4_2023 Q1_2024 Q2_2024
1  North      Electronics   45000   48000   46000   52000   50000   54000
2  South         Clothing   32000   35000   33000   38000   36000   40000
3   East      Electronics   38000   41000   39000   44000   42000   46000
4   West         Clothing   28000   31000   29000   34000   32000   36000

=== SURVEY RESPONSES DATA ===
'data.frame':	250 obs. of  3 variables:
 $ Respondent_ID: int  1 1 1 1 1 2 2 2 2 2 ...
 $ Question  

  Employee_ID Employee_Name Department R_Programming Excel SQL Python Tableau
1           1    Employee 1  Marketing             4     4   4      2       4
2           2    Employee 2    Finance             3     5   2      4       2
3           3    Employee 3    Finance             1     2   1      4       4
4           4    Employee 4         IT             4     5   3      5       2
5           5    Employee 5    Finance             1     2   1      2       1
6           6    Employee 6         IT             5     2   1      4       1


## Part 2: Converting Wide to Long with `pivot_longer()`

1. **Basic Wide to Long Conversion:**
   - Using the `quarterly_sales_wide` dataset, convert it from wide to long format:
     - The quarter columns (e.g., `Q1_2023`, `Q2_2023`, etc.) should become values in a new column called `Quarter`
     - The sales values should go into a new column called `Sales_Amount`
     - Keep all other identifying columns (e.g., `Region`, `Product_Category`)
   - Store the result in a data frame called `quarterly_sales_long`.

2. **Advanced Wide to Long with Name Parsing:**
   - If the quarter columns contain both year and quarter information (e.g., `Q1_2023`, `Q2_2023`), use `names_sep` or `names_pattern` to separate this into two columns: `Quarter` and `Year`.
   - Store the result in a data frame called `quarterly_sales_parsed`.

3. **Employee Skills Conversion:**
   - Using the `employee_skills_wide` dataset, convert it from wide to long format:
     - Skill columns (e.g., `R_Programming`, `Excel`, `SQL`) should become values in a column called `Skill`
     - The proficiency levels should go into a column called `Proficiency_Level`
     - Keep employee identifying information
   - Store the result in a data frame called `employee_skills_long`.

---

In [69]:
# Check if file exists and load data
if (file.exists("quarterly_sales_wide.csv")) {
  quarterly_sales_wide <- read.csv("quarterly_sales_wide.csv", stringsAsFactors = FALSE)
  print("quarterly_sales_wide loaded successfully.")
  print(head(quarterly_sales_wide))
} else {
  stop("File quarterly_sales_wide.csv not found in the working directory.")
}

[1] "quarterly_sales_wide loaded successfully."
  Region Product_Category Q1_2023 Q2_2023 Q3_2023 Q4_2023 Q1_2024 Q2_2024
1  North      Electronics   45000   48000   46000   52000   50000   54000
2  South         Clothing   32000   35000   33000   38000   36000   40000
3   East      Electronics   38000   41000   39000   44000   42000   46000
4   West         Clothing   28000   31000   29000   34000   32000   36000


In [70]:
# Task 2.1: Basic Wide to Long Conversion - Quarterly Sales
quarterly_sales_long <- quarterly_sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),    # YOUR CODE HERE: specify quarter columns
    names_to = "Quarter",      # YOUR CODE HERE: name for quarter column
    values_to = "Sales_Amount"      # YOUR CODE HERE: name for sales values column
  )

print("Quarterly Sales - Long Format:")
print(head(quarterly_sales_long))

[1] "Quarterly Sales - Long Format:"
[90m# A tibble: 6 × 4[39m
  Region Product_Category Quarter Sales_Amount
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m          [3m[90m<int>[39m[23m
[90m1[39m North  Electronics      Q1_2023        [4m4[24m[4m5[24m000
[90m2[39m North  Electronics      Q2_2023        [4m4[24m[4m8[24m000
[90m3[39m North  Electronics      Q3_2023        [4m4[24m[4m6[24m000
[90m4[39m North  Electronics      Q4_2023        [4m5[24m[4m2[24m000
[90m5[39m North  Electronics      Q1_2024        [4m5[24m[4m0[24m000
[90m6[39m North  Electronics      Q2_2024        [4m5[24m[4m4[24m000


In [71]:
# Task 2.2: Advanced Wide to Long with Name Parsing
quarterly_sales_parsed <- quarterly_sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),
    names_to = c("Quarter", "Year"),  # YOUR CODE HERE: Quarter and Year columns
    names_sep = "_",                                 # YOUR CODE HERE: separator
    values_to = "Sales_Amount"                          # YOUR CODE HERE: sales values column
  )

print("Quarterly Sales - Parsed Format:")
print(head(quarterly_sales_parsed))

[1] "Quarterly Sales - Parsed Format:"
[90m# A tibble: 6 × 5[39m
  Region Product_Category Quarter Year  Sales_Amount
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m        [3m[90m<int>[39m[23m
[90m1[39m North  Electronics      Q1      2023         [4m4[24m[4m5[24m000
[90m2[39m North  Electronics      Q2      2023         [4m4[24m[4m8[24m000
[90m3[39m North  Electronics      Q3      2023         [4m4[24m[4m6[24m000
[90m4[39m North  Electronics      Q4      2023         [4m5[24m[4m2[24m000
[90m5[39m North  Electronics      Q1      2024         [4m5[24m[4m0[24m000
[90m6[39m North  Electronics      Q2      2024         [4m5[24m[4m4[24m000


In [72]:
# Task 2.3: Employee Skills Wide to Long
employee_skills_long <- employee_skills_wide %>%
  pivot_longer(
    cols = c(R_Programming, Excel, SQL),    # YOUR CODE HERE: skill columns
    names_to = "Skill",      # YOUR CODE HERE: skill column name
    values_to = "Proficiency_Level"      # YOUR CODE HERE: proficiency column name
  )

print("Employee Skills - Long Format:")
print(head(employee_skills_long))

[1] "Employee Skills - Long Format:"


[90m# A tibble: 6 × 7[39m
  Employee_ID Employee_Name Department Python Tableau Skill    Proficiency_Level
        [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m       [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m                [3m[90m<int>[39m[23m
[90m1[39m           1 Employee 1    Marketing       2       4 R_Progr…                 4
[90m2[39m           1 Employee 1    Marketing       2       4 Excel                    4
[90m3[39m           1 Employee 1    Marketing       2       4 SQL                      4
[90m4[39m           2 Employee 2    Finance         4       2 R_Progr…                 3
[90m5[39m           2 Employee 2    Finance         4       2 Excel                    5
[90m6[39m           2 Employee 2    Finance         4       2 SQL                      2


## Part 3: Converting Long to Wide with `pivot_wider()`

1. **Basic Long to Wide Conversion:**
   - Using the `survey_responses_long` dataset (which should have columns like `Respondent_ID`, `Question`, `Response`), convert it to wide format:
     - Each unique question should become a separate column
     - The responses should fill the cells
     - Each row should represent one respondent
   - Store the result in a data frame called `survey_responses_wide`.

2. **Aggregated Long to Wide:**
   - Using your `quarterly_sales_long` data from Part 2, create a wide format where:
     - Each region becomes a column
     - Each row represents a quarter-year combination
     - The values are the total sales for that region in that quarter
   - Store the result in a data frame called `sales_by_region_wide`.

3. **Skills Matrix Creation:**
   - Using your `employee_skills_long` data from Part 2, create a skills matrix where:
     - Each skill becomes a column
     - Each row represents an employee
     - The values are the proficiency levels
   - Store the result in a data frame called `skills_matrix`.

---

In [73]:
# Task 3.1: Survey Responses Long to Wide
survey_responses_wide <- survey_responses_long %>%
  pivot_wider(
    names_from = Question,      # YOUR CODE HERE: column for new names
    values_from = Response      # YOUR CODE HERE: column for values
  )

print("Survey Responses - Wide Format:")
print(head(survey_responses_wide))

[1] "Survey Responses - Wide Format:"


[90m# A tibble: 6 × 6[39m
  Respondent_ID Product_Quality Customer_Service Value_for_Money Delivery_Speed
          [3m[90m<int>[39m[23m           [3m[90m<int>[39m[23m            [3m[90m<int>[39m[23m           [3m[90m<int>[39m[23m          [3m[90m<int>[39m[23m
[90m1[39m             1               5                4               3              4
[90m2[39m             2               1                3               2              3
[90m3[39m             3               3                3               2              3
[90m4[39m             4               3                5               4              1
[90m5[39m             5               5                1               4              4
[90m6[39m             6               2                1               4              4
[90m# ℹ 1 more variable: Overall_Satisfaction <int>[39m


In [74]:
# Task 3.2: Aggregated Long to Wide - Sales by Region
sales_by_region_wide <- quarterly_sales_long %>%
  pivot_wider(
    names_from = Region,      # YOUR CODE HERE: region column
    values_from = Sales_Amount      # YOUR CODE HERE: sales column                   
  )

print("Sales by Region - Wide Format:")
print(head(sales_by_region_wide))

[1] "Sales by Region - Wide Format:"
[90m# A tibble: 6 × 6[39m
  Product_Category Quarter North South  East  West
  [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m   [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m Electronics      Q1_2023 [4m4[24m[4m5[24m000    [31mNA[39m [4m3[24m[4m8[24m000    [31mNA[39m
[90m2[39m Electronics      Q2_2023 [4m4[24m[4m8[24m000    [31mNA[39m [4m4[24m[4m1[24m000    [31mNA[39m
[90m3[39m Electronics      Q3_2023 [4m4[24m[4m6[24m000    [31mNA[39m [4m3[24m[4m9[24m000    [31mNA[39m
[90m4[39m Electronics      Q4_2023 [4m5[24m[4m2[24m000    [31mNA[39m [4m4[24m[4m4[24m000    [31mNA[39m
[90m5[39m Electronics      Q1_2024 [4m5[24m[4m0[24m000    [31mNA[39m [4m4[24m[4m2[24m000    [31mNA[39m
[90m6[39m Electronics      Q2_2024 [4m5[24m[4m4[24m000    [31mNA[39m [4m4[24m[4m6[24m000    [31mNA[39m


In [75]:
# Task 3.3: Skills Matrix Creation
skills_matrix <- employee_skills_long %>%
  pivot_wider(
    names_from = Skill,      # YOUR CODE HERE: skill column
    values_from = Proficiency_Level      # YOUR CODE HERE: proficiency column
  )

print("Skills Matrix:")
print(head(skills_matrix))

[1] "Skills Matrix:"


[90m# A tibble: 6 × 8[39m
  Employee_ID Employee_Name Department Python Tableau R_Programming Excel   SQL
        [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m       [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m         [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m           1 Employee 1    Marketing       2       4             4     4     4
[90m2[39m           2 Employee 2    Finance         4       2             3     5     2
[90m3[39m           3 Employee 3    Finance         4       4             1     2     1
[90m4[39m           4 Employee 4    IT              5       2             4     5     3
[90m5[39m           5 Employee 5    Finance         2       1             1     2     1
[90m6[39m           6 Employee 6    IT              4       1             5     2     1


## Part 4: Complex Reshaping Scenarios

1. **Multiple Value Columns:**
   - Create a dataset that has both `Sales_Amount` and `Profit_Amount` for each quarter and region.
   - Convert this to long format where you have separate rows for sales and profit, with a column indicating the metric type.
   - Then convert it back to wide format with quarters as columns.

2. **Handling Missing Values in Reshaping:**
   - When reshaping your data, some combinations might not exist (e.g., an employee might not have a rating for every skill).
   - Demonstrate how `pivot_wider()` handles missing values and how you can control this behavior using the `values_fill` argument.

3. **Nested Reshaping:**
   - Take your `quarterly_sales_long` data and create a summary that shows:
     - Average sales by product category and quarter
     - Convert this to wide format with quarters as columns
     - Then convert back to long format but group quarters into "H1" (Q1, Q2) and "H2" (Q3, Q4)

---

In [76]:
# Task 4.1: Multiple Value Columns
quarterly_sales_wide$Profit_Amount <- round(quarterly_sales_wide$Q1_2023 * 0.2, 2) 
multi_value_long <- quarterly_sales_wide %>%
  pivot_longer(
    cols = matches("Q[1-4]_\\d{4}|Profit_Amount"),
    names_to = "Metric",
    values_to = "Amount"
  )
print("Multi-value Long Format:")
print(head(multi_value_long))

multi_value_wide <- multi_value_long %>%
  pivot_wider(
    names_from = Metric,
    values_from = Amount
  )

print("Multi-value Wide Format:")
print(head(multi_value_wide))

[1] "Multi-value Long Format:"
[90m# A tibble: 6 × 4[39m
  Region Product_Category Metric  Amount
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m North  Electronics      Q1_2023  [4m4[24m[4m5[24m000
[90m2[39m North  Electronics      Q2_2023  [4m4[24m[4m8[24m000
[90m3[39m North  Electronics      Q3_2023  [4m4[24m[4m6[24m000
[90m4[39m North  Electronics      Q4_2023  [4m5[24m[4m2[24m000
[90m5[39m North  Electronics      Q1_2024  [4m5[24m[4m0[24m000
[90m6[39m North  Electronics      Q2_2024  [4m5[24m[4m4[24m000
[1] "Multi-value Wide Format:"
[90m# A tibble: 4 × 9[39m
  Region Product_Category Q1_2023 Q2_2023 Q3_2023 Q4_2023 Q1_2024 Q2_2024
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39

In [77]:
# Task 4.2: Handling Missing Values
skills_matrix_default <- employee_skills_long %>%
  pivot_wider(
    names_from = Skill,
    values_from = Proficiency_Level
  )

print("Skills Matrix - Default (NA for missing):")
print(head(skills_matrix_default))

skills_matrix_filled <- employee_skills_long %>%
  pivot_wider(
    names_from = Skill,
    values_from = Proficiency_Level,
    values_fill = 0
  )

print("Skills Matrix - Filled (0 for mising):")
print(head(skills_matrix_filled))

[1] "Skills Matrix - Default (NA for missing):"
[90m# A tibble: 6 × 8[39m
  Employee_ID Employee_Name Department Python Tableau R_Programming Excel   SQL
        [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m       [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m         [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m           1 Employee 1    Marketing       2       4             4     4     4
[90m2[39m           2 Employee 2    Finance         4       2             3     5     2
[90m3[39m           3 Employee 3    Finance         4       4             1     2     1
[90m4[39m           4 Employee 4    IT              5       2             4     5     3
[90m5[39m           5 Employee 5    Finance         2       1             1     2     1
[90m6[39m           6 Employee 6    IT              4       1             5     2     1
[1] "Skills Matrix - Filled (0 for mising):"
[90m# A tibble: 6 × 8[39m

In [78]:
# Task 4.3: Nested Reshaping
avg_sales <- quarterly_sales_long %>%
  group_by(Product_Category, Quarter) %>%
  summarise(Average_Sales = mean(Sales_Amount, na.rm = TRUE))

avg_sales_wide <- avg_sales %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Average_Sales
  )

print("Average Sales by Product Category and Quarter (Wide):")
print(head(avg_sales_wide))

avg_sales_long <- avg_sales_wide %>%
  pivot_longer(
    cols = starts_with("Q"),
    names_to = "Quarter",
    values_to = "Average_Sales"
  ) %>%
mutate(Half = case_when(
    Quarter %in% c("Q1_2023", "Q2_2023") ~ "H1_2023",
    Quarter %in% c("Q3_2023", "Q4_2023") ~ "H2_2023",
    TRUE ~ NA_character_
  ))

print("Average Sales by Product Category and Half (Long):")
print(head(avg_sales_long))

[1m[22m`summarise()` has grouped output by 'Product_Category'. You can override using
the `.groups` argument.


[1] "Average Sales by Product Category and Quarter (Wide):"
[90m# A tibble: 2 × 7[39m
[90m# Groups:   Product_Category [2][39m
  Product_Category Q1_2023 Q1_2024 Q2_2023 Q2_2024 Q3_2023 Q4_2023
  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m Clothing           [4m3[24m[4m0[24m000   [4m3[24m[4m4[24m000   [4m3[24m[4m3[24m000   [4m3[24m[4m8[24m000   [4m3[24m[4m1[24m000   [4m3[24m[4m6[24m000
[90m2[39m Electronics        [4m4[24m[4m1[24m500   [4m4[24m[4m6[24m000   [4m4[24m[4m4[24m500   [4m5[24m[4m0[24m000   [4m4[24m[4m2[24m500   [4m4[24m[4m8[24m000
[1] "Average Sales by Product Category and Half (Long):"
[90m# A tibble: 6 × 4[39m
[90m# Groups:   Product_Category [1][39m
  Product_Category Quarter Average_Sales Half   
  [3m[90m<chr>[39m[23m            [3m[90m<chr>[3

## Part 5: Business Applications

1. **Time Series Analysis Preparation:**
   - Using your `quarterly_sales_long` data, prepare it for time series analysis by:
     - Ensuring it's in proper long format with a date/time column
     - Creating a complete time series (filling in any missing quarters with 0 sales)
     - Adding calculated columns for year-over-year growth rates

2. **Dashboard Data Preparation:**
   - Create a wide format dataset suitable for a business dashboard that shows:
     - Rows: Product categories
     - Columns: Quarters
     - Values: Total sales
     - Additional columns for year-over-year comparisons

3. **Survey Analysis:**
   - Using your `survey_responses_wide` data, create summary statistics:
     - Calculate average scores for each question
     - Identify questions with the highest and lowest satisfaction
     - Create a correlation matrix between different survey questions

---

In [79]:
# Task 5.1: Time Series Analysis Preparation
quarterly_sales_long <- quarterly_sales_long %>%
  mutate(Date = as.Date(paste0(sub("Q([1-4])_(\\d{4})", "\\2-", c("01", "04", "07", "10")[as.numeric(sub("Q([1-4])_.*", "\\1", Quarter))], "-01")), format="%Y-%m-%d"))

library(tidyr)
complete_sales <- quarterly_sales_long %>%
  complete(Region, Product_Category, Quarter, Date, fill = list(Sales_Amount = 0))

complete_sales <- complete_sales %>%
  group_by(Region,  Product_Category) %>%
  arrange(Date) %>%
  mutate(YoY_Growth = (Sales_Amount - lag(Sales_Amount, 4)) / lag(Sales_Amount, 4) * 100)

print("Quarterly Sales Data Prepared for Time Series Analysis:")
print(head(complete_sales))

[1] "Quarterly Sales Data Prepared for Time Series Analysis:"
[90m# A tibble: 6 × 6[39m
[90m# Groups:   Region, Product_Category [1][39m
  Region Product_Category Quarter Date   Sales_Amount YoY_Growth
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m   [3m[90m<date>[39m[23m        [3m[90m<int>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m East   Clothing         Q1_2023 [31mNA[39m                0         [31mNA[39m
[90m2[39m East   Clothing         Q1_2024 [31mNA[39m                0         [31mNA[39m
[90m3[39m East   Clothing         Q2_2023 [31mNA[39m                0         [31mNA[39m
[90m4[39m East   Clothing         Q2_2024 [31mNA[39m                0         [31mNA[39m
[90m5[39m East   Clothing         Q3_2023 [31mNA[39m                0        [31mNaN[39m
[90m6[39m East   Clothing         Q4_2023 [31mNA[39m                0        [31mNaN[39m


In [80]:
# Task 5.2: Dashboard Data Preparation
dashboard_data <- quarterly_sales_long %>%
  group_by(Product_Category, Quarter) %>%
  summarise(Total_Sales = sum(Sales_Amount, na.rm = TRUE)) %>%
  pivot_wider(
    names_from = Quarter,
    values_from = Total_Sales,
  )

dashboard_data <- dashboard_data %>%
  mutate(Q1_YoY = (`Q1_2023` - lag(`Q1_2023`)) / lag(`Q1_2023`) * 100)

print("Dashboard-ready dataset:")
print(head(dashboard_data))


[1m[22m`summarise()` has grouped output by 'Product_Category'. You can override using
the `.groups` argument.


[1] "Dashboard-ready dataset:"
[90m# A tibble: 2 × 8[39m
[90m# Groups:   Product_Category [2][39m
  Product_Category Q1_2023 Q1_2024 Q2_2023 Q2_2024 Q3_2023 Q4_2023 Q1_YoY
  [3m[90m<chr>[39m[23m              [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m   [3m[90m<int>[39m[23m  [3m[90m<dbl>[39m[23m
[90m1[39m Clothing           [4m6[24m[4m0[24m000   [4m6[24m[4m8[24m000   [4m6[24m[4m6[24m000   [4m7[24m[4m6[24m000   [4m6[24m[4m2[24m000   [4m7[24m[4m2[24m000     [31mNA[39m
[90m2[39m Electronics        [4m8[24m[4m3[24m000   [4m9[24m[4m2[24m000   [4m8[24m[4m9[24m000  [4m1[24m[4m0[24m[4m0[24m000   [4m8[24m[4m5[24m000   [4m9[24m[4m6[24m000     [31mNA[39m


In [81]:
# Task 5.3: Survey Analysis
question_means <- survey_responses_wide %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

print("Average scores for each question:")
print(question_means)

max_question <- names(question_means)[which.max(question_means)]
min_question <- names(question_means)[which.min(question_means)]

cat("Highest satisfaction question:", max_question, "\n")
cat("Lowest satisfaction question:", min_question, "\n")

cor_matrix <- survey_responses_wide %>%
  select(where(is.numeric)) %>%
  cor(use = "pairwise.complete.obs")

print("Correlation matrix between survey questions:")
print(cor_matrix)

[1] "Average scores for each question:"


[90m# A tibble: 1 × 6[39m
  Respondent_ID Product_Quality Customer_Service Value_for_Money Delivery_Speed
          [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m            [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m          25.5            3.14             3.04             2.9           3.36
[90m# ℹ 1 more variable: Overall_Satisfaction <dbl>[39m
Highest satisfaction question: Respondent_ID 
Lowest satisfaction question: Value_for_Money 
[1] "Correlation matrix between survey questions:"
                     Respondent_ID Product_Quality Customer_Service
Respondent_ID            1.0000000     -0.10533860      -0.10574640
Product_Quality         -0.1053386      1.00000000       0.22285138
Customer_Service        -0.1057464      0.22285138       1.00000000
Value_for_Money         -0.2375157      0.37750376       0.08411789
Delivery_Speed           0.1585383     -0.11432290      -0.09463263
Overall_Satisfact

## Part 6: Data Validation and Quality Checks

1. **Reshape Validation:**
   - After each major reshaping operation, verify that:
     - The total number of data points is preserved (accounting for the different structure)
     - No data was lost or duplicated unexpectedly
     - The relationships between variables are maintained

2. **Tidy Data Assessment:**
   - For each of your final datasets, assess whether they meet the criteria for "tidy data":
     - Each variable forms a column
     - Each observation forms a row
     - Each type of observational unit forms a table
   - Identify which format (wide or long) is more "tidy" for each specific analysis purpose.

---

In [82]:
# Task 6.1: Reshape Validation
total_sales_long <- sum(quarterly_sales_long$Sales_Amount, na.rm = TRUE)
total_sales_dashboard <- dashboard_data %>%
  select(-Product_Category) %>%
  unlist() %>%

cat("Total sales in long format:", total_sales_long, "\n")
cat("Total sales in dashboard wide format:", total_sales_dashboard, "\n")

cat("Duplicated rows in quarterly_sales_long:", sum(duplicated(quarterly_sales_long)), "\n")
cat("Duplicated rows in dashboard_data:", sum(duplicated(dashboard_data)), "\n")

cat("Missing values in dashboard_data:", sum(is.na(dashboard_data)), "\n")



[1m[22mAdding missing grouping variables: `Product_Category`


Clothing Electronics 60000 83000 68000 92000 66000 89000 76000 100000 62000 85000 72000 96000 NA NA Total sales in long format: 949000 
Total sales in dashboard wide format: 
Duplicated rows in quarterly_sales_long: 0 
Duplicated rows in dashboard_data: 0 
Missing values in dashboard_data: 2 


In [83]:
# Task 6.2: Tidy Data Assessment
cat("quarterly_sales_long is tidy if:\n")
cat("- Each variable (Region, Product_Category, Quarter, Sales_Amount) forms a column\n")
cat("- Each observation (a sale for a region/category/quarter) forms a row\n")
cat("- Each type of observational unit (sales transaction) forms a table\n\n")

cat("dashboard_data is tidy for dashboard/reporting if:\n")
cat("- Each variable (Product_Category, Q1_2023, Q2_2023, ...) forms a column\n")
cat("- Each observation (a product category) forms a row\n")
cat("- Wide format is better for summary tables and dashboards\n\n")

cat("employee_skills_long is tidy if:\n")
cat("- Each variable (Employee_ID, Skill, Proficiency_Level) forms a column\n")
cat("- Each observation (an employee's skill rating) forms a row\n")
cat("- Long format is better for analysis and plotting\n\n")

cat("skills_matrix is tidy for matrix-style analysis if:\n")
cat("- Each variable (Employee_ID, R_Programming, Excel, SQL, ...) forms a column\n")
cat("- Each observation (an employee) forms a row\n")
cat("- Wide format is better for comparing employees across skills\n\n")


quarterly_sales_long is tidy if:
- Each variable (Region, Product_Category, Quarter, Sales_Amount) forms a column


- Each observation (a sale for a region/category/quarter) forms a row
- Each type of observational unit (sales transaction) forms a table

dashboard_data is tidy for dashboard/reporting if:
- Each variable (Product_Category, Q1_2023, Q2_2023, ...) forms a column
- Each observation (a product category) forms a row
- Wide format is better for summary tables and dashboards

employee_skills_long is tidy if:
- Each variable (Employee_ID, Skill, Proficiency_Level) forms a column
- Each observation (an employee's skill rating) forms a row
- Long format is better for analysis and plotting

skills_matrix is tidy for matrix-style analysis if:
- Each variable (Employee_ID, R_Programming, Excel, SQL, ...) forms a column
- Each observation (an employee) forms a row
- Wide format is better for comparing employees across skills



## Part 7: Visualization Preparation

1. **ggplot2 Preparation:**
   - Prepare your `quarterly_sales_long` data for creating a line chart showing sales trends over time for each region.
   - Prepare your `employee_skills_long` data for creating a heatmap showing skill proficiency across employees.

2. **Comparison Visualization:**
   - Create a dataset that allows you to compare the same metric (e.g., sales) across different dimensions (e.g., regions, quarters) in a single visualization.

---

In [84]:
# Task 7.1: ggplot2 Data Preparation
sales_trend_data <- quarterly_sales_long %>%
  group_by(Region, Quarter) %>%
  summarise(Total_Sales = sum(Sales_Amount, na.rm = TRUE)) %>%
  ungroup()

skills_heatmap_data <- employee_skills_long %>%
  group_by(Employee_ID, Skill) %>%
  summarise(Avg_Proficiency = mean(Proficiency_Level, na.rm = TRUE)) %>%
  ungroup()


[1m[22m`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
[1m[22m`summarise()` has grouped output by 'Employee_ID'. You can override using the
`.groups` argument.


In [85]:
# Task 7.2: Comparison Visualization Data
comparison_data <- quarterly_sales_long %>%
  group_by(Region, Quarter) %>%
  summarise(Total_Sales = sum(Sales_Amount, na.rm = TRUE)) %>%
  ungroup()

  print("Comparison dataset for sales across regions and quarters:")
  print(head(comparison_data))

[1m[22m`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.


[1] "Comparison dataset for sales across regions and quarters:"
[90m# A tibble: 6 × 3[39m
  Region Quarter Total_Sales
  [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m         [3m[90m<int>[39m[23m
[90m1[39m East   Q1_2023       [4m3[24m[4m8[24m000
[90m2[39m East   Q1_2024       [4m4[24m[4m2[24m000
[90m3[39m East   Q2_2023       [4m4[24m[4m1[24m000
[90m4[39m East   Q2_2024       [4m4[24m[4m6[24m000
[90m5[39m East   Q3_2023       [4m3[24m[4m9[24m000
[90m6[39m East   Q4_2023       [4m4[24m[4m4[24m000


## Part 8: Analysis Questions

Using the reshaped datasets you've created, answer the following questions:

1. **Trend Analysis:** What trends do you observe in quarterly sales across different regions? Which region shows the most consistent growth?

2. **Skills Gap Analysis:** Based on the employee skills data, what are the most common skill gaps in the organization? Which skills are most prevalent?

3. **Survey Insights:** What are the key findings from the survey data? Are there any patterns in responses that suggest areas for improvement?

4. **Data Structure Impact:** How did reshaping the data change your ability to answer these business questions? Provide specific examples.

---

In [86]:
# Analysis Question 1: Trend Analysis
sales_trends <- quarterly_sales_long %>%
  group_by(Region, Quarter) %>%
  summarise(Total_Sales = sum(Sales_Amount, na.rm = TRUE)) %>%
  ungroup()

print("Quarterly sales trends by region:")
print(sales_trends)

region_growth <- sales_trends %>%
  group_by(Region) %>%
  arrange(Quarter) %>%
  mutate(Growth = Total_Sales - lag(Total_Sales)) %>%
  summarise(Avg_Growth = mean(Growth, na.rm = TRUE),
            Growth_SD = sd(Growth, na.rm = TRUE)) %>%
  ungroup()

print("Average growth and growth variability by region:")
print(region_growth)

# low Growth_SD shows consistent growth

[1m[22m`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.


[1] "Quarterly sales trends by region:"
[90m# A tibble: 24 × 3[39m
   Region Quarter Total_Sales
   [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m         [3m[90m<int>[39m[23m
[90m 1[39m East   Q1_2023       [4m3[24m[4m8[24m000
[90m 2[39m East   Q1_2024       [4m4[24m[4m2[24m000
[90m 3[39m East   Q2_2023       [4m4[24m[4m1[24m000
[90m 4[39m East   Q2_2024       [4m4[24m[4m6[24m000
[90m 5[39m East   Q3_2023       [4m3[24m[4m9[24m000
[90m 6[39m East   Q4_2023       [4m4[24m[4m4[24m000
[90m 7[39m North  Q1_2023       [4m4[24m[4m5[24m000
[90m 8[39m North  Q1_2024       [4m5[24m[4m0[24m000
[90m 9[39m North  Q2_2023       [4m4[24m[4m8[24m000
[90m10[39m North  Q2_2024       [4m5[24m[4m4[24m000
[90m# ℹ 14 more rows[39m
[1] "Average growth and growth variability by region:"
[90m# A tibble: 4 × 3[39m
  Region Avg_Growth Growth_SD
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m
[90m1

In [87]:
# Analysis Question 2: Skills Gap Analysis  
skill_summary <- employee_skills_long %>%
  group_by(Skill) %>%
  summarise(
    Avg_Proficiency = mean(Proficiency_Level, na.rm = TRUE),
    Missing_Count = sum(is.na(Proficiency_Level)),
    Below_Threshold = sum(Proficiency_Level < 3, na.rm = TRUE) # Example threshold for skill gap
  ) %>%
  arrange(Avg_Proficiency)

print("Skill gap summary (lower average = bigger gap):")
print(skill_summary)

# skill gaps: lowest Avg_Proficiency, high Below_Threshold
# prevalent skills: highest Avg_Proficiency

[1] "Skill gap summary (lower average = bigger gap):"
[90m# A tibble: 3 × 4[39m
  Skill         Avg_Proficiency Missing_Count Below_Threshold
  [3m[90m<chr>[39m[23m                   [3m[90m<dbl>[39m[23m         [3m[90m<int>[39m[23m           [3m[90m<int>[39m[23m
[90m1[39m Excel                    2.87             0              14
[90m2[39m SQL                      3.03             0              14
[90m3[39m R_Programming            3.1              0              11


In [88]:
# Analysis Question 3: Survey Insights
question_means <- survey_responses_wide %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

print("Average scores for each question:")
print(question_means)

max_question <- names(question_means)[which.max(question_means)]
min_question <- names(question_means)[which.min(question_means)]

cat("Highest satisfaction question:", max_question, "\n")
cat("Lowest satisfaction question:", min_question, "\n")

cor_matrix <- survey_responses_wide %>%
  select(where(is.numeric)) %>%
  cor(use = "pairwise.complete.obs")

print("Correlation matrix between survey questions:")
print(cor_matrix)


[1] "Average scores for each question:"
[90m# A tibble: 1 × 6[39m
  Respondent_ID Product_Quality Customer_Service Value_for_Money Delivery_Speed
          [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m            [3m[90m<dbl>[39m[23m           [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m          25.5            3.14             3.04             2.9           3.36
[90m# ℹ 1 more variable: Overall_Satisfaction <dbl>[39m
Highest satisfaction question: Respondent_ID 
Lowest satisfaction question: Value_for_Money 
[1] "Correlation matrix between survey questions:"
                     Respondent_ID Product_Quality Customer_Service
Respondent_ID            1.0000000     -0.10533860      -0.10574640
Product_Quality         -0.1053386      1.00000000       0.22285138
Customer_Service        -0.1057464      0.22285138       1.00000000
Value_for_Money         -0.2375157      0.37750376       0.08411789
Delivery_Speed           0.1585383     -0.114

In [89]:
cat("Reshaping the data made it much easier to answer business questions:\n")
cat("- By converting sales data from wide to long format, I could group and summarize sales by region and quarter, enabling trend analysis and growth calculations.\n")
cat("- Transforming employee skills data to long format allowed quick identification of skill gaps and proficiency averages across the organization.\n")
cat("- Pivoting survey responses to wide format made it simple to compute averages and correlations between questions, revealing satisfaction patterns.\n")
cat("- Overall, reshaping ensured each variable was in its own column and each observation in its own row, supporting flexible analysis and visualization.\n")
cat("For example, calculating growth rates or skill gaps would be difficult or error-prone in wide format, but straightforward in tidy long format.\n")

Reshaping the data made it much easier to answer business questions:
- By converting sales data from wide to long format, I could group and summarize sales by region and quarter, enabling trend analysis and growth calculations.
- Transforming employee skills data to long format allowed quick identification of skill gaps and proficiency averages across the organization.
- Pivoting survey responses to wide format made it simple to compute averages and correlations between questions, revealing satisfaction patterns.
- Overall, reshaping ensured each variable was in its own column and each observation in its own row, supporting flexible analysis and visualization.
For example, calculating growth rates or skill gaps would be difficult or error-prone in wide format, but straightforward in tidy long format.


## Part 9: Reflection Questions

Answer the following questions in your submission:

1. **Tidy Data Philosophy:** Explain the concept of "tidy data" in your own words. Why is this concept important for data analysis, and how does it relate to the reshaping operations you performed?
   :Tidy data is a neat organizational way to structure datasets so that variables are in columns, rows, forming a table. It is crucial for for simplification, or Tidy formats. 

2. **Format Selection:** For each of the datasets you worked with, explain when you would prefer the wide format versus the long format. What factors influence this decision?
   :Wide format is better for visualizations and summaries. Long format is preferred for analysis and modeling. 

3. **Business Context:** Describe three real-world business scenarios where data reshaping would be essential. For each scenario, explain what format the data might start in and what format would be needed for analysis.
   :1. sales performance across regions. Convert to long format for a trend analysis or times series.
   :2. Skill assessment planning. Convert to long format for distribution analysis. Then  convert to wide format for a dashboard. 
   :3. Customer feedback. Convert to long format for analysis or visualize response distributions.

4. **Tool Integration:** How do the reshaping capabilities of `tidyr` complement the data manipulation functions of `dplyr`? Provide examples of analyses that require both types of operations.
   :'tidyr' reshapes data, such as pivoting longer or wider for a better structure in analysis. 'dplyr' filters, summarizes, and mutates data. Which could be used after reshaping to calcualte data summaries. 

5. **Data Pipeline:** In a typical business analytics workflow, at what stage would you perform data reshaping? How does this fit into the overall data wrangling process?
   :Datareshaping is performed after cleaning, before visualizations. Ensures data is in the correct formats for analysis and reporting.

---

**Submission Checklist:**

- [ ] R notebook with all code and outputs completed
- [ ] All required data reshaping operations completed successfully
- [ ] Data validation checks performed and documented
- [ ] Datasets prepared for visualization and further analysis
- [ ] Answers to analysis questions with supporting evidence
- [ ] Answers to reflection questions
- [ ] Code is well-commented and demonstrates understanding of when to use each reshaping function

Good luck!