# Homework 4: Data Transformation with dplyr - Part 2 (Mutate, Summarize, Group_by, Count)

Welcome to Homework 4! This assignment builds on your foundational dplyr skills by introducing advanced data transformation techniques essential for business analytics.

## Learning Objectives

By completing this homework, you will:
- **Master mutate()**: Create new variables and calculated fields for business insights
- **Apply summarize()**: Generate aggregate statistics and key performance indicators (KPIs)
- **Utilize group_by()**: Perform grouped analysis across business dimensions
- **Implement count()**: Conduct frequency analysis and cross-tabulations
- **Develop business intelligence**: Combine functions to create comprehensive analytics

## Business Context

You'll be working with real company sales data to perform the type of analysis that drives business decisions. This includes:
- **Financial Analysis**: Calculate profit margins, ROI, and efficiency metrics
- **Performance Segmentation**: Categorize transactions and customers by performance
- **Regional Analysis**: Compare performance across geographic regions
- **Product Analysis**: Evaluate product category performance
- **Time-based Analysis**: Identify trends and seasonal patterns
- **Sales Rep Performance**: Assess individual representative effectiveness

## Dataset Information

You'll be working with `company_sales_data.csv` which contains:
- **Sales transactions** with revenue, cost, and unit data
- **Geographic information** with regional breakdowns
- **Product categories** for portfolio analysis
- **Sales representative** performance data
- **Date information** for trend analysis
- **Customer metrics** for segmentation analysis

## Instructions

Complete each section by writing R code in the designated areas. Focus on creating clean, well-commented code that demonstrates your understanding of both the technical concepts and business applications.

**Remember**: In real business analytics, you're not just manipulating data - you're uncovering insights that drive strategic decisions!

## Part 1: Setup and Data Import

**Task**: Load the necessary packages and import your dataset.

**Business Context**: Every analysis begins with proper data setup. In professional environments, data scientists spend significant time ensuring data is properly loaded and validated before analysis begins.

**What you need to do**:
1. Load the `tidyverse` package (which includes dplyr)
2. Import the `company_sales_data.csv` file 
3. Examine the structure and basic properties of your data
4. Perform initial data validation

**Hint**: Use `read_csv()` for better data type detection, and always examine your data structure with `str()`, `summary()`, and `head()` functions.

In [4]:
# Load required packages
# TODO: Load tidyverse
library(tidyverse)

# Import the company sales data
# TODO: Read the company_sales_data.csv file into a variable called company_data
company_data <- read_csv("/workspaces/Fall2025-MS3083-Base_Template/data/company_sales_data.csv")

# Examine the dataset structure
# TODO: Use str() to examine the data structure
str(company_data)

# Display the first few rows
# TODO: Use head() to show the first 6 rows
print(head(company_data, 10))

# Generate summary statistics
# TODO: Use summary() to get basic statistics for all columns
summary(company_data)

# Check dataset dimensions
# TODO: Display the number of rows and columns using nrow() and ncol()
cat("Dataset dimensions:", nrow(company_data), "rows x", ncol(company_data), "columns\n")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m300[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


spc_tbl_ [300 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ TransactionID   : num [1:300] 1 2 3 4 5 6 7 8 9 10 ...
 $ Sales_Rep_Name  : chr [1:300] "Carol Davis" "Carol Davis" "Carol Davis" "Bob Smith" ...
 $ Region          : chr [1:300] "Latin America" "Europe" "Europe" "Europe" ...
 $ Product_Category: chr [1:300] "Services" "Hardware" "Services" "Hardware" ...
 $ Revenue         : num [1:300] 20751 32360 39268 28865 3932 ...
 $ Cost            : num [1:300] 12253 24595 23291 12429 1778 ...
 $ Units_Sold      : num [1:300] 78 13 34 90 63 26 25 1 20 15 ...
 $ Sale_Date       : Date[1:300], format: "2023-04-24" "2023-06-09" ...
 - attr(*, "spec")=
  .. cols(
  ..   TransactionID = [32mcol_double()[39m,
  ..   Sales_Rep_Name = [31mcol_character()[39m,
  ..   Region = [31mcol_character()[39m,
  ..   Product_Category = [31mcol_character()[39m,
  ..   Revenue = [32mcol_double()[39m,
  ..   Cost = [32mcol_double()[39m,
  ..   Units_Sold = [32mcol_double()[39m,
  ..   Sale_D

 TransactionID    Sales_Rep_Name        Region          Product_Category  
 Min.   :  1.00   Length:300         Length:300         Length:300        
 1st Qu.: 75.75   Class :character   Class :character   Class :character  
 Median :150.50   Mode  :character   Mode  :character   Mode  :character  
 Mean   :150.50                                                           
 3rd Qu.:225.25                                                           
 Max.   :300.00                                                           
    Revenue           Cost         Units_Sold       Sale_Date         
 Min.   : 1032   Min.   :  567   Min.   :  1.00   Min.   :2023-01-02  
 1st Qu.:15034   1st Qu.: 7216   1st Qu.: 28.75   1st Qu.:2023-04-07  
 Median :26062   Median :13956   Median : 56.00   Median :2023-07-05  
 Mean   :25906   Mean   :14547   Mean   : 53.90   Mean   :2023-07-07  
 3rd Qu.:37708   3rd Qu.:21714   3rd Qu.: 80.00   3rd Qu.:2023-10-08  
 Max.   :49956   Max.   :37811   Max.   :100.00  

Dataset dimensions: 300 rows x 8 columns


## Part 2: Creating New Variables with mutate()

**Task**: Use `mutate()` to create new calculated fields for business analysis.

**Business Context**: Raw data rarely tells the complete story. Business analysts must create derived metrics like profit margins, efficiency ratios, and performance categories to generate actionable insights.

**Variables to create**:
1. **Profit**: Revenue minus Cost
2. **Profit_Margin**: (Profit / Revenue) × 100
3. **Cost_Ratio**: (Cost / Revenue) × 100  
4. **Revenue_Per_Unit**: Revenue divided by Units_Sold
5. **Cost_Per_Unit**: Cost divided by Units_Sold
6. **ROI**: (Profit / Cost) × 100

**Hint**: Use the pipe operator (`%>%`) to chain multiple mutate operations together. Remember to create meaningful variable names that clearly indicate what each metric represents.

In [5]:
# Create basic financial metrics using mutate()
company_data <- company_data %>%
  mutate(
    # TODO: Create Profit (Revenue - Cost)
    Profit = Revenue - Cost,
    # TODO: Create Profit_Margin ((Profit / Revenue) * 100)
    Profit_Margin = (Profit / Revenue) * 100,
    # TODO: Create Cost_Ratio ((Cost / Revenue) * 100)
    Cost_Ratio = (Cost / Revenue) * 100,
    # TODO: Create Revenue_Per_Unit (Revenue / Units_Sold)
    Revenue_Per_Unit = Revenue / Units_Sold,
    # TODO: Create Cost_Per_Unit (Cost / Units_Sold)
    Cost_Per_Unit = Cost / Units_Sold,
    # TODO: Create ROI ((Profit / Cost) * 100)
    ROI = (Profit / Cost) * 100
  )

# Display the new financial metrics
# TODO: Select and display columns: Revenue, Cost, Profit, Profit_Margin, ROI
company_data %>%
select(Revenue, Cost, Profit, Profit_Margin, ROI) %>%
head(10)


Revenue,Cost,Profit,Profit_Margin,ROI
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
20750.92,12253.36,8497.56,40.95028,69.34882
32359.98,24595.2,7764.78,23.99501,31.57031
39268.4,23291.09,15977.31,40.68745,68.59838
28865.09,12428.74,16436.35,56.94197,132.2447
3932.36,1778.18,2154.18,54.78084,121.14522
48209.75,26052.04,22157.71,45.96106,85.05173
33055.35,20432.24,12623.11,38.1878,61.78035
49856.54,37118.63,12737.91,25.54913,34.31676
13477.13,8685.85,4791.28,35.55119,55.1619
21188.96,16757.06,4431.9,20.91608,26.44796


## Part 3: Creating Categorical Variables

**Task**: Create categorical variables for business segmentation using conditional logic.

**Business Context**: Segmentation is crucial for targeted business strategies. By categorizing transactions and customers, businesses can tailor their approaches to different performance levels and customer types.

**Categories to create**:
1. **Performance_Category**: Based on Profit_Margin (High: >50%, Medium: 30-50%, Low: <30%)
2. **Revenue_Size**: Based on Revenue (Large: >30000, Medium: 15000-30000, Small: <15000)
3. **Deal_Size**: Based on Units_Sold (Bulk: >50, Standard: 20-50, Small: <20)
4. **High_Value_Customer**: Flag for Revenue > 25000
5. **Profitable_Deal**: Flag for Profit_Margin > 40%

**Hint**: Use `case_when()` for multiple conditions or `ifelse()` for simple binary classifications.

In [6]:
# Create categorical variables for business segmentation
company_data <- company_data %>%
  mutate(
    # TODO: Create Performance_Category using case_when()
    # High: Profit_Margin > 50, Medium: Profit_Margin > 30, Low: everything else
    Performance_Category = case_when( Profit_Margin > 50 ~ "High", Profit_Margin > 30 ~ "Medium", TRUE ~ "Low"),
    # TODO: Create Revenue_Size category
    # Large: Revenue > 30000, Medium: Revenue > 15000, Small: everything else
    Revenue_Size = case_when(Revenue > 3000 ~ "Large", Revenue > 15000 ~ "Medium", TRUE ~ "Small"),    
    # TODO: Create Deal_Size category  
    # Bulk: Units_Sold > 50, Standard: Units_Sold > 20, Small: everything else
    Deal_Size = case_when(Units_Sold >50 ~ "Bulk", Units_Sold > 20 ~ "Standard", TRUE ~ "Small"),
    # TODO: Create High_Value_Customer flag (Yes/No for Revenue > 25000)
    High_Value_Customer = ifelse(Revenue > 25000, "Yes", "No"),
    # TODO: Create Profitable_Deal flag (Yes/No for Profit_Margin > 40)
    Profitable_Deal = ifelse(Profit_Margin > 40, "Yes", "No")
  )

# Examine the distribution of categorical variables
# TODO: Use table() to show the distribution of Performance_Category
cat("Performance Category Distribution:\n")
print(table(company_data$Performance_Category))

# TODO: Use table() to show the distribution of Revenue_Size
cat("\nRevenue Side Distribution:\n")
print(table(company_data$Revenue_Size))

# TODO: Use table() to show the distribution of High_Value_Customer
cat("\nHigh Value Customer Distribution:\n")
print(table(company_data$High_Value_Customer))

Performance Category Distribution:

  High    Low Medium 
   113     66    121 



Revenue Side Distribution:

Large Small 
  286    14 

High Value Customer Distribution:

 No Yes 
146 154 


## Part 4: Summary Statistics with summarize()

**Task**: Use `summarize()` to calculate key business metrics and overall performance indicators.

**Business Context**: Executive dashboards and business reports rely on aggregate statistics to provide high-level insights. These summary metrics help stakeholders quickly understand overall business performance.

**Metrics to calculate**:
1. **Total Revenue** and **Total Profit**
2. **Average Profit Margin** and **Average ROI**
3. **Total Units Sold** and **Total Transactions**
4. **Revenue Statistics**: Min, Max, Mean, Median, Standard Deviation
5. **Profit Margin Statistics**: Quartiles and distribution metrics

**Hint**: You can calculate multiple summary statistics in a single `summarize()` call. Use functions like `sum()`, `mean()`, `median()`, `min()`, `max()`, `sd()`, and `n()`.

In [20]:
# Calculate overall business summary statistics
business_summary <- company_data %>%
  summarize(
    # TODO: Calculate total_revenue (sum of Revenue)
    total_revenue = sum(Revenue, na.rm = TRUE),
    # TODO: Calculate total_profit (sum of Profit)
    total_profit = sum(Profit, na.rm = TRUE),
    # TODO: Calculate avg_profit_margin (mean of Profit_Margin)
    avg_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    # TODO: Calculate avg_roi (mean of ROI)
    avg_roi = mean(ROI, na.rm = TRUE),
    # TODO: Calculate total_units (sum of Units_Sold)
    total_units = sum(Units_Sold, na.rm = TRUE),
    # TODO: Calculate transaction_count (use n())
    transaction_count = n(),
    # TODO: Calculate avg_revenue_per_transaction (mean of Revenue)
    avg_revenue_per_transaction = mean(Revenue, na.rm = TRUE)
  )

# Display the business summary
print("Overall Business Performance Summary:")
print(business_summary)

# Calculate detailed revenue statistics
revenue_statistics <- company_data %>%
  summarize(
    # TODO: Calculate revenue statistics: min, max, mean, median, sd
    min_revenue = min(Revenue, na.rm = TRUE),
    max_revenue = max(Revenue, na.rm = TRUE),
    mena_revenue = mean(Revenue, na.rm = TRUE),
    median_revenue = median(Revenue, na.rm = TRUE),
    sd_revenue = sd(Revenue, na.rm = TRUE)
  )

print("Detailed Revenue Statistics:")
print(revenue_statistics)

[1] "Overall Business Performance Summary:"
[90m# A tibble: 1 × 7[39m
  total_revenue total_profit avg_profit_margin avg_roi total_units
          [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m             [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.     3[4m4[24m[4m0[24m[4m7[24m512.              44.2    93.5       [4m1[24m[4m6[24m169
[90m# ℹ 2 more variables: transaction_count <int>,[39m
[90m#   avg_revenue_per_transaction <dbl>[39m
[1] "Detailed Revenue Statistics:"
[90m# A tibble: 1 × 5[39m
  min_revenue max_revenue mena_revenue median_revenue sd_revenue
        [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m1[39m       [4m1[24m032.      [4m4[24m[4m9[24m956.       [4m2[24m[4m5[24m906.         [4m2[24m[4m6[24m062.     [4m1[24m[4m3[24m944.


## Part 5: Grouped Analysis with group_by()

**Task**: Use `group_by()` combined with `summarize()` to analyze performance across different business dimensions.

**Business Context**: Understanding how different segments perform is crucial for strategic decision-making. Regional analysis helps with resource allocation, product analysis guides inventory decisions, and sales rep analysis informs performance management.

**Analyses to perform**:
1. **Regional Analysis**: Performance by Region
2. **Product Category Analysis**: Performance by Product_Category  
3. **Performance Category Analysis**: Compare High/Medium/Low performers
4. **Sales Rep Analysis**: Individual representative performance

**Metrics for each group**:
- Total Revenue and Profit
- Average Profit Margin
- Transaction Count
- Total Units Sold
- Revenue Share (percentage of total)

**Hint**: Use `group_by()` followed by `summarize()`, and don't forget to use `arrange(desc())` to sort by key metrics.

In [8]:
# Regional Analysis
regional_performance <- company_data %>%
  group_by(Region) %>%
  summarize(
    # TODO: Calculate total_revenue, total_profit, avg_profit_margin, 
    # transaction_count, total_units for each region
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_profit = sum(Profit, na.rm = TRUE),
    avg_profit_margin = mean(Profit_Margin, na.rm= TRUE),
    transaction_count = n(),
    total_units = sum(Units_Sold, na.rm = TRUE),
    .groups = 'drop'  # Remove grouping
  ) %>%
  # TODO: Add revenue_share calculation (total_revenue / sum(total_revenue) * 100)
  mutate(
    rvenue_share = (total_revenue / sum(total_revenue)) * 100
  ) %>%
  # TODO: Arrange by total_revenue in descending order
  arrange(desc(total_revenue))

print("Regional Performance Analysis:")
print(regional_performance)

[1] "Regional Performance Analysis:"


[90m# A tibble: 4 × 7[39m
  Region        total_revenue total_profit avg_profit_margin transaction_count
  [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m             [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Europe             2[4m2[24m[4m2[24m[4m4[24m182.     1[4m0[24m[4m0[24m[4m6[24m807.              45.5                82
[90m2[39m Latin America      2[4m1[24m[4m1[24m[4m2[24m037.      [4m8[24m[4m9[24m[4m1[24m481.              43.0                83
[90m3[39m Asia Pacific       1[4m8[24m[4m0[24m[4m4[24m243.      [4m8[24m[4m2[24m[4m7[24m243.              46.0                67
[90m4[39m North America      1[4m6[24m[4m3[24m[4m1[24m248.      [4m6[24m[4m8[24m[4m1[24m981.              42.4                68
[90m# ℹ 2 more variables: total_units <dbl>, rvenue_share <dbl>[39m


In [9]:
# Product Category Analysis
category_performance <- company_data %>%
  group_by(Product_Category) %>%  # TODO: Group by Product_Category
  summarize(
    # TODO: Calculate the same metrics as regional analysis
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_pofit = sum(Profit, na.rm = TRUE),
    avg_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    transaction_count = n(),
    total_units = sum(Units_Sold, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # TODO: Add revenue_share and arrange by total_revenue
  mutate(
    revenue_share = (total_revenue / sum(total_revenue)) * 100
  ) %>%
  arrange(desc(total_revenue))

print("Product Category Performance Analysis:")
print(category_performance)

[1] "Product Category Performance Analysis:"
[90m# A tibble: 4 × 7[39m
  Product_Category total_revenue total_pofit avg_profit_margin transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Consulting            1[4m9[24m[4m7[24m[4m8[24m840.     [4m8[24m[4m8[24m[4m4[24m265.              44.4                76
[90m2[39m Services              1[4m9[24m[4m6[24m[4m1[24m565.     [4m8[24m[4m1[24m[4m4[24m908.              42.3                72
[90m3[39m Hardware              1[4m9[24m[4m5[24m[4m1[24m325.     [4m8[24m[4m5[24m[4m8[24m246.              43.5                73
[90m4[39m Software              1[4m8[24m[4m7[24m[4m9[24m981.     [4m8[24m[4m5[24m[4m0[24m092.              46.4                79
[90m# ℹ 2 more variables: total_units <dbl>, revenue_share <dbl>[39m


In [10]:
# Performance Category Analysis (High/Medium/Low performers)
performance_analysis <- company_data %>%
  group_by(Performance_Category) %>%  # TODO: Group by Performance_Category
  summarize(
    # TODO: Calculate summary metrics for each performance level
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_profit = sum(Profit, na.rm = TRUE),
    avg_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    transaction_count = n(),
    total_units = sum(Units_Sold, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # TODO: Add revenue_share calculation
  mutate(
    revenue_share = (total_revenue / sum(total_revenue)) * 100
  ) %>%
  arrange(desc(total_revenue))

print("Performance Category Analysis:\n")
print(performance_analysis)

[1] "Performance Category Analysis:\n"


[90m# A tibble: 3 × 7[39m
  Performance_Category total_revenue total_profit avg_profit_margin
  [3m[90m<chr>[39m[23m                        [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m             [3m[90m<dbl>[39m[23m
[90m1[39m Medium                    3[4m1[24m[4m8[24m[4m7[24m386.     1[4m2[24m[4m6[24m[4m7[24m202.              39.8
[90m2[39m High                      2[4m7[24m[4m7[24m[4m1[24m248.     1[4m6[24m[4m8[24m[4m2[24m947.              60.1
[90m3[39m Low                       1[4m8[24m[4m1[24m[4m3[24m077.      [4m4[24m[4m5[24m[4m7[24m363.              25.3
[90m# ℹ 3 more variables: transaction_count <int>, total_units <dbl>,[39m
[90m#   revenue_share <dbl>[39m


## Part 6: Advanced Grouping and Cross-Tabulation

**Task**: Perform multi-dimensional analysis and frequency counting using `count()` and advanced grouping.

**Business Context**: Real business questions often require analysis across multiple dimensions simultaneously. For example, "Which product categories perform best in each region?" requires cross-tabulation analysis.

**Analyses to perform**:
1. **Multi-dimensional grouping**: Region × Product Category performance
2. **Count analysis**: Frequency distributions using `count()`
3. **Cross-tabulation**: Performance Category vs Revenue Size
4. **Time-based analysis**: Performance by time periods (if date data available)

**Hint**: Use `count()` for frequency analysis and multiple variables in `group_by()` for cross-dimensional analysis.

In [11]:
# Multi-dimensional analysis: Region and Product Category
region_category_analysis <- company_data %>%
  group_by(Region, Product_Category) %>%  # TODO: Group by Region and Product_Category
  summarize(
    # TODO: Calculate total_revenue, transaction_count, avg_profit_margin
    total_revenue = sum(Revenue, na.rm = TRUE),
    transaction_count = n(),
    avg_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # TODO: Arrange by total_revenue descending
  arrange(desc(total_revenue))

print("Top 10 Region-Product Category Combinations:")
print(head(region_category_analysis, 10))

[1] "Top 10 Region-Product Category Combinations:"


[90m# A tibble: 10 × 5[39m
   Region     Product_Category total_revenue transaction_count avg_profit_margin
   [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m             [3m[90m<dbl>[39m[23m
[90m 1[39m Europe     Hardware               [4m7[24m[4m7[24m[4m7[24m044.                27              45.6
[90m 2[39m Asia Paci… Consulting             [4m7[24m[4m5[24m[4m9[24m641.                27              47.5
[90m 3[39m Latin Ame… Services               [4m6[24m[4m4[24m[4m4[24m772.                22              43.9
[90m 4[39m Latin Ame… Software               [4m5[24m[4m5[24m[4m9[24m611.                24              44.2
[90m 5[39m Europe     Software               [4m5[24m[4m4[24m[4m2[24m961.                23              49.6
[90m 6[39m Europe     Services               [4m5[24m[4m1[24m[4m3[24m507.                18              42.8
[90m 

In [12]:
# Frequency analysis using count()
print("Frequency Analysis:")

# TODO: Count by Performance_Category
performance_counts <- company_data %>%
  count(Performance_Category)

print("Performance Category Distribution:")
print(performance_counts)

# TODO: Count by Revenue_Size
revenue_size_counts <- company_data %>%
  count(Revenue_Size)

print("Revenue Size Distribution:")
print(revenue_size_counts)

# TODO: Count by Deal_Size
deal_size_counts <- company_data %>%
  count(Deal_Size)

print("Deal Size Distribution:")
print(deal_size_counts)

[1] "Frequency Analysis:"
[1] "Performance Category Distribution:"
[90m# A tibble: 3 × 2[39m
  Performance_Category     n
  [3m[90m<chr>[39m[23m                [3m[90m<int>[39m[23m
[90m1[39m High                   113
[90m2[39m Low                     66
[90m3[39m Medium                 121
[1] "Revenue Size Distribution:"
[90m# A tibble: 2 × 2[39m
  Revenue_Size     n
  [3m[90m<chr>[39m[23m        [3m[90m<int>[39m[23m
[90m1[39m Large          286
[90m2[39m Small           14
[1] "Deal Size Distribution:"
[90m# A tibble: 3 × 2[39m
  Deal_Size     n
  [3m[90m<chr>[39m[23m     [3m[90m<int>[39m[23m
[90m1[39m Bulk        159
[90m2[39m Small        51
[90m3[39m Standard     90


In [24]:
# Cross-tabulation analysis
print("Cross-Tabulation Analysis:")

# TODO: Create a cross-tabulation of Performance_Category vs Revenue_Size
# Hint: Use count() with two variables, then consider using pivot_wider() or table()
cross_tab_performance_revenue <- company_data %>%
  count(Performance_Category, Revenue_Size) %>%
  pivot_wider(names_from = Revenue_Size, values_from = n, values_fill = 0)

print("Performance Category vs Revenue Size:")
print(cross_tab_performance_revenue)

# Alternative using base R table function
print("Using table() function:")
# TODO: Create table(company_data$Performance_Category, company_data$Revenue_Size)
performance_revenue_table <- table(company_data$Performance_Category, company_data$Revenue_Size)
print(performance_revenue_table)

[1] "Cross-Tabulation Analysis:"


[1] "Performance Category vs Revenue Size:"
[90m# A tibble: 3 × 3[39m
  Performance_Category Large Small
  [3m[90m<chr>[39m[23m                [3m[90m<int>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m High                   107     6
[90m2[39m Low                     65     1
[90m3[39m Medium                 114     7
[1] "Using table() function:"
        
         Large Small
  High     107     6
  Low       65     1
  Medium   114     7


## Part 7: Business Intelligence Dashboard

**Task**: Create a comprehensive business intelligence summary that combines all your analysis techniques.

**Business Context**: Executive dashboards consolidate multiple analyses into actionable insights. This section simulates creating a report that would be presented to business stakeholders for strategic decision-making.

**Dashboard sections to create**:
1. **Key Performance Indicators (KPIs)**: Overall business health metrics
2. **Top Performers**: Best regions, categories, and sales representatives
3. **Performance Insights**: Distribution of high/medium/low performers
4. **Efficiency Metrics**: Cost ratios, ROI, and operational efficiency
5. **Strategic Recommendations**: Data-driven business recommendations

**Hint**: Use `cat()` or `print()` statements to create formatted output that resembles a professional business report.

In [14]:
# Create a comprehensive business intelligence dashboard
cat("\n", "=", rep("=", 60), "\n")
cat("HALE'S BUSINESS INTELLIGENCE DASHBOARD\n")
cat("=", rep("=", 60), "\n\n")

# Section 1: Key Performance Indicators
cat("📊 KEY PERFORMANCE INDICATORS\n")
cat("─────────────────────────────────\n")

# TODO: Calculate and display overall KPIs
# - Total Revenue, Total Profit, Overall Profit Margin
# - Total Transactions, Average Deal Size
# - High-Value Customer Rate, Profitable Deal Rate

kpi_summary <- company_data %>%
  summarize(
    # TODO: Add your KPI calculations here
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_profit = sum(Profit, na.rm = TRUE),
    overall_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    total_transactions = n(),
    avg_deal_size = mean(Units_Sold, na.rm = TRUE),
    high_value_customers = sum(High_Value_Customer == "Yes"),
    profitable_deals = sum(Profitable_Deal == "Yes"),
    high_value_customer_rate = sum(High_Value_Customer == "Yes") / n() * 100,
    profitable_deal_rate = sum(Profitable_Deal == "Yes") / n() * 100
  )

# TODO: Display KPIs using cat() or print() with proper formatting
cat("Total Revenue: $", round(kpi_summary$total_revenue, 2), "\n")
cat("Total Profit: $", round(kpi_summary$total_profit, 2), "\n")
cat("Overall Profit Margin: ", round(kpi_summary$overall_profit_margin, 2), "%\n")
cat("Total Transactions: ", kpi_summary$total_transactions, "\n")
cat("Average Deal Size (Units Sold): ", round(kpi_summary$avg_deal_size, 2), "\n")
cat("High-Value Customers: ", kpi_summary$high_value_customers, "(", round(kpi_summary$high_value_customer_rate, 2), "%)\n", sep = "")
cat("Profitable Deals: ", kpi_summary$profitable_deals, "(", round(kpi_summary$profitable_deal_rate, 2), "%)\n", sep = "")

# Section 2: Top Performers
cat("\n🏆 TOP PERFORMERS\n")
cat("─────────────────\n")

# TODO: Display top performing region
top_region <- company_data %>%
group_by(Region) %>%
summarize(total_revenue = sum(Revenue, na.rm = TRUE)) %>%
arrange(desc(total_revenue)) %>%
head(1)
cat("Top Region: ", top_region$Region, " ($", round(top_region$total_revenue, 2), ")\n", sep="")
# TODO: Get the region with highest total revenue

# TODO: Display top performing product category  
top_category <- company_data %>%
group_by(Product_Category) %>%
summarize(total_revenue = sum(Revenue, na.rm = TRUE)) %>%
arrange(desc(total_revenue)) %>%
head(1)
cat("\nTop Product Category: ", top_category$Product_Category, " ($", round(top_category$total_revenue, 2), ")\n", sep="")
# TODO: Get the product category with highest total revenue

# TODO: Display performance distribution
cat("\n📈 PERFORMANCE DISTRIBUTION\n")
cat("──────────────────────────\n")
# TODO: Show the count and percentage of High/Medium/Low performers
performance_distribution <- company_data %>%
count(Performance_Category) %>%
mutate(percent = round(n / sum(n) * 100, 1))

print(performance_distribution)


 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 
HALE'S BUSINESS INTELLIGENCE DASHBOARD
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 

📊 KEY PERFORMANCE INDICATORS
─────────────────────────────────


Total Revenue: $ 7771711 
Total Profit: $ 3407512 
Overall Profit Margin:  44.22 %
Total Transactions:  300 
Average Deal Size (Units Sold):  53.9 
High-Value Customers: 154(51.33%)
Profitable Deals: 168(56%)

🏆 TOP PERFORMERS
─────────────────
Top Region: Europe ($2224182)

Top Product Category: Consulting ($1978840)

📈 PERFORMANCE DISTRIBUTION
──────────────────────────
[90m# A tibble: 3 × 3[39m
  Performance_Category     n percent
  [3m[90m<chr>[39m[23m                [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m High                   113    37.7
[90m2[39m Low                     66    22  
[90m3[39m Medium                 121    40.3


## Part 8: Data Validation and Quality Checks

**Task**: Validate your calculated fields and check for data quality issues.

**Business Context**: Data quality is paramount in business analytics. Before presenting results to stakeholders, analysts must verify that calculations are correct and identify any data anomalies that could affect decision-making.

## Part 8: Data Validation and Quality Checks

**Task**: Validate your calculated fields and check for data quality issues.

**Business Context**: Data quality is paramount in business analytics. Before presenting results to stakeholders, analysts must verify that calculations are correct and identify any data anomalies that could affect decision-making. In professional settings, data validation is often the difference between accurate insights and costly business mistakes.

**Why Validation is Critical:**
- **Stakeholder Trust**: Executives need confidence in your analysis
- **Decision Impact**: Business decisions based on incorrect data can be costly
- **Regulatory Compliance**: Many industries require data accuracy documentation
- **Professional Credibility**: Accurate analysis establishes your reputation as a reliable analyst

**Validation Methodology:**
1. **Calculation Verification**: Mathematical accuracy of derived metrics
2. **Range Validation**: Ensure values fall within business-logical ranges
3. **Outlier Detection**: Identify unusual values that might indicate data issues
4. **Consistency Checks**: Verify that related metrics align logically
5. **Missing Value Analysis**: Check for gaps in critical business data
6. **Business Logic Testing**: Ensure results make practical business sense

**Professional Validation Techniques:**
- Use `all.equal()` for precise mathematical verification
- Apply `summary()` to identify outliers and impossible values
- Implement logical tests for business rule compliance
- Cross-check totals and subtotals for internal consistency
- Validate categorical assignments against their criteria

**Hint**: Think like a skeptical executive - question every calculation and assumption. If you can't defend your numbers, neither should your stakeholders trust them.

In [15]:
# Comprehensive Data Validation and Quality Assurance
cat("📋 COMPREHENSIVE DATA VALIDATION REPORT\n")
cat("════════════════════════════════════════\n\n")

# SECTION 1: Mathematical Calculation Verification
cat("🧮 SECTION 1: CALCULATION VERIFICATION\n")
cat("─────────────────────────────────────────\n")

# 1. Validate profit calculation (Profit = Revenue - Cost)
# TODO: Check if Profit equals Revenue - Cost using all.equal()
profit_validation <- all.equal(company_data$Profit, company_data$Revenue - company_data$Cost)
cat("✓ Profit calculation (Revenue - Cost): ",
ifelse(isTRUE(profit_validation), "✅ PASSED", "❌ FAILED"), "\n")

# 2. Validate profit margin calculation ((Profit/Revenue)*100)
# TODO: Check if Profit_Margin equals (Profit/Revenue)*100
expected_profit_margin <- (company_data$Profit / company_data$Revenue) * 100
margin_validation <- all.equal(company_data$Profit_Margin, expected_profit_margin)
cat("✓ Profit Margin calculation: ",
ifelse(isTRUE(margin_validation), "✅ PASSED", "❌ FAILED"), "\n")

# 3. Validate ROI calculation ((Profit/Cost)*100)
# TODO: Check if ROI equals (Profit/Cost)*100
expected_roi <- (company_data$Profit / company_data$Cost) * 100
roi_validation <- all.equal(company_data$ROI, expected_roi)
cat("✓ ROI calculation: ",
ifelse(isTRUE(roi_validation), "✅ PASSED", "❌ FAILED"), "\n")

# SECTION 2: Range and Outlier Analysis
cat("\n📊 SECTION 2: RANGE AND OUTLIER VALIDATION\n")
cat("──────────────────────────────────────────────\n")

# 4. Check for extreme profit margins (business logic: should rarely exceed 100% or be below -50%)
# TODO: Count transactions with extreme profit margins (>100% or <-50%)
extreme_margins <- sum(company_data$Profit_Margin > 100 | company_data$Profit_Margin < -50, na.rm = TRUE)
cat("⚠️  Extreme profit margins (>100% or <-50%): ", extreme_margins, " transactions\n")

# 5. Check for impossible business values
# TODO: Count negative revenues (impossible in normal business)
negative_revenue <- sum(company_data$Revenue < 0, na.rm = TRUE)
cat("⚠️  Negative revenue values: ", negative_revenue, " transactions\n")

# TODO: Count zero or negative units sold (business logic violation)
invalid_units <- sum(company_data$Units_Sold <= 0, na.rm = TRUE)
cat("⚠️  Invalid units sold (≤0): ", invalid_units, " transactions\n")

# SECTION 3: Missing Data Analysis
cat("\n🔍 SECTION 3: MISSING DATA ANALYSIS\n")
cat("──────────────────────────────────────\n")

# 6. Check for missing values in critical business columns
# TODO: Count missing values in Revenue, Cost, and Units_Sold
cat("Missing value analysis:\n")
missing_revenue <- sum(is.na(company_data$Revenue))
missing_cost <- sum(is.na(company_data$Cost))
missing_units <- sum(is.na(company_data$Units_Sold))
missing_profit <- sum(is.na(company_data$Profit))


cat("  • Missing Revenue values: ", missing_revenue, "\n")
cat("  • Missing Cost values: ", missing_cost, "\n") 
cat("  • Missing Units_Sold values: ", missing_units, "\n")
cat("  • Missing Profit values: ", missing_profit, "\n")


# SECTION 4: Categorical Variable Validation
cat("\n🏷️  SECTION 4: CATEGORICAL VARIABLE VALIDATION\n")
cat("─────────────────────────────────────────────────\n")

# 7. Validate Performance_Category assignments
# TODO: Check if High performers actually have Profit_Margin > 50
high_perf_correct <- sum(company_data$Performance_Category == "High" & company_data$Profit_Margin > 50, na.rm = TRUE)
high_perf_total <- sum(company_data$Performance_Category == "High", na.rm = TRUE)
cat("✓ High Performance category accuracy: ", high_perf_correct, "/", high_perf_total, " correctly assigned\n")

# TODO: Check if Revenue_Size categories match their criteria
large_revenue_correct <- sum(company_data$Revenue_Size == "Large" & company_data$Revenue > 30000, na.rm = TRUE)
large_revenue_total <- sum(company_data$Revenue_Size == "Large", na.rm = TRUE)
cat("✓ Large Revenue_Size category accuracy: ", large_revenue_correct, "/", large_revenue_total, " correctly assigned\n")

# SECTION 5: Summary Statistics Review
cat("\n📈 SECTION 5: SUMMARY STATISTICS REVIEW\n")
cat("─────────────────────────────────────────\n")

cat("Key financial metrics summary (check for reasonableness):\n")
# TODO: Display summary statistics for Profit_Margin and ROI
cat("\nProfit Margin Distribution:\n")
print(summary(company_data$Profit_Margin))

cat("\nROI Distribution:\n")
print(summary(company_data$ROI))

# SECTION 6: Business Logic Validation
cat("\n💼 SECTION 6: BUSINESS LOGIC VALIDATION\n")
cat("─────────────────────────────────────────\n")

# 8. Check if Cost_Ratio + Profit_Margin approximately equals 100%
# TODO: Verify that Cost_Ratio + Profit_Margin ≈ 100% (allowing for rounding)
total_percentage_check <- abs((company_data$Cost_Ratio + company_data$Profit_Margin) - 100)
percentage_errors <- sum(total_percentage_check > 1, na.rm = TRUE)  # Allow 1% tolerance for rounding
cat("✓ Cost_Ratio + Profit_Margin = 100% check: ", percentage_errors, " transactions outside tolerance\n")

# 9. Verify Revenue_Per_Unit and Cost_Per_Unit calculations
# TODO: Check if Revenue_Per_Unit = Revenue / Units_Sold
expected_revenue_per_unit <- company_data$Revenue / company_data$Units_Sold
revenue_per_unit_check <- all.equal(company_data$Revenue_Per_Unit, expected_revenue_per_unit)
cat("✓ Revenue_Per_Unit calculation: ")
cat(ifelse(revenue_per_unit_check == TRUE, "✅ PASSED", "❌ FAILED"), "\n")

# VALIDATION SUMMARY
cat("\n📋 VALIDATION SUMMARY REPORT\n")
cat("═══════════════════════════════\n")
cat("Data validation completed. Review all sections above before proceeding.\n")
cat("\n🎯 Next Steps:\n")
cat("1. Investigate any failed validations or extreme values\n")
cat("2. Document validation results for stakeholder presentation\n")
cat("3. Consider data cleaning if significant issues are found\n")
cat("4. Proceed with analysis only after validation passes\n")
cat("\n⚠️  Remember: Never present unvalidated results to business stakeholders!\n")

📋 COMPREHENSIVE DATA VALIDATION REPORT
════════════════════════════════════════

🧮 SECTION 1: CALCULATION VERIFICATION
─────────────────────────────────────────
✓ Profit calculation (Revenue - Cost):  ✅ PASSED 
✓ Profit Margin calculation:  ✅ PASSED 
✓ ROI calculation:  ✅ PASSED 

📊 SECTION 2: RANGE AND OUTLIER VALIDATION
──────────────────────────────────────────────
⚠️  Extreme profit margins (>100% or <-50%):  0  transactions
⚠️  Negative revenue values:  0  transactions
⚠️  Invalid units sold (≤0):  0  transactions

🔍 SECTION 3: MISSING DATA ANALYSIS
──────────────────────────────────────
Missing value analysis:
  • Missing Revenue values:  0 
  • Missing Cost values:  0 
  • Missing Units_Sold values:  0 
  • Missing Profit values:  0 

🏷️  SECTION 4: CATEGORICAL VARIABLE VALIDATION
─────────────────────────────────────────────────
✓ High Performance category accuracy:  113 / 113  correctly assigned


✓ Large Revenue_Size category accuracy:  122 / 286  correctly assigned

📈 SECTION 5: SUMMARY STATISTICS REVIEW
─────────────────────────────────────────
Key financial metrics summary (check for reasonableness):

Profit Margin Distribution:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.13   31.17   43.70   44.22   56.55   69.93 

ROI Distribution:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.20   45.28   77.63   93.49  130.16  232.57 

💼 SECTION 6: BUSINESS LOGIC VALIDATION
─────────────────────────────────────────
✓ Cost_Ratio + Profit_Margin = 100% check:  0  transactions outside tolerance
✓ Revenue_Per_Unit calculation: ✅ PASSED 

📋 VALIDATION SUMMARY REPORT
═══════════════════════════════
Data validation completed. Review all sections above before proceeding.

🎯 Next Steps:
1. Investigate any failed validations or extreme values
2. Document validation results for stakeholder presentation
3. Consider data cleaning if significant issues are found
4. Proceed with analy

## Part 9: Business Insights and Recommendations

**Task**: Synthesize your analysis into actionable business insights and strategic recommendations.

**Business Context**: The ultimate goal of data analysis is to drive business decisions. This section requires you to think like a business consultant, translating your technical findings into strategic recommendations that stakeholders can act upon.

**Required deliverables**:
1. **Key Findings**: Top 3-5 most important insights from your analysis
2. **Performance Gaps**: Areas where the business is underperforming
3. **Opportunities**: Segments or regions with growth potential  
4. **Strategic Recommendations**: Specific, actionable suggestions
5. **Risk Factors**: Data-driven identification of potential business risks

**Hint**: Frame your insights in business terms, focusing on revenue impact, efficiency improvements, and strategic opportunities rather than just statistical findings.

In [43]:
# Generate business insights and recommendations
cat("💡 BUSINESS INSIGHTS & RECOMMENDATIONS\n")
cat("════════════════════════════════════════\n\n")

cat("🔍 KEY FINDINGS:\n")
cat("─────────────────\n")

# TODO: Identify and present your top 3-5 key findings
# For example:
# 1. Which region generates the most revenue?
# 2. What percentage of deals are highly profitable?
# 3. Which product category has the best margins?
# 4. How many transactions fall into each performance category?

# Find top performing region
top_region_data <- regional_performance %>% slice(1)
cat("1. Top Region: ", top_region_data$Region, " generates $", 
    format(top_region_data$total_revenue, big.mark = ","), 
    " (", round(top_region_data$total_revenue, 1), "% of total revenue)\n")

cat("\n2. Profitable Deals: ", kpi_summary$profitable_deals, "  (", round(kpi_summary$profitable_deal_rate, 2), "% of total deals)\n", sep = "")

# TODO: Display top performing product category  
top_category <- company_data %>%
group_by(Product_Category) %>%
summarize(total_revenue = sum(Revenue, na.rm = TRUE)) %>%
arrange(desc(total_revenue)) %>%
head(1)
cat("\n3. Top Product Category: ", top_category$Product_Category, "  generates $",
format(top_category$total_revenue, big.mark = ","), 
" (", round(top_category$total_revenue, 1), "% of total revenue)\n")

# Product Category Analysis
category_performance <- company_data %>%
  group_by(Product_Category) %>%  # TODO: Group by Product_Category
  summarize(
    # TODO: Calculate the same metrics as regional analysis
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_pofit = sum(Profit, na.rm = TRUE),
    avg_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    transaction_count = n(),
    total_units = sum(Units_Sold, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(total_revenue))
  best_margin_category <- category_performance %>% slice(1)
  cat("\n4. Best Margin Category: ", best_margin_category$Product_Category, " with avg margin of ", round(best_margin_category$avg_profit_margin, 2), "%\n", sep = "")

cat("\n5. Performance Category Distribution:\n")
print(table(company_data$Performance_Category))

# TODO: Display performance distribution
cat("\n📈 PERFORMANCE DISTRIBUTION\n")
cat("──────────────────────────\n")
# TODO: Show the count and percentage of High/Medium/Low performers
cat("1. Performance Summary:\n")
performance_distribution <- company_data %>%
count(Performance_Category) %>%
mutate(percent = round(n / sum(n) * 100, 1))
print(performance_distribution)

# Product Category Analysis
category_performance <- company_data %>%
  group_by(Product_Category) %>%  # TODO: Group by Product_Category
  summarize(
    # TODO: Calculate the same metrics as regional analysis
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_profit = sum(Profit, na.rm = TRUE),
    avg_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    transaction_count = n(),
    total_units = sum(Units_Sold, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # TODO: Add revenue_share and arrange by total_revenue
  mutate(
    revenue_share = (total_revenue / sum(total_revenue)) * 100
  ) %>%
  arrange(desc(total_revenue))
cat("\n2. Product Category Performance Analysis:")
print(category_performance)

# TODO: Display summary statistics for Profit_Margin and ROI
cat("\n3. Profit Margin Distribution:\n")
print(summary(company_data$Profit_Margin))

cat("\n4. ROI Distribution:\n")
print(summary(company_data$ROI))

cat("\n5. Key Performance Indicators: \n")
kpi_summary <- company_data %>%
  summarize(
    # TODO: Add your KPI calculations here
    total_revenue = sum(Revenue, na.rm = TRUE),
    total_profit = sum(Profit, na.rm = TRUE),
    overall_profit_margin = mean(Profit_Margin, na.rm = TRUE),
    total_transactions = n(),
    avg_deal_size = mean(Units_Sold, na.rm = TRUE),
    high_value_customers = sum(High_Value_Customer == "Yes"),
    profitable_deals = sum(Profitable_Deal == "Yes"),
    high_value_customer_rate = sum(High_Value_Customer == "Yes") / n() * 100,
    profitable_deal_rate = sum(Profitable_Deal == "Yes") / n() * 100
  )

# TODO: Display KPIs using cat() or print() with proper formatting
cat("Total Revenue: $", round(kpi_summary$total_revenue, 2), "\n")
cat("Total Profit: $", round(kpi_summary$total_profit, 2), "\n")
cat("Overall Profit Margin: ", round(kpi_summary$overall_profit_margin, 2), "%\n")
cat("Total Transactions: ", kpi_summary$total_transactions, "\n")
cat("Average Deal Size (Units Sold): ", round(kpi_summary$avg_deal_size, 2), "\n")
cat("High-Value Customers: ", kpi_summary$high_value_customers, "(", round(kpi_summary$high_value_customer_rate, 2), "%)\n", sep = "")
cat("Profitable Deals: ", kpi_summary$profitable_deals, "(", round(kpi_summary$profitable_deal_rate, 2), "%)\n", sep = "")

# TODO: Add 3-4 more key findings using your analysis results


cat("\n💰 PERFORMANCE ANALYSIS:\n")
cat("────────────────────────\n")

# TODO: Analyze the distribution of performance categories
# What percentage are high/medium/low performers?
performance_distribution <- company_data %>%
count(Performance_Category) %>%
mutate(percent = round(n / sum(n) * 100, 1))
print(performance_distribution)

cat("\n🎯 STRATEGIC RECOMMENDATIONS:\n") 
cat("──────────────────────────────\n")

# TODO: Provide 4-5 specific, actionable recommendations based on your analysis
cat("1. Focus expansion efforts on top-performing regions and product categories\n")
cat("2. Investigate success factors from high-performing segments for replication\n")
cat("3. Come up with a way to help medium performers do better so more of them become high performers\n")
cat("4. See if low margin products can be improved by changing prices\n")
cat("5. Make a loyalty program for top customers to keep them coming back and spending more\n")
# TODO: Add 3 more recommendations


cat("\n⚠️  RISK FACTORS & OPPORTUNITIES:\n")
cat("──────────────────────────────────\n")

# TODO: Identify potential risks and opportunities based on your data analysis

cat("Risks: \n")
cat("1. We depend a lot on a few top regions and products so if they dont do well revenue could drop\n")
cat("2. New products dont make much of a profit\n")
cat("3. Sales could change with seasons\n")

cat("\nOpportunities: \n")
cat("1. Try new products and new regions\n")
cat("2. Create seasonal events to boost sales\n")
cat("3. Add more variations of products to increase revenue")


💡 BUSINESS INSIGHTS & RECOMMENDATIONS
════════════════════════════════════════



🔍 KEY FINDINGS:
─────────────────
1. Top Region:  Europe  generates $ 2,224,182  ( 2224182 % of total revenue)

2. Profitable Deals: 168  (56% of total deals)

3. Top Product Category:  Consulting   generates $ 1,978,840  ( 1978840 % of total revenue)

4. Best Margin Category: Consulting with avg margin of 44.45%

5. Performance Category Distribution:

  High    Low Medium 
   113     66    121 

📈 PERFORMANCE DISTRIBUTION
──────────────────────────
1. Performance Summary:
[90m# A tibble: 3 × 3[39m
  Performance_Category     n percent
  [3m[90m<chr>[39m[23m                [3m[90m<int>[39m[23m   [3m[90m<dbl>[39m[23m
[90m1[39m High                   113    37.7
[90m2[39m Low                     66    22  
[90m3[39m Medium                 121    40.3

2. Product Category Performance Analysis:[90m# A tibble: 4 × 7[39m
  Product_Category total_revenue total_profit avg_profit_margin
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m        [3m[90m<

## Reflection Questions

**Task**: Answer the following questions to demonstrate your understanding of the business applications of data transformation.

**Business Context**: Reflection is a critical part of the learning process in data analytics. Understanding not just how to perform analyses, but when and why to use different techniques, distinguishes proficient analysts from experts.

**Questions to address**:

1. **Technical Mastery**: Which dplyr function (mutate, summarize, group_by, count) did you find most powerful for business analysis, and why?
   1. I think summarize() was the most useful, because when it is used with group(), you can see so many different patters in specific parts of the data next to each other like in the product category analysis in part 9.
2. **Business Impact**: What was the most surprising or valuable insight you discovered in your analysis?
   1. The most surprising thing I saw was that most of the revenue comes from a few products. The top products make up most of the sales. It made me think about how many companies have the same thing happening to them?

3. **Strategic Thinking**: If you were presenting these findings to a company executive, which 3 insights would you prioritize and why?
   1. If I was presenting to an executive I would choose to focus on top regions, amount of profitable deals, top product category. I would focus on these because the top region brings in the most money, if the amount of profitable deals was increased, and the top products were focused on a little bit more they could be more variety in that specific item. Overall, I would focus on increasing profits.

4. **Methodology**: How did grouping and segmentation help reveal patterns that weren't visible in the raw data?
   1. Grouping helped gather all the data together into different segments which helped reveal some patterns. It could be used to see with regions are doing well and which aren't doing so well. That way the company can dig deeper to see why a region isn't doing as good as others.

5. **Future Applications**: What additional business questions could you answer with this dataset using the techniques you've learned?
   1. I could answer questions like Which months had the most sales. Which sales rep makes the most sales, how much revenue then bring in, or even which product they eat make the most revenue off of. Which in turn could help get rid of low performing employees, and maybe even reward some the the higher performing employees to keep them motivated, and feel appreciated. 

**Instructions**: Write thoughtful, paragraph-length responses that demonstrate both technical understanding and business acumen.

### Reflection Responses

**1. Most Powerful Function for Business Analysis:**

I think summarize() was the most useful, because when it is used with group(), you can see so many different patters in specific parts of the data next to each other like in the product category analysis in part 9.

**2. Most Valuable Business Insight:**

I would focus on the few products that bring in the most revenue. If the top products were focused on a little bit more they could be more variety in that specific item, which could lead to more sales.

**3. Top 3 Insights for Executive Presentation:**

I would value top regions since they bring in the most money, top products since they also bring in the most money, and look further into medium and low performers. That way we aren't just focusing on the categories that are already the top performers, but focus on some of the categories that are low performers. Even though, a product and region makes the most money, there is always room for improvement.

**4. Value of Grouping and Segmentation:**

Grouping data helps reveal patterns tht aren't easily noticeable in the raw data. Grouping helps compare specific groups of data, and helps make calculating averages, totals,and percentages easier.

**5. Future Business Applications:**

An additional analysis could focus on the employees performance as well. Focusing on revenue in a business is good, but you also are investing in your employees. Finding the lower performing employees could help you terminate them, and finding the employees that make the most revenue could get a raise, or some type of gift that shows you appreciate them. A little bit of kindness can go a long way with employee performance.

## Submission Checklist

Before submitting your homework, ensure you have completed all sections:

### Technical Requirements
- [ ] **Part 1**: Successfully imported data and performed initial exploration
- [ ] **Part 2**: Created all required financial metrics using mutate()
- [ ] **Part 3**: Developed categorical variables for business segmentation  
- [ ] **Part 4**: Generated comprehensive summary statistics
- [ ] **Part 5**: Performed grouped analysis by multiple business dimensions
- [ ] **Part 6**: Conducted frequency analysis and cross-tabulations
- [ ] **Part 7**: Created a professional business intelligence dashboard
- [ ] **Part 8**: Validated data quality and calculation accuracy
- [ ] **Part 9**: Synthesized findings into business insights and recommendations

### Code Quality
- [ ] All code sections are completed with working R code
- [ ] Code is well-commented and follows best practices
- [ ] Variable names are meaningful and consistent
- [ ] Proper use of the pipe operator (%>%) for readable code chains
- [ ] Appropriate use of .groups = 'drop' after group_by operations

### Business Analysis
- [ ] Analysis demonstrates understanding of business context
- [ ] Insights are presented in business-friendly language
- [ ] Recommendations are specific and actionable
- [ ] Cross-dimensional analysis reveals meaningful patterns
- [ ] Dashboard provides comprehensive performance overview

### Reflection and Learning
- [ ] Reflection questions answered thoughtfully and completely
- [ ] Responses demonstrate both technical and business understanding
- [ ] Examples from your own analysis support your reflections

**Final Note**: This homework simulates real-world business analytics work. Focus not just on getting the right technical results, but on developing the analytical thinking skills that drive business value. In professional settings, your ability to translate data into actionable insights is what makes you valuable to an organization.

**Good luck, and remember**: Great analysts don't just manipulate data—they uncover the stories that data tells about business performance and opportunities!

---

## 🚀 Ready to Submit?

### Easy Submission Steps (No Command Line Required!):

1. **Save this notebook** (Ctrl+S or File → Save)

2. **Use VS Code Source Control**:
   - Click the **Source Control** icon in the left sidebar (tree branch symbol)
   - Click the **"+"** button next to your notebook file
   - Type a message: `Submit homework 2 - Data Cleaning - [Your Name]`
   - Click **"Commit"** 
   - Click **"Sync Changes"** or **"Push"**

3. **Verify on GitHub**: Go to your repository online and confirm your notebook appears with your completed work

**📖 Need help?** See [GITHUB_CLASSROOM_SUBMISSION.md](../../GITHUB_CLASSROOM_SUBMISSION.md) for detailed instructions.
