# Homework Assignment - Lesson 2: Data Cleaning - Handling Missing Values and Outliers

**Student Name:** [YOUR NAME HERE]  
**Date:** [TODAY'S DATE]  
**Course:** Data Management  

---

## Instructions

Complete all the tasks below by adding your R code in the code cells and your written responses in markdown cells. This assignment focuses on real-world data cleaning techniques including handling missing values and outliers.

**💡 Key Learning Goals:**
- Identify and handle missing values using multiple strategies
- Detect and treat outliers using statistical methods
- Make informed decisions about data quality trade-offs
- Document your data cleaning process and reasoning

**📋 SUBMISSION**: When you're done, see [GITHUB_CLASSROOM_SUBMISSION.md](../../GITHUB_CLASSROOM_SUBMISSION.md) for complete submission instructions.

---

### Part 1: Data Import and Initial Assessment

In this section, you'll import a "messy" dataset that contains missing values and outliers, simulating real-world data quality challenges.

#### 1.1 Environment Setup

Load the required packages for data cleaning and analysis.

In [28]:
# 1.1 — Environment Setup
# (Skip tidyverse to avoid install prompts)
# library(tidyverse)

# Confirm where this notebook runs (should end with /Homework)
getwd()


#### 1.2 Import Messy Dataset

Import the provided messy sales dataset that contains real-world data quality issues including missing values, outliers, and inconsistencies.

In [29]:
# 1.2 — Import Messy Dataset
df <- read.csv("../../data/messy_sales_data.csv")  # go up two levels, then into data/
cat("Loaded ../../data/messy_sales_data.csv | shape:", nrow(df), "rows x", ncol(df), "cols\n")
head(df); str(df)


Loaded ../../data/messy_sales_data.csv | shape: 200 rows x 6 cols


Unnamed: 0_level_0,TransactionID,Customer_Name,Product_Category,Sales_Amount,Purchase_Date,Quantity
Unnamed: 0_level_1,<int>,<chr>,<chr>,<dbl>,<chr>,<int>
1,1,,Home,362.3175,,2
2,2,Alice Brown,Clothing,573.0791,2023-10-21,3
3,3,Jane Doe,Electronics,487.6874,2023-12-28,-1
4,4,Jane Doe,Electronics,5000.0,2023-06-16,7
5,5,John Smith,Books,344.1746,2023-05-05,100
6,6,John Smith,Books,434.9527,2023-11-28,4


'data.frame':	200 obs. of  6 variables:
 $ TransactionID   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Customer_Name   : chr  "" "Alice Brown" "Jane Doe" "Jane Doe" ...
 $ Product_Category: chr  "Home" "Clothing" "Electronics" "Electronics" ...
 $ Sales_Amount    : num  362 573 488 5000 344 ...
 $ Purchase_Date   : chr  "" "2023-10-21" "2023-12-28" "2023-06-16" ...
 $ Quantity        : int  2 3 -1 7 100 4 0 7 3 2 ...


#### 1.3 Initial Data Assessment

Perform a comprehensive inspection of the messy dataset to understand its structure and identify data quality issues.

In [30]:
# 1.3 — Initial Data Assessment (uses df loaded in 1.2)

# Basic shape
cat("Rows:", nrow(df), " | Cols:", ncol(df), "\n")

# Structure and quick summary
str(df)
summary(df)

# Missing values per column (count + percent)
miss_by_col <- data.frame(
  column = names(df),
  n_missing = colSums(is.na(df)),
  pct_missing = round(100 * colSums(is.na(df)) / nrow(df), 2)
)
miss_by_col <- miss_by_col[order(-miss_by_col$pct_missing), ]
print(miss_by_col, row.names = FALSE)

# Duplicate rows
cat("Duplicate rows:", sum(duplicated(df)), "\n")

# Numeric columns: distribution stats
num_cols <- names(df)[sapply(df, is.numeric)]
if (length(num_cols) > 0) {
  num_summary <- data.frame(
    column = num_cols,
    mean   = sapply(df[num_cols], function(x) mean(x, na.rm=TRUE)),
    sd     = sapply(df[num_cols], function(x) sd(x, na.rm=TRUE)),
    min    = sapply(df[num_cols], function(x) min(x, na.rm=TRUE)),
    q1     = sapply(df[num_cols], function(x) quantile(x, 0.25, na.rm=TRUE)),
    median = sapply(df[num_cols], function(x) median(x, na.rm=TRUE)),
    q3     = sapply(df[num_cols], function(x) quantile(x, 0.75, na.rm=TRUE)),
    max    = sapply(df[num_co_]()


ERROR: Error in parse(text = input): <text>:35:0: unexpected end of input
33:     max    = sapply(df[num_co_]()
34: 
   ^


# Structure and summary of the data
print("=== DATA STRUCTURE ===")
str(messy_sales)

print("=== SUMMARY STATISTICS ===")
summary(messy_sales)

In [None]:
**Data Quality Assessment:**

Based on the imported messy_sales dataset, document all the data quality issues you observe:

1. **Missing Values:** [Look for NA values - which columns have missing data?]

2. **Potential Outliers:** [Check Sales_Amount and Quantity - do any values seem extreme?]

3. **Data Inconsistencies:** [Look at Product_Category - are there inconsistent naming conventions?]

4. **Data Types:** [Are Purchase_Date and Sales_Amount using appropriate data types?]

5. **Invalid Values:** [Are there any logically impossible values like negative quantities?]

**YOUR OBSERVATIONS:**

[Write your detailed observations here after running the code above]

---

### Part 2: Missing Value Analysis and Treatment

In this section, you'll identify missing values and apply different strategies to handle them.



Complete the following tasks to thoroughly understand the missing value patterns in your dataset.

In [None]:
# 2.1 — Quantify missingness
col_missing <- colSums(is.na(df))
miss_summary <- data.frame(
  column = names(col_missing),
  n_missing = as.integer(col_missing),
  pct_missing = round(100 * col_missing / nrow(df), 2)
)[order(-col_missing), ]
print(miss_summary, row.names = FALSE)
cat("TOTAL missing cells:", sum(is.na(df)),
    " | fully-complete rows:", sum(complete.cases(df)), "\n")

# artifact for grading
write.csv(miss_summary, "missing_values_summary.csv", row.names = FALSE)


In [None]:
# 2.2 — Removal (listwise deletion)
df_removed <- na.omit(df)
cat("After removal:", nrow(df_removed), "rows x", ncol(df_removed), "cols\n")
write.csv(df_removed, "clean_removed_missing.csv", row.names = FALSE)


In [None]:
# TODO: Remove all rows with missing values
sales_removed_na <- # YOUR CODE HERE

# Compare dimensions
print("Original dataset dimensions:")
print(dim(messy_sales))
print("After removing NA rows:")
print(dim(sales_removed_na))
print(paste("Rows lost:", nrow(messy_sales) - nrow(sales_removed_na)))

#### 2.3 Missing Value Treatment - Option B (Imputation)


Apply appropriate imputation strategies for different types of variables.

In [None]:
# 2.3 — Simple imputation: mean for numeric, mode otherwise
Mode <- function(x){ ux <- unique(x[!is.na(x)]); ux[which.max(tabulate(match(x, ux)))] }
impute_simple <- function(d){
  out <- d
  for(nm in names(out)){
    x <- out[[nm]]
    if(is.numeric(x)){
      out[[nm]][is.na(x)] <- mean(x, na.rm = TRUE)
    } else {
      if(!all(is.na(x))) out[[nm]][is.na(x)] <- Mode(x)
    }
  }
  stopifnot(sum(is.na(out)) == 0)  # must end with zero NAs
  out
}
df_imputed <- impute_simple(df)
cat("After imputation:", nrow(df_imputed), "rows x", ncol(df_imputed), "cols; NAs:", sum(is.na(df_imputed)), "\n")
write.csv(df_imputed, "clean_imputed.csv", row.names = FALSE)



After imputation: 200 rows x 6 cols; NAs: 0 


In [None]:
### TODO: Create a mode function for categorical variables
get_mode <- function(v) {
  ### YOUR CODE HERE
  ### Hint: Use unique(), tabulate(), match(), and which.max()
}

In [None]:
### TODO: Impute Customer_Name with mode (for categorical missing values)
sales_imputed$Customer_Name <- # YOUR CODE HERE

ERROR: Error in parse(text = input): <text>:3:0: unexpected end of input
1: ### TODO: Impute Customer_Name with mode (for categorical missing values)
2: sales_imputed$Customer_Name <- # YOUR CODE HERE
  ^


In [None]:
### To practice median imputation, try it on Quantity column
### TODO: Impute Quantity with median (alternative approach for numeric data)
sales_imputed$Quantity <- # YOUR CODE HERE

In [None]:
### Verify imputation success
print("Missing values after imputation:")
print(colSums(is.na(sales_imputed)))

In [None]:
#### 2.4 Compare Missing Value Strategies Analyze the impact of different missing value treatment approaches.

In [None]:
# Compare summary statistics
print("=== ORIGINAL DATA ===")
summary(messy_sales$Sales_Amount)

In [None]:
print("=== AFTER REMOVING NAs ===")
summary(sales_removed_na$Sales_Amount)

In [None]:
print("=== AFTER IMPUTATION ===")
summary(sales_imputed$Sales_Amount)


**Analysis Questions:**

1. **Which approach would you recommend for this dataset and why?**

[YOUR ANSWER HERE]

2. **What are the trade-offs between removal and imputation?**

[YOUR ANSWER HERE]

---

"### Part 3: Outlier Detection and Treatment
",

Using your imputed dataset, identify and handle outliers in the Sales_Amount variable.

In [None]:
# Quartiles & IQR for Sales_Amount
stopifnot("Sales_Amount" %in% names(df_imputed))
Q1_sales <- quantile(df_imputed$Sales_Amount, 0.25, na.rm = TRUE)
Q3_sales <- quantile(df_imputed$Sales_Amount, 0.75, na.rm = TRUE)
IQR_sales <- Q3_sales - Q1_sales
Q1_sales; Q3_sales; IQR_sales


In [None]:
### TODO: Calculate quartiles and IQR for Sales_Amount
Q1_sales <- # YOUR CODE HERE
Q3_sales <- # YOUR CODE HERE  
IQR_sales <- # YOUR CODE HERE

In [None]:
### TODO: Calculate outlier thresholds
upper_threshold <- # YOUR CODE HERE
lower_threshold <- # YOUR CODE HERE

In [None]:
### TODO: Identify outliers
outliers <- # YOUR CODE HERE

print(paste("Q1:", Q1_sales))
print(paste("Q3:", Q3_sales))
print(paste("IQR:", IQR_sales))
print(paste("Lower threshold:", lower_threshold))
print(paste("Upper threshold:", upper_threshold))
print(paste("Number of outliers found:", nrow(outliers)))
print("Outlier rows:")
print(outliers)

### 3.2 Outlier Visualization

Create a boxplot to visualize the outliers in Sales_Amount.

In [None]:
### TODO: Create a boxplot for Sales_Amount
# Use ggplot2 to create a boxplot showing outliers
boxplot_sales <- # YOUR CODE HERE

# Display the plot
print(boxplot_sales)

### 3.3 Outlier Treatment - Option A (Removal)

Remove rows containing outliers and assess the impact.

In [None]:
# 3.2 — Boxplot for Sales_Amount (recomputes IQR thresholds)
Q1  <- quantile(df_imputed$Sales_Amount, 0.25, na.rm = TRUE)
Q3  <- quantile(df_imputed$Sales_Amount, 0.75, na.rm = TRUE)
IQRv <- Q3 - Q1
lower_threshold <- Q1 - 1.5 * IQRv
upper_threshold <- Q3 + 1.5 * IQRv

if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  boxplot_sales <- ggplot(df_imputed, aes(y = Sales_Amount)) +
    geom_boxplot(outlier.alpha = 0.6) +
    labs(title = "Sales_Amount (imputed)", y = "Sales_Amount")
  print(boxplot_sales)
} else {
  # fallback: base R
  boxplot_sales <- boxplot(df_imputed$Sales_Amount,
                           main = "Sale


### 3.4 Outlier Treatment - Option B (Capping)

Apply capping/winsorization to handle outliers while preserving data points.

### 3.2 Outlier Visualization

Create a boxplot to visualize the outliers in Sales_Amount.

In [None]:
# TODO: Create a boxplot for Sales_Amount
# Use ggplot2 to create a boxplot showing outliers
boxplot_sales <- # YOUR CODE HERE
# Hint: ggplot(sales_imputed, aes(y = Sales_Amount)) + geom_boxplot() + ggtitle("Sales Amount Outliers")

# Display the plot
print(boxplot_sales)

### 3.3 Outlier Treatment - Option A (Removal)

Remove rows containing outliers and assess the impact.

In [None]:
# 3.3 — Remove rows with Sales_Amount outside IQR bounds
keep <- df_imputed$Sales_Amount >= lower_threshold & df_imputed$Sales_Amount <= upper_threshold
sales_outliers_removed <- df_imputed[keep, , drop = FALSE]

cat("Removed", sum(!keep), "rows; kept", nrow(sales_outliers_removed), "rows\n")
write.csv(sales_outliers_removed, "clean_sales_amount_outliers_removed.csv", row.names = FALSE)


In [None]:
### TODO: Create a capped version of the dataset
sales_outliers_capped <- sales_imputed

In [None]:
### TODO: Apply capping to Sales_Amount
sales_outliers_capped$Sales_Amount <- # YOUR CODE HERE
### Hint: Use ifelse() to replace values above/below thresholds

In [None]:
### Verify capping worked
print("Sales_Amount range after capping:")
print(range(sales_outliers_capped$Sales_Amount, na.rm = TRUE))

In [None]:
### Check for remaining outliers
remaining_outliers <- # YOUR CODE HERE
print(paste("Remaining outliers after capping:", nrow(remaining_outliers)))

---

## Part 4: Final Data Quality Assessment and Decision Making

Choose your final cleaned dataset and justify your decision based on the analysis you've completed.

In [None]:
# TODO: Choose your final cleaned dataset
final_dataset <- # Choose one: messy_sales, sales_removed_na, sales_imputed, sales_outliers_removed, or sales_outliers_capped

print("=== FINAL DATASET SUMMARY ===")
print(dim(final_dataset))
summary(final_dataset$Sales_Amount)

**Justification for Your Choice:**

[Explain why you chose this particular cleaned dataset. Consider factors like:
- Sample size preservation
- Data quality improvements
- Business impact
- Analysis requirements]

**YOUR JUSTIFICATION:**

[Write your detailed reasoning here]

### 4.2 Create Comparison Summary

Create a comprehensive comparison of your original and final datasets.

In [None]:
# Create comparison summary
comparison_summary <- data.frame(
  Metric = c("Number of Rows", "Missing Values", "Mean Sales_Amount", "Median Sales_Amount", "Outliers"),
  Original_Data = c(
    nrow(messy_sales),
    sum(is.na(messy_sales)),
    round(mean(messy_sales$Sales_Amount, na.rm = TRUE), 2),
    round(median(messy_sales$Sales_Amount, na.rm = TRUE), 2),
    "Check manually" # TODO: Calculate this
  ),
  Final_Data = c(
    nrow(final_dataset),
    sum(is.na(final_dataset)),
    round(mean(final_dataset$Sales_Amount, na.rm = TRUE), 2),
    round(median(final_dataset$Sales_Amount, na.rm = TRUE), 2),
    "Check manually" # TODO: Calculate this
  )
)

print("=== DATA CLEANING COMPARISON ===")
print(comparison_summary)

---

## Part 5: Reflection Questions

Answer the following questions to demonstrate your understanding of data cleaning concepts and their business implications.

### Question 1: Missing Value Strategy

In what business scenarios would you prefer removing rows with missing values versus imputing them? Provide specific examples.

**YOUR ANSWER:**

[Write your detailed response here]

### Question 2: Outlier Interpretation  

You identified outliers in the Sales_Amount column. In a real business context, what could these outliers represent? Should they always be removed or treated? Explain your reasoning.

**YOUR ANSWER:**

[Write your detailed response here]

### Question 3: Data Quality Impact

How might the presence of missing values and outliers affect common business analytics tasks such as calculating average sales, identifying top-performing products, or forecasting future sales?

**YOUR ANSWER:**

[Write your detailed response here]

### Question 4: Ethical Considerations

What are the ethical implications of removing or modifying data during the cleaning process? How can analysts ensure transparency and maintain data integrity?

**YOUR ANSWER:**

[Write your detailed response here]

---

## Submission Checklist

Before submitting, make sure you have:

- [ ] **Part 1**: Created and inspected the messy dataset
- [ ] **Part 2**: Completed missing value identification and treatment
- [ ] **Part 3**: Detected and treated outliers using IQR method  
- [ ] **Part 4**: Chosen and justified your final cleaned dataset
- [ ] **Part 4**: Created comparison summary table
- [ ] **Part 5**: Answered all reflection questions thoroughly
- [ ] **Code Quality**: All TODO sections completed with working code
- [ ] **Documentation**: Added your name and date at the top
- [ ] **Testing**: Run all cells to verify output
- [ ] **Submission**: Committed and pushed to GitHub

**Great work mastering data cleaning techniques! 🧹✨**

---

## 🚀 Ready to Submit?

### Easy Submission Steps (No Command Line Required!):

1. **Save this notebook** (Ctrl+S or File → Save)

2. **Use VS Code Source Control**:
   - Click the **Source Control** icon in the left sidebar (tree branch symbol)
   - Click the **"+"** button next to your notebook file
   - Type a message: `Submit homework 2 - Data Cleaning - [Your Name]`
   - Click **"Commit"** 
   - Click **"Sync Changes"** or **"Push"**

3. **Verify on GitHub**: Go to your repository online and confirm your notebook appears with your completed work

**📖 Need help?** See [GITHUB_CLASSROOM_SUBMISSION.md](../../GITHUB_CLASSROOM_SUBMISSION.md) for detailed instructions.

**🎉 Congratulations on completing your data cleaning assignment!**