# Day 1 Exercise: Cleaning Messy Cafe Sales Data

**Name:** _[Your name here]_  
**Date:** October 8, 2025

---

## Objective

Transform a messy cafe sales dataset into a tidy format, designate and validate a primary key, and create summary tables.

## Dataset

**File:** `../data/day1/dirty_cafe_sales.csv`  
**Rows:** 10,000 cafe transactions  
**Data Dictionary:** See `../data/day1/README.md`

## Deliverable

This notebook should **"Restart & Run All"** successfully when you're done!

---

## Section 1: Setup and Data Loading

### TODO 1: Import libraries

In [None]:
# TODO: Import pandas and numpy
# Also import warnings to suppress FutureWarnings for cleaner output
# Your code here:
# import pandas as pd
# import numpy as np
# import warnings
# warnings.filterwarnings('ignore', category=FutureWarning)


### TODO 2: Load the data

In [None]:
# TODO: Load the dirty cafe sales data from ../data/day1/dirty_cafe_sales.csv
# Hint: Use relative path from the notebooks/ directory
# df = pd.read_csv(...)


---

## Section 2: Initial Exploration

Before cleaning, let's understand what we have.

### TODO 3: Display basic information

In [None]:
# TODO: Display the shape of the dataframe
# print(f"Dataset shape: ...")


In [None]:
# TODO: Display the first 10 rows


In [None]:
# TODO: Display column names and types
# print(df.dtypes)


### TODO 4: Check for missing values

In [None]:
# TODO: Count missing values (NaN) in each column
# print(df.isnull().sum())


### TODO 5: Check for sentinel values

Look for "ERROR" and "UNKNOWN" in the data.

In [None]:
# TODO: Count "ERROR" values in each column
# Hint: (df == 'ERROR').sum() counts ERROR across all columns


In [None]:
# TODO: Count "UNKNOWN" values in each column


### Reflection: What Issues Did You Find?

**TODO:** Write 2-3 sentences describing the data quality issues you observed.

_[Your reflection here]_

---

## Section 3: Is This Data Tidy?

### TODO 6: Evaluate against tidy data principles

**The Three Rules:**
1. Each variable is a column
2. Each observation is a row
3. Each value is a cell

**Questions to answer in markdown:**

1. What is the unit of observation in this dataset? (What does each row represent?)

_[Your answer]_

2. Does each variable have its own column?

_[Your answer]_

3. Is this dataset tidy? Why or why not?

_[Your answer]_

---

## Section 4: Identify and Validate Primary Key

### TODO 7: Identify the primary key candidate

In [None]:
# TODO: Check if 'Transaction ID' is unique
# Hint: df['Transaction ID'].is_unique


In [None]:
# TODO: Check for any NULL values in 'Transaction ID'
# Hint: df['Transaction ID'].isnull().sum()


In [None]:
# TODO: If there are duplicates, find them
# duplicates = df[df.duplicated(subset=['Transaction ID'], keep=False)]
# print(duplicates)


### TODO 8: Write validation assertions

Once you've confirmed (or fixed) the primary key, write assertions to prove it.

In [None]:
# TODO: Add assertions to validate primary key
# assert df['Transaction ID'].is_unique, "Duplicate transaction IDs found"
# assert df['Transaction ID'].notna().all(), "NULL transaction IDs found"
# print("✅ Transaction ID is a valid primary key")


### Reflection: Primary Key

**TODO:** Explain what you found and any decisions you made.

_[Your reflection here: Is Transaction ID a good primary key? Did you find any issues? How did you handle them?]_

---

## Section 5: Handle Missing Values

### TODO 9: Standardize missing value representations

Convert "ERROR", "UNKNOWN", and empty strings to NaN.

In [None]:
# TODO: Replace sentinel values with NaN
# Hint: df = df.replace(['ERROR', 'UNKNOWN', ''], np.nan)
# Or replace column by column for more control


In [None]:
# TODO: Check missing values again after standardization
# print(df.isnull().sum())


### TODO 10: Decide how to handle missing values

**Options:**
- Drop rows with missing values in critical columns
- Fill with default values
- Keep as NaN (document impact on analysis)

**Your strategy:**

_[Write your strategy here. Example: "I will keep NaN for Payment Method because it represents missing data at point of sale. These transactions will be excluded from payment method analysis but included in overall sales totals."]_

In [None]:
# TODO: Implement your missing value strategy
# Example:
# df = df.dropna(subset=['Transaction ID'])  # Drop if no ID
# df['Payment Method'] = df['Payment Method'].fillna('Unknown')  # Or keep as NaN


---

## Section 6: Fix Type Issues

### TODO 11: Convert Quantity to integer

In [None]:
# TODO: Convert Quantity to integer
# Hint: You may need to handle NaN first
# df['Quantity'] = df['Quantity'].astype('Int64')  # Int64 allows NaN


### TODO 12: Convert prices to float

In [None]:
# TODO: Convert 'Price Per Unit' to float
# Hint: May need to handle non-numeric values first
# df['Price Per Unit'] = pd.to_numeric(df['Price Per Unit'], errors='coerce')


In [None]:
# TODO: Convert 'Total Spent' to float


### TODO 13: Convert Transaction Date to datetime

In [None]:
# TODO: Parse Transaction Date as datetime
# df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], errors='coerce')


### TODO 14: Verify types

In [None]:
# TODO: Display dtypes to verify conversions worked
# print(df.dtypes)


### TODO 15: Write type assertions

In [None]:
# TODO: Add assertions to validate types
# assert df['Quantity'].dtype in ['int64', 'Int64'], "Quantity should be integer"
# assert df['Price Per Unit'].dtype == 'float64', "Price should be float"
# assert df['Transaction Date'].dtype == 'datetime64[ns]', "Date should be datetime"
# print("✅ All types are correct")


---

## Section 7: Validate Data Integrity

### TODO 16: Check if Total Spent = Quantity × Price Per Unit

In [None]:
# TODO: Calculate expected total
# df['Calculated Total'] = df['Quantity'] * df['Price Per Unit']


In [None]:
# TODO: Compare with actual Total Spent (use np.isclose for float comparison)
# mask = df['Total Spent'].notna() & df['Calculated Total'].notna()
# mismatches = ~np.isclose(df.loc[mask, 'Total Spent'], df.loc[mask, 'Calculated Total'])
# print(f"Mismatches found: {mismatches.sum()}")


### TODO 17: Check for impossible values

In [None]:
# TODO: Check for negative or zero prices
# Hint: df[df['Price Per Unit'] <= 0]


In [None]:
# TODO: Check for zero or negative quantities


### Reflection: Data Integrity

**TODO:** What did you find? How did you handle integrity issues?

_[Your reflection here]_

---

## Section 8: Create Summary Tables

Now that data is clean, answer some business questions!

### TODO 18: Total sales by payment method

In [None]:
# TODO: Calculate total revenue and transaction count by payment method
# Hint: df.groupby('Payment Method').agg({'Total Spent': ['sum', 'count'], ...})


### TODO 19: Most popular items

In [None]:
# TODO: Find most popular items by quantity sold
# Hint: df.groupby('Item')['Quantity'].sum().sort_values(ascending=False)


In [None]:
# TODO: Find highest revenue items
# Hint: df.groupby('Item')['Total Spent'].sum().sort_values(ascending=False)


### TODO 20: Location comparison

In [None]:
# TODO: Compare transaction volume and average transaction value by location
# df.groupby('Location').agg({
#     'Transaction ID': 'count',
#     'Total Spent': ['sum', 'mean']
# })


---

## Section 9: Final Validation

### TODO 21: Run all validations

In [None]:
# TODO: Gather all your assertions in one cell to prove data quality

print("Running final validation...\n")

# Primary key
# assert df['Transaction ID'].is_unique, "Duplicate transaction IDs"
# assert df['Transaction ID'].notna().all(), "NULL transaction IDs"
# print("✅ Primary key validated")

# Types
# assert df['Quantity'].dtype in ['int64', 'Int64'], "Quantity type wrong"
# assert df['Price Per Unit'].dtype == 'float64', "Price type wrong"
# assert df['Transaction Date'].dtype == 'datetime64[ns]', "Date type wrong"
# print("✅ Types validated")

# Data ranges (adjust based on your data)
# assert (df['Quantity'] > 0).all(), "Invalid quantities found"
# assert (df['Price Per Unit'] > 0).all(), "Invalid prices found"
# print("✅ Data ranges validated")

# print("\n✅ All validations passed!")


---

## Section 10: Documentation

### TODO 22: Document your data cleaning process

Write a brief summary (8-10 sentences) of:
1. What problems you found
2. What decisions you made
3. What the implications are for analysis
4. What a stakeholder should know about this data

---

## Data Cleaning Summary

_[Your summary here]_

### Issues Found
- _[List major issues]_

### Actions Taken
- _[List your cleaning steps]_

### Assumptions Made
- _[List key assumptions]_

### Implications for Analysis
- _[What should analysts know?]_

### Data Quality Assessment
- _[Overall, how clean is this data now? What percentage is usable?]_

---

## Congratulations!

You've successfully cleaned a real messy dataset using tidy data principles!

**Final check:** Can you **"Restart & Run All"** successfully? That's the gold standard!

**Reflection:** What was the hardest part? What did you learn?