# Day 1, Block A: Data Thinking & Tidy Foundations

**Duration:** 90 minutes (13:30–15:10)  
**Course:** ECBS5294 - Introduction to Data Science: Working with Data  
**Instructor:** Eduardo Ariño de la Rubia

---

## Learning Objectives

By the end of this session, you will be able to:

1. Articulate the **strategic value** of data structure
2. State the **three rules** of tidy data and recognize violations
3. Identify the **five common messiness patterns** in real datasets
4. Designate and validate a **primary key (UID)** for a dataset
5. Recognize common **type pitfalls** (dates, floats, strings-as-numbers)
6. Handle **missing values** appropriately
7. Transform a messy dataset into a tidy format

---

## 1. The Power of Data Structure

### Why Does Data Structure Matter?

> **"When you define how data is structured, you're not just organizing information—you're defining the language your entire organization will use to make decisions."**

Think about it:

- Every **"customer," "order," "product"** in your database becomes a **noun** the business rallies around
- Every **"purchase," "return," "review"** becomes a **verb** that drives metrics
- You're laying the foundation for how people **think, talk, and measure success**

### Example: What Is a Customer?

This isn't just a technical question:
- One per **email address**?
- One per **household**?
- One per **device**?

**Each choice has business implications:**
- Affects customer count metrics
- Changes how you measure retention
- Impacts marketing campaigns
- Determines privacy/GDPR compliance

**You're not just structuring data—you're making strategic business decisions.**

---

### A Brief History: Why Do Databases Exist?

####  Before Computers (Pre-1960s)
- Data stored in **filing cabinets**
- Physical cards, journals, ledgers
- Problems: Space, hard to search, hard to back up

#### Early Databases (1960s)
- First computerized databases emerged
- **Navigational databases**: Hierarchical and network models
- Problem: Had to know the question in advance to design the structure

#### The Revolution: E.F. Codd's Relational Model (1970)
- IBM computer scientist Edgar F. Codd published a groundbreaking paper
- **Key Innovation:** Build cross-linked tables where you **store each fact once**

**The Problem Codd Solved:**
1. **Disk space was expensive** → Eliminate redundancy
2. **Flexibility was limited** → Answer **any question** if the data exists
3. **Schema independence** → Logical structure separate from physical storage

> **"Store each fact once, answer any question."**

#### Modern Era (1970s-Today)
- **1979:** Oracle - first commercial relational database
- **1980s-90s:** Relational databases became dominant
- **SQL** (Structured Query Language) became the standard

**Why This Matters:**
When you learn SQL and work with databases, you're using technology that solved one of computing's fundamental problems. Understanding **tables, keys, and relationships** is the foundation.

---

## 2. The Three Rules of Tidy Data

### Who is Hadley Wickham?

- Statistician at Posit (formerly RStudio)
- Created the "tidyverse" - influential R packages
- Published influential paper: **"Tidy Data"** (2014)
- These principles work across **all tools** (Python, R, SQL, even Excel!)

### The Three Rules

A dataset is **tidy** if:

1. **Each variable is a column**
2. **Each observation is a row**
3. **Each value is a cell**

That's it! Simple to state, but powerful in practice.

### Why These Rules Matter

- **Consistency** makes tools easier to learn and use
- **Vectorized operations** work naturally (add up a column)
- **Analysis becomes intuitive** (filter rows, group columns)
- **Joining data becomes possible** (need consistent row-level observations)

### Example: Messy vs. Tidy

Let's look at some sales data to see what "tidy" really means.

---

In [1]:
# Setup
import pandas as pd
import numpy as np
import warnings

# Suppress FutureWarnings for cleaner output in teaching environment
warnings.filterwarnings('ignore', category=FutureWarning)

# Messy Example: Years as column names
messy_sales = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey'],
    '2021': [100, 150, 200],
    '2022': [120, 160, 190],
    '2023': [140, 180, 210]
})

print("MESSY: Years as column names")
print(messy_sales)
print("\n" + "="*50 + "\n")

# Tidy Example: Year as a variable
tidy_sales = pd.DataFrame({
    'product': ['Widget', 'Widget', 'Widget', 'Gadget', 'Gadget', 'Gadget', 'Doohickey', 'Doohickey', 'Doohickey'],
    'year': [2021, 2022, 2023, 2021, 2022, 2023, 2021, 2022, 2023],
    'sales': [100, 120, 140, 150, 160, 180, 200, 190, 210]
})

print("TIDY: Year as a variable (column)")
print(tidy_sales)

MESSY: Years as column names
     product  2021  2022  2023
0     Widget   100   120   140
1     Gadget   150   160   180
2  Doohickey   200   190   210


TIDY: Year as a variable (column)
     product  year  sales
0     Widget  2021    100
1     Widget  2022    120
2     Widget  2023    140
3     Gadget  2021    150
4     Gadget  2022    160
5     Gadget  2023    180
6  Doohickey  2021    200
7  Doohickey  2022    190
8  Doohickey  2023    210


### Let's Check the Rules

**Messy version:**
- ❌ **Rule 1 violated**: Column headers (2021, 2022, 2023) are **values**, not **variable names**
- ❌ **Analysis problem**: How do you add 2024 data? How do you filter by year? How do you plot trends?

**Tidy version:**
- ✅ **Rule 1**: Each variable (product, year, sales) is a column
- ✅ **Rule 2**: Each observation (product-year combination) is a row
- ✅ **Rule 3**: Each value is in its own cell
- ✅ **Analysis is easy**: Filter by year, group by product, plot trends, add new years

---

In [None]:
# Example: Why tidy is better for analysis
print("Calculate average sales per year (try this with messy version!):")
print(tidy_sales.groupby('year')['sales'].mean())

print("\nFilter to just 2023:")
print(tidy_sales[tidy_sales['year'] == 2023])

print("\nAdd 2024 data: Just add new rows!")
# new_data = pd.DataFrame({'product': ['Widget'], 'year': [2024], 'sales': [150]})
# tidy_sales = pd.concat([tidy_sales, new_data], ignore_index=True)

### Interactive Check: Is This Tidy?

Look at the following examples and ask yourself:
1. Is each variable a column?
2. Is each observation a row?
3. Is each value in its own cell?

We'll discuss these together!

---

In [None]:
# Example 2: Is this tidy?
example2 = pd.DataFrame({
    'student_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'math_score': [85, 92, 78],
    'english_score': [90, 88, 95]
})

print("Example 2:")
print(example2)
print("\nIs this tidy? Think about it...")

**Answer:** It depends on your analysis!

- If each student is an observation → **Tidy** ✅
- If each test score is an observation → **Not tidy** ❌ (subjects are column names, not a variable)

**Tidy version for score-level analysis:**

In [None]:
example2_tidy = pd.DataFrame({
    'student_id': [1, 1, 2, 2, 3, 3],
    'name': ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie'],
    'subject': ['math', 'english', 'math', 'english', 'math', 'english'],
    'score': [85, 90, 92, 88, 78, 95]
})

print("Tidy version (if each score is an observation):")
print(example2_tidy)

### Key Insight

> **"Tidy datasets are all alike, but every messy dataset is messy in its own way."** – Hadley Wickham

---

## 3. The Five Common Problems

Hadley Wickham identified five common ways datasets become messy:

### Problem 1: Column Headers are Values, Not Variable Names

**Example:** Years, categories, or measurements as column names

In [None]:
# Problem 1 Example
problem1 = pd.DataFrame({
    'religion': ['Buddhist', 'Catholic', 'Protestant'],
    '<$10k': [27, 418, 732],
    '$10-20k': [21, 617, 670],
    '$20-30k': [30, 732, 638]
})

print("Problem 1: Income brackets as column names")
print(problem1)
print("\nIssue: Can't easily filter by income bracket, plot income distribution, etc.")

**Solution:** Make income bracket a variable

In [None]:
# Solution using pd.melt()
problem1_tidy = problem1.melt(
    id_vars=['religion'],
    var_name='income_bracket',
    value_name='count'
)

print("Solution: Income bracket as a variable")
print(problem1_tidy)

### Problem 2: Multiple Variables Stored in One Column

**Example:** Combining attributes into single column names

In [None]:
# Problem 2 Example
problem2 = pd.DataFrame({
    'country': ['USA', 'Canada', 'Mexico'],
    'male_under_18': [1000, 800, 600],
    'male_over_18': [2000, 1500, 1200],
    'female_under_18': [950, 780, 620],
    'female_over_18': [2100, 1550, 1180]
})

print("Problem 2: Gender and age combined in column names")
print(problem2)
print("\nIssue: Can't analyze by gender alone or age alone easily")

**Solution:** Separate gender and age into distinct variables

In [None]:
# Solution: Melt then split
problem2_melted = problem2.melt(id_vars=['country'], var_name='demographic', value_name='count')
problem2_melted[['gender', 'age_group']] = problem2_melted['demographic'].str.split('_', n=1, expand=True)
problem2_tidy = problem2_melted[['country', 'gender', 'age_group', 'count']]

print("Solution: Gender and age as separate variables")
print(problem2_tidy.head(8))
print("\nNow we can analyze by gender, age, or both!")

### Problem 3: Variables Stored in Both Rows AND Columns

**Example:** Matrix-style data where both axes represent variables

This is less common and more complex. Key idea: If you have a "pivot table" style layout, you likely need to unpivot it.

**We'll skip detailed examples** for now (confusing for day 1). Just know: If your data looks like a matrix or spreadsheet pivot table, it probably needs tidying.

---

### Problem 4: Multiple Types of Observational Units in Same Table

**Example:** Mixing customer info with transaction info

In [None]:
# Problem 4 Example
problem4 = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN003'],
    'customer_name': ['Alice', 'Bob', 'Alice'],
    'customer_email': ['alice@example.com', 'bob@example.com', 'alice@example.com'],
    'product': ['Widget', 'Gadget', 'Doohickey'],
    'price': [10.00, 15.00, 20.00]
})

print("Problem 4: Customer info repeated on every transaction")
print(problem4)
print("\nIssues:")
print("- Alice's email is stored twice (redundancy)")
print("- If Alice changes email, must update multiple rows")
print("- Wasted storage space")
print("- Risk of inconsistency (what if one row has old email?)")

**Solution:** Separate into two tables linked by customer_id

In [None]:
# Solution: Two tables
customers = pd.DataFrame({
    'customer_id': [1, 2],
    'customer_name': ['Alice', 'Bob'],
    'customer_email': ['alice@example.com', 'bob@example.com']
})

transactions = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN003'],
    'customer_id': [1, 2, 1],
    'product': ['Widget', 'Gadget', 'Doohickey'],
    'price': [10.00, 15.00, 20.00]
})

print("Solution: Separate tables")
print("\nCustomers table:")
print(customers)
print("\nTransactions table:")
print(transactions)
print("\nAlice's email stored ONCE. Link via customer_id. This is 'normalization'!")

### Problem 5: Single Observational Unit Spread Across Multiple Tables

**Example:** Data split by time period or category into separate files

In [None]:
# Problem 5 Example
sales_jan = pd.DataFrame({
    'product': ['Widget', 'Gadget'],
    'sales': [100, 150]
})

sales_feb = pd.DataFrame({
    'product': ['Widget', 'Gadget'],
    'sales': [120, 160]
})

print("Problem 5: Data split across files")
print("\nsales_jan.csv:")
print(sales_jan)
print("\nsales_feb.csv:")
print(sales_feb)
print("\nIssue: Can't easily query across months, calculate totals, etc.")

**Solution:** Combine into one table with a month variable

In [None]:
# Solution: Combine with month column
sales_jan['month'] = 'January'
sales_feb['month'] = 'February'
sales_all = pd.concat([sales_jan, sales_feb], ignore_index=True)

print("Solution: One table with month variable")
print(sales_all)
print("\nNow we can easily:")
print("- Query any month: sales_all[sales_all['month'] == 'January']")
print("- Calculate totals: sales_all.groupby('product')['sales'].sum()")
print("- Add March data: Just append new rows!")

### Summary of Five Problems

1. **Column headers are values** → Make them a variable
2. **Multiple variables in one column** → Split into separate variables
3. **Variables in rows AND columns** → Reshape (complex, skip for now)
4. **Multiple observational units in one table** → Split into separate tables
5. **Single observational unit across multiple tables** → Combine into one table

---

## 4. Primary Keys & Identity

### What is a Primary Key?

> **"Every entity needs a unique identifier—a primary key (UID). This is the backbone of data integrity."**

A **primary key** uniquely identifies each row in your dataset.

### What Makes a Good Primary Key?

- **Unique**: No duplicates
- **Non-null**: Every row must have one
- **Stable**: Doesn't change over time
- **Single-purpose**: Exists to identify, not to describe

### Types of Primary Keys

**Natural Key:** Something inherent to the entity
- Email address (for users)
- ISBN (for books)
- SSN (for people—privacy concerns!)
- License plate (for vehicles)

**Surrogate Key:** Made up for database purposes
- customer_id: 1, 2, 3, ...
- order_id: ORD001, ORD002, ...
- Auto-incrementing integers

**Composite Key:** Multiple columns together
- (store_id, date) for daily store sales
- (student_id, course_id) for enrollments

### Why Primary Keys Matter

**Business Impact:**
- Foundation for joining tables (which customer made which purchase?)
- Ensures you can uniquely identify each observation
- Prevents duplicate records (customer entered twice?)
- Essential for updates (which record do I change?)
- Enables tracking over time (same customer, different transactions)

**Red Flags:**
- No obvious UID → Need to create one or investigate grain of data
- Multiple rows with same UID → Data quality problem, investigate!
- UID with NULLs → Missing identifiers, data incomplete

### Example: Validating a Primary Key

In [None]:
# Example transactions with good primary key
good_transactions = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN003'],
    'customer_id': [101, 102, 101],
    'amount': [50.00, 75.00, 30.00]
})

print("Good transactions table:")
print(good_transactions)

# Validate primary key
print("\n=== Primary Key Validation ===")
print(f"Is transaction_id unique? {good_transactions['transaction_id'].is_unique}")
print(f"Any NULLs in transaction_id? {good_transactions['transaction_id'].isna().any()}")
print("✅ transaction_id is a valid primary key!")

In [None]:
# Example with BAD primary key (duplicates)
bad_transactions = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN001'],  # Duplicate!
    'customer_id': [101, 102, 103],
    'amount': [50.00, 75.00, 30.00]
})

print("\nBad transactions table (has duplicate IDs):")
print(bad_transactions)

# Validate
print("\n=== Primary Key Validation ===")
print(f"Is transaction_id unique? {bad_transactions['transaction_id'].is_unique}")
print("❌ PROBLEM: Duplicate transaction IDs found!")

# Find duplicates
duplicates = bad_transactions[bad_transactions.duplicated(subset=['transaction_id'], keep=False)]
print("\nDuplicate rows:")
print(duplicates)

### Always Validate With Assertions

**Best practice:** Add assertions to your code to prove data quality

In [None]:
# Example: Assertions for primary key validation
def validate_primary_key(df, key_column):
    """
    Validate that a column is a proper primary key.
    Raises AssertionError if validation fails.
    """
    # Check for uniqueness
    assert df[key_column].is_unique, f"Duplicate values found in {key_column}"
    
    # Check for NULLs
    assert df[key_column].notna().all(), f"NULL values found in {key_column}"
    
    print(f"✅ {key_column} is a valid primary key")
    print(f"   - {len(df)} unique values")
    print(f"   - No NULLs")

# Test it
try:
    validate_primary_key(good_transactions, 'transaction_id')
except AssertionError as e:
    print(f"❌ Validation failed: {e}")

print("\n" + "="*50 + "\n")

try:
    validate_primary_key(bad_transactions, 'transaction_id')
except AssertionError as e:
    print(f"❌ Validation failed: {e}")

### Key Takeaway

**Always designate and validate your primary key!**

- Use assertions to prove uniqueness and non-null
- If you don't have a natural key, create a surrogate key
- Document what the key represents

---

## 5. Types & Common Pitfalls

> **"Computers don't understand context—you must tell them what each column means. The wrong type silently breaks calculations."**

### Why Types Matter

- Enable correct operations (math on numbers, not strings)
- Enable validation (dates must be valid dates)
- Enable efficiency (numbers stored efficiently, not as text)
- Enable analysis (group by categories, aggregate numbers)

### Common Type Pitfalls

---

### Pitfall 1: Dates

Dates are surprisingly hard!

In [None]:
# Pitfall 1: Mixed date formats
messy_dates = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN003'],
    'date': ['2024-01-15', '01/16/2024', 'January 17, 2024']
})

print("Messy dates (multiple formats):")
print(messy_dates)
print(f"\nDate column type: {messy_dates['date'].dtype}")
print("Problem: These are strings, not dates! Can't do date math or filtering.")

In [None]:
# Solution: Parse dates explicitly
# Note: This may fail if formats are truly mixed in one column
# In real life, you'd need to handle each format

clean_dates = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN003'],
    'date': ['2024-01-15', '2024-01-16', '2024-01-17']
})

clean_dates['date'] = pd.to_datetime(clean_dates['date'])

print("Clean dates:")
print(clean_dates)
print(f"\nDate column type: {clean_dates['date'].dtype}")
print("✅ Now we can do date math!")

# Examples of what we can do now
print(f"\nLatest date: {clean_dates['date'].max()}")
print(f"Days between first and last: {(clean_dates['date'].max() - clean_dates['date'].min()).days}")

**Key Lessons for Dates:**
- Always parse explicitly with `pd.to_datetime()`
- Standardize format (prefer ISO: YYYY-MM-DD)
- Watch for timezone issues
- Excel date serials (numbers like 44562) need special handling

---

### Pitfall 2: Floating-Point Precision

**Surprise:** Computers can't represent all decimal numbers exactly!

In [None]:
# The famous 0.1 + 0.2 problem
result = 0.1 + 0.2
print(f"0.1 + 0.2 = {result}")
print(f"Does 0.1 + 0.2 == 0.3? {result == 0.3}")
print(f"\nWhy? Binary representation limitation.")
print(f"Actual value: {result:.20f}")

**When This Matters:**
- Financial calculations (money)
- Equality checks (`==` can fail)
- Accumulating small amounts (errors compound)

**Solutions:**
- Use `Decimal` type for money
- Use `np.isclose()` or `round()` for comparisons
- Round final results appropriately

In [None]:
# Better: Use np.isclose() for float comparisons
print(f"np.isclose(0.1 + 0.2, 0.3): {np.isclose(0.1 + 0.2, 0.3)}")

# Or round for display
print(f"round(0.1 + 0.2, 2) == 0.3: {round(0.1 + 0.2, 2) == 0.3}")

---

### Pitfall 3: Numbers Stored as Strings

**Very common in real-world data!**

In [None]:
# Numbers with formatting stored as strings
messy_numbers = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey'],
    'price': ['$12.50', '$1,234.56', '45.00']
})

print("Messy numbers:")
print(messy_numbers)
print(f"\nPrice column type: {messy_numbers['price'].dtype}")
print("\nTry to calculate total:")
try:
    total = messy_numbers['price'].sum()
    print(f"Total: {total}")
    print("Uh oh... that's not right!")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Solution: Clean and convert
clean_numbers = messy_numbers.copy()

# Remove $ and , then convert to float
clean_numbers['price'] = clean_numbers['price'].str.replace('$', '', regex=False)
clean_numbers['price'] = clean_numbers['price'].str.replace(',', '', regex=False)
clean_numbers['price'] = clean_numbers['price'].astype(float)

print("Clean numbers:")
print(clean_numbers)
print(f"\nPrice column type: {clean_numbers['price'].dtype}")
print(f"Total: ${clean_numbers['price'].sum():,.2f}")
print("✅ Now math works!")

**Other Examples:**
- Percentages: "5%" → 0.05
- Units embedded: "42.5 kg" → 42.5
- Phone numbers: "(555) 123-4567" → Keep as string or extract digits

---

### Pitfall 4: Booleans

Many ways to represent True/False!

In [None]:
# Many representations of boolean
bool_mess = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'is_active': ['Yes', 'No', 'Y', 'N']
})

print("Messy booleans:")
print(bool_mess)
print(f"\nis_active type: {bool_mess['is_active'].dtype}")

In [None]:
# Solution: Standardize
bool_clean = bool_mess.copy()
bool_clean['is_active'] = bool_clean['is_active'].map({
    'Yes': True,
    'Y': True,
    'No': False,
    'N': False
})

print("Clean booleans:")
print(bool_clean)
print(f"\nis_active type: {bool_clean['is_active'].dtype}")
print(f"\nActive customers: {bool_clean['is_active'].sum()}")

### Summary: Type Pitfalls

1. **Dates**: Parse explicitly, standardize format
2. **Floats**: Watch for precision issues, don't use `==`
3. **Numbers as strings**: Clean formatting, then convert
4. **Booleans**: Standardize representations

**Always check types** with `df.dtypes` after loading data!

---

## 6. Missing Values

> **"Missing data IS data—but different representations mean different things. Your choice matters."**

### Common Representations of Missing Data

In [None]:
# Many ways to represent "missing"
missing_examples = pd.DataFrame({
    'customer': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, np.nan, 30, -999, 35],
    'city': ['NYC', '', 'Boston', 'N/A', 'Chicago'],
    'income': [50000, 60000, 0, 70000, np.nan]
})

print("Dataset with various missing value representations:")
print(missing_examples)
print("\nWhat's missing here?")
print("- Bob's age: NaN (proper missing)")
print("- Bob's city: empty string''")
print("- Charlie's income: 0 (is this missing or actually zero?)")
print("- Diana's age: -999 (sentinel value)")
print("- Diana's city: 'N/A' (string sentinel)")
print("- Eve's income: NaN")

### Why It Matters: Different Representations, Different Results

In [None]:
# Example: How NULL vs 0 affects calculations
with_null = pd.Series([10, 20, np.nan])
with_zero = pd.Series([10, 20, 0])

print("=== Impact on Aggregations ===")
print(f"\nWith NULL: [10, 20, NaN]")
print(f"  Mean: {with_null.mean():.2f}  (NULLs excluded)")
print(f"  Count: {with_null.count()}  (counts non-NULL)")

print(f"\nWith Zero: [10, 20, 0]")
print(f"  Mean: {with_zero.mean():.2f}  (zero included!)")
print(f"  Count: {with_zero.count()}  (counts all)")

print("\n⚠️  Same data, but 0 vs NaN gives different results!")

### Decision Framework: What Does "Missing" Mean?

Before handling missing values, ask:

1. **"Not applicable"**
   - This field doesn't apply to this row
   - Example: "spouse_name" for single person
   - Strategy: Leave as NULL or N/A

2. **"Unknown"**
   - We don't know the value but it exists
   - Example: Customer age not collected
   - Strategy: NULL or impute (fill with median/mean)

3. **"Not collected yet"**
   - Temporal gap, data will arrive later
   - Example: Survey response pending
   - Strategy: Mark as pending, don't analyze yet

**Your representation choice should reflect the meaning!**

### Best Practices

In [None]:
# Example: Standardizing missing values
messy_missing = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],
    'price': [10.0, np.nan, 'N/A', '']
})

print("Before standardization:")
print(messy_missing)
print(f"\nMissing values detected: {messy_missing['price'].isna().sum()}")
print("Problem: NaN, 'N/A', and '' are all missing, but pandas only sees NaN")

In [None]:
# Solution: Standardize to NULL (NaN)
clean_missing = messy_missing.copy()

# Replace common sentinel values with NaN
clean_missing['price'] = clean_missing['price'].replace(['N/A', '', 'Unknown', 'ERROR'], np.nan)

print("After standardization:")
print(clean_missing)
print(f"\nMissing values detected: {clean_missing['price'].isna().sum()}")
print("✅ Now all missing values are represented consistently")

### Handling Strategies

Once standardized, you have options:

1. **Drop rows** with missing values
   - `df.dropna(subset=['column'])`
   - Use when: Missing data is small percentage
   - Risk: Losing information

2. **Drop columns** with too many missing values
   - `df.dropna(axis=1, thresh=len(df)*0.5)`
   - Use when: Column is mostly empty

3. **Fill with default value**
   - `df['column'].fillna(0)` or `.fillna('Unknown')`
   - Use when: There's a meaningful default
   - Document your choice!

4. **Impute with statistics**
   - `df['age'].fillna(df['age'].median())`
   - Use when: Missing at random
   - Risk: Creating fake data

5. **Keep as NULL**
   - Leave as NaN
   - Use when: Missing is meaningful (not applicable)
   - Be aware how it affects aggregations

In [None]:
# Example: Different strategies
data_with_missing = pd.DataFrame({
    'transaction_id': ['TXN001', 'TXN002', 'TXN003', 'TXN004'],
    'amount': [100, 200, np.nan, 150],
    'notes': ['Paid', np.nan, 'Paid', np.nan]
})

print("Original data:")
print(data_with_missing)

# Strategy 1: Fill amount with median
strategy1 = data_with_missing.copy()
strategy1['amount'] = strategy1['amount'].fillna(strategy1['amount'].median())
print("\nStrategy 1: Fill amount with median")
print(strategy1)

# Strategy 2: Fill notes with 'No notes'
strategy2 = data_with_missing.copy()
strategy2['notes'] = strategy2['notes'].fillna('No notes')
print("\nStrategy 2: Fill notes with 'No notes'")
print(strategy2)

# Strategy 3: Drop rows with any missing
strategy3 = data_with_missing.dropna()
print("\nStrategy 3: Drop rows with ANY missing values")
print(strategy3)
print(f"Rows remaining: {len(strategy3)} out of {len(data_with_missing)}")

### Document Your Choices!

**Example documentation:**

> **Handling missing payment_method values:** Found 2,579 transactions (25.8%) with NULL payment method. These represent transactions where payment method was not recorded at point of sale. 
>
> **Decision:** Convert to NaN and exclude from payment method analysis. Our "sales by payment method" report will only include transactions with known payment methods (74.2% of data).
>
> **Alternative considered:** Could create "Unknown" category, but this might mislead stakeholders into thinking it's a valid payment option. 
>
> **Implication:** Total sales across all payment methods will be less than total sales overall. Need to clearly communicate this in reporting.

### Summary: Missing Values

1. **Standardize** to NULL/NaN first
2. **Understand** what "missing" means for each column
3. **Choose** a strategy (drop, fill, impute, keep)
4. **Document** your choice and implications
5. **Communicate** how it affects downstream analysis

---

## Summary: Key Takeaways

### 1. Data Structure is Strategic
- You're defining the language the business uses
- Databases solve: "Store each fact once, answer any question"

### 2. Tidy Data Principles
- Each variable is a column
- Each observation is a row
- Each value is a cell

### 3. Five Common Problems
1. Column headers are values
2. Multiple variables in one column
3. Variables in rows AND columns
4. Multiple observational units in one table
5. Single observational unit across multiple tables

### 4. Primary Keys
- Every dataset needs a unique identifier
- Always validate with assertions
- Foundation for joins and data integrity

### 5. Types Matter
- Dates: Parse explicitly
- Floats: Watch precision, don't use `==`
- Strings-as-numbers: Clean then convert
- Booleans: Standardize representations

### 6. Missing Values
- Different representations mean different things
- Standardize to NULL/NaN
- Document your handling strategy
- Understand impact on aggregations

---

## Next Steps

**In-Class Exercise:**
- Apply these concepts to real messy data
- Practice identifying and fixing problems
- Build your data cleaning skills

**Notebook:** `day1_exercise_tidy.ipynb`  
**Dataset:** `data/day1/dirty_cafe_sales.csv`  
**Data Dictionary:** `data/day1/README.md`

**Let's practice!** 🚀