# Homework 3: Multi-City Data Integration & Pipeline

**Course:** ECBS5294 - Introduction to Data Science: Working with Data
**Due:** Wednesday, October 29, 23:59
**Total Points:** 100

---

## Student Information

**Name:** [Your name here]
**Date:** [Today's date]

---

## Assignment Overview

You're a Data Analyst at PolicyMetrics, a consulting firm. Your task: integrate business licensing data from Chicago (CSV) and building permit data from NYC (JSON) into a clean analytical database.

**Complete all 6 parts below.**

**See full instructions:** `assignments/hw3/README.md`

---

In [None]:
# Setup (PROVIDED - don't modify)
import pandas as pd
import duckdb
import json
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')

# Create DuckDB connection
con = duckdb.connect(':memory:')

print("✅ Setup complete")
print(f"\nPandas version: {pd.__version__}")
print(f"DuckDB version: {duckdb.__version__}")

---

## Part 1: Data Ingestion & Exploration (15 points)

**Load both datasets and profile them.**

**TODO:**
1. Load Chicago business licenses CSV
2. Load NYC building permits JSON
3. Display first rows of each
4. Document row counts, columns, data types
5. Note any obvious data quality issues

---

### 1.1: Load Chicago Business Licenses (CSV)

In [None]:
# TODO: Load Chicago CSV

# Load the CSV file
chicago_df = pd.read_csv('../../data/day3/hw3_data_pack/chicago_business_licenses.csv')

# Display basic info
print("=== Chicago Business Licenses ===\n")
print(f"Rows: {len(chicago_df)}")
print(f"Columns: {len(chicago_df.columns)}")
print(f"\nColumn names:\n{list(chicago_df.columns)}")

# Display first few rows
print("\nFirst 5 rows:")
display(chicago_df.head())

# Check data types
print("\nData types:")
print(chicago_df.dtypes)

### 1.2: Load NYC Building Permits (JSON)

In [None]:
# TODO: Load NYC JSON

# Load JSON file
with open('../../data/day3/hw3_data_pack/nyc_building_permits.json', 'r') as f:
    nyc_data = json.load(f)

# Convert to DataFrame
nyc_df = pd.DataFrame(nyc_data)

# Display basic info
print("=== NYC Building Permits ===\n")
print(f"Rows: {len(nyc_df)}")
print(f"Columns: {len(nyc_df.columns)}")
print(f"\nColumn names:\n{list(nyc_df.columns)}")

# Display first few rows
print("\nFirst 5 rows:")
display(nyc_df.head())

# Check data types
print("\nData types:")
print(nyc_df.dtypes)

### 1.3: Initial Data Quality Observations

**TODO:** Document what you notice about the data.

(Double-click to edit this markdown cell and add your observations)

**Chicago observations:**
- [TODO: What do you notice? Missing values? Data types? Anything surprising?]

**NYC observations:**
- [TODO: What do you notice? Especially check lat/lon data types!]

**Cross-dataset observations:**
- [TODO: Any challenges you foresee in working with both datasets?]

---

---

## Part 2: Bronze Layer (20 points)

**Load raw data into bronze tables (preserve as-is).**

**TODO:**
1. Create `bronze_chicago_licenses` table
2. Create `bronze_nyc_permits` table
3. Verify row counts match source files

---

In [None]:
# TODO: Create bronze tables

print("=== BRONZE LAYER ===\n")

# Bronze Chicago
con.execute("CREATE TABLE bronze_chicago_licenses AS SELECT * FROM chicago_df")
chicago_count = con.execute("SELECT COUNT(*) FROM bronze_chicago_licenses").fetchone()[0]
print(f"✓ Created bronze_chicago_licenses: {chicago_count} rows")

# Bronze NYC
con.execute("CREATE TABLE bronze_nyc_permits AS SELECT * FROM nyc_df")
nyc_count = con.execute("SELECT COUNT(*) FROM bronze_nyc_permits").fetchone()[0]
print(f"✓ Created bronze_nyc_permits: {nyc_count} rows")

# Verify counts match
assert chicago_count == len(chicago_df), "Chicago row count mismatch!"
assert nyc_count == len(nyc_df), "NYC row count mismatch!"

print("\n✅ Bronze layer complete: Raw data preserved")

---

## Part 3: Silver Layer - Normalization (25 points)

**Transform into analysis-ready format.**

**TODO:**
1. Create clean silver tables with proper types
2. Fix dates (parse to datetime)
3. Convert NYC lat/lon from strings to floats
4. Handle missing values appropriately
5. Remove rows with NULL in critical fields
6. Document cleaning decisions

---

### 3.1: Silver - Chicago Licenses

In [None]:
# TODO: Create silver Chicago table

print("=== SILVER: Chicago Licenses ===\n")

con.execute("""
    CREATE TABLE silver_chicago_licenses AS
    SELECT
        license_id,
        account_number,
        legal_name,
        doing_business_as_name,
        address,
        city,
        state,
        zip_code,
        ward,
        license_code,
        license_description,
        business_activity,
        application_type,
        TRY_CAST(application_created_date AS TIMESTAMP) as application_created_date,
        TRY_CAST(license_start_date AS TIMESTAMP) as license_start_date,
        TRY_CAST(expiration_date AS TIMESTAMP) as expiration_date,
        license_status,
        TRY_CAST(latitude AS DOUBLE) as latitude,
        TRY_CAST(longitude AS DOUBLE) as longitude
    FROM bronze_chicago_licenses
    WHERE license_id IS NOT NULL  -- Remove rows without license ID
""")

# Check results
silver_chicago_count = con.execute("SELECT COUNT(*) FROM silver_chicago_licenses").fetchone()[0]
removed_chicago = chicago_count - silver_chicago_count

print(f"Silver Chicago rows: {silver_chicago_count}")
print(f"Removed {removed_chicago} rows with NULL license_id ({removed_chicago/chicago_count*100:.2f}%)")

# Verify types
sample = con.execute("SELECT * FROM silver_chicago_licenses LIMIT 1").df()
print(f"\nDate type check:")
print(f"  application_created_date: {sample['application_created_date'].dtype}")
print(f"  license_start_date: {sample['license_start_date'].dtype}")

print("\n✓ Silver Chicago complete")

### 3.2: Silver - NYC Permits

In [None]:
# TODO: Create silver NYC table

print("=== SILVER: NYC Permits ===\n")

# Critical: Convert lat/lon from strings to floats BEFORE creating table
nyc_clean = nyc_df.copy()
nyc_clean['gis_latitude'] = pd.to_numeric(nyc_clean['gis_latitude'], errors='coerce')
nyc_clean['gis_longitude'] = pd.to_numeric(nyc_clean['gis_longitude'], errors='coerce')

# Update the dataframe in DuckDB's scope
con.register('nyc_clean', nyc_clean)

con.execute("""
    CREATE TABLE silver_nyc_permits AS
    SELECT
        job__ as job_number,
        borough,
        bin__ as bin,
        house__,
        street_name,
        zip_code,
        job_type,
        permit_status,
        TRY_CAST(filing_date AS TIMESTAMP) as filing_date,
        TRY_CAST(issuance_date AS TIMESTAMP) as issuance_date,
        TRY_CAST(expiration_date AS TIMESTAMP) as expiration_date,
        owner_s_business_name,
        owner_s_first_name,
        owner_s_last_name,
        gis_latitude as latitude,  -- Already converted to float
        gis_longitude as longitude  -- Already converted to float
    FROM nyc_clean
    WHERE job__ IS NOT NULL  -- Remove rows without job number
""")

# Check results
silver_nyc_count = con.execute("SELECT COUNT(*) FROM silver_nyc_permits").fetchone()[0]
removed_nyc = nyc_count - silver_nyc_count

print(f"Silver NYC rows: {silver_nyc_count}")
print(f"Removed {removed_nyc} rows with NULL job number ({removed_nyc/nyc_count*100:.2f}%)")

# Verify lat/lon are numeric
sample_nyc = con.execute("SELECT * FROM silver_nyc_permits LIMIT 1").df()
print(f"\nLat/lon type check:")
print(f"  latitude: {sample_nyc['latitude'].dtype}")
print(f"  longitude: {sample_nyc['longitude'].dtype}")

print("\n✓ Silver NYC complete")

### 3.3: Document Cleaning Decisions

**TODO:** Document what you did and why.

(Double-click to edit)

**Chicago cleaning:**
- Removed [X] rows with NULL license_id
- Parsed dates: application_created_date, license_start_date, expiration_date
- Converted lat/lon to DOUBLE
- [TODO: Any other decisions?]

**NYC cleaning:**
- Removed [X] rows with NULL job number
- **Critical:** Converted gis_latitude and gis_longitude from STRING to FLOAT
- Parsed dates: filing_date, issuance_date, expiration_date
- [TODO: Any other decisions?]

**Missing values:**
- [TODO: How did you handle NULLs in non-critical fields?]

---

---

## Part 4: Silver Layer - Validations (15 points)

**Write at least 3 assertions to validate data quality.**

**Requirements:**
- At least 3 validations (5 pts each)
- Use assertions (raise error if fail)
- Include clear error messages
- Validate something meaningful

**Suggestions:**
1. Primary key uniqueness
2. Required fields non-null
3. Data types correct
4. Date ranges reasonable
5. Business rules (if applicable)

---

In [None]:
# TODO: Write validations

print("=== VALIDATIONS ===\n")

# Validation 1: TODO - Add your validation
# Example: Primary key uniqueness for Chicago
chicago_total = con.execute("SELECT COUNT(*) FROM silver_chicago_licenses").fetchone()[0]
chicago_unique = con.execute("SELECT COUNT(DISTINCT license_id) FROM silver_chicago_licenses").fetchone()[0]

print(f"✓ Validation 1: Chicago license_id uniqueness")
print(f"  Total rows: {chicago_total}")
print(f"  Unique license_ids: {chicago_unique}")
assert chicago_total == chicago_unique, f"Duplicate license_ids! {chicago_total} rows but {chicago_unique} unique IDs"
print("  ✅ PASS\n")

# Validation 2: TODO - Add your validation
# Example: Primary key uniqueness for NYC


# Validation 3: TODO - Add your validation
# Example: Date types or ranges


# Add more validations if desired (up to 5 total)


print("="*60)
print("✅ ALL VALIDATIONS PASSED")
print("="*60)

---

## Part 5: Gold Layer - Analytics (15 points)

**Create 5-7 business KPIs for the steering committee.**

**Requirements:**
- At least 5 KPIs (max 7)
- Clear business context for each
- Use appropriate aggregations
- Display results clearly

**Suggestions:**
- Top license types (Chicago)
- License status distribution (Chicago)
- Permits by borough (NYC)
- Time from filing to issuance (NYC)
- Geographic analysis
- Trend over time
- Cross-city comparisons (creative!)

---

### KPI 1: [Your KPI Title]

**Business Question:** [Why does this matter?]

In [None]:
# TODO: KPI 1

print("=== KPI 1: [Your Title] ===\n")

result = con.execute("""
    -- TODO: Write your query
    SELECT
        'example' as placeholder,
        COUNT(*) as count
    FROM silver_chicago_licenses
    LIMIT 10
""").df()

display(result)

print("\nInsight: [TODO: What does this tell us?]")

### KPI 2: [Your KPI Title]

**Business Question:** [Why does this matter?]

In [None]:
# TODO: KPI 2

print("=== KPI 2: [Your Title] ===\n")

result = con.execute("""
    -- TODO: Write your query
""").df()

display(result)

print("\nInsight: [TODO: What does this tell us?]")

### KPI 3: [Your KPI Title]

**Business Question:** [Why does this matter?]

In [None]:
# TODO: KPI 3

print("=== KPI 3: [Your Title] ===\n")

result = con.execute("""
    -- TODO: Write your query
""").df()

display(result)

print("\nInsight: [TODO: What does this tell us?]")

### KPI 4: [Your KPI Title]

**Business Question:** [Why does this matter?]

In [None]:
# TODO: KPI 4

print("=== KPI 4: [Your Title] ===\n")

result = con.execute("""
    -- TODO: Write your query
""").df()

display(result)

print("\nInsight: [TODO: What does this tell us?]")

### KPI 5: [Your KPI Title]

**Business Question:** [Why does this matter?]

In [None]:
# TODO: KPI 5

print("=== KPI 5: [Your Title] ===\n")

result = con.execute("""
    -- TODO: Write your query
""").df()

display(result)

print("\nInsight: [TODO: What does this tell us?]")

### (Optional) KPI 6 & 7

Add more KPIs if desired (max 7 total).

---

---

## Part 6: Documentation (10 points)

**Two deliverables:**
- A. Data Dictionary (5 pts)
- B. Stakeholder Note (5 pts)

---

### A. Data Dictionary

**TODO:** Document all tables you created.

(Double-click to edit)

#### TABLE: bronze_chicago_licenses
- **Description:** Raw Chicago business license data as received
- **Row count:** [TODO]
- **Source:** ../../data/day3/hw3_data_pack/chicago_business_licenses.csv

**Key columns:**
- license_id (VARCHAR): [TODO: description]
- legal_name (VARCHAR): [TODO: description]
- ... [TODO: add more]

---

#### TABLE: bronze_nyc_permits
- **Description:** Raw NYC building permit data as received
- **Row count:** [TODO]
- **Source:** ../../data/day3/hw3_data_pack/nyc_building_permits.json

**Key columns:**
- job__ (VARCHAR): [TODO: description]
- borough (VARCHAR): [TODO: description]
- ... [TODO: add more]

---

#### TABLE: silver_chicago_licenses
- **Description:** Cleaned Chicago licenses, analysis-ready
- **Row count:** [TODO]
- **Source:** bronze_chicago_licenses
- **Cleaning:** Removed [X] rows with NULL license_id, parsed dates

**Columns:**
- license_id (VARCHAR): Unique license identifier [PK]
- application_created_date (TIMESTAMP): When application was created
- license_start_date (TIMESTAMP): When license became effective
- ... [TODO: document all columns]

---

#### TABLE: silver_nyc_permits
- **Description:** Cleaned NYC permits, analysis-ready
- **Row count:** [TODO]
- **Source:** bronze_nyc_permits
- **Cleaning:** Removed [X] rows with NULL job number, converted lat/lon to FLOAT

**Columns:**
- job_number (VARCHAR): Unique job identifier [PK]
- borough (VARCHAR): NYC borough
- filing_date (TIMESTAMP): When permit application was filed
- latitude (DOUBLE): Latitude coordinate
- longitude (DOUBLE): Longitude coordinate
- ... [TODO: document all columns]

---

[TODO: Add any gold tables you created for KPIs]

---

### B. Stakeholder Note

**TODO:** Write 8-10 sentences for non-technical executives.

**Audience:** City officials who need to make policy decisions (don't know SQL).

**Include:**
1. What the data shows (high-level findings)
2. What assumptions you made
3. What limitations exist
4. What questions the data CAN'T answer
5. Recommendations for next steps

(Double-click to edit)

---

**Executive Summary: Multi-City Business Regulation Analysis**

[TODO: Write your 8-10 sentence stakeholder note here]

[Start with: "This analysis integrated X Chicago business licenses and Y NYC building permits..."]

[Mention key findings from your KPIs]

[Note assumptions: "We assumed all licenses in 'AAC' status are currently active..."]

[Note limitations: "This data cannot answer questions about business closures or economic impact..."]

[Conclude with recommendations: "We recommend the steering committee focus on..."]

---

---

## ✅ Submission Checklist

Before submitting, verify:

- [ ] All 6 parts completed
- [ ] Notebook runs end-to-end (Kernel → Restart & Run All)
- [ ] All datasets load correctly with **relative paths**
- [ ] All assertions pass (validations succeed)
- [ ] At least 5 KPIs created and displayed
- [ ] Data dictionary complete (all tables documented)
- [ ] Stakeholder note written (8-10 sentences)
- [ ] Student name filled in at top
- [ ] File renamed: `hw3_[your_name].ipynb`

**Submit on Moodle by Wednesday, October 29, 23:59**

---

## 🎉 Congratulations!

You've completed the final project for ECBS5294!

You've demonstrated:
- ✅ Multi-format data ingestion (CSV + JSON)
- ✅ Pipeline design (bronze → silver → gold)
- ✅ Data validation (assertions)
- ✅ SQL analytics (aggregations, KPIs)
- ✅ Professional documentation

**This is portfolio-worthy work.** You're ready to be a data professional!

---