# Silver Data Cleaning

**Purpose:** Clean raw data from the [Bronze](./1_bronze.ipynb) layer to create a unified data asset. This includes column standardization, data type enforcement, value harmonization, deduplication, and provenance tracking.

**Transformations Applied:**
- **Standardize** column names to lowercase snake_case
- **Tag** each row with its source region for provenance
- **Harmonize** categorical values across data sources
- **Enforce** consistent data types

This data will be used when creating [Gold](./3_gold.ipynb), where tailored data assets will be created to efficiently answer specific questions.


For more on Medallion Architecture, see [Databricks Glossary: Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

---

### References  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from https://www.databricks.com/glossary/medallion-architecture


---

## Table of Contents

TBD... Planned to reorganize once finalized changes

-----

## 1. Setup

**Purpose:**  
Ensure the environment has all necessary libraries installed and imported.  
- `%pip install -r ../../requirements.txt` installs dependencies. 

> **Note:** we use a project-wide `requirements.txt` for consistency

In [17]:
%pip install -r ../../requirements.txt

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/pip/__main__.py", line 8, in <module>
    if sys.path[0] in ("", os.getcwd()):
                           ^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory
Note: you may need to restart the kernel to use updated packages.


In [18]:
import os
import pandas as pd

## 2. Configuration

Below, we define our Bronze Data Assets, so that we can work with them. We also define maps that will help us process the data and join different dataframes together.

In [19]:
# Data source configurations
BRONZE_DIR = "../../data-assets/bronze"
BRONZE_FILE_NAME = "{}_df.parquet"

# Load all the Bronze datasets
BRONZE_FILES = ["dallas", "san_jose", "soco"]
BRONZE_FILE_PATHS = {
    file: os.path.join(BRONZE_DIR, BRONZE_FILE_NAME.format(file)) for file in BRONZE_FILES 
}
BRONZE_DFS = {
    file: pd.read_parquet(path) for file, path in BRONZE_FILE_PATHS.items()
}

## 3. Data Cleaning & Standardization

**Purpose:**  
Clean up and harmonize column names across sources:  
- Apply a single `COLUMN_MAP` dict.  
- Lowercase everything for consistency.  
- Explicitly state column data type
- Bucket columns with high caridnaliy `primary_color`.
- Create an `age_category` which will bucket each species based on matrurity level.
- Standardize epxlicit naming conventons to create usable data points for analysis.
- Drop unintended duplicates.  

This ensures downstream steps can assume a uniform schema.

In [20]:
# ─── Data Cleaning ───

# Function to apply the column mapping 
def standardize_columns(source: str, df: pd.DataFrame, mapping: dict) -> pd.DataFrame:
    """
    Standardize DataFrame column names.

    Parameters
    ----------
    df : pandas.DataFrame
        The raw DataFrame whose columns need standardization to enable better
        analysis.
    mapping : dict
        A dict where keys are original column names (exact match) and
        values are the desired standardized names (snake_case).

    Returns
    -------
    pandas.DataFrame
        A copy of `df` with:
        1. Columns renamed according to `mapping`.
        2. All column names converted to lowercase.
        3. Any duplicate column names (arising when multiple originals map
           to the same new name) removed—only the first occurrence is kept.

    Notes
    -----
    - Columns not present in `mapping` are left unchanged (apart from lowercasing).
    - Renaming happens before lowercasing, so mapping keys are case-sensitive.
    - Dropping duplicate columns avoids collisions in downstream code.
    """
    # Apply the renaming mapping
    df = df.rename(columns=mapping)
    # Convert all column names to lowercase
    df.columns = df.columns.str.lower()
    # Remove duplicate columns, keeping the first occurrence
    df = df.loc[:, ~df.columns.duplicated()]
    print(f" - {source}: {list(df.columns)}")
    return df

In [21]:
# Here we will define the full column mapping for all the DataFrames:
COLUMN_MAP = {
    # Animal identification
    **{col: "animal_id" for col in ["AnimalID", "Animal_Id", "Animal ID"]}, # Using Python's dictionary operators for cleaner code!
    **{col: "animal_type" for col in ["AnimalType", "Animal_Type", "Type"]},
    
    # Animal characteristics
    **{col: "breed" for col in ["PrimaryBreed", "Animal_Breed", "Breed"]},
    **{col: "primary_color" for col in ["PrimaryColor", "Color"]},
    "Age": "age",
    "Date Of Birth": "date_of_birth",
    "Sex": "sex",
    
    # Intake information
    **{col: "intake_type" for col in ["IntakeType", "Intake_type", "Intake Type"]},
    **{col: "intake_condition" for col in ["IntakeCondition", "Intake_Condition", "Intake Condition"]},
    **{col: "intake_reason" for col in ["IntakeReason", "Reason"]},
    **{col: "intake_date" for col in ["IntakeDate", "Intake_Date", "Intake Date"]},
    
    # Outcome information
    **{col: "outcome_type" for col in ["OutcomeType", "outcome_type", "Outcome Type"]},
    **{col: "outcome_date" for col in ["OutcomeDate", "Outcome_Date", "Outcome Date"]}
}

In [22]:
# Apply standardization (renaming and lowercasing) to all DataFrames
print("Column standardization starting...\n")
CLEAN_DFS = {
    source: standardize_columns(source, df, COLUMN_MAP)
    for source, df in BRONZE_DFS.items()
}
print("\n---\nColumn standardization complete.")


Column standardization starting...

 - dallas: ['animal_id', 'animal_type', 'breed', 'kennel_number', 'kennel_status', 'tag_type', 'activity_number', 'activity_sequence', 'source_id', 'census_tract', 'council_district', 'intake_type', 'intake_subtype', 'intake_total', 'intake_reason', 'staff_id', 'intake_date', 'intake_time', 'due_out', 'intake_condition', 'hold_request', 'outcome_type', 'outcome_subtype', 'outcome_date', 'outcome_time', 'receipt_number', 'impound_number', 'service_request_number', 'outcome_condition', 'chip_status', 'animal_origin', 'additional_information', 'month', 'year']
 - san_jose: ['_id', 'animal_id', 'animalname', 'animal_type', 'primary_color', 'secondarycolor', 'breed', 'sex', 'dob', 'age', 'intake_date', 'intake_condition', 'intake_type', 'intakesubtype', 'intake_reason', 'outcome_date', 'outcome_type', 'outcomesubtype', 'outcomecondition', 'crossing', 'jurisdiction', 'lastupdate']
 - soco: ['name', 'animal_type', 'breed', 'primary_color', 'sex', 'size', 'd

## Data Type Enforcement & Value Harmonization

Apply consistent data types and standardize categorical values across sources.

In [23]:
# ─── SILVER DTYPE MAPPING ───
# @Lina Add in your explicit dtypes here 
SILVER_DTYPES = {
    'intake_type'     : 'category',
    'intake_condition': 'category',
    'intake_reason'   : 'object',
    'intake_date'     : 'datetime64[ns]',
    'outcome_type'    : 'category',
    'outcome_date'    : 'datetime64[ns]',
}

VALUE_MAPPINGS = {

    # @ Lina do your bucketing in here

    # ==== INTAKE TYPE ====

    'intake_type': {
    # Born at facility
    "BORN HERE"           : "born_at_shelter",    # SO
    # Confiscated/Legal
    "CONFISCATE"          : "confiscated",        # SJ, SO
    "CONFISCATED"         : "confiscated",        # DA
    # Disposal/Euthanasia requests
    "DISPO REQ"           : "disposal_request",   # SJ
    "DISPOS REQ"          : "disposal_request",   # DA
    "EUTH REQ"            : "euthanasia_request", # SJ
    # Foster
    "FOSTER"              : "foster",             # DA, SJ
    # Protective custody/Quarantine
    "KEEPSAFE"            : "protective_custody", # DA
    "QUARANTINE"          : "protective_custody", # SO
    # Resource/Treatment
    "RESOURCE"            : "treatment",          # DA
    "TREATMENT"           : "treatment",          # DA
    # Return to owner
    "RETURN"              : "return_to_owner",    # SJ
    # Spay/Neuter services
    "NEUTER"              : "spay_neuter",        # SJ
    "S/N CLINIC"          : "spay_neuter",        # SJ
    "SPAY"                : "spay_neuter",        # SJ
    # Stray/TNR
    "STRAY"               : "stray",              # DA, SJ, SO
    "TNR"                 : "stray",              # DA
    # Surrender/Returns
    "ADOPTION RETURN"     : "surrender",          # SO
    "OS APPT"             : "surrender",          # SO
    "OWNER SUR"           : "surrender",          # SJ
    "OWNER SURRENDER"     : "surrender",          # DA, SO
    # Transfer
    "TRANSFER"            : "transfer",           # DA, SJ, SO
    # Wildlife
    "WILDLIFE"            : "wildlife"            # DA, SJ
    },
    
    # ==== INTAKE CONDITION ====

    'intake_condition': {
    # Age-related
    "GERIATRIC"           : "age_related",        # DA
    "UNDERAGE"            : "age_related",        # DA
    # Behavioral
    "AGGRESSIVE"          : "behavioral",         # SJ
    "BEH M"               : "behavioral",         # SJ
    "BEH R"               : "behavioral",         # SJ
    "BEH U"               : "behavioral",         # SJ
    "FERAL"               : "behavioral",         # SJ
    # Critical/Severe
    "CRITICAL"            : "critical",           # DA
    "FATAL"               : "critical",           # DA
    "UNTREATABLE"         : "critical",           # SC
    # Deceased
    "DECEASED"            : "deceased",           # DA
    "DEAD"                : "deceased",           # SJ
    # Healthy/Normal
    "APP WNL"             : "healthy",            # DA
    "NORMAL"              : "healthy",            # DA
    "HEALTHY"             : "healthy",            # SJ/SC
    # Medical
    "APP INJ"             : "medical",            # DA
    "APP SICK"            : "medical",            # DA
    "MED EMERG"           : "medical",            # SJ
    "MED M"               : "medical",            # SJ
    "MED R"               : "medical",            # SJ
    "MED SEV"             : "medical",            # SJ
    "TREATABLE/MANAGEABLE": "medical",            # SC
    "TREATABLE/REHAB"     : "medical",            # SC
    # Reproductive
    "NURSING"             : "reproductive",       # SJ
    "PREGNANT"            : "reproductive",       # SJ
    # Unknown/Other
    "UNKNOWN"             : "unknown"             # SC
    },
    
    # ==== INTAKE REASON ====

    'intake_reason': {
    # Adoption related
    "FOR ADOPT"           : "for_adoption",       # DA
    "FOR PLCMNT"          : "for_adoption",       # DA
    "IP ADOPT"            : "for_adoption",       # SJ
    # Behavioral issues
    "BEHAVIOR"            : "behavior",           # DA
    "AGG ANIMAL"          : "behavior",           # SJ
    "AGG PEOPLE"          : "behavior",           # SJ
    "BITES"               : "behavior",           # SJ
    "CHASES ANI"          : "behavior",           # SJ
    "DESTRUC IN"          : "behavior",           # SJ
    "ESCAPES"             : "behavior",           # SJ
    "HOUSE SOIL"          : "behavior",           # SJ
    "HYPER"               : "behavior",           # SJ
    "NOFRIENDLY"          : "behavior",           # SJ
    "PICA"                : "behavior",           # SJ
    # Breeding restrictions
    "BREED REST"          : "breed_restriction",  # DA
    # Euthanasia/Death
    "OWR REQ EU"          : "owner_requested_euthanasia", # DA
    "IP EUTH"             : "owner_requested_euthanasia", # SJ
    # Medical
    "MEDICAL"             : "medical",            # DA
    "SURGERY"             : "medical",            # DA
    "VET CARE"            : "medical",            # DA
    # Other/Miscellaneous
    "OTHER"               : "other",              # DA
    "OTHRINTAKS"          : "other",              # DA
    # Owner surrender - Housing/Financial
    "CANTAFFORD"          : "owner_surrender",    # DA
    "EVICTION"            : "owner_surrender",    # DA
    "FINANCIAL"           : "owner_surrender",    # DA
    "HOUSING"             : "owner_surrender",    # DA
    "LLCONFLICT"          : "owner_surrender",    # DA
    "LOSSHOUSNG"          : "owner_surrender",    # DA
    "PETDEPFEE"           : "owner_surrender",    # DA
    "LANDLORD"            : "owner_surrender",    # SJ
    "MOVE"                : "owner_surrender",    # SJ
    "NO HOME"             : "owner_surrender",    # SJ
    # Owner surrender - Personal/Life circumstances
    "OWR DEATH"           : "owner_surrender",    # DA
    "PERLIFECNG"          : "owner_surrender",    # DA
    "PERSNLISSU"          : "owner_surrender",    # DA
    "TEMLIFECNG"          : "owner_surrender",    # DA
    "ALLERGIC"            : "owner_surrender",    # SJ
    "CHILD PROB"          : "owner_surrender",    # SJ
    "NO TIME"             : "owner_surrender",    # SJ
    "OWNER DIED"          : "owner_surrender",    # SJ
    "OWNER PROB"          : "owner_surrender",    # SJ
    "TRAVEL"              : "owner_surrender",    # SJ
    # Owner surrender - Pet management
    "NOTRIGHTFT"          : "owner_surrender",    # DA
    "ATTENTION"           : "owner_surrender",    # SJ
    "OTHER PET"           : "owner_surrender",    # SJ
    "TOO BIG"             : "owner_surrender",    # SJ
    "TOO MANY"            : "owner_surrender",    # SJ
    # Stray/Found
    "STRAY"               : "stray",              # DA
    # Temporary/Short-term
    "SHORT-TERM"          : "temporary_care",     # DA
    # TNR/Clinic
    "TNR CLINIC"          : "trap_neuter_return", # DA
    # Transfers
    "TRANSFER"            : "transfer"            # DA
},

    # ==== OUTCOME TYPE ====
    
    'outcome_type': {
    # Adoption & Rescue
    "ADOPTION"            : "adoption",           # DA, SJ, SO
    "RESCUE"              : "adoption",           # SJ
    # Death/Euthanasia
    "DIED"                : "deceased",           # DA, SJ, SO
    "EUTH"                : "euthanasia",         # SJ
    "EUTHANIZE"           : "euthanasia",         # SO
    "EUTHANIZED"          : "euthanasia",         # DA
    "REQ EUTH"            : "euthanasia",         # SJ
    # Disposal/Other deaths
    "DISPOSAL"            : "disposal",           # DA, SJ, SO
    # Escaped/Missing/Lost
    "ESCAPED/STOLEN"      : "escaped",            # SO
    "FOUND ANIM"          : "found",              # SJ
    "FOUND EXP"           : "found",              # DA
    "LOST EXP"            : "lost",               # DA, SJ
    "MISSING"             : "lost",               # DA, SJ
    # Foster
    "FOSTER"              : "foster",             # DA, SJ
    # Medical/Treatment
    "TREATMENT"           : "treatment",          # DA
    "VET"                 : "treatment",          # SO
    # Other/Closed/Unknown
    "CLOSED"              : "other",              # DA
    "OTHER"               : "other",              # DA
    # Return to Owner
    "RETURN TO OWNER"     : "return_to_owner",    # SO
    "RETURNED TO OWNER"   : "return_to_owner",    # DA
    "RTF"                 : "return_to_field",    # SJ
    "RTO"                 : "return_to_owner",    # SJ
    "RTOS"                : "return_to_owner",    # SO
    # Spay/Neuter Services
    "NEUTER"              : "spay_neuter",        # SJ
    "SNR"                 : "spay_neuter",        # DA
    "SPAY"                : "spay_neuter",        # SJ
    # TNR/Release
    "TNR"                 : "trap_nueter_release",# DA
    # Transfer
    "TRANSFER"            : "transfer",           # DA, SJ, SO
    # Wildlife
    "WILDLIFE"            : "wildlife"            # DA
    }
}


**Important Note:** The cell below reveals inconsistencies across datasets where identical concepts are represented with slight variations (Example: "CONFISCATED", "CONFISCATE", "CONFISCTED"). These inconsistencies would create data fragmentation in downstream analysis.

The `apply_silver_transforms` function addresses these issues by:

1. **Enforcing uniform data types** across all datasets
2. **Standardizing categorical values** using the VALUE_MAPPINGS dictionary
3. **Validating temporal data** and handling future dates
4. **Gracefully handling missing columns** across different data sources

This ensures all datasets share a common vocabulary and data structure for reliable analysis.

In [24]:
print("=" * 80)
print("VALUE HARMONIZATION: BEFORE vs AFTER")
print("=" * 80)

categorical_cols = ['intake_type', 'intake_condition', 'intake_reason', 'outcome_type']

for source, df in CLEAN_DFS.items():
    print(f"\n{source.upper()} DATASET:")
    for col in categorical_cols:
        if col in df.columns:
            unique_values = df[col].dropna().unique()
            print(f"   BEFORE {col}: {sorted(unique_values)}")

VALUE HARMONIZATION: BEFORE vs AFTER

DALLAS DATASET:
   BEFORE intake_type: ['CONFISCATED', 'DISPOS REQ', 'FOSTER', 'KEEPSAFE', 'OWNER SURRENDER', 'RESOURCE', 'STRAY', 'TNR', 'TRANSFER', 'TREATMENT', 'WILDLIFE']
   BEFORE intake_condition: ['APP INJ', 'APP SICK', 'APP WNL', 'CRITICAL', 'DECEASED', 'FATAL', 'GERIATRIC', 'NORMAL', 'UNDERAGE']
   BEFORE intake_reason: ['BEHAVIOR', 'BREED REST', 'CANTAFFORD', 'EVICTION', 'FINANCIAL', 'FOR ADOPT', 'FOR PLCMNT', 'HOUSING', 'LLCONFLICT', 'LOSSHOUSNG', 'MEDICAL', 'NOTRIGHTFT', 'OTHER', 'OTHRINTAKS', 'OWR DEATH', 'OWR REQ EU', 'PERLIFECNG', 'PERSNLISSU', 'PETDEPFEE', 'SHORT-TERM', 'STRAY', 'SURGERY', 'TEMLIFECNG', 'TNR CLINIC', 'TRANSFER', 'VET CARE']
   BEFORE outcome_type: ['ADOPTION', 'CLOSED', 'DIED', 'DISPOSAL', 'EUTHANIZED', 'FOSTER', 'FOUND EXP', 'LOST EXP', 'MISSING', 'OTHER', 'RETURNED TO OWNER', 'SNR', 'TNR', 'TRANSFER', 'TREATMENT', 'WILDLIFE']

SAN_JOSE DATASET:
   BEFORE intake_type: ['CONFISCATE', 'DISPO REQ', 'EUTH REQ', 'FOSTER

-----

## 4. Data Merging & Harmonization

**Purpose:**  
Stack our fully-cleaned “Silver” tables into one master table, ensure a consistent column order, and tag each row with its region. The result is a single `silver_df` ready for analysis.
- Reindex to a common `FINAL_COLUMNS` list.  
- Ensure correct dtypes
- Bucket like values to reduce dimensionality

Result: a single `silver_df` ready for analysis or Gold-layer transforms.

In [25]:
def apply_silver_transforms(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """
    Apply comprehensive silver-layer transformations to a DataFrame.
    
    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with standardized columns
    source : str
        Source identifier for provenance tracking
        
    Returns
    -------
    pd.DataFrame
        Transformed DataFrame with harmonized values and proper types
    """
    df = df.copy()
    
    # Add provenance
    df['region'] = source
    
    # Ensure intake_reason column exists
    if 'intake_reason' not in df.columns:
        df['intake_reason'] = pd.NA
    
    # Apply data types
    for col, dtype in SILVER_DTYPES.items():
        if col in df.columns:
            if dtype == 'datetime64[ns]':
                df[col] = pd.to_datetime(df[col], errors='coerce')
            else:
                df[col] = df[col].astype(dtype)
    
    # Data validation: Check for future dates
    current_date = pd.Timestamp.now().normalize()
    date_columns = ['intake_date', 'outcome_date']
    
    for col in date_columns:
        if col in df.columns:
            future_dates = df[col] > current_date
            if future_dates.any():
                future_count = future_dates.sum()
                max_future_date = df.loc[future_dates, col].max()
                print(f"WARNING: Found {future_count:,} future dates in {col} for {source}")
                print(f"         Latest future date: {max_future_date.date()}")
                print(f"         Setting future dates to NaT (Not a Time)")
                
                # Set future dates to NaT
                df.loc[future_dates, col] = pd.NaT
    
    # Harmonize categorical values
    for col, mapping in VALUE_MAPPINGS.items():
        if col in df.columns:
            # Normalize text before mapping
            normalized = df[col].astype(str).str.strip().str.upper()
            df[col] = normalized.map(mapping).fillna('other' if col != 'intake_reason' else 'unknown')
    
    return df


In [26]:
# Apply transformations
SILVER_DFS = {
    source: apply_silver_transforms(df, source)
    for source, df in CLEAN_DFS.items()
}

print("Silver transformations applied successfully!")

         Latest future date: 2025-09-27
         Setting future dates to NaT (Not a Time)
Silver transformations applied successfully!


In [None]:
print("\n" + "-" * 50)
print("AFTER VALUE HARMONIZATION:")
print("-" * 50)

for source, df in SILVER_DFS.items():
    print(f"\n{source.upper()} DATASET:")
    for col in categorical_cols:
        if col in df.columns:
            unique_values = df[col].dropna().unique()
            print(f"   AFTER: {col}: {sorted(unique_values)}")

print(f"\nVALUE HARMONIZATION SUMMARY:")
print(f"   - Mapped {len(VALUE_MAPPINGS)} categorical variables to consistent values")
print(f"   - Unified terminology across Dallas, San Jose, and Sonoma County datasets")
print(f"   - Ready for cross-shelter analysis")


--------------------------------------------------
AFTER VALUE HARMONIZATION:
--------------------------------------------------

DALLAS DATASET:
   AFTER: intake_type: ['confiscated', 'disposal_request', 'foster', 'protective_custody', 'stray', 'surrender', 'transfer', 'treatment', 'wildlife']
   AFTER: intake_condition: ['age_related', 'critical', 'deceased', 'healthy', 'medical']
   AFTER: intake_reason: ['behavior', 'breed_restriction', 'for_adoption', 'medical', 'other', 'owner_requested_euthanasia', 'owner_surrender', 'stray', 'temporary_care', 'transfer', 'trap_neuter_return', 'unknown']
   AFTER: outcome_type: ['adoption', 'deceased', 'disposal', 'euthanasia', 'foster', 'found', 'lost', 'other', 'return_to_owner', 'spay_neuter', 'transfer', 'trap_nueter_release', 'treatment', 'wildlife']

SAN_JOSE DATASET:
   AFTER: intake_type: ['confiscated', 'disposal_request', 'euthanasia_request', 'foster', 'return_to_owner', 'spay_neuter', 'stray', 'surrender', 'transfer', 'wildlife']
  

### Data Integration & Quality Checks

Merge all sources into a unified silver dataset and perform quality validation.

In [28]:
FINAL_SCHEMA = [
    "animal_id", "animal_type", "breed", "primary_color", "age", "date_of_birth", "sex",
    "intake_type", "intake_condition", "intake_reason", "intake_date",
    "outcome_type", "outcome_date", "region"
]

In [29]:
def create_silver_dataset(dataframes: dict[str, pd.DataFrame], schema: list[str]) -> pd.DataFrame:
    """
    Combine multiple source DataFrames into unified silver dataset.
    
    Parameters
    ----------
    dataframes : dict[str, pd.DataFrame]
        Source DataFrames to combine
    schema : list[str]
        Final column schema to enforce
        
    Returns
    -------
    pd.DataFrame
        Unified silver dataset
    """
    # Combine all sources
    combined = pd.concat(dataframes.values(), ignore_index=True, sort=False)
    
    # Enforce schema
    return (
        combined
        .reindex(columns=schema)
        # .drop_duplicates() Dropping duplicates may miss repeat intakes TBD
        .reset_index(drop=True)
    )

In [30]:
# Here we create the final silver dataset
silver_df = create_silver_dataset(SILVER_DFS, FINAL_SCHEMA)

print(f"Silver dataset created: {silver_df.shape[0]:,} records × {silver_df.shape[1]} columns")
print(f"Duplicates removed: {sum(df.shape[0] for df in SILVER_DFS.values()) - silver_df.shape[0]:,}")

Silver dataset created: 111,907 records × 14 columns
Duplicates removed: 0


### Data Quality Assessment

Comprehensive quality checks and data profiling.

In [31]:
def generate_data_overview(df: pd.DataFrame) -> None:
    """
    Generate comprehensive data quality overview.
    
    Parameters
    ----------
    df : pd.DataFrame
        Dataset to profile
    """
    print("=" * 60)
    print("DATA QUALITY PROFILE")
    print("=" * 60)
    
    # Dataset overview
    print(f"\nDATASET OVERVIEW")
    print(f"Total records: {df.shape[0]:,}")
    print(f"Total columns: {df.shape[1]}")
    
    # Missing data analysis
    print(f"\nMISSING DATA ANALYSIS")
    missing_data = df.isnull().sum()
    missing_pct = (missing_data / len(df) * 100).round(3)
    
    for col in missing_data.index:
        if missing_data[col] > 0:
            # change missing_pcft to .4f
            print(f"  {col}: {missing_data[col]:,} ({missing_pct[col]:.3f}%)")
    
    # Cardinality analysis
    print(f"\nCARDINALITY ANALYSIS")
    cardinality = df.nunique().sort_values(ascending=False)
    for col, count in cardinality.items():
        print(f"  {col}: {count:,} unique values")
    
    # Categorical distributions
    categorical_cols = ['intake_type', 'intake_condition', 'intake_reason', 'outcome_type', 'animal_type']
    
    for col in categorical_cols:
        if col in df.columns:
            print(f"\n{col.upper()} DISTRIBUTION")
            dist = df[col].value_counts(normalize=True).head(10)
            for value, pct in dist.items():
                print(f"  {value}: {pct:.1%}")
    
    # Temporal analysis
    print(f"\nTEMPORAL ANALYSIS")
    if 'intake_date' in df.columns:
        date_range = df['intake_date'].agg(['min', 'max'])
        print(f"  Intake date range: {date_range['min'].date()} to {date_range['max'].date()}")
        
        # Monthly trends
        monthly = df.set_index('intake_date').resample('M').size()
        print(f"  Average monthly intake: {monthly.mean():.0f} animals")
        print(f"  Peak month: {monthly.idxmax().strftime('%B %Y')} ({monthly.max():,} animals)")

In [32]:
# Lets generate the data profile for the silver dataset
generate_data_overview(silver_df)

DATA QUALITY PROFILE

DATASET OVERVIEW
Total records: 111,907
Total columns: 14

MISSING DATA ANALYSIS
  breed: 39 (0.035%)
  primary_color: 65,079 (58.155%)
  age: 95,633 (85.458%)
  date_of_birth: 88,783 (79.336%)
  sex: 65,079 (58.155%)
  intake_date: 1 (0.001%)
  outcome_date: 2,134 (1.907%)

CARDINALITY ANALYSIS
  animal_id: 91,523 unique values
  date_of_birth: 6,572 unique values
  intake_date: 3,984 unique values
  outcome_date: 3,599 unique values
  breed: 1,235 unique values
  primary_color: 395 unique values
  age: 64 unique values
  outcome_type: 16 unique values
  intake_type: 13 unique values
  intake_reason: 12 unique values
  sex: 10 unique values
  intake_condition: 8 unique values
  animal_type: 6 unique values
  region: 3 unique values

INTAKE_TYPE DISTRIBUTION
  stray: 59.3%
  surrender: 12.1%
  foster: 11.4%
  confiscated: 6.5%
  treatment: 3.0%
  disposal_request: 2.2%
  wildlife: 1.8%
  protective_custody: 1.7%
  spay_neuter: 0.9%
  transfer: 0.7%

INTAKE_CONDITI

  monthly = df.set_index('intake_date').resample('M').size()


-----