# Silver Data Cleaning

**Purpose:** Clean raw data from the [Bronze](./1_bronze.ipynb) layer to create a unified data asset. This includes column standardization, data type enforcement, value harmonization, deduplication, and provenance tracking.

**Transformations Applied:**
- **Standardize** column names to lowercase snake_case
- **Tag** each row with its source region for provenance
- **Harmonize** categorical values across data sources
- **Enforce** consistent data types

This data will be used when creating [Gold](./3_gold.ipynb), where tailored data assets will be created to efficiently answer specific questions.


For more on Medallion Architecture, see [Databricks Glossary: Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

---

### References  
Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from https://www.databricks.com/glossary/medallion-architecture


-----

## Table of Contents

1. [Setup](#setup)  
   Install required packages and import libraries.

2. [Configuration & Data Loading](#configuration--data-loading)  
   Centralize file paths, API parameters, and date-column lists, then ingest the raw Bronze dataset into pandas.

3. [Define Helper Functions](#define-helper-functions)  
   Define all cleaning and enrichment transforms as modular functions—date anomaly filters, age parsers, imputation routines, etc.

4. [Data Cleaning & Standardization](#data-cleaning--standardization)  
   Harmonize column names, drop duplicates, and enforce schema across sources.

5. [Value Mapping & Data Type Enforcement](#value-mapping--data-type-enforcement)  
   Apply categorical/value mappings and cast explicit dtypes for Silver.

6. [Execute Transformations](#execute-transformations)  
   Run each helper function in sequence to clean and enrich the DataFrame.

7. [Create Silver and Exploratory Checks](#create-silver-and-quick-exploratory-checks)  
   Inspect missingness, distributions, date ranges, and trends to validate Silver.

-----

## 1. Setup

**Purpose:**  
Ensure the environment has all necessary libraries installed and imported.  
```python
# Install project-wide dependencies
%pip install -r ../../requirements.txt
``` 

> **Note:** we use a project-wide `requirements.txt` for consistency

In [38]:
%pip install -r ../../requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [39]:
import os
import pandas as pd
import numpy as np

-----
## 2. Configuration and Data Loading

**Purpose:**
Here we centralize file paths, API endpoints, and date-column definitions, then ingest every raw Bronze source.

In [40]:
# Data source configurations
BRONZE_DIR = "../../data-assets/bronze"
BRONZE_FILE_NAME = "{}_df.parquet"

# Load all the Bronze datasets
BRONZE_FILES = ["dallas", "san_jose", "soco"]
BRONZE_FILE_PATHS = {
    file: os.path.join(BRONZE_DIR, BRONZE_FILE_NAME.format(file)) for file in BRONZE_FILES 
}
BRONZE_DFS = {
    file: pd.read_parquet(path) for file, path in BRONZE_FILE_PATHS.items()
}

-----
## 3. Define Helper Functions

**Purpose:**
Below, we define some functions to help us with our transformations.

In [52]:
# ─── Data Cleaning ───

# Function to apply the column mapping 
def standardize_columns(source: str, df: pd.DataFrame, mapping: dict) -> pd.DataFrame:
    """
    Standardize DataFrame column names.

    Parameters
    ----------
    df : pandas.DataFrame
        The raw DataFrame whose columns need standardization to enable better
        analysis.
    mapping : dict
        A dict where keys are original column names (exact match) and
        values are the desired standardized names (snake_case).

    Returns
    -------
    pandas.DataFrame
        A copy of `df` with:
        1. Columns renamed according to `mapping`.
        2. All column names converted to lowercase.
        3. Any duplicate column names (arising when multiple originals map
           to the same new name) removed—only the first occurrence is kept.

    Notes
    -----
    - Columns not present in `mapping` are left unchanged (apart from lowercasing).
    - Renaming happens before lowercasing, so mapping keys are case-sensitive.
    - Dropping duplicate columns avoids collisions in downstream code.
    """
    # Apply the renaming mapping
    df = df.rename(columns=mapping)
    # Convert all column names to lowercase
    df.columns = df.columns.str.lower()
    # Remove duplicate columns, keeping the first occurrence
    df = df.loc[:, ~df.columns.duplicated()]
    print(f" - {source}: {list(df.columns)}")
    return df


def apply_silver_transforms(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """
    Apply comprehensive silver-layer transformations to a DataFrame.
    
    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with standardized columns
    source : str
        Source identifier for provenance tracking
        
    Returns
    -------
    pd.DataFrame
        Transformed DataFrame with harmonized values and proper types
    """
    # Copy to avoid modifying the original DataFrame
    df = df.copy()
    
    # Add provenance
    df['region'] = source
    
    # Ensure intake_reason column exists
    if 'intake_reason' not in df.columns:
        df['intake_reason'] = pd.NA
    
    # Apply data types
    for col, dtype in SILVER_DTYPES.items():
        if col in df.columns:
            if dtype == 'datetime64[ns]':
                df[col] = pd.to_datetime(df[col], errors='coerce')
            elif col == "age":
            # Clean age strings like "6 MONTHS", "2 YEARS", etc.
                age_str = df[col].astype(str).str.strip().str.upper()

                extracted = age_str.str.extract(r'(?P<value>\d+\.?\d*)\s*(?P<unit>YEAR|YEARS|MONTH|MONTHS)?')
                extracted['value'] = extracted['value'].astype(float)

            # Convert months to years if unit is MONTH(S)
                df[col] = extracted.apply(
                    lambda row: row['value'] / 12 if row['unit'] in ['MONTH', 'MONTHS']
                    else row['value'],
                    axis=1
                )
            else:
                df[col] = df[col].astype(dtype)
    
    # Data validation: Check for future dates
    current_date = pd.Timestamp.now().normalize()
    date_columns = ['intake_date', 'outcome_date']
    
    for col in date_columns:
        if col in df.columns:
            future_dates = df[col] > current_date
            if future_dates.any():
                future_count = future_dates.sum()
                max_future_date = df.loc[future_dates, col].max()
                print(f"WARNING: Found {future_count:,} future dates in {col} for {source}")
                print(f"         Latest future date: {max_future_date.date()}")
                print(f"         Setting future dates to NaT (Not a Time)")
                
                # Set future dates to NaT
                df.loc[future_dates, col] = pd.NaT
    
    # Harmonize categorical values
    for col, mapping in VALUE_MAPPINGS.items():
        if col in df.columns:
            # Normalize text before mapping
            normalized = df[col].astype(str).str.strip().str.upper()
            df[col] = normalized.map(mapping).fillna('other' if col != 'intake_reason' else 'unknown')
    
    return df

def create_silver_dataset(dataframes: dict[str, pd.DataFrame], schema: list[str]) -> pd.DataFrame:
    """
    Combine multiple source DataFrames into unified silver dataset.
    
    Parameters
    ----------
    dataframes : dict[str, pd.DataFrame]
        Source DataFrames to combine
    schema : list[str]
        Final column schema to enforce
        
    Returns
    -------
    pd.DataFrame
        Unified silver dataset
    """
    # Combine all sources
    combined = pd.concat(dataframes.values(), ignore_index=True, sort=False)
    
    # Enforce schema
    return (
        combined
        .reindex(columns=schema)
        # .drop_duplicates() Dropping duplicates may miss repeat intakes TBD
        .reset_index(drop=True)
    )


def generate_data_overview(df: pd.DataFrame) -> None:
    """
    Generate comprehensive data quality overview.
    
    Parameters
    ----------
    df : pd.DataFrame
        Dataset to profile
    """
    print("=" * 60)
    print("DATA QUALITY PROFILE")
    print("=" * 60)
    
    # Dataset overview
    print(f"\nDATASET OVERVIEW")
    print(f"Total records: {df.shape[0]:,}")
    print(f"Total columns: {df.shape[1]}")
    
    # Missing data analysis
    print(f"\nMISSING DATA ANALYSIS")
    missing_data = df.isnull().sum()
    missing_pct = (missing_data / len(df) * 100).round(3)
    
    for col in missing_data.index:
        if missing_data[col] > 0:
            # change missing_pcft to .4f
            print(f"  {col}: {missing_data[col]:,} ({missing_pct[col]:.3f}%)")
    
    # Cardinality analysis
    print(f"\nCARDINALITY ANALYSIS")
    cardinality = df.nunique().sort_values(ascending=False)
    for col, count in cardinality.items():
        print(f"  {col}: {count:,} unique values")
    
    # Categorical distributions
    categorical_cols = ['intake_type', 'intake_condition', 'intake_reason', 'outcome_type', 'animal_type']
    
    for col in categorical_cols:
        if col in df.columns:
            print(f"\n{col.upper()} DISTRIBUTION")
            dist = df[col].value_counts(normalize=True).head(10)
            for value, pct in dist.items():
                print(f"  {value}: {pct:.1%}")
    
    # Temporal analysis
    print(f"\nTEMPORAL ANALYSIS")
    if 'intake_date' in df.columns:
        date_range = df['intake_date'].agg(['min', 'max'])
        print(f"  Intake date range: {date_range['min'].date()} to {date_range['max'].date()}")
        
        # Monthly trends
        monthly = df.set_index('intake_date').resample('M').size()
        print(f"  Average monthly intake: {monthly.mean():.0f} animals")
        print(f"  Peak month: {monthly.idxmax().strftime('%B %Y')} ({monthly.max():,} animals)")

# Lina : compute the age from dates intake date and date of birth 
def compute_age_from_dates(df: pd.DataFrame) -> pd.DataFrame:

    # Copy the DataFrame to avoid modifying the original
    df = df.copy()

    if {"age", "intake_date", "date_of_birth"}.issubset(df.columns):
        # Convert date columns to datetime if not already
        df["intake_date"] = pd.to_datetime(df["intake_date"], errors='coerce')
        df["date_of_birth"] = pd.to_datetime(df["date_of_birth"], errors='coerce')

        # Create a mask for the rows where age is missing but both intake_date and date_of_birth are available
        mask = df["age"].isna() & df["intake_date"].notna() & df["date_of_birth"].notna()
        
        df.loc[mask, "age"] = (df.loc[mask, "intake_date"] - df.loc[mask, "date_of_birth"]).dt.days / 365.25
        print(f"Computed age for {mask.sum()} rows")
    return df

# Lina : Apply Imputation (missing age using species - specific median)
def impute_missing_age(df: pd.DataFrame) -> None:
    df = df.copy()
    if "animal_type" in df.columns and "age" in df.columns:
        for species in df["animal_type"].dropna().unique():
            species_mask = df["animal_type"] == species
            median_age = df.loc[species_mask, "age"].median()
            missing_mask = species_mask & df["age"].isna()
            df.loc[missing_mask, "age"] = median_age
            print(f"Imputed {missing_mask.sum()} missing ages for species: {species} (median={median_age:.2f})")
    return df


# Lina : Bin ages in accordance to life stage (Puppy/Kitten, adult, senior)
def bin_age_into_life_stages(df: pd.DataFrame) -> None:
    df = df.copy()
    def categorize(row):
        if pd.isna(row["age"]) or pd.isna(row["animal_type"]):
            return pd.NA
        if row["age"] < 0.5:
            return "puppy" if row["animal_type"] == "dog" else (
                   "kitten" if row["animal_type"] == "cat" else pd.NA)
        elif row["age"] < 7:
            return "adult"
        else:
            return "senior"
    df["age"] = df.apply(categorize, axis=1).astype("category")
    print("Binned age into categories: puppy/kitten, adult, senior")
    return df

# Lina : Harmonizing male and female
def recatogarize_sex(df: pd.DataFrame) -> None:
    df = df.copy()
    male_terms = ['MALE', 'Male', 'NEUTERED', 'Neutered']
    female_terms = ['FEMALE', 'Female', 'SPAYED', 'Spayed']
    
    df["sex"] = df["sex"].apply(
        lambda x: "male" if x in male_terms else
                  "female" if x in female_terms else pd.NA
    )
    return df


# Lina : populating missing columns based on statistical distribution.
# Logic is, I would like to observe the distribution of two key values when assigning "sex" for missing values : 1_animal_type , 2_breed. The logic is
# for each row that is missing "sex", a compiled probability dictionary is referenced for a specific group based on the aformentioned key values. 
# using numpy random seed to assign sex based on probability distribution for each group.

def impute_sex_by_species_and_breed(df: pd.DataFrame, seed: int = 42) -> None:
    df = df.copy()
    # Set seed for reproducibility - 42 because its the most popular number 
    np.random.seed(seed)
    # Get normalized sex distributions per (animal_type, breed)
    sex_probs = (
        df.dropna(subset=["sex"])
        .groupby(["animal_type", "breed"])["sex"]
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
    )

    # Convert to lookup dictionary for speed
    sex_prob_dict = sex_probs.to_dict(orient="index")
    # The sampling and random logic based on distribution of sex (male/female)
    def sample_sex(row):
        if pd.notna(row["sex"]):
            return row["sex"]
        key = (row["animal_type"], row["breed"])
        p = sex_prob_dict.get(key)
        if p and (p.get("male", 0) + p.get("female", 0)) > 0:
            return np.random.choice(["male", "female"], p=[p.get("male", 0), p.get("female", 0)])
        return pd.NA

    df["sex"] = df.apply(sample_sex, axis=1)
    return df

#Lina : utilizing the same logic used for sex, we are able to do the same for primary color missing values
# By going through all the rows and grouping them according to species & breed while making note of the color associated with each pair
# We can assign a probability to each color, compiling a color distribution for each species-breed pair.
# hence, randomly assigning a color using those accumulated probabilities for each missing row.

def impute_primary_color_by_species_and_breed(df: pd.DataFrame, seed: int = 42) -> None:
    df = df.copy()
    np.random.seed(seed)
    
    # Compute normalized primary_color probabilities per group
    color_probs = (
        df.dropna(subset=["primary_color"])
        .groupby(["animal_type", "breed"])["primary_color"]
        .value_counts(normalize=True)
        .unstack()
        .fillna(0)
    )


    # Convert to lookup dictionary
    color_prob_dict = color_probs.to_dict(orient="index")
    # The sampling and random logic based on distribution of primary color (species/breed to assign color)
    def sample_color(row):
        if pd.notna(row["primary_color"]):
            return row["primary_color"]
        key = (row["animal_type"], row["breed"])
        p = color_prob_dict.get(key)
        if p and sum(p.values()) > 0:
            choices = list(p.keys())
            probabilities = list(p.values())
            return np.random.choice(choices, p=probabilities)
        return pd.NA

    df["primary_color"] = df.apply(sample_color, axis=1)
    return df

-----

## 4. Data Cleaning & Standardization

**Purpose:**  
Align all of our sources to a common schema.

> **Note:** This step enforces snake_case naming and removes accidental duplicates.

In [53]:
# Here we will define the full column mapping for all the DataFrames:
COLUMN_MAP = {
    # Animal identification
    **{col: "animal_id" for col in ["AnimalID", "Animal_Id", "Animal ID"]}, # Using Python's dictionary operators for cleaner code!
    **{col: "animal_type" for col in ["AnimalType", "Animal_Type", "Type"]},
    
    # Animal characteristics
    **{col: "breed" for col in ["PrimaryBreed", "Animal_Breed", "Breed"]},
    **{col: "primary_color" for col in ["PrimaryColor", "Color"]},
    "Age": "age",
    "Date Of Birth": "date_of_birth",
    "Sex": "sex",
    
    # Intake information
    **{col: "intake_type" for col in ["IntakeType", "Intake_type", "Intake Type"]},
    **{col: "intake_condition" for col in ["IntakeCondition", "Intake_Condition", "Intake Condition"]},
    **{col: "intake_reason" for col in ["IntakeReason", "Reason"]},
    **{col: "intake_date" for col in ["IntakeDate", "Intake_Date", "Intake Date"]},
    
    # Outcome information
    **{col: "outcome_type" for col in ["OutcomeType", "outcome_type", "Outcome Type"]},
    **{col: "outcome_date" for col in ["OutcomeDate", "Outcome_Date", "Outcome Date"]}
}

In [54]:
# Apply standardization (renaming and lowercasing) to all DataFrames
print("Column standardization starting...\n")
CLEAN_DFS = {
    source: standardize_columns(source, df, COLUMN_MAP)
    for source, df in BRONZE_DFS.items()
}
print("\n---\nColumn standardization complete.")


Column standardization starting...

 - dallas: ['animal_id', 'animal_type', 'breed', 'kennel_number', 'kennel_status', 'tag_type', 'activity_number', 'activity_sequence', 'source_id', 'census_tract', 'council_district', 'intake_type', 'intake_subtype', 'intake_total', 'intake_reason', 'staff_id', 'intake_date', 'intake_time', 'due_out', 'intake_condition', 'hold_request', 'outcome_type', 'outcome_subtype', 'outcome_date', 'outcome_time', 'receipt_number', 'impound_number', 'service_request_number', 'outcome_condition', 'chip_status', 'animal_origin', 'additional_information', 'month', 'year']
 - san_jose: ['_id', 'animal_id', 'animalname', 'animal_type', 'primary_color', 'secondarycolor', 'breed', 'sex', 'dob', 'age', 'intake_date', 'intake_condition', 'intake_type', 'intakesubtype', 'intake_reason', 'outcome_date', 'outcome_type', 'outcomesubtype', 'outcomecondition', 'crossing', 'jurisdiction', 'lastupdate']
 - soco: ['name', 'animal_type', 'breed', 'primary_color', 'sex', 'size', 'd

-----

## 5. Value Mapping & Data Type Enforcement

**Purpose:**  
Convert raw categorical codes into clean, analysis-ready categories and cast explicit dtypes.  

> **Note:** Using `category` dtype optimizes memory and speeds up grouping operations.



In [55]:
# ─── SILVER DTYPE MAPPING ───
# Define explicit pandas dtypes for key columns
SILVER_DTYPES = {
    'animal_id'       : 'category',
    'animal_type'     : 'category',
    'breed'           : 'category',
    'primary_color'   : 'category',
    'age'             : 'float',
    'sex'             : 'category',
    'intake_type'     : 'category',
    'intake_condition': 'category',
    'intake_reason'   : 'object',
    'intake_date'     : 'datetime64[ns]',
    'outcome_type'    : 'category',
    'outcome_date'    : 'datetime64[ns]',
    'region'          : 'category'
}

# ─── VALUE MAPPINGS ───

# animal_type mapping
ANIMAL_TYPE_MAP = {
    # focus will be on dogs and cats, all other species will be labeled as "other"
    "DOG"                 : "dog",
    "CAT"                 : "cat",
    "BIRD"                : "other",
    "LIVESTOCK"           : "other",
    "WILDLIFE"            : "other",
    "OTHER"               : "other"
    }

# breed mapping
BREED_MAP = {
    # the logic is everything that has a single breed is maintained, anything that has indication of dual breed or mix will be classified as mixed [all entries with '/' or 'MIX']
    # cats are classified according to short hair, medium hair, and long hair within the "breed" column.
    # ==== MIXED ====
    "ABYSSINIAN/DOMESTIC SH"   : "mixed",   # soco
    "ABYSSINIAN/MIX"           : "mixed",   # soco
    "AFFENPINSCHER/MIX"        : "mixed",   # soco
    "ALASKAN HUSKY/LABRADOR RETR": "mixed", # soco
    "GERM SHEPHERD/CHOW CHOW"  : "mixed",   # soco
    "LABRADOR RETR/MIX"        : "mixed",   # soco
    "PIT BULL/MIX"             : "mixed",   # soco
    # ==== PIT BULL ====
    "PIT BULL"                 : "pit_bull",        # dallas
    "AM PIT BULL TER"         : "pit_bull",         # soco
    # ==== LABRADOR ====
    "LABRADOR RETR"           : "labrador",         # soco, dallas
    "LAB"                     : "labrador",         # san_jose
    # ==== GERMAN SHEPHERD ====
    "GERM SHEPHERD"           : "german_shepherd",  # soco
    "GERMAN SHEPHERD"         : "german_shepherd",  # dallas
    # ==== AKITA ====
    "AKITA"                   : "akita",            # soco, dallas
    # ==== HUSKY ====
    "ALASK MALAMUTE"          : "husky",            # soco
    "ALASKAN HUSKY"           : "husky",            # soco
    # ==== CHIHUAHUA ====
    "CHIHUAHUA"               : "chihuahua",        # dallas
    # ==== BOXER ====
    "BOXER"                   : "boxer",            # dallas
    # ==== POODLE ====
    "POODLE"                  : "poodle",           # dallas
    # ==== BEAGLE ====
    "BEAGLE"                  : "beagle",           # dallas
    # ==== SHIH TZU ====
    "SHIH TZU"                : "shih_tzu",         # soco
    # ==== TERRIER ====
    "AIREDALE TERR"           : "terrier",          # soco
    "BULL TERRIER"            : "terrier",          # dallas
    "AFFENPINSCHER"           : "terrier",          # soco
    # ==== CAT DOMESTIC ====
    "DOMESTIC SH"             : "cat_short_hair",     # soco, dallas
    "DOMESTIC LH"             : "cat_long_hair",     # dallas
    "DOMESTIC MH"             : "cat_medium_hair",     # dallas
    # ==== UNKNOWN ====
    "UNKNOWN"                 : "unknown",          # san_jose
    }

# Primary color mapping
PRIMARY_COLOR_MAP = {
# the logic behind the grouping is, I wanted to keep as much of the extra details in regards to patterns as possible while standerdizing color groups. 
    # ==== BLACK VARIANTS ====
    "BLACK"                  : "black",           # soco, san_jose, dallas
    "BLACK/WHITE"            : "black",           # soco
    "BLACK/BLUE MERLE"       : "black_merle",     # soco
    "BLACK/BRINDLE"          : "black_brindle",   # soco
    "BLACK/TABBY"            : "black_tabby",     # soco
    "BLACK/TRICOLOR"         : "black_tricolor",  # soco

    # ==== WHITE ====
    "WHITE"                  : "white",           # soco, san_jose
    "WHITE/BLACK"            : "white",           # soco
    "WHITE/GRAY"             : "white",           # soco

    # ==== BROWN FAMILY ====
    "BROWN"                  : "brown",           # soco
    "BROWN/WHITE"            : "brown",           # soco
    "CHOCOLATE"              : "brown",           # soco
    "CHOCOLATE/TABBY"        : "brown_tabby",     # soco
    "BRINDLE/BROWN"          : "brown_brindle",   # soco

    # ==== GRAY / GREY ====
    "GRAY"                   : "gray",            # soco
    "GREY"                   : "gray",            # san_jose
    "GRAY TABBY"             : "gray_tabby",      # soco

    # ==== BLUE FAMILY ====
    "BLUE"                   : "blue",            # soco, dallas
    "BLUE MERLE"             : "blue_merle",      # soco
    "BLUE CREAM"             : "blue",            # soco
    "BLUE/WHITE"             : "blue",            # soco

    # ==== ORANGE ====
    "ORANGE"                 : "orange",          # soco
    "ORANGE/TABBY"           : "orange_tabby",    # soco

    # ==== CREAM / FAWN ====
    "CREAM"                  : "cream",           # soco, san_jose
    "FAWN"                   : "fawn",            # dallas
    "CREAM/TABBY"            : "cream_tabby",     # soco

    # ==== CALICO / TORTIE ====
    "CALICO"                 : "calico",          # soco
    "TORTIE"                 : "tortie",          # soco
    "TORTIE/TABBY"           : "tortie_tabby",    # soco

    # ==== OTHER SPECIAL PATTERNS ====
    "TABBY/WHITE"            : "tabby",           # soco
    "TRICOLOR"               : "tricolor",        # soco
    "SMOKE"                  : "smoke",           # soco
    "TIGER/GRAY"             : "gray_tiger",      # soco
    "POINT"                  : "point",           # soco
    "TICK"                   : "tick",            # soco

    # ==== RARE OR UNKNOWN ====
    "AGOUTI"                 : "other",           # soco
    "AGOUTI/BRN TABBY"       : "other",           # soco
    "0"                      : "other",           # san_jose
    }


# 1. intake_type mapping
INTAKE_TYPE_MAP = {
    "BORN HERE":           "born_at_shelter",
    "CONFISCATE":          "confiscated",
    "CONFISCATED":         "confiscated",
    "DISPO REQ":           "disposal_request",
    "DISPOS REQ":          "disposal_request",
    "EUTH REQ":            "euthanasia_request",
    "FOSTER":              "foster",
    "KEEPSAFE":            "protective_custody",
    "QUARANTINE":          "protective_custody",
    "RESOURCE":            "treatment",
    "TREATMENT":           "treatment",
    "RETURN":              "return_to_owner",
    "NEUTER":              "spay_neuter",
    "S/N CLINIC":          "spay_neuter",
    "SPAY":                "spay_neuter",
    "STRAY":               "stray",
    "TNR":                 "stray",
    "ADOPTION RETURN":     "surrender",
    "OS APPT":             "surrender",
    "OWNER SUR":           "surrender",
    "OWNER SURRENDER":     "surrender",
    "TRANSFER":            "transfer",
    "WILDLIFE":            "wildlife",
}

# 2. intake_condition mapping
INTAKE_CONDITION_MAP = {
    "GERIATRIC":            "age_related",
    "UNDERAGE":             "age_related",
    "AGGRESSIVE":           "behavioral",
    "BEH M":                "behavioral",
    "BEH R":                "behavioral",
    "BEH U":                "behavioral",
    "FERAL":                "behavioral",
    "CRITICAL":             "critical",
    "FATAL":                "critical",
    "UNTREATABLE":          "critical",
    "DECEASED":             "deceased",
    "DEAD":                 "deceased",
    "APP WNL":              "healthy",
    "NORMAL":               "healthy",
    "HEALTHY":              "healthy",
    "APP INJ":              "medical",
    "APP SICK":             "medical",
    "MED EMERG":            "medical",
    "MED M":                "medical",
    "MED R":                "medical",
    "MED SEV":              "medical",
    "TREATABLE/MANAGEABLE":"medical",
    "TREATABLE/REHAB":      "medical",
    "NURSING":              "reproductive",
    "PREGNANT":             "reproductive",
    "UNKNOWN":              "unknown",
}

# 3. intake_reason mapping
INTAKE_REASON_MAP = {
    "FOR ADOPT":              "for_adoption",
    "FOR PLCMNT":             "for_adoption",
    "IP ADOPT":               "for_adoption",
    "BEHAVIOR":               "behavior",
    "AGG ANIMAL":             "behavior",
    "AGG PEOPLE":             "behavior",
    "BITES":                  "behavior",
    "CHASES ANI":             "behavior",
    "DESTRUC IN":             "behavior",
    "ESCAPES":                "behavior",
    "HOUSE SOIL":             "behavior",
    "HYPER":                  "behavior",
    "NOFRIENDLY":             "behavior",
    "PICA":                   "behavior",
    "BREED REST":             "breed_restriction",
    "OWR REQ EU":             "owner_requested_euthanasia",
    "IP EUTH":                "owner_requested_euthanasia",
    "MEDICAL":                "medical",
    "SURGERY":                "medical",
    "VET CARE":               "medical",
    "OTHER":                  "other",
    "OTHRINTAKS":             "other",
    "CANTAFFORD":             "owner_surrender",
    "EVICTION":               "owner_surrender",
    "FINANCIAL":              "owner_surrender",
    "HOUSING":                "owner_surrender",
    "LLCONFLICT":             "owner_surrender",
    "LOSSHOUSNG":             "owner_surrender",
    "PETDEPFEE":              "owner_surrender",
    "LANDLORD":               "owner_surrender",
    "MOVE":                   "owner_surrender",
    "NO HOME":                "owner_surrender",
    "OWR DEATH":              "owner_surrender",
    "PERLIFECNG":             "owner_surrender",
    "PERSNLISSU":             "owner_surrender",
    "TEMLIFECNG":             "owner_surrender",
    "ALLERGIC":               "owner_surrender",
    "CHILD PROB":             "owner_surrender",
    "NO TIME":                "owner_surrender",
    "OWNER DIED":             "owner_surrender",
    "OWNER PROB":             "owner_surrender",
    "TRAVEL":                 "owner_surrender",
    "NOTRIGHTFT":             "owner_surrender",
    "ATTENTION":              "owner_surrender",
    "OTHER PET":              "owner_surrender",
    "TOO BIG":                "owner_surrender",
    "TOO MANY":               "owner_surrender",
    "SHORT-TERM":             "temporary_care",
    "TNR CLINIC":             "trap_neuter_return",
    "TRANSFER":               "transfer",
}

# 4. outcome_type mapping
OUTCOME_TYPE_MAP = {
    "ADOPTION":               "adoption",
    "RESCUE":                 "adoption",
    "DIED":                   "deceased",
    "EUTH":                   "euthanasia",
    "EUTHANIZE":              "euthanasia",
    "EUTHANIZED":             "euthanasia",
    "REQ EUTH":               "euthanasia",
    "DISPOSAL":               "disposal",
    "ESCAPED/STOLEN":         "escaped",
    "FOUND ANIM":             "found",
    "FOUND EXP":              "found",
    "LOST EXP":               "lost",
    "MISSING":                "lost",
    "FOSTER":                 "foster",
    "TREATMENT":              "treatment",
    "VET":                    "treatment",
    "CLOSED":                 "other",
    "OTHER":                  "other",
    "RETURN TO OWNER":        "return_to_owner",
    "RETURNED TO OWNER":      "return_to_owner",
    "RTF":                    "return_to_field",
    "RTO":                    "return_to_owner",
    "RTOS":                   "return_to_owner",
    "NEUTER":                 "spay_neuter",
    "SNR":                    "spay_neuter",
    "SPAY":                   "spay_neuter",
    "TNR":                    "trap_neuter_release",
    "TRANSFER":               "transfer",
    "WILDLIFE":               "wildlife",
}

# Final assembly
VALUE_MAPPINGS = {
    "animal_type"     : ANIMAL_TYPE_MAP,
    "breed"           : BREED_MAP,
    "primary_color"   : PRIMARY_COLOR_MAP,
    "intake_type"     : INTAKE_TYPE_MAP,
    "intake_condition": INTAKE_CONDITION_MAP,
    "intake_reason"   : INTAKE_REASON_MAP,
    "outcome_type"    : OUTCOME_TYPE_MAP,
}


**Important Note:** The cell below reveals inconsistencies across datasets where identical concepts are represented with slight variations (Example: "CONFISCATED", "CONFISCATE", "CONFISCTED"). These inconsistencies would create data fragmentation in downstream analysis.

The `apply_silver_transforms` function addresses these issues by:

1. **Enforcing uniform data types** across all datasets
2. **Standardizing categorical values** using the VALUE_MAPPINGS dictionary
3. **Validating temporal data** and handling future dates
4. **Gracefully handling missing columns** across different data sources

This ensures all datasets share a common vocabulary and data structure for reliable analysis.

In [56]:
print("=" * 30)
print("BEFORE VALUE HARMONIZATION:")
print("=" * 30)

categorical_cols = ['animal_type'
                    , 'breed'
                    , 'primary_color'
                    , 'intake_type'
                    , 'intake_condition'
                    , 'intake_reason'
                    , 'outcome_type']

for source, df in CLEAN_DFS.items():
    print(f"\n{source.upper()} DATASET:")
    for col in categorical_cols:
        if col in df.columns:
            unique_values = df[col].dropna().unique()
            print(f"   BEFORE {col}: {sorted(unique_values)}")

BEFORE VALUE HARMONIZATION:

DALLAS DATASET:
   BEFORE animal_type: ['BIRD', 'CAT', 'DOG', 'LIVESTOCK', 'WILDLIFE']
   BEFORE breed: ['ABYSSINIAN', 'AFFENPINSCHER', 'AFGHAN HOUND', 'AIREDALE TERR', 'AKBASH', 'AKITA', 'ALASK KLEE KAI', 'ALASK MALAMUTE', 'ALASKAN HUSKY', 'ALLIGATOR', 'AM PIT BULL TER', 'AMER BULLDOG', 'AMER CURL LH', 'AMER CURL SH', 'AMER ESKIMO', 'AMER FOXHOUND', 'AMER SH', 'AMER WIREHAIR', 'AMERICAN', 'AMERICAN STAFF', 'ANATOL SHEPHERD', 'ANGORA', 'ARMADILLO', 'AUST CATTLE DOG', 'AUST KELPIE', 'AUST SHEPHERD', 'AUST TERRIER', 'BALINESE', 'BASENJI', 'BASSET HOUND', 'BAT', 'BEAGLE', 'BEARDED COLLIE', 'BEAUCERON', 'BELG LAEKENOIS', 'BELG MALINOIS', 'BENGAL', 'BERNESE MTN DOG', 'BICHON FRISE', 'BIRMAN', 'BLACK MOUTH CUR', 'BLACK/TAN HOUND', 'BLACKBIRD', 'BLOODHOUND', 'BLUE LACY', 'BLUEBIRD', 'BLUETICK HOUND', 'BOERBOEL', 'BOMBAY', 'BORDER COLLIE', 'BORDER TERRIER', 'BOSTON TERRIER', 'BOXER', 'BOYKIN SPAN', 'BRITISH BLUE', 'BRITISH SH', 'BRITTANY', 'BRUSS GRIFFON', 'BULL TE

-----

## 6. Execute Transformations

**Purpose:**  Orchestrate the cleaning in a single, easy-to-read cell.

In [57]:
# Apply transformations
SILVER_DFS = {
    source: apply_silver_transforms(df, source)
    for source, df in CLEAN_DFS.items()
}

print("Silver transformations applied successfully!")

         Latest future date: 2025-09-27
         Setting future dates to NaT (Not a Time)
Silver transformations applied successfully!


In [58]:
print("\n" + "-" * 30)
print("AFTER VALUE HARMONIZATION:")
print("-" * 30)

for source, df in SILVER_DFS.items():
    print(f"\n{source.upper()} DATASET:")
    for col in categorical_cols:
        if col in df.columns:
            unique_values = df[col].dropna().unique()
            print(f"   AFTER: {col}: {sorted(unique_values)}")

# print(f"\nVALUE HARMONIZATION SUMMARY:")
# print(f"   - Mapped {len(VALUE_MAPPINGS)} categorical variables to consistent values")
# print(f"   - Unified terminology across Dallas, San Jose, and Sonoma County datasets")
# print(f"   - Ready for cross-shelter analysis")


------------------------------
AFTER VALUE HARMONIZATION:
------------------------------

DALLAS DATASET:
   AFTER: animal_type: ['cat', 'dog', 'other']
   AFTER: breed: ['akita', 'beagle', 'boxer', 'cat_long_hair', 'cat_medium_hair', 'cat_short_hair', 'german_shepherd', 'husky', 'labrador', 'other', 'pit_bull', 'shih_tzu', 'terrier']
   AFTER: intake_type: ['confiscated', 'disposal_request', 'foster', 'protective_custody', 'stray', 'surrender', 'transfer', 'treatment', 'wildlife']
   AFTER: intake_condition: ['age_related', 'critical', 'deceased', 'healthy', 'medical']
   AFTER: intake_reason: ['behavior', 'breed_restriction', 'for_adoption', 'medical', 'other', 'owner_requested_euthanasia', 'owner_surrender', 'temporary_care', 'transfer', 'trap_neuter_return', 'unknown']
   AFTER: outcome_type: ['adoption', 'deceased', 'disposal', 'euthanasia', 'foster', 'found', 'lost', 'other', 'return_to_owner', 'spay_neuter', 'transfer', 'trap_neuter_release', 'treatment', 'wildlife']

SAN_JOSE 

-----
## 7. Create Silver & Quality Checks

**Purpose:**  
Combine each source’s cleaned DataFrame into the final `silver_df` according to our `FINAL_SCHEMA`, and if desired, do a data quality assesment.

In [59]:
FINAL_SCHEMA = [
    "animal_id", "animal_type", "breed", "primary_color", "age", "date_of_birth", "sex",
    "intake_type", "intake_condition", "intake_reason", "intake_date",
    "outcome_type", "outcome_date", "region"
]

In [60]:
# Here we create the final silver dataset
silver_df = create_silver_dataset(SILVER_DFS, FINAL_SCHEMA)
print(f"Silver dataset created: {silver_df.shape[0]:,} records × {silver_df.shape[1]} columns")
print(f"Duplicates removed: {sum(df.shape[0] for df in SILVER_DFS.values()) - silver_df.shape[0]:,}")

Silver dataset created: 111,907 records × 14 columns
Duplicates removed: 0


Compute the age from intake date and date of birth.

In [61]:
# Enhance silver layer with computed age from dates
print("Computing age from intake_date and date_of_birth...")
silver_df = compute_age_from_dates(silver_df)
print(f"✓ Age computed for {len(silver_df)} records")

Computing age from intake_date and date_of_birth...
Computed age for 23124 rows
✓ Age computed for 111907 records


Now we compute the age from dates intake date and date of birth

#### Data Quality Assessment

Comprehensive quality checks and data profiling.

In [13]:
# Lets generate the data profile for the silver dataset
generate_data_overview(silver_df)

DATA QUALITY PROFILE

DATASET OVERVIEW
Total records: 111,907
Total columns: 14

MISSING DATA ANALYSIS
  breed: 39 (0.035%)
  primary_color: 65,079 (58.155%)
  age: 95,633 (85.458%)
  date_of_birth: 88,783 (79.336%)
  sex: 65,079 (58.155%)
  intake_date: 1 (0.001%)
  outcome_date: 2,134 (1.907%)

CARDINALITY ANALYSIS
  animal_id: 91,523 unique values
  date_of_birth: 6,572 unique values
  intake_date: 3,984 unique values
  outcome_date: 3,599 unique values
  breed: 1,235 unique values
  primary_color: 395 unique values
  age: 64 unique values
  outcome_type: 16 unique values
  intake_type: 13 unique values
  intake_reason: 11 unique values
  sex: 10 unique values
  intake_condition: 8 unique values
  animal_type: 6 unique values
  region: 3 unique values

INTAKE_TYPE DISTRIBUTION
  stray: 59.3%
  surrender: 12.1%
  foster: 11.4%
  confiscated: 6.5%
  treatment: 3.0%
  disposal_request: 2.2%
  wildlife: 1.8%
  protective_custody: 1.7%
  spay_neuter: 0.9%
  transfer: 0.7%

INTAKE_CONDITI

  monthly = df.set_index('intake_date').resample('M').size()


-----