# Silver Data Load

In this notebook, we will be filtering, cleaning, and augmenting to create the Silver Layer.

This layer brings the data from various sources into an "Enterprise View," allowing for further analysis and allowing us to answer one of our research questions:

> Do certain intake attributes make an animal more likely to be adopted?

For more information on Medallion Architecture, see [Databricks Glossary](https://www.databricks.com/glossary/medallion-architecture) (Databricks, n.d.).

---

> **Note:** This is under the assumption that you have already loaded the data through the [`notebooks/elt/1_bronze.ipynb`](./1_bronze.ipynb).

### References

Databricks. (n.d.). *Medallion Architecture*. Retrieved May 10, 2025, from [https://www.databricks.com/glossary/medallion-architecture](https://www.databricks.com/glossary/medallion-architecture)

## Checklist

1. **Date Anomalies**  
   - Identify any `intake_date` or `outcome_date` > today’s date  
     - ▢ Flag them with a boolean column (e.g. `is_future_intake`)  
     - ▢ Decide whether to drop or correct (e.g. trim off future records)  
   - Check for missing or null dates  
     - ▢ Log how many, then drop or impute if necessary  
   - Validate date formats & dtypes  
     - ▢ Ensure all date columns are `datetime64`  
     - ▢ Normalize any stray string–dates via `pd.to_datetime(errors='coerce')`

2. **Clean the `age` Column**  
   - Extract number + unit from raw string (e.g. “7 MONTHS”, “2 WEEKS”)  
     - ▢ Write a parser (regex or split) to separate `value` and `unit`  
   - Convert to a consistent numeric metric (`age_years`)  
     - ▢ DAYS → days/365  
     - ▢ WEEKS → (weeks×7)/365  
     - ▢ MONTHS → months/12  
     - ▢ YEARS → years  
   - Handle invalid or missing formats  
     - ▢ Coerce unrecognized strings to `NaN`  
     - ▢ Optionally, add an `age_was_missing` indicator  
   - Add & validate new column  
     - ▢ Create `age_years` in the DataFrame  
     - ▢ Inspect its distribution (`.describe()`, histograms) for outliers

In [None]:
%pip install -r ../../requirements.txt

Note: you may need to restart the kernel to use updated packages.
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.58.0-cp311-cp311-macosx_10_9_universal2.whl.metadata (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.5/104.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.2 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.2.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (8.9 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pypars

In [84]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
pd.set_option('display.max_columns', None)