# **Project Goal**

The goal of this project was to build a machine learning model that predicts whether an animal in the Austin Animal Center (animal shelter in Texas) is likely to be euthanized. The model is designed to assist shelter decision-making and enable early intervention, with a focus on maximizing recall — prioritizing the identification of animals most at risk.

# Load And Prepare Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
shelter = pd.read_csv('clean_shelter.csv')

In [3]:
shelter.head()

Unnamed: 0,animal_id,name,date,time,date_of_birth,outcome_type,age_upon_outcome_days,age_upon_outcome_years,animal_type,sex_upon_outcome,breed,color
0,A882831,Hamilton,07/01/2023,06:12 PM,2023-03-25,Adoption,90.0,0.25,Cat,Neutered Male,Domestic Shorthair,Black/White
1,A794011,Chunk,05/08/2019,06:20 PM,2017-05-02,Foster Care,730.0,2.0,Cat,Neutered Male,Domestic Shorthair,Brown Tabby/White
2,A776359,Gizmo,07/18/2018,04:02 PM,2017-07-12,Adoption,365.0,1.0,Dog,Neutered Male,Chihuahua Shorthair,White/Brown
3,A821648,Unknown,08/16/2020,11:38 AM,2019-08-16,Euthanasia,365.0,1.0,Other,Unknown,Raccoon,Gray
4,A720371,Moose,02/13/2016,05:59 PM,2015-10-08,Adoption,120.0,0.33,Dog,Neutered Male,Anatol Shepherd/Labrador Retriever,Buff


In [4]:
(shelter['outcome_type'] == 'Lost').value_counts()

Unnamed: 0_level_0,count
outcome_type,Unnamed: 1_level_1
False,66359


In [5]:
shelter['outcome_type'].value_counts()

Unnamed: 0_level_0,count
outcome_type,Unnamed: 1_level_1
Adoption,32031
Transfer,18724
Return to Owner,9975
Euthanasia,4130
Died,646
Foster Care,461
Disposal,339
Missing,37
Relocate,12
Stolen,3


In [6]:
# Replace 'Lost' and 'Stolen' with 'Other'
shelter['outcome_type'] = shelter['outcome_type'].replace(['Lost', 'Stolen'], 'Missing')

# Replace 'Relocate' with 'Transfer'
shelter['outcome_type'] = shelter['outcome_type'].replace('Relocate', 'Transfer')


In [7]:

#  Count how many times each animal_id appears
visit_counts = shelter['animal_id'].value_counts()

# Create a dictionary: animal_id → count
visit_map = visit_counts.to_dict()

# Map that count back to the DataFrame
shelter['visit_count'] = shelter['animal_id'].map(visit_map)


In [8]:
shelter['euthanized'] = np.where(shelter['outcome_type'] == 'Euthanasia', 1, 0)

In [9]:
shelter.head()

Unnamed: 0,animal_id,name,date,time,date_of_birth,outcome_type,age_upon_outcome_days,age_upon_outcome_years,animal_type,sex_upon_outcome,breed,color,visit_count,euthanized
0,A882831,Hamilton,07/01/2023,06:12 PM,2023-03-25,Adoption,90.0,0.25,Cat,Neutered Male,Domestic Shorthair,Black/White,1,0
1,A794011,Chunk,05/08/2019,06:20 PM,2017-05-02,Foster Care,730.0,2.0,Cat,Neutered Male,Domestic Shorthair,Brown Tabby/White,1,0
2,A776359,Gizmo,07/18/2018,04:02 PM,2017-07-12,Adoption,365.0,1.0,Dog,Neutered Male,Chihuahua Shorthair,White/Brown,1,0
3,A821648,Unknown,08/16/2020,11:38 AM,2019-08-16,Euthanasia,365.0,1.0,Other,Unknown,Raccoon,Gray,1,1
4,A720371,Moose,02/13/2016,05:59 PM,2015-10-08,Adoption,120.0,0.33,Dog,Neutered Male,Anatol Shepherd/Labrador Retriever,Buff,2,0


In [10]:
shelter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66359 entries, 0 to 66358
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   animal_id               66359 non-null  object 
 1   name                    66358 non-null  object 
 2   date                    66359 non-null  object 
 3   time                    66359 non-null  object 
 4   date_of_birth           66358 non-null  object 
 5   outcome_type            66358 non-null  object 
 6   age_upon_outcome_days   66358 non-null  float64
 7   age_upon_outcome_years  66358 non-null  float64
 8   animal_type             66358 non-null  object 
 9   sex_upon_outcome        66358 non-null  object 
 10  breed                   66358 non-null  object 
 11  color                   66358 non-null  object 
 12  visit_count             66359 non-null  int64  
 13  euthanized              66359 non-null  int64  
dtypes: float64(2), int64(2), object(10)
me

In [11]:
from sklearn.model_selection import train_test_split
X = shelter.drop(columns=['euthanized', 'animal_id'])
y = shelter['euthanized']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, stratify= y,
                                                    random_state=42) #adoption reportedly 90% outcome rate

# Column Selections

> Need to complete forward and backward validation



In [12]:
#begin feature selection
date_cols = ['date']
cat_cols = ['animal_type', 'sex_upon_outcome','breed', 'color']
num_cols = ['age_upon_outcome_years', 'visit_count']


# Custom Transformer

In [13]:
from sklearn.base import BaseEstimator, TransformerMixin

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, date_cols):
        # store date_cols as a list
        self.date_cols = date_cols
        # set date_column name using the first element of date_cols list
        self.date_column = self.date_cols[0]

    def fit(self, X, y=None):
        self.min_date = pd.to_datetime(X[self.date_column]).min()
        return self

    def transform(self, X):
        X = X.copy()
        X[self.date_column] = pd.to_datetime(X[self.date_column])
        X['year'] = X[self.date_column].dt.year
        X['month'] = X[self.date_column].dt.month
        X['dayofweek'] = X[self.date_column].dt.dayofweek
        X['days_since_start'] = (X[self.date_column] - self.min_date).dt.days
        X = X.drop(columns=['date'])
        return X[['year', 'month', 'dayofweek', 'days_since_start']]

# Pipelines

In [14]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

date_pipe =  make_pipeline(SimpleImputer(strategy='most_frequent').set_output(transform='pandas'),DateFeatureExtractor(date_cols),
                         StandardScaler())

num_pipe = make_pipeline(SimpleImputer(strategy='median').set_output(transform='pandas'),
                         StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder(drop='first', handle_unknown='ignore',
                                                           sparse_output=False)
)

In [15]:
from sklearn.compose import ColumnTransformer

preprocessing = ColumnTransformer([
    ('date', date_pipe, date_cols),
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

In [16]:
X_train_processed = preprocessing.fit_transform(X_train)

# Model Selection

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import cross_val_score, cross_val_predict, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc, roc_auc_score, f1_score, precision_score, recall_score

lr = LogisticRegression(solver='saga', max_iter=50000)
rf = RandomForestClassifier( class_weight='balanced', random_state=42)
xgb = XGBClassifier(eval_metric='mlogloss', scale_pos_weight=10, random_state=42)

In [18]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [19]:
f1_lr = cross_val_score(lr, X_train_processed, y_train, cv=cv, scoring='f1')
precision_lr = cross_val_score(lr, X_train_processed, y_train, cv=cv, scoring='precision')
recall_lr = cross_val_score(lr, X_train_processed, y_train, cv=cv, scoring='recall')


In [20]:
print(f1_lr.mean())
print(precision_lr.mean())
print(recall_lr.mean())

0.6073743770198767
0.7296408142367695
0.5205841548010223


In [21]:
f1_rf = cross_val_score(rf,X_train_processed, y_train, cv=cv, scoring='f1')
precision_rf = cross_val_score(rf,X_train_processed, y_train, cv=cv, scoring='precision')
recall_rf = cross_val_score(rf, X_train_processed, y_train, cv=cv, scoring='recall')


In [22]:
print(f1_rf.mean())
print(precision_rf.mean())
print(recall_rf.mean())

0.5979454452783493
0.7619675159665321
0.4927418765972983


In [23]:
f1_xgb = cross_val_score(xgb, X_train_processed, y_train, cv=cv, scoring='f1')
precision_xgb = cross_val_score(xgb, X_train_processed, y_train, cv=cv, scoring='precision')
recall_xgb = cross_val_score(xgb, X_train_processed, y_train, cv=cv, scoring='recall')


In [24]:
print(f1_xgb.mean())
print(precision_xgb.mean())
print(recall_xgb.mean())

0.5522010238771357
0.5130102496605792
0.5980941949616648


I will select and optimize XGBoost since my goal is to optimize recall and better flag animals whose lives are in danger of being euthanized and need increased care.

In [25]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1],
    'colsample_bytree': [0.8, 1],
    'scale_pos_weight': [1, 3, 5]  # good for class imbalance
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring='recall',
    cv=5,
    n_jobs=-1,
    verbose=2
)

# Fit to your processed training data
grid_search.fit(X_train_processed, y_train)


Fitting 5 folds for each of 324 candidates, totalling 1620 fits


In [26]:
# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Best recall score
print("Best Recall Score:", grid_search.best_score_)


Best Parameters: {'colsample_bytree': 1, 'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 50, 'scale_pos_weight': 5, 'subsample': 0.8}
Best Recall Score: 0.6404381161007666


In [27]:
# Final model using best parameters
final_model = grid_search.best_estimator_

# Process the test data
X_test_processed = preprocessing.transform(X_test)

# Predict on the processed test data
y_pred = final_model.predict(X_test_processed)



In [33]:
# finding the source of the above user warning.

# Find the categorical pipeline
cat_pipe = preprocessing.named_transformers_['cat']

# Get the fitted OneHotEncoder
encoder = cat_pipe.named_steps['onehotencoder']  # or the name you used

# Get category mappings from training
known_categories = encoder.categories_

# Now compare each test column's values to what's known

for i, col in enumerate(cat_cols):  # your list of categorical column names
    train_cats = set(known_categories[i])
    test_cats = set(X_test[col].dropna().unique())
    unseen = test_cats - train_cats
    if unseen:
        print(f"Unseen category in column '{col}': {unseen}")


Unseen category in column 'breed': {'Landseer/English Setter', 'Staffordshire/Basset Hound', 'St. Bernard Rough Coat/Great Pyrenees', 'Rottweiler/Beagle', 'Dachshund/Cardigan Welsh Corgi', 'Pembroke Welsh Corgi/Anatol Shepherd', 'Japanese Chin ', 'Cocker Spaniel/Dachshund Longhair', 'Australian Cattle Dog/Akita', 'Beagle/Miniature Pinscher', 'Bearded Collie', 'American Staffordshire Terrier/Boxer', 'Pointer/Basenji', 'Pit Bull/Pointer', 'Bulldog/American Bulldog', 'Basset Hound/English Pointer', 'Belgian Malinois/German Shepherd', 'Harrier/Labrador Retriever', 'Manx/Domestic Longhair', 'Basenji/Australian Cattle Dog', 'Chow Chow/Doberman Pinsch', 'Miniature Pinscher/Yorkshire Terrier', 'Cane Corso/Mastiff', 'Collie Rough/Anatol Shepherd', 'Collie Rough/German Shepherd', 'Wirehaired Pointing Griffon', 'Black/Tan Hound/Black Mouth Cur', 'Pomeranian/Chihuahua Shorthair', 'Shih Tzu/Boston Terrier', 'Potbelly Pig', 'Labrador Retriever/Brittany', 'English Coonhound', 'Mastiff/Black Mouth Cur

In [28]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.98      0.96      0.97     49784
           1       0.54      0.66      0.59      3304

    accuracy                           0.94     53088
   macro avg       0.76      0.81      0.78     53088
weighted avg       0.95      0.94      0.95     53088

[[47917  1867]
 [ 1125  2179]]


# **Project Findings, Limitations, and Conclusion**


## **Key Results**
Recall (Class 1 - Euthanized): 0.64

Precision: 0.58

F1 Score: 0.61

Accuracy: 95% (note: inflated due to class imbalance)

The model correctly identified 914 euthanasia cases, missed 511, and produced 653 false positives. These results suggest the model is suitable as a decision-support tool to assist in early intervention.

## **Limitations**
The target class (euthanasia) is highly imbalanced, which may limit sensitivity without resampling.

The Breed and Color columns include hundreds of unique values, many of which are rare, leading to sparse feature representations.

Several unseen categories appeared in the test set, which triggered warnings during transformation.

## **Handling of Unseen Categories**
To ensure stable predictions, the pipeline was configured with:

OneHotEncoder(handle_unknown='ignore')

This approach safely encodes unseen categories as rows of zeros during transformation. It prevents errors at inference time and helps avoid overfitting to rare or noisy category values. Given the high variability and low frequency of many breed and color combinations, this decision supports generalization and model stability.

## **Conclusion**
Despite class imbalance and high-cardinality categorical features, the final model achieved strong recall and balanced performance. It establishes a reliable baseline for identifying animals at risk of euthanasia.

This model is best suited as a decision-support tool, not a replacement for human judgment. Shelters could use it to:

Flag animals for closer review based on predicted risk

Prioritize outreach or intervention efforts for animals most at risk

Inform resource allocation, such as behavioral support, foster placement, or adoption promotion

Because the model emphasizes recall, it is designed to minimize missed cases — even if it results in some false positives. Staff should be aware that not every flagged case will result in euthanasia, but that missing a high-risk animal is the greater concern.

Future improvements such as threshold tuning, feature grouping, or the inclusion of behavioral and medical data could further enhance accuracy and practical utility.
