# Predicting Shelter Animal Outcomes
## MSDS 699 Final Project
### Jordan Uyeki

For this project, I analyzed the Austin Animal Center Shelter Outcomes dataset that is publicly available on [Kaggle](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and).   

My goal was to build a machine learning model that can predict shelter animal outcomes based on their adoption profiles. 

In [36]:
import pandas as pd
import numpy as np
import re
import math
import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import TransformerMixin, BaseEstimator, clone
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MultiLabelBinarizer

from sklearn.model_selection import RandomizedSearchCV
from   sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn import set_config

## The Data

In [2]:
# Reading in data
shelter = pd.read_csv('data/shelter_data.csv')
shelter.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,fixed_status,gender
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact,Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed,Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered,Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered,Male
4,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact,Male


In [3]:
shelter.shape

(56641, 13)

This dataset consists of data on over 50,000 dogs and cats that have been in the Austin Animal Services Center System. There are 12 potential model features that can be used to predict the target variable, outcome_type.

## Feature Engineering

In [4]:
# Separating features and targets 
X = shelter.drop(columns=['outcome_type'])
y = shelter[['outcome_type']].values.ravel()

In [5]:
# Transforming outcome 
le = preprocessing.LabelEncoder()
le.fit(y)
y = le.transform(y) 

In [6]:
# Splitting out training/testing sets
# Using 20% for witheld test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Some of the variables in the dataset require some feature engineering. To help with the more complex requirements, I borrowed the class definitions from this [Kaggle post](https://www.kaggle.com/jankoch/scikit-learn-pipelines-and-pandas) as it allows custom functions to be applied to specific columns.

In [7]:
# Transformer Class Definitions 
class SelectColumnsTransformer(BaseEstimator, TransformerMixin):
    "A DataFrame transformer that provides column selection" 
    def __init__(self, columns=[]):
        self.columns = columns

    def transform(self, X, **transform_params):
        "Select columns of a DataFrame"
        trans = X[self.columns].copy() 
        return trans

    def fit(self, X, y=None, **fit_params):
        return self
    
class DataFrameFunctionTransformer(BaseEstimator, TransformerMixin):
    "A DataFrame transformer providing imputation or function application"
    def __init__(self, func, impute = False):
        self.func = func
        self.impute = impute
        self.series = pd.Series() 

    def transform(self, X, **transformparams):
        "Transforms a DataFrame"
        if self.impute:
            trans = pd.DataFrame(X).fillna(self.series).copy()
        else:
            trans = pd.DataFrame(X).apply(self.func).copy()
        return trans

    def fit(self, X, y=None, **fitparams):
        "Fixes the values to impute or does nothing"
        if self.impute:
            self.series = pd.DataFrame(X).apply(self.func).copy()
        return self
    
class DataFrameFeatureUnion(BaseEstimator, TransformerMixin):
    "A DataFrame transformer that unites several DataFrame transformers"
    def __init__(self, list_of_transformers):
        self.list_of_transformers = list_of_transformers
        
    def transform(self, X, **transformparamn):
        "Applies the fitted transformers on a DataFrame"
        concatted = pd.concat([transformer.transform(X)
                            for transformer in
                            self.fitted_transformers_], axis=1).copy()
        return concatted

    def fit(self, X, y=None, **fitparams):
        "Fits several DataFrame Transformers"
        self.fitted_transformers_ = []
        for transformer in self.list_of_transformers:
            fitted_trans = clone(transformer).fit(X, y=None, **fitparams)
            self.fitted_transformers_.append(fitted_trans)
        return self
    
class ToDummiesTransformer(BaseEstimator, TransformerMixin):
    "A Dataframe transformer that provide dummy variable encoding"
    
    def transform(self, X, **transformparams):
        "Returns a dummy variable encoded version of a DataFrame"
        trans = pd.get_dummies(X).copy()
        return trans

    def fit(self, X, y=None, **fitparams):
        return self

class DropAllZeroTrainColumnsTransformer(BaseEstimator, TransformerMixin):
    "A DataFrame transformer that provides dropping all-zero columns"

    def transform(self, X, **transformparams):
        "Drops certain all-zero columns of X"
        
        trans = X.drop(self.cols_, axis=1).copy()
        return trans

    def fit(self, X, y=None, **fitparams):
        "Determines the all-zero columns of X"
        self.cols_ = X.columns[(X==0).all()]
        return self

In [8]:
# Transforming age_upon_outcome column 
# This column is originally made of strings 
# Want to make this column numeric (need to calculate all ages to be in years)
def age_to_year(x): 
    "Calculates age of animal in years"
    age = []
    for row in x: 
        # splitting number from units 
        age_units = row.split(" ")
        # removing plural from units (eg. year vs years )
        age_units[1] = re.sub('s', '', age_units[1])
        # converting all ages to years 
        if age_units[1] == 'day':
            age.append(int(age_units[0]) / 365)
        elif age_units[1] == 'week':
            age.append(int(age_units[0]) / 52)
        elif age_units[1] == 'month':
            age.append(int(age_units[0]) / 12)
        else: 
            age.append(int(age_units[0]))
    return pd.Series(age)

# Creating pipeline component for age_upon_outcome transformation 
age_pipeline = make_pipeline(
    SelectColumnsTransformer('age_upon_outcome'),
    DataFrameFunctionTransformer(lambda x: age_to_year(x))
)

In [9]:
# Transforming monthyear column 
# Want to extract the month to see if time of year impacts outcome 
# Format of column: 2016-09-23T17:09:00
def extract_month(x):
    "Extracts month from column"
    month = []
    for row in x: 
        # Making missing values be NaN for imputation later on 
        if row == 'NA':
            month.append(np.NaN)
        # Extracting month if not NA
        else:
            month.append(int(row[5:7]))
    return pd.Series(month)

# Creating pipeline component for month transformation 
month_pipeline = make_pipeline(
    SelectColumnsTransformer('monthyear'),
    DataFrameFunctionTransformer(lambda x: x.fillna('NA')),
    DataFrameFunctionTransformer(lambda x: extract_month(x)),
    DataFrameFunctionTransformer(func = np.mean, impute = True)
)

In [10]:
# Transforming name column
# I am interested to see if having a name (versus not having a name) impacts the outcome 
def has_name(x):
    "Determines if listed animal has a name ('yes' or 'no')"
    name = []
    for row in x: 
        # Does not have name 
        if row == 'NA':
            name.append('no')
        # Does have name 
        else:
            name.append('yes')
    return pd.Series(name)
    
# Creating pipeline component for name transformation 
name_pipeline = make_pipeline(
    SelectColumnsTransformer('name'),
    DataFrameFunctionTransformer(lambda x: x.fillna('NA')),
    DataFrameFunctionTransformer(lambda x: has_name(x)),
    ToDummiesTransformer()
)

In [11]:
# Transforming fixed_status column 
# Current values are Intact, Spayed, Neutered
# Spayed/Neutered could be correlated with the gender column 
# This column will now only have two possible values: 'Intact' or 'Fixed' (for Spayed/Neutered)
def to_fixed(x):
    status = []
    for row in x: 
        if row == 'NA': 
            status.append(row)
        elif row == 'Intact':
            status.append(row)
        else:
            status.append('Fixed')
    return pd.Series(status)

# Creating pipeline component for fixed_status transformation 
fixed_pipeline = make_pipeline(
    SelectColumnsTransformer('fixed_status'),
    DataFrameFunctionTransformer(lambda x: x.fillna('NA')),
    DataFrameFunctionTransformer(lambda x: to_fixed(x)),
    DataFrameFunctionTransformer(lambda x: 'NA', impute = True),
    ToDummiesTransformer()
)

In [12]:
# Imputing missing values on remaining columns 
def fix_na(x): 
    res = []
    for row in x: 
        if  not isinstance(row, str):
            if math.isnan(row):
                res.append('NA')
        else: 
            res.append(row)
    return pd.Series(res)

# Creating pipeline component for remaining column transformations
remaining = ['animal_type', 'gender']
remaining_pipeline = make_pipeline(
    SelectColumnsTransformer(remaining),
    DataFrameFunctionTransformer(lambda x: fix_na(x)), 
    DataFrameFunctionTransformer(lambda x: 'NA', impute = True),
    ToDummiesTransformer()
)

In summary, the following feature engineering was performed,
* age_upon_outcome
    * This feature column initially consisted of strings representing the animal's weight in various units (eg. 2 weeks, 1 month, 7 years) 
    * To make this column numerically comparable, I made all of the ages be in the same scale, years
    * Lastly, I imputed missing values using the mean 
* monthyear
    * This feature column initially consisted of datetime objects representing when the animal was listed on the site
    * Since I was most interested in whether or not time of year impacts an animal's chance of adoption (eg. maybe people tend to adopt more in December, around the holidays), I extracted the month from this feature column.
    * Lastly, I imputed missing values using the mean 
* name
    * This feature column initially consisted of the animal's assignmed name (or NA if it did not have a name listed)
    * I decided to just use whether or not the animal had a name listed to make it more applicable to the model  
    * Lastly, I imputed missing values and then OneHotEncoded the feature column
* fixed_status
    * This feature column in initially consisted of 3 possible values, Intact, Spayed, Neutered
    * I decided to combine the Spayed/Neutered values into a more general group, Fixed since the initialy categories are gender specific
    * Lastly, I imputed missing values and then OneHotEncoded the feature column
* animal_type
    * This feature column consisted of 2 possible values: Dog, Cat
    * There were no major feature engineering requirements but I did impute missing values and then OneHotEncoded the feature column 
* gender
    * This feature column consisted of 2 possible values: Male, Female
    * There were no major feature engineering requirements but I did impute missing values and then OneHotEncoded the feature column 

In [13]:
# Combining feature engineering pipelines into one final preprocessing pipeline
prep_pipeline = DataFrameFeatureUnion([age_pipeline, month_pipeline, name_pipeline, fixed_pipeline, remaining_pipeline])

## Algorithms & Search

I performed a randomized search for the optimal model and its hyperparameters. I included the following models in my search,     
* RandomForestClassifier()  
* RidgeClassifier()     
* LogisticRegression() - multiclass

In [20]:
# Dummy Class to be Used for filler in Pipeline Transformations
class DummyEstimator(BaseEstimator):
    "Pass through class, methods are present but do nothing."
    def fit(self): pass
    def score(self): pass

In [21]:
# Define space of algorithms and hyperparameters to search 
search_space = [
                {'clf': [RandomForestClassifier()],  
                 # Number of trees to include (more may improve generality of model)
                 'clf__n_estimators': [5, 10, 50, 75, 100, 150, 200],
                 # limit tree depth to prevent overfitting 
                 'clf__max_depth': [10, 25, 50, None],
                 # Minimum number of samples needed per leaf (helps prevent overfitting)
                 'clf__min_samples_leaf': [1,2,3,4,5],
                 # Minimum number of samples required to keep splitting (more may lead to overfitting)
                 'clf__min_samples_split': [2,3,4,5],
                 # Can balance class weights to even things out (helpful if there is a huge imbalance)
                 'clf__class_weight': [None, "balanced", "balanced_subsample"]
                },
    
                {'clf': [RidgeClassifier()],
                 # Specifies regularization strength
                 'clf__alpha': [0, 0.2, 0.5, 0.75, 1.0],
                 # So X does not get overwritting when fitting model 
                 'clf__copy_X': [True],
                 # Prevent Memory error if algorithm fails to converge
                 'clf__max_iter': [50, 100, 150, 200, 300]
                },
    
                {'clf': [LogisticRegression()], 
                 # norm used in regularization 
                 'clf__penalty': ['l1', 'l2', 'elasticnet', 'none'],
                  # Can balance class weights
                 'clf__class_weight': [None, 'balanced'],
                 # Prevent Memory error if algorithm fails to converge
                 'clf__max_iter': [50, 100, 150, 200, 300],
                  # More than 2 outcomes
                 'clf__multi_class':['multinomial']}
                ]

In [22]:
# Set up for random search 
pipe = Pipeline([('clf', DummyEstimator())]) 
clf_algos_rand = RandomizedSearchCV(estimator=pipe, 
                                    param_distributions=search_space, 
                                    n_iter=25,
                                    cv=5, 
                                    n_jobs=-1,
                                    verbose=1)

In [23]:
# Perform search 
best_model = clf_algos_rand.fit(prep_pipeline.fit_transform(X_train), y_train);
# View best model (from search results)
best_model.best_estimator_.get_params()['clf']

Fitting 5 folds for each of 25 candidates, totalling 125 fits


RandomForestClassifier(max_depth=10, min_samples_leaf=4, min_samples_split=3,
                       n_estimators=150)

The final best model from the candidate search is the RandomForestClassifier() with the above hyperparameters (defult ones not shown).

## Final Model and Discussion

In [28]:
# Final Model with optimal parameters from Randomized Search
final_model_pipe = Pipeline([('preprocessing', prep_pipeline),
                    ('rf', RandomForestClassifier(n_estimators=150,
                                                  max_depth = 10,
                                                  min_samples_leaf=4,
                                                  min_samples_split=3))])

In [29]:
# Visualizing final pipeline
set_config(display='diagram')
final_model_pipe

In [30]:
# Fitting Model 
final_model_pipe.fit(X_train, y_train)

# Making predictions
final_pred  = final_model_pipe.predict(X_test)

#### Evaluation Metric 1: Accuracy Score

I chose Accuracy Score as one of my evaluation metrics because it tells you about the proportion of times that the model is able to get the classification exactly right.  This is a quick and straightforward way to evaluate the model's overall accuracy.

In [41]:
# Accuracy Score of Final Model on Testing Data 
accuracy_score(y_test, final_pred)

0.7750904757701474

In [43]:
# Accuracy Score of Final Model on Training Data
accuracy_score(y_train, final_model_pipe.predict(X_train))

0.7830155367231638

The Accuracy Score means that my RandomForestClassifier() machine learning model is able to correctly predict the shelter animal's outcome 77.5% of the time on the test dataset.  I also computed the Accuracy for the training data and it indicates that the model is able to correctly predict the shelter animal's outcome 78.3% of the time on the training dataset. Since my accuracy score on the training and testing set, this gives me confidence in the generality of the model. 

#### Evaluation Metric 1: F1 Score

I chose the F1 Score as my second evaluation metric because it gives information on the overall precision and robustness of the model. It is a balance between the model's precision and recall (weighted averages of correct classifications). 

In [44]:
# F1 Score of Final Model
f1_score(y_test, final_pred, average='macro')

0.23716605183484238

In [46]:
# F1 Score of Final Model on Training Data
f1_score(y_train, final_model_pipe.predict(X_train), average='macro')

0.25454762621541915

The F1 score of 0.237 indicates that the model has poor precision and recall and is not as precise as we would want at predicting the shelter animal outcomes. Like the accuracy score, the F1 score on the training data (0.255) is barely better than the F1 score on the testing data, which further validates the generality of the model.   

I will discuss possible model modifications to improve the evaluation metrics in the next section.

## Summary

In [50]:
# Final Model 
best_model.best_params_['clf']

In future iterations of the project, I would like to make improvements in the following areas,
* I would like to perform a more in-depth analysis of feature importance. I did some EDA and visualizations to explore the impact of the different data features on shelter animal outcomes but I would like to use sklearn's permutation_performance() functionality to automate this and make the evaluation be more robust. 
* The target variable consisted of many closely related outcomes; the current possible outcomes include 'Transfer', 'Adoption', 'Died', 'Euthanasia', 'Missing', 'Disposal', 'Rto-Adopt'. In the future, I might consider combining some of these into one group.  I think that this might improve the model's evaluation metrics because closely related features would be considered to be the same. For example, I might try turning this into a classification model by aggregating the current target outcomes to 2 classes, positive outcome and negative outcome. 
* I would also like to incorporate other variables that were available in the dataset, specifically Breed and Fur Color. Since these values are manually inputted, there are way too many unique values to include in the model. I would like into more advanced feature engineering techniques, possibly incoporating some NLP, to extract the underlying information from these columns. 

With further improvements, this model could be very useful in the animal welfare sectors. Thousands of dogs and cats are registered in rescue/shelter/pound systems each year. Many of these animals are fortunate enough to find forever homes but others are not so lucky. According the the ASPCA, over 1.5 million shelter animals are euthanized each year due to overcapacity. To help lower this number, this model could be used as a way to evaluate shelter animal adoption profiles to optimize their chance of finding a family of their own. 