# 3. Modeling
___
## 3.2 Logistic Regression Model Tuning

In this notebook we will be tuning Logistic Regression models on the Austin and Dallas datasets separately to predict the outcome probabilities for each city. The main metrics we're looking at is specificity, with accuracy and precision as secondary metrics.

In [1]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, confusion_matrix, ConfusionMatrixDisplay

## Austin Shelter Model

In [16]:
# read in Austin data
animals_austin = pd.read_csv('../data/austin-data.csv')
animals_austin.head()

Unnamed: 0,animal_id,outcome_time,date_of_birth,outcome_type,outcome_gender,outcome_age,intake_time,found_location,intake_type,intake_condition,animal_type,intake_gender,intake_age,breed,color,stay,repeat,animal_stay,stay_duration,spay_neuter
0,A912799,2024-10-17 13:07:00,2024-07-21,Adoption,Spayed Female,2.0,2024-09-05 14:57:00,7201 Levander Loop in Austin (TX),Abandoned,Normal,Cat,Intact Female,1.0,Domestic Shorthair,Brown Tabby,1,0,A912799-1,41,1
1,A912055,2024-10-17 12:25:00,2023-10-25,Adoption,Neutered Male,11.0,2024-08-25 08:20:00,1800 Fairlawn Lane in Austin (TX),Stray,Injured,Cat,Intact Male,10.0,Domestic Shorthair,Brown Tabby/White,1,0,A912055-1,53,1
2,A915002,2024-10-17 12:21:00,2023-10-10,Return to Owner,Intact Male,12.0,2024-10-10 12:10:00,Austin (TX),Public Assist,Normal,Dog,Intact Male,12.0,German Shepherd Mix,Tan,1,0,A915002-1,7,0
3,A912548,2024-10-17 11:45:00,2021-09-02,Adoption,Neutered Male,36.0,2024-09-02 22:31:00,6900 Bryn Mawr in Austin (TX),Stray,Normal,Dog,Intact Male,36.0,Siberian Husky Mix,Black/White,1,0,A912548-1,44,1
4,A915279,2024-10-17 00:00:00,2022-10-14,Transfer,Intact Female,24.0,2024-10-14 11:47:00,14514 Highsmith Street in Austin (TX),Stray,Normal,Cat,Intact Female,24.0,Domestic Shorthair,Black,1,0,A915279-1,2,0


In [17]:
# convert target variable to 0's and 1's
animals_austin['outcome_type'] = animals_austin['outcome_type'].map(lambda x: 1 if x in ['Adoption', 'Return to Owner'] else 0)

# dummify cats and dogs
animals_austin['animal_type'] = animals_austin['animal_type'].map({'Cat': 0, 'Dog': 1})

# see distribution of target variable
animals_austin['outcome_type'].value_counts(normalize=True)

outcome_type
1    0.660305
0    0.339695
Name: proportion, dtype: float64

In [4]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration', 'breed']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [5]:
# count vectorize features
cv = CountVectorizer(stop_words = ['mix'])

X_train_cv = cv.fit_transform(X_train['breed'])
X_test_cv = cv.transform(X_test['breed'])

X_train_cv = pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names_out())
X_test_cv = pd.DataFrame(X_test_cv.todense(), columns=cv.get_feature_names_out())

# recombine countvectorized breeds with other features
X_train_combined = pd.concat([X_train.reset_index(drop=True), X_train_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')
X_test_combined = pd.concat([X_test.reset_index(drop=True), X_test_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')

In [6]:
# pipeline with standard scaler and logistic regression
pipe = Pipeline([
    ('ss', StandardScaler()),
    ('logr', LogisticRegression())
])

# parameters for grid search
params = {
    'logr__C': np.linspace(0.00001, 0.001, 100)
}

gs = GridSearchCV(pipe, param_grid=params, n_jobs=-1)

gs.fit(X_train_combined, y_train)

In [7]:
gs.best_params_, gs.best_score_

({'logr__C': 0.00039999999999999996}, 0.7696959542724551)

In [8]:
print(f'Logistic Regression Training Accuracy: {gs.score(X_train_combined, y_train)}')
print(f'Logistic Regression Testing Acuracy: {gs.score(X_test_combined, y_test)}')

# confusion matrix values to calculate specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test_combined)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs.predict(X_test_combined))}')

Logistic Regression Training Accuracy: 0.7710251979568198
Logistic Regression Testing Acuracy: 0.7695177874619044
Specificity: 0.645104222752096
Precision: 0.8014971605575633


*Certainly beating the baseline accuracy of 63%, but specificity is low. Precision is decent, meaning when we predict an animal is adopted they are 80% of the time, but going to try and improve this as well. Trying model without removing mix from the breeds.*

In [9]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration', 'breed']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [10]:
# count vectorize features
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train['breed'])
X_test_cv = cv.transform(X_test['breed'])

X_train_cv = pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names_out())
X_test_cv = pd.DataFrame(X_test_cv.todense(), columns=cv.get_feature_names_out())

# recombine countvectorized breeds with other features
X_train_combined = pd.concat([X_train.reset_index(drop=True), X_train_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')
X_test_combined = pd.concat([X_test.reset_index(drop=True), X_test_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')

In [12]:
# new pipeline
pipe2 = Pipeline([
    ('ss', StandardScaler()),
    ('logr', LogisticRegression())
])

# parameters for grid search
params2 = {
    'logr__C': np.linspace(0.0003, 0.001, 50)
}

gs2 = GridSearchCV(pipe2, param_grid=params2, n_jobs=-1)

gs2.fit(X_train_combined, y_train)

In [13]:
gs2.best_params_, gs2.best_score_

({'logr__C': 0.00034285714285714285}, 0.7684901712382967)

In [14]:
print(f'Training Accuracy: {gs2.score(X_train_combined, y_train)}')
print(f'Testing Accuracy: {gs2.score(X_test_combined, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs2.predict(X_test_combined)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs2.predict(X_test_combined))}')

Training Accuracy: 0.769458633195982
Testing Accuracy: 0.7676664103221397
Specificity: 0.6362587493269748
Precision: 0.7979836814900252


*This model did slightly worse. Trying replacing any mixed breed with just mix next.*

In [21]:
animals_austin['breed'] = animals_austin['breed'].map(lambda x: 'mix' if 'Mix' in x else x)

In [5]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration', 'breed']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [6]:
# count vectorize features
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train['breed'])
X_test_cv = cv.transform(X_test['breed'])

X_train_cv = pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names_out())
X_test_cv = pd.DataFrame(X_test_cv.todense(), columns=cv.get_feature_names_out())

# recombine countvectorized breeds with other features
X_train_combined = pd.concat([X_train.reset_index(drop=True), X_train_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')
X_test_combined = pd.concat([X_test.reset_index(drop=True), X_test_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')

In [7]:
# new pipeline
pipe3 = Pipeline([
    ('ss', StandardScaler()),
    ('logr', LogisticRegression())
])

# parameters for grid search
params3 = {
    'logr__C': np.linspace(0.0003, 0.001, 50)
}

gs3 = GridSearchCV(pipe3, param_grid=params3, n_jobs=-1)

gs3.fit(X_train_combined, y_train)

In [8]:
gs3.best_params_, gs3.best_score_

({'logr__C': 0.0003}, 0.7706169235751534)

In [9]:
print(f'Training Accuracy: {gs3.score(X_train_combined, y_train)}')
print(f'Testing Accuracy: {gs3.score(X_test_combined, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs3.predict(X_test_combined)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs3.predict(X_test_combined))}')

Training Accuracy: 0.7670470728974802
Testing Accuracy: 0.7660998604346464
Specificity: 0.6547188677794016
Precision: 0.8037509836495584


*Accuracy is about the same, but got a slight improvement on Specificity. Now going to try simplifying breeds even more to either mix or purebred.*

In [23]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration', 'breed']].copy()
y = animals_austin['outcome_type']

# map 1 for purebred and 0 for mix breeds
X['breed'] = X['breed'].map(lambda x: 0 if x == 'mix' else 1)

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [16]:
# new pipeline
pipe4 = Pipeline([
    ('ss', StandardScaler()),
    ('logr', LogisticRegression())
])

# parameters for grid search
params4 = {
    'logr__C': np.linspace(0.0001, 0.001, 50)
}

gs4 = GridSearchCV(pipe4, param_grid=params4, n_jobs=-1)

gs4.fit(X_train, y_train)

In [17]:
gs4.best_params_, gs4.best_score_

({'logr__C': 0.00028367346938775514}, 0.7726961627882791)

In [18]:
print(f'Training Accuracy: {gs4.score(X_train, y_train)}')
print(f'Testing Accuracy: {gs4.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs4.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs4.predict(X_test))}')

Training Accuracy: 0.7662495490192355
Testing Accuracy: 0.7659574468085106
Specificity: 0.6410276132605184
Precision: 0.7990527448869752


*Worse than the previous model. Going to try adding intake type and intake condition.*

In [5]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration',
                    'breed', 'intake_type', 'intake_condition']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [6]:
# count vectorize breeds
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train['breed'])
X_test_cv = cv.transform(X_test['breed'])

X_train_cv = pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names_out())
X_test_cv = pd.DataFrame(X_test_cv.todense(), columns=cv.get_feature_names_out())

# recombine countvectorized breeds with other features
X_train_combined = pd.concat([X_train.reset_index(drop=True), X_train_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')
X_test_combined = pd.concat([X_test.reset_index(drop=True), X_test_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')

In [29]:
# onehotencoder for intake type and condition features
ohe = OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False)

# new pipeline
pipe5 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', ohe, ['intake_type', 'intake_condition'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('ss', StandardScaler()),
    ('logr', LogisticRegression())
])

# parameters for grid search
params5 = {
    'logr__C': np.linspace(0.0001, 0.001, 50)
}

gs5 = GridSearchCV(pipe5, param_grid=params5, n_jobs=-1)

gs5.fit(X_train_combined, y_train)

In [30]:
gs5.best_params_, gs5.best_score_

({'logr__C': 0.00013673469387755102}, 0.799802489340465)

In [31]:
print(f'Training Accuracy: {gs5.score(X_train_combined, y_train)}')
print(f'Testing Accuracy: {gs5.score(X_test_combined, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs5.predict(X_test_combined)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs5.predict(X_test_combined))}')

Training Accuracy: 0.794561646696922
Testing Accuracy: 0.7915634167877182
Specificity: 0.6215675717252519
Precision: 0.8002436053593179


*This model had the best accuracy so far, but also had the worst specificity. Next going to try some boosted and random forest models to look into stacking.*

In [10]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration',
                    'breed', 'intake_type', 'intake_condition']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [11]:
# count vectorize breeds
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train['breed'])
X_test_cv = cv.transform(X_test['breed'])

X_train_cv = pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names_out())
X_test_cv = pd.DataFrame(X_test_cv.todense(), columns=cv.get_feature_names_out())

# recombine countvectorized breeds with other features
X_train_combined = pd.concat([X_train.reset_index(drop=True), X_train_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')
X_test_combined = pd.concat([X_test.reset_index(drop=True), X_test_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')

In [15]:
# onehotencoder for intake type and condition features
ohe = OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False)

# new pipeline for random forest
pipe6 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', ohe, ['intake_type', 'intake_condition'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('rf', RandomForestClassifier(random_state = 42))
])

# parameters for grid search
params6 = {
    'rf__max_depth': [None, *range(2, 15, 2)],
    'rf__min_samples_split': [2, 3, 4, 5],
    'rf__min_samples_leaf': range(1, 7)
}

gs6 = GridSearchCV(pipe6, param_grid=params6, n_jobs=-1)

gs6.fit(X_train_combined, y_train)

160 fits failed out of a total of 1120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
160 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\Users\_Cramer_\anaconda3\Lib\site-packages\skle

In [16]:
gs6.best_params_, gs6.best_score_

({'rf__max_depth': None,
  'rf__min_samples_leaf': 2,
  'rf__min_samples_split': 5},
 0.8277348166373409)

In [18]:
print(f'Training Accuracy: {gs6.score(X_train_combined, y_train)}')
print(f'Testing Accuracy: {gs6.score(X_test_combined, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs6.predict(X_test_combined)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs6.predict(X_test_combined))}')

Training Accuracy: 0.8515181436682301
Testing Accuracy: 0.8253154461818907
Specificity: 0.6637950926851781
Precision: 0.823158150260954


*Improvement on every metric we're following with this Random Forest Model. Trying an AdaBoost Classifier next.*

In [94]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration',
                    'breed', 'intake_type', 'intake_condition']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [27]:
# count vectorize breeds
cv = CountVectorizer()

X_train_cv = cv.fit_transform(X_train['breed'])
X_test_cv = cv.transform(X_test['breed'])

X_train_cv = pd.DataFrame(X_train_cv.todense(), columns=cv.get_feature_names_out())
X_test_cv = pd.DataFrame(X_test_cv.todense(), columns=cv.get_feature_names_out())

# recombine countvectorized breeds with other features
X_train_combined = pd.concat([X_train.reset_index(drop=True), X_train_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')
X_test_combined = pd.concat([X_test.reset_index(drop=True), X_test_cv.reset_index(drop=True)], axis = 1).drop(columns='breed')

In [24]:
# onehotencoder for intake type and condition features
ohe = OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False)

# new pipeline for adaboost
pipe7 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', ohe, ['intake_type', 'intake_condition'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('ada', AdaBoostClassifier(n_estimators = 100, random_state = 42, algorithm = 'SAMME'))
])

pipe7.fit(X_train_combined, y_train)

In [25]:
print(f'Training Accuracy: {pipe7.score(X_train_combined, y_train)}')
print(f'Testing Accuracy: {pipe7.score(X_test_combined, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, pipe7.predict(X_test_combined)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, pipe7.predict(X_test_combined))}')

Training Accuracy: 0.7803486318667755
Testing Accuracy: 0.7790595004129995
Specificity: 0.6751019152372895
Precision: 0.8147287161717619


*AdaBoost Classifier had the highest specificity so far, but lower accuracy than the RandomForest Classifier. The next step is to stack these last two and the fifth Logistic Regression from above because even though the others had higher specificity, this model used the same features as the RandomForest and AdaBoost models.*

In [18]:
# set up X and y
X = animals_austin[['animal_type', 'intake_age', 'spay_neuter', 'stay_duration',
                    'breed', 'intake_type', 'intake_condition']]
y = animals_austin['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [19]:
level1_estimators = [
    ('logr_pipe', Pipeline([
        ('ctx', ColumnTransformer([
            ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False),
             ['intake_type', 'intake_condition']),
            ('cv', CountVectorizer(), 'breed'),
            ('ss', StandardScaler(), ['intake_age', 'stay_duration'])
        ], remainder = 'passthrough', verbose_feature_names_out = False)),
        ('logr', LogisticRegression(C = 0.0001367))
    ])),
    ('rf_pipe', Pipeline([
            ('ctx', ColumnTransformer([
        ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False),
            ['intake_type', 'intake_condition']),
        ('cv', CountVectorizer(), 'breed')
        ], remainder = 'passthrough', verbose_feature_names_out = False)),
        ('rf', RandomForestClassifier(max_depth = None, min_samples_leaf = 2,
                                      min_samples_split = 5, random_state = 42))
    ])),
    ('ada_pipe', Pipeline([
            ('ctx', ColumnTransformer([
        ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False),
            ['intake_type', 'intake_condition']),
        ('cv', CountVectorizer(), 'breed')
        ], remainder = 'passthrough', verbose_feature_names_out = False)),
        ('ada', AdaBoostClassifier(n_estimators = 100, random_state = 42, algorithm = 'SAMME'))
    ]))]

stacked_model = StackingClassifier(estimators = level1_estimators,
                                   final_estimator=LogisticRegression(),
                                   n_jobs = -1)

stacked_model.fit(X_train, y_train)

In [20]:
# suppressing passthrough warnings
import warnings
warnings.filterwarnings('ignore', category = UserWarning)


print(f'Training Accuracy: {stacked_model.score(X_train, y_train)}')
print(f'Testing Accuracy: {stacked_model.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, stacked_model.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, stacked_model.predict(X_test))}')

Training Accuracy: 0.8600912355954995
Testing Accuracy: 0.8345759076750242
Specificity: 0.665238632093198
Precision: 0.8425592082007777


*This will be the final Logistic Regression model for Austin animal shelters moving forward, it has the best accuracy of 83.46% and second best specificty of 66.52%.*

In [119]:
# saving model
with open('../models/stacked_logr_austin_model.pkl', 'wb') as file:
    pickle.dump(stacked_model, file)

---
## Dallas Shelter Model

In [2]:
# read in dallas shelter data
animals_dallas = pd.read_csv('../data/dallas-combined-shelter-data.csv')
animals_dallas.head()

Unnamed: 0,animal_id,animal_type,animal_breed,intake_type,reason,intake_date,intake_condition,outcome_type,outcome_date,outcome_condition,stay_duration
0,A1229376,CAT,DOMESTIC SH,DISPOS REQ,OTHRINTAKS,2024-10-04,DECEASED,DISPOSAL,2027-10-04,DECEASED,1095
1,A1229851,DOG,MIXED BREED,STRAY,OTHRINTAKS,2024-10-09,APP WNL,ADOPTION,2024-10-27,APP WNL,18
2,A1225816,CAT,DOMESTIC SH,FOSTER,SURGERY,2024-10-26,APP WNL,ADOPTION,2024-10-27,APP WNL,1
3,A1204135,DOG,MIXED BREED,FOSTER,FOR ADOPT,2024-10-27,APP WNL,ADOPTION,2024-10-27,APP WNL,0
4,A1231147,DOG,CHIHUAHUA SH,OWNER SURRENDER,PERSNLISSU,2024-10-24,APP WNL,ADOPTION,2024-10-27,APP WNL,3


In [3]:
# convert target variable to 0's and 1's
animals_dallas['outcome_type'] = animals_dallas['outcome_type'].map(lambda x: 1 if x in ['ADOPTION', 'RETURNED TO OWNER', 'FOSTER'] else 0)

# see distribution of target variable
animals_dallas['outcome_type'].value_counts(normalize=True)

outcome_type
0    0.549091
1    0.450909
Name: proportion, dtype: float64

In [57]:
animals_dallas['animal_breed'].nunique()

284

*Dallas shelters don't label as specifically on breeds, so will just one hot encode breed instead of count vectorizing.*

In [67]:
# set X and y
X = animals_dallas[['animal_type', 'animal_breed', 'intake_type', 'reason', 'stay_duration']]
y = animals_dallas['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [68]:
pipe8 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
        ['animal_type', 'animal_breed', 'intake_type', 'reason']),
        ('ss', StandardScaler(), ['stay_duration'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('logr', LogisticRegression())
])

params8 = {
    'logr__C': np.linspace(0.0001, 0.01, 50)
}

gs8 = GridSearchCV(pipe8, param_grid=params8, n_jobs = -1)

gs8.fit(X_train, y_train)

In [76]:
gs8.best_params_, gs8.best_score_

({'logr__C': 0.00979795918367347}, 0.6858432972607369)

In [11]:
# suppressing warnings for unknown categories below
import warnings
warnings.filterwarnings('ignore', category = UserWarning)

In [73]:
print(f'Training Accuracy: {gs8.score(X_train, y_train)}')
print(f'Testing Accuracy: {gs8.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs8.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs8.predict(X_test))}')

Training Accuracy: 0.6875690838182236
Testing Accuracy: 0.6868234999687167
Specificity: 0.7837629967241134
Precision: 0.6835369158294076


*The accuracy for the Dallas shelter data is lower than Austin, but much better with specificity. Trying without stay duration.*

In [74]:
# set X and y
X = animals_dallas[['animal_type', 'animal_breed', 'intake_type', 'reason']]
y = animals_dallas['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [77]:
pipe9 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
        ['animal_type', 'animal_breed', 'intake_type', 'reason'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('logr', LogisticRegression())
])

params9 = {
    'logr__C': np.linspace(0.0001, 0.01, 50)
}

gs9 = GridSearchCV(pipe9, param_grid=params9, n_jobs = -1)

gs9.fit(X_train, y_train)

In [78]:
gs9.best_params_, gs9.best_score_

({'logr__C': 0.01}, 0.6783248972626508)

In [79]:
print(f'Training Accuracy: {gs9.score(X_train, y_train)}')
print(f'Testing Accuracy: {gs9.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs9.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs9.predict(X_test))}')

Training Accuracy: 0.6787159273394648
Testing Accuracy: 0.6794406556966778
Specificity: 0.7960974220196553
Precision: 0.6839595567133206


*Removing stay duration as a feature dropped accuracy by 1% and raised specificity by 1%. Going to move forward with RandomForest Classifier and AdaBoost Classifier keeping stay_duration as a feature.*

In [80]:
# set X and y
X = animals_dallas[['animal_type', 'animal_breed', 'intake_type', 'reason', 'stay_duration']]
y = animals_dallas['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [81]:
pipe10 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
        ['animal_type', 'animal_breed', 'intake_type', 'reason']),
        ('ss', StandardScaler(), ['stay_duration'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('rf', RandomForestClassifier(random_state = 42))
])

params10 = {
    'rf__max_depth': [None, *range(2, 15, 2)],
    'rf__min_samples_split': [2, 3, 4, 5],
    'rf__min_samples_leaf': range(1, 7)
}

gs10 = GridSearchCV(pipe10, param_grid=params10, n_jobs = -1)

gs10.fit(X_train, y_train)

In [82]:
gs10.best_params_, gs10.best_score_

({'rf__max_depth': None,
  'rf__min_samples_leaf': 2,
  'rf__min_samples_split': 5},
 0.7393115671301483)

In [83]:
print(f'Training Accuracy: {gs10.score(X_train, y_train)}')
print(f'Testing Accuracy: {gs10.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, gs10.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, gs10.predict(X_test))}')

Training Accuracy: 0.761288035204071
Testing Accuracy: 0.7392229243571294
Specificity: 0.7397806580259222
Precision: 0.6997534921939195


*Large improvement on accuracy with RandomForest Classifier, but has a lower specificity.*

In [7]:
# set X and y
X = animals_dallas[['animal_type', 'animal_breed', 'intake_type', 'reason', 'stay_duration']]
y = animals_dallas['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [9]:
pipe11 = Pipeline([
    ('ctx', ColumnTransformer([
        ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
        ['animal_type', 'animal_breed', 'intake_type', 'reason']),
        ('ss', StandardScaler(), ['stay_duration'])
    ], remainder = 'passthrough', verbose_feature_names_out = False)),
    ('ada', AdaBoostClassifier(n_estimators = 100, algorithm = 'SAMME', random_state = 42))
])

pipe11.fit(X_train, y_train)

In [12]:
print(f'Training Accuracy: {pipe11.score(X_train, y_train)}')
print(f'Testing Accuracy: {pipe11.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, pipe11.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, pipe11.predict(X_test))}')

Training Accuracy: 0.6473961917871072
Testing Accuracy: 0.6486735906901082
Specificity: 0.7439111237715426
Precision: 0.6307401626550563


*Lower on accuracy again with AdaBoost Classifier. Going to stack the three different models again.*

In [124]:
# set X and y
X = animals_dallas[['animal_type', 'animal_breed', 'intake_type', 'reason', 'stay_duration', 'intake_condition']]
y = animals_dallas['outcome_type']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [125]:
level1_estimators_dallas = [
    ('logr_pipe', Pipeline([
        ('ctx', ColumnTransformer([
            ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
            ['animal_type', 'animal_breed', 'intake_type', 'reason'])
        ], remainder = StandardScaler(), verbose_feature_names_out = False)),
        ('logr', LogisticRegression(C = 0.00979796))
    ])),
     ('rf_pipe', Pipeline([
        ('ctx', ColumnTransformer([
            ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
            ['animal_type', 'animal_breed', 'intake_type', 'reason']),
    ], remainder = StandardScaler(), verbose_feature_names_out = False)),
    ('rf', RandomForestClassifier(min_samples_leaf = 2, min_samples_split = 5, random_state = 42))
    ])),
      ('ada_pipe', Pipeline([
        ('ctx', ColumnTransformer([
            ('ohe', OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False), 
            ['animal_type', 'animal_breed', 'intake_type', 'reason']),
        ], remainder = StandardScaler(), verbose_feature_names_out = False)),
        ('ada', AdaBoostClassifier(n_estimators = 100, algorithm = 'SAMME', random_state = 42))
    ]))]

stacked_model_dallas = StackingClassifier(estimators = level1_estimators_dallas,
                                          final_estimator = LogisticRegression(),
                                          n_jobs = -1)

stacked_model_dallas.fit(X_train, y_train)

In [126]:
print(f'Training Accuracy: {stacked_model_dallas.score(X_train, y_train)}')
print(f'Testing Accuracy: {stacked_model_dallas.score(X_test, y_test)}')

# calculating specificity
tn, fp, fn, tp = confusion_matrix(y_test, stacked_model_dallas.predict(X_test)).ravel()

print(f'Specificity: {tn / (tn + fp)}')
print(f'Precision: {precision_score(y_test, stacked_model_dallas.predict(X_test))}')

Training Accuracy: 0.7623308098187658
Testing Accuracy: 0.7397390977914033
Specificity: 0.7455917960404501
Precision: 0.7027953410981698


*Accuracy for this stacked model is the same as just the RandomForest Classifier, but improved specificity and precision so this will be the model to move forward with.*

In [127]:
# saving model
with open('../models/stacked_logr_dallas_model.pkl', 'wb') as file:
    pickle.dump(stacked_model_dallas, file)

___
## Insights

There were changes made to how we cleaned the original Austin shelter data by not dropping duplicate animal ID observations. This happened well into the project workflow so we just refit the best performing model with the newly cleaned data.

The Austin data trains models with higher accuracy while the Dallas data trains models with higher specificity.