# Animal Shelter Outcomes
---

Problem Statement...

### Background  

In 2023 6.5 million cats and dogs entered animal shelters in the US.
    3.3 million cats
    3.2 million dogs
4.8 million were adopted
    2.6 million cats
    2.2 million dogs
850,000 animals were euthanized, lost, or died in care.
690,000 animals were euhtanized.

There are 14,429 animal shelters in America.  9514 are resuce organizations and 4915 are government funded animal shelters.

https://www.shelteranimalscount.org/stats

Some of the biggest problems facing these organizations are overcrowding(1), lack of access to medical services (1), funding shortfalls (2) and staffing shortages (3).

(1)https://www.bissellpetfoundation.org/news/shelter-crisis-2022/#
(2)https://pmc.ncbi.nlm.nih.gov/articles/PMC3398531/#:~:text=Insufficient%20levels%20of%20funding%20support,standardized%20inspection%20of%20shelter%20facilities. 
(3)https://www.nationalgeographic.com/animals/article/why-animal-shelters-are-facing-a-new-crisis#:~:text=Many%20shelters%20helping%20dogs%2C%20cats,didn't%20deserve%20this.%22&text=Newly%20washed%20dog%20bowls%20and,prep%20room%20at%20the%20shelter.


While we can't affect monetary issues (funding and staffing shortages) we have some recommendations to address all of these issues.  

By understanding which animals need the most assistance, resources can be pooled by transfering an animal earlier.

Improving coordination with vetinary schools to increase access to medical services.

By examining different data collection processed we can find what data collection improves an animals outcome.




...

---
## Data Cleaning

In [1]:
# imports
import pandas as pd
import numpy as np

### Austin Shelter Intakes

In [2]:
# reading in shelter intakes
intakes =  pd.read_csv('../data/austin_animal_center_intakes_20241017.csv')
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A857105,Johnny Ringo,05/12/2022 12:23:00 AM,May 2022,4404 Sarasota Drive in Austin (TX),Public Assist,Normal,Cat,Neutered Male,2 years,Domestic Shorthair,Orange Tabby


In [3]:
intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168040 entries, 0 to 168039
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         168040 non-null  object
 1   Name              119647 non-null  object
 2   DateTime          168040 non-null  object
 3   MonthYear         168040 non-null  object
 4   Found Location    168040 non-null  object
 5   Intake Type       168040 non-null  object
 6   Intake Condition  168040 non-null  object
 7   Animal Type       168040 non-null  object
 8   Sex upon Intake   168038 non-null  object
 9   Age upon Intake   168039 non-null  object
 10  Breed             168040 non-null  object
 11  Color             168040 non-null  object
dtypes: object(12)
memory usage: 15.4+ MB


*The Name column has a lot of null values and we're assuming a pet's name won't affect their chances of adoption, so going to drop this column. Also dropping MonthYear because that information is also in the DateTime column.*

In [4]:
# dropping inconsequential columns
intakes.drop(columns=['Name', 'MonthYear'], inplace=True)

In [5]:
# renaming columns to be intake specific and snake case
columns = {
    'Animal ID': 'animal_id',
    'DateTime': 'intake_time',
    'Found Location': 'found_location',
    'Intake Type': 'intake_type',
    'Intake Condition': 'intake_condition',
    'Animal Type': 'animal_type',
    'Sex upon Intake': 'intake_gender',
    'Age upon Intake': 'intake_age',
    'Breed': 'intake_breed',
    'Color': 'intake_color'    
}

intakes = intakes.rename(columns=columns)

In [6]:
# converting intake_time column to datetime format
intakes['intake_time'] = pd.to_datetime(intakes['intake_time'], format='%m/%d/%Y %I:%M:%S %p').dt.normalize()

In [7]:
intakes.nunique()

animal_id           151007
intake_time           4032
found_location       68164
intake_type              6
intake_condition        20
animal_type              5
intake_gender            5
intake_age              55
intake_breed          2969
intake_color           651
dtype: int64

*There are many animals that have more than one stay at a shelter. Detecting any common characteristics of these animals will help identify these animals earlier on.  Hopefully this allows shelters to appropriately address and apply resources for these animals to reduce the chances of them returning.*

In [8]:
# Creating a dataframe for repeat animals
repeat_intakes = intakes[intakes.duplicated(subset=['animal_id'], keep=False)].sort_values(by=['animal_id'])
f'There are {len(repeat_intakes)} animals that have made repeated intakes in Austin Animal Shelters.'

'There are 30119 animals that have made repeated intakes in Austin Animal Shelters.'

### Austin Shelter Outcomes

In [9]:
# reading in shelter outcomes
outcomes = pd.read_csv('../data/austin_animal_center_outcomes_20241017.csv')
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A882831,*Hamilton,07/01/2023 06:12:00 PM,Jul 2023,03/25/2023,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair Mix,Black/White
1,A794011,Chunk,05/08/2019 06:20:00 PM,May 2019,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
2,A776359,Gizmo,07/18/2018 04:02:00 PM,Jul 2018,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
3,A821648,,08/16/2020 11:38:00 AM,Aug 2020,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
4,A720371,Moose,02/13/2016 05:59:00 PM,Feb 2016,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff


In [10]:
outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167942 entries, 0 to 167941
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         167942 non-null  object
 1   Name              119733 non-null  object
 2   DateTime          167942 non-null  object
 3   MonthYear         167942 non-null  object
 4   Date of Birth     167942 non-null  object
 5   Outcome Type      167896 non-null  object
 6   Outcome Subtype   77144 non-null   object
 7   Animal Type       167942 non-null  object
 8   Sex upon Outcome  167940 non-null  object
 9   Age upon Outcome  167926 non-null  object
 10  Breed             167942 non-null  object
 11  Color             167942 non-null  object
dtypes: object(12)
memory usage: 15.4+ MB


*Dropping Name and MonthYear columns for outcomes data as well as Outcome Subtype. This column has even more null values than Name and we'll be focusing on the primary Outcome Type only.*

In [11]:
# dropping inconsequential columns
outcomes.drop(columns=['Name', 'MonthYear', 'Outcome Subtype'], inplace=True)

In [12]:
# renaming columns to be outcome specific and snake case
outcome_columns = {
    'Animal ID': 'animal_id',
    'DateTime': 'outcome_time',
    'Date of Birth': 'date_of_birth',
    'Outcome Type': 'outcome_type',
    'Animal Type': 'outcome_animal_type',
    'Sex upon Outcome': 'outcome_gender',
    'Age upon Outcome': 'outcome_age',
    'Breed': 'outcome_breed',
    'Color': 'outcome_color'    
}

outcomes = outcomes.rename(columns=outcome_columns)

In [13]:
# converting outcome_time and date of birth columns to datetime format
outcomes['outcome_time'] = pd.to_datetime(outcomes['outcome_time'], format='%m/%d/%Y %I:%M:%S %p').dt.normalize()
outcomes['date_of_birth'] = pd.to_datetime(outcomes['date_of_birth'], format='%m/%d/%Y')

In [14]:
outcomes.nunique()

animal_id              150912
outcome_time             4020
date_of_birth            8501
outcome_type               11
outcome_animal_type         5
outcome_gender              5
outcome_age                55
outcome_breed            2969
outcome_color             653
dtype: int64

*Identifying duplicate animal_id's for outcomes.  Unfortunately, the datasets are separated and while the animal IDs crossover, there isn't a unique identifier linking the intake and outcome observations.

For the purpose of predictive modeling, this information might be pertinet, but could also be absent as there are other characteristics to consider.  For the sake of the unknown the data will be preserved as reasonably as practical.

This means that intakes and outcomes will be sequentially linked together.  This is not perfect as some intakes and outcomes are missing as well as the presence of typos.*

In [15]:
repeat_outcomes = outcomes[outcomes.duplicated(subset=['animal_id'], keep=False)].sort_values(by=['animal_id'])
f'There are {len(repeat_outcomes)} animals that have made repeated outcomes in Austin Animal Shelters.'

'There are 30117 animals that have made repeated outcomes in Austin Animal Shelters.'

### Merge Austin DataFrames

In [16]:
# The two dataframes are concat together.  This allows to sort all observations sequentially.
# A column is added to create a date column.
df = pd.concat([repeat_intakes, repeat_outcomes])
df['sequential_date'] = df['intake_time']
df['sequential_date'] = df['sequential_date'].fillna(df['outcome_time'])

df.sort_values(by=['animal_id','sequential_date'], inplace=True)
df.head(10)

Unnamed: 0,animal_id,intake_time,found_location,intake_type,intake_condition,animal_type,intake_gender,intake_age,intake_breed,intake_color,outcome_time,date_of_birth,outcome_type,outcome_animal_type,outcome_gender,outcome_age,outcome_breed,outcome_color,sequential_date
113926,A006100,2014-03-07,8700 Research in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6 years,Spinone Italiano Mix,Yellow/White,NaT,NaT,,,,,,,2014-03-07
145647,A006100,NaT,,,,,,,,,2014-03-08,2007-07-09,Return to Owner,Dog,Neutered Male,6 years,Spinone Italiano Mix,Yellow/White,2014-03-08
5413,A006100,2014-12-19,8700 Research Blvd in Austin (TX),Public Assist,Normal,Dog,Neutered Male,7 years,Spinone Italiano Mix,Yellow/White,NaT,NaT,,,,,,,2014-12-19
71767,A006100,NaT,,,,,,,,,2014-12-20,2007-07-09,Return to Owner,Dog,Neutered Male,7 years,Spinone Italiano Mix,Yellow/White,2014-12-20
25241,A006100,2017-12-07,Colony Creek And Hunters Trace in Austin (TX),Stray,Normal,Dog,Neutered Male,10 years,Spinone Italiano Mix,Yellow/White,NaT,NaT,,,,,,,2017-12-07
128276,A006100,NaT,,,,,,,,,2017-12-07,2007-07-09,Return to Owner,Dog,Neutered Male,10 years,Spinone Italiano Mix,Yellow/White,2017-12-07
123976,A245945,2014-07-03,Garden And Mildred in Austin (TX),Stray,Normal,Dog,Neutered Male,14 years,Labrador Retriever Mix,Tan,NaT,NaT,,,,,,,2014-07-03
114595,A245945,NaT,,,,,,,,,2014-07-04,2000-05-23,Return to Owner,Dog,Neutered Male,14 years,Labrador Retriever Mix,Tan,2014-07-04
152866,A245945,2015-05-20,7403 Blessing Ave in Austin (TX),Stray,Normal,Dog,Neutered Male,15 years,Labrador Retriever Mix,Tan,NaT,NaT,,,,,,,2015-05-20
87896,A245945,NaT,,,,,,,,,2015-05-25,2000-05-23,Transfer,Dog,Neutered Male,15 years,Labrador Retriever Mix,Tan,2015-05-25


In [17]:
df.tail()

Unnamed: 0,animal_id,intake_time,found_location,intake_type,intake_condition,animal_type,intake_gender,intake_age,intake_breed,intake_color,outcome_time,date_of_birth,outcome_type,outcome_animal_type,outcome_gender,outcome_age,outcome_breed,outcome_color,sequential_date
167957,A913872,2024-10-16,2001 Hwy 71 in Austin (TX),Public Assist,Normal,Dog,Intact Male,3 years,Bulldog,Blue Merle/White,NaT,NaT,,,,,,,2024-10-16
167154,A913892,2024-09-23,3201 Burleson Rd in Austin (TX),Stray,Normal,Cat,Intact Female,2 months,Domestic Shorthair,Brown Tabby,NaT,NaT,,,,,,,2024-09-23
167410,A913892,NaT,,,,,,,,,2024-09-27,2024-06-27,Adoption,Cat,Spayed Female,3 months,Domestic Shorthair,Brown Tabby,2024-09-27
167153,A913892,2024-10-04,Austin (TX),Owner Surrender,Normal,Cat,Spayed Female,3 months,Domestic Shorthair,Brown Tabby,NaT,NaT,,,,,,,2024-10-04
167431,A913892,NaT,,,,,,,,,2024-10-04,2024-06-27,Adoption,Cat,Spayed Female,3 months,Domestic Shorthair,Brown Tabby,2024-10-04


In [18]:
# creating a df for intakes and dropping empty intake columns created on the concat
df1_index = df[df['intake_time'].isna()].index
df1 = df.drop(labels=df1_index)
df1.dropna(axis=1, inplace=True)

In [19]:
# creating a unique identifier, animal_stay, to tie with outcomes
df1.sort_values(by=['animal_id', 'sequential_date'], inplace=True)
df1['stay'] = df1.groupby('animal_id').cumcount() +1 
df1['stay'] = df1['stay'].astype('str')
df1['animal_stay'] = df1['animal_id'] + '-' + df1['stay']
df1.shape

(24708, 13)

In [20]:
# dropping empty outcome columns created on the concat
df2_index = df[df['outcome_time'].isna()].index
df2 = df.drop(labels=df2_index)
df2.dropna(axis=1, inplace=True)

In [21]:
# creating a unique identifier, animal_stay, to tie with intakes
df2.sort_values(by=['animal_id', 'sequential_date'], inplace=True)
df2['stay'] = df2.groupby('animal_id').cumcount() +1 
df2['stay'] = df2['stay'].astype('str')
df2['animal_stay'] = df2['animal_id'] + '-' + df2['stay']
df2.shape

(24706, 11)

In [22]:
# merging the two dataframes
df3 = pd.merge(left=df2, right=df1, how='inner', on='animal_stay')

In [23]:
# calculate the duration in the shelter
df3['stay_duration'] = df3['outcome_time'] - df3['intake_time']

# convert stay_duration to number of days
df3['stay_duration'] = df3['stay_duration'].dt.days.astype(int)

# check for negative stay_duration
df3[df3['stay_duration'] < 0]['stay_duration'].sort_values()

4350    -3467
1808    -3401
5796    -3099
4036    -2837
2640    -2819
         ... 
17880      -1
4676       -1
17820      -1
9115       -1
13311      -1
Name: stay_duration, Length: 2414, dtype: int32

In [24]:
df3[df3['stay_duration'] == -1]['stay_duration'].sort_values()

168     -1
16172   -1
15848   -1
15786   -1
15735   -1
        ..
8584    -1
8446    -1
8305    -1
8001    -1
20827   -1
Name: stay_duration, Length: 130, dtype: int32

In [25]:
df3.shape

(20920, 24)

There are a lot of observations where the stay is -1 days. We are making the assumption this is because of reporting errors between AM and PM and are going to assume they are all zero. There are 2284 additional negative day stay durations.  Instead of removing confidence from the exisiting data (by imputing), we will drop these observations.  While roughly 11% of the observations are alot, there will be over 18,000 observations remaining.  Making assumptions or simple imputations here will add too much uncertainty into the data.  Most appear to be of clerical mistakes, resulting in Missing at Random entries, entering the wrong year, month, etc OR they were already in the system when the records start BUT there is no way of knowing.

In [26]:
# change -1 day stay durations to 0
df3['stay_duration'] = df3['stay_duration'].map(lambda x: 0 if x == -1 else x)

# drop observations with a negative stay duration
print(df3.shape)
df3 = df3[df3['stay_duration'] >= 0]
print(df3.shape)

(20920, 24)
(18636, 24)


In [27]:
# Calculate upper boundary
upper_boundary = df3['stay_duration'].quantile(0.75) + (1.5 * (df3['stay_duration'].quantile(0.75) - df3['stay_duration'].quantile(0.25)))

While normally outliers above an upper boundary would be considered for removal, in the context of a pets duration in the animal shelter system, this would distort one of the main objectives, finding out which animals are prone to return and place additional stress on the system as a whole.

Additionally, the city of Austin has a threshhold for animals length of stay.  Once an animal has been in the shelter for a year it is placed on an urgent placement list.  It appears from there website there is currently one animal above 365 days, 2 approaching 365 days, and another 6 animals over 180 days.

In this context, we will remove stay durations about 365 as they appear to be definitionally exceptions and or clerical mistakes.

https://www.austintexas.gov/page/urgent-placement#:~:text=Terminology&text=Currently%2C%20any%20dog%20that%20has,automatically%20added%20to%20this%20list.&text=When%20we%20reach%20a%20level,only%20provide%20so%20much%20relief 


In [28]:
# drop observations with a stay duration excedding 365
print(df3.shape)
df3 = df3[df3['stay_duration'] <= 365]
print(df3.shape)

(18636, 24)
(17912, 24)


In [29]:
# compare intake/outcome animal_type, gender, breed, and color
print(f'Number of animal type changes: {df3[df3['animal_type'] != df3['outcome_animal_type']].shape[0]}')
print(f'Number of neuters/spays: {df3[df3['intake_gender'] != df3['outcome_gender']].shape[0]}')
print(f'Number of breed changes: {df3[df3['intake_breed'] != df3['outcome_breed']].shape[0]}')
print(f'Number of color changes: {df3[df3['intake_color'] != df3['outcome_color']].shape[0]}')

Number of animal type changes: 0
Number of neuters/spays: 6153
Number of breed changes: 0
Number of color changes: 0


*No changes in animal type, breed, or color from intake to outcome, dropping the duplicate column. Also making column showing if an animal is neutered or spayed while in the shelter.*

In [30]:
# dropping duplicate columns
df3.drop(columns=['outcome_animal_type', 'outcome_breed', 'outcome_color'], inplace=True)

# renaming columns without intake specifier
df3.rename(columns={'intake_breed': 'breed', 'intake_color': 'color', 'animal_id_x': 'animal_id'}, inplace=True)

In [31]:
df3['spay_neuter'] = (df3['intake_gender'] != df3['outcome_gender']).astype(int)

In [32]:
# checking remaining null values
df3.isnull().sum()

animal_id            0
outcome_time         0
date_of_birth        0
outcome_gender       0
outcome_age          0
sequential_date_x    0
stay_x               0
animal_stay          0
animal_id_y          0
intake_time          0
found_location       0
intake_type          0
intake_condition     0
animal_type          0
intake_gender        0
intake_age           0
breed                0
color                0
sequential_date_y    0
stay_y               0
stay_duration        0
spay_neuter          0
dtype: int64

In [33]:
# function for converting age columns to age in months
def convert_age(age): 
    '''
    Convert an age in string to age in months.
   
    Argument:
    age(str): The animal age in str format, eg. '7 years'.
   
    Return:
    float: The age converted to months. Return 0 if the unit is not list int the fucntion.
    '''
    value, unit = age.split()
    value = abs(int(value)) # assume the nagetive age is typo 
    
    if 'year' in unit:
        return value * 12
    elif 'month' in unit:
        return value
    elif 'week' in unit:
        return round(float(value * 0.23), 2)
    elif 'day' in unit:
        return round(float(value * 0.033), 2)
    else:
        return 0

In [34]:
df3['intake_age'] = df3['intake_age'].map(convert_age)
df3['outcome_age'] = df3['outcome_age'].map(convert_age)

In [35]:
# checking animal types
df3['animal_type'].value_counts(normalize=True)

animal_type
Dog      0.805549
Cat      0.192106
Other    0.002345
Name: proportion, dtype: float64

In [36]:
# see breeds under Other animal type
df3[df3['animal_type'] == 'Other']['breed'].unique()

array(['Tortoise Mix', 'Bat', 'Skunk Mix', 'Bat Mix', 'Rat Mix',
       'Rabbit Sh Mix', 'Hotot Mix', 'Chinchilla-Stnd Mix', 'Ferret Mix',
       'Lionhead', 'Californian Mix', 'Lop-English Mix', 'American',
       'Guinea Pig', 'Rabbit Sh', 'Californian', 'Florida White'],
      dtype=object)

*There are a few observations labeled as Other. Animals labeled as Other contains some household pets but also a lot of wildlife that doesn't pertain to our problem statement. Dropping everything that isn't a cat or dog.*

In [37]:
# dropping obsevations that aren't cats or dogs
print(df3.shape)
df3 = df3[df3['animal_type'] != 'Other']
df3.shape

(17912, 22)


(17870, 22)

In [38]:
df3['repeat'] = 1

In [39]:
# review remaining columns
df3.columns

Index(['animal_id', 'outcome_time', 'date_of_birth', 'outcome_gender',
       'outcome_age', 'sequential_date_x', 'stay_x', 'animal_stay',
       'animal_id_y', 'intake_time', 'found_location', 'intake_type',
       'intake_condition', 'animal_type', 'intake_gender', 'intake_age',
       'breed', 'color', 'sequential_date_y', 'stay_y', 'stay_duration',
       'spay_neuter', 'repeat'],
      dtype='object')

In [40]:
# Dropping duplicate columns and no longer needed
df3.drop(columns=['animal_id_y', 'sequential_date_x', 'sequential_date_y', 'stay_x', 'stay_y'], inplace=True)

In [41]:
# final check of column names
df3.columns

Index(['animal_id', 'outcome_time', 'date_of_birth', 'outcome_gender',
       'outcome_age', 'animal_stay', 'intake_time', 'found_location',
       'intake_type', 'intake_condition', 'animal_type', 'intake_gender',
       'intake_age', 'breed', 'color', 'stay_duration', 'spay_neuter',
       'repeat'],
      dtype='object')

In [42]:
# saving combined data to use in other notebooks
df3.to_csv('../data/austin-repeats-combined-shelter-data.csv', index=False)

---
## Data Dictionary

All intake data is from [Austin Animal Center Intakes](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm/about_data) and outcome data is from [Austin Animal Center Outcomes](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/about_data).

|feature|type|description|
|---|---|---|
|**animal_id**|*str*|Unique animal ID|
|**outcome_time**|*datetime*|Day and time of animal outcome|
|**date_of_birth**|*datetime*|Animal's date of birth|
|**outcome_gender**|*str*|Neuter/spay status at outcome|
|**outcome_age**|*float*|Animal age at outcome|
|**animal_stay**|*str*|Animal ID and stay count|
|**intake_time**|*datetime*|Day and time animal is taken in by shelter|
|**found_location**|*str*|Where animal is found|
|**intake_type**|*str*|How animal is taken in|
|**intake_condition**|*str*|Animal's health condition upon intake|
|**animal_type**|*str*|Type of animal|
|**intake_gender**|*str*|Neuter/spay status upon intake|
|**intake_age**|*float*|Animal age in months upon intake|
|**breed**|*str*|Animal breed|
|**color**|*str*|Animal color|
|**stay_duration**|*str*|How many days animal is in shelter before outcome|
|**spay_neuter**|*int*|1 if an animal is spayed or neutered while in the shelter|
|**repeat**|*int*|1 if an animal will enter animal shelters more than once|

---
## Dallas Shelters

In [41]:
# reading in data for 2014-2015
dallas_2014 = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_2014_-_2015_20241028.csv')

  dallas_2014 = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_2014_-_2015_20241028.csv')


In [42]:
pd.set_option('display.max_columns', 35)
dallas_2014.head()

Unnamed: 0,Animal Id,Animal Type,Animal Breed,Kennel Number,Kennel Status,Tag Type,Activity Number,Activity Sequence,Source Id,Census Tract,Council District,Intake Type,Intake Subtype,Intake Total,Reason,Staff Id,Intake Date,Intake Time,Due Out,Intake Condition,Hold Request,Outcome Type,Outcome Date,Outcome Time,Receipt Number,Impound Number,Service Request Number,Outcome Condition,Chip Status,Animal Origin,Additional Information,Month,Year
0,A0000575,CAT,DOMESTIC SH,AC 035,UNAVAILABLE,,,1,P0671044,W,W,STRAY,CONFINED,1,,SN,10/02/2014 12:00:00 AM,12/31/1899 11:56:00 AM,10/06/2014 12:00:00 AM,TREATABLE REHABILITABLE NON-CONTAGIOUS,ADOPTION,ADOPTION,10/12/2014 12:00:00 AM,12/31/1899 03:25:00 PM,R14-372380,K14-297573,,TREATABLE REHABILITABLE NON-CONTAGIOUS,SCAN NO CHIP,OVER THE COUNTER,ADOPTED,OCT.2014,FY2015
1,A0008962,DOG,LABRADOR RETR,LFD 088,LAB,,,1,P0053980,75218,18,CONFISCATED,KEEP SAFE,1,,MB,09/24/2015 12:00:00 AM,12/31/1899 03:50:00 PM,10/03/2015 12:00:00 AM,TREATABLE REHABILITABLE NON-CONTAGIOUS,,EUTHANIZED,10/04/2015 12:00:00 AM,12/31/1899 12:22:00 PM,,K15-328347,442631.0,TREATABLE MANAGEABLE NON-CONTAGIOUS,SCAN NO CHIP,FIELD,,SEP.2015,FY2015
2,A0121376,DOG,GERM SHEPHERD,LFD 042,LAB,,,1,P0661191,39A,9A,STRAY,CONFINED,1,,MB,05/01/2015 12:00:00 AM,12/31/1899 12:09:00 PM,05/02/2015 12:00:00 AM,TREATABLE MANAGEABLE NON-CONTAGIOUS,,EUTHANIZED,05/03/2015 12:00:00 AM,12/31/1899 11:53:00 AM,,K15-314218,,TREATABLE MANAGEABLE NON-CONTAGIOUS,SCAN CHIP,FIELD,,MAY.2015,FY2015
3,A0129114,CAT,DOMESTIC SH,PSCAT 11,UNAVAILABLE,,,1,P0055049,75243,43,OWNER SURRENDER,GENERAL,1,ALLERGIC,CBM/JS,09/19/2015 12:00:00 AM,12/31/1899 04:46:00 PM,09/22/2015 12:00:00 AM,TREATABLE REHABILITABLE NON-CONTAGIOUS,EVERYDAY ADOPTION CENTER,ADOPTION,10/26/2015 12:00:00 AM,12/31/1899 02:09:00 PM,R15-425259,K15-327996,,TREATABLE REHABILITABLE NON-CONTAGIOUS,SCAN CHIP,OVER THE COUNTER,VOMIT 5X 9/20,SEP.2015,FY2015
4,A0157434,DOG,ROTTWEILER,FREEZER,UNAVAILABLE,,,1,P0093154,38G,8G,OWNER SURRENDER,- DEAD ON ARRIVAL,1,,DD,12/03/2014 12:00:00 AM,12/31/1899 08:06:00 PM,12/03/2014 12:00:00 AM,UNHEALTHY UNTREATABLE NON-CONTAGIOUS,,DEAD ON ARRIVAL,12/04/2014 12:00:00 AM,12/31/1899 12:00:00 PM,,K14-302641,,UNHEALTHY UNTREATABLE NON-CONTAGIOUS,SCAN NO CHIP,FIELD,,DEC.2014,FY2015


*The time of intake and outcome are split between two columns, one for the data and one for the time of day. Since we're only evaluating how long an animal is in a shelter with how many days we'll only convert the date to datetime and sort by that. Also converting column names to snake case.*

In [43]:
# function to convert column names
def to_snake_case(columns):
    '''
    Convert dataframe column names to snake case

    Keyword arguments:
    columns -- original column names of dataframe

    Returns a dictionary to pass into pandas rename function
    '''
    return {column: column.lower().replace(' ', '_') for column in columns}

In [44]:
dallas_2014.rename(columns = to_snake_case(dallas_2014.columns), inplace = True)

In [45]:
# converting intake_date and outcome_date to datetime
dallas_2014['intake_date'] = pd.to_datetime(dallas_2014['intake_date'], format='%m/%d/%Y %I:%M:%S %p')
dallas_2014['outcome_date'] = pd.to_datetime(dallas_2014['outcome_date'], format='%m/%d/%Y %I:%M:%S %p')

# sorting by outcome_date with most recent first
dallas_2014.sort_values(by = 'outcome_date', ascending = False, inplace = True)

*Will read in the datasets from other years and combine before looking at nulls, inconsequential columns, and duplicates similar to the Austin data.*

In [46]:
# function to read in data from remainging years
def read_in_dallas_data(year):
    '''
    Function to read in csv data for animal shelter data by year

    Keyword arguments:
    year -- which year to bring in data from

    Returns dataframe with column names converted to snake case,
    intake and outcome dates converted to datetime,
    and sorted by outcome date descending.
    '''
    year = str(year)
    # read in data
    if year != '2023':
        df = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_'+year+'_-_'+str(int(year)+1)+'_20241028.csv', low_memory=False)
    else:
        df = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_'+year+'_-_'+str(int(year)+2)+'_20241028.csv', low_memory=False)
    # convert column names
    df.rename(columns=to_snake_case(df.columns), inplace = True)
    # convert intake and outcome dates to datetime
    df['intake_date'] = pd.to_datetime(df['intake_date'], format='mixed')
    df['outcome_date'] = pd.to_datetime(df['outcome_date'], format='mixed')
    # sort by outcome_date
    df.sort_values(by = 'outcome_date', ascending = False, inplace = True)
    return df

In [47]:
dallas_2015 = read_in_dallas_data(2015)
dallas_2016 = read_in_dallas_data(2016)
dallas_2017 = read_in_dallas_data(2017)
dallas_2018 = read_in_dallas_data(2018)
dallas_2019 = read_in_dallas_data(2019)
dallas_2020 = read_in_dallas_data(2020)
dallas_2021 = read_in_dallas_data(2021)
dallas_2022 = read_in_dallas_data(2022)
dallas_2023 = read_in_dallas_data(2023)

In [48]:
dallas_combined = pd.concat([dallas_2014, dallas_2015, dallas_2016, dallas_2017, dallas_2018,
                             dallas_2019, dallas_2020, dallas_2021, dallas_2022, dallas_2023])

In [49]:
# resorting combined dataframe by outcome date
dallas_combined.sort_values(by = 'outcome_date', ascending = False, inplace = True)

# dropping duplicates by animal_id to focus on most recent observation per unique animal
print(dallas_combined.shape)
#dallas_combined.drop_duplicates(subset = 'animal_id', inplace = True)
print(dallas_combined.shape)
#dallas_combined.head()

(344923, 34)
(344923, 34)


In [52]:
dallas_repeats = dallas_combined[dallas_combined['animal_id'].duplicated()]
dallas_repeats['intake_type'].value_counts(normalize=True)
#dallas_repeats.shape

intake_type
STRAY              0.573325
OWNER SURRENDER    0.194051
FOSTER             0.096768
TREATMENT          0.093715
CONFISCATED        0.026188
TRANSFER           0.006436
KEEPSAFE           0.005089
LOST REPORT        0.002279
TNR                0.001018
FOUND REPORT       0.000602
RESOURCE           0.000358
WILDLIFE           0.000172
Name: proportion, dtype: float64

In [66]:
dallas_repeats

Index(['animal_id', 'animal_type', 'animal_breed', 'kennel_status',
       'intake_type', 'reason', 'intake_date', 'intake_condition',
       'outcome_type', 'outcome_date', 'outcome_condition', 'chip_status',
       'animal_origin'],
      dtype='object')

In [59]:
# dropping duplicates by animal_id to focus on most recent observation per unique animal
print(dallas_combined.shape)
dallas_combined.drop_duplicates(subset = 'animal_id', inplace = True)
print(dallas_combined.shape)
#dallas_combined.head()

(344923, 34)
(275158, 34)


In [60]:
dallas_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 275158 entries, 42903 to 44634
Data columns (total 34 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   animal_id               275158 non-null  object        
 1   animal_type             275157 non-null  object        
 2   animal_breed            275042 non-null  object        
 3   kennel_number           275158 non-null  object        
 4   kennel_status           275158 non-null  object        
 5   tag_type                2 non-null       object        
 6   activity_number         114170 non-null  object        
 7   activity_sequence       275158 non-null  int64         
 8   source_id               275158 non-null  object        
 9   census_tract            228721 non-null  object        
 10  council_district        228721 non-null  object        
 11  intake_type             275158 non-null  object        
 12  intake_subtype          270833 n

*tag_type and service_request_number are primarily null values, going to drop these columns. Also going to drop several columns that don't have an impact on an animal's outcome type.*

In [61]:
dallas_combined.drop(columns = ['kennel_number', 'tag_type', 'activity_number', 'activity_sequence',
                                'source_id', 'census_tract', 'council_district', 'intake_subtype',
                                'intake_total', 'staff_id', 'intake_time', 'due_out', 'hold_request',
                                'outcome_time', 'receipt_number', 'impound_number', 'service_request_number',
                                'additional_information', 'month', 'year', 'outcome_subtype'], inplace = True)

In [48]:
# see animal types
dallas_combined['animal_id'].duplicated()

animal_type
DOG          191178
CAT           65699
WILDLIFE       9724
BIRD           8275
LIVESTOCK       280
D                 1
Name: count, dtype: int64

*Same as the data from Austin shelters, we are going to ignore anything that isn't a cat or dog.*

In [49]:
# see remainging null values
dallas_combined.isnull().sum()

animal_id                 0
animal_type               1
animal_breed            116
kennel_status             0
intake_type               0
reason               127247
intake_date               0
intake_condition          0
outcome_type              0
outcome_date           1139
outcome_condition     22376
chip_status           18293
animal_origin         18379
dtype: int64

*Going to create a duration column for how long an animal is in the shelter. Dropping observations with no outcome date first.*

In [50]:
# drop null outcome date rows
dallas_combined.dropna(subset = 'outcome_date', inplace = True)

# column for stay duration
dallas_combined['stay_duration'] = dallas_combined['outcome_date'] - dallas_combined['intake_date']
# convert to number of days
dallas_combined['stay_duration'] = dallas_combined['stay_duration'].dt.days.astype(int)

# check for negative stay_duration
dallas_combined[dallas_combined['stay_duration'] < 0]['stay_duration'].value_counts()

stay_duration
-333      3
-30       2
-332      2
-330      1
-331      1
-300      1
-55       1
-698      1
-272      1
-19963    1
Name: count, dtype: int64

*Just a few instances with negative stays, just dropping these rows.*

In [51]:
dallas_combined = dallas_combined[dallas_combined['stay_duration'] >= 0]

*Looking at the columns with remaing null values.*

In [52]:
dallas_combined['reason'].unique()

array(['OTHRINTAKS', 'SURGERY', 'FOR ADOPT', 'PERSNLISSU', 'OTHER',
       'NOTRIGHTFT', nan, 'SHORT-TERM', 'TRANSFER', 'BEHAVIOR', 'MEDICAL',
       'HOUSING', 'FINANCIAL', 'EVICTION', 'TNR CLINIC', 'STRAY',
       'TOO MANY', 'OWNER PROBLEM', 'AGGRESSIVE - PEOPLE',
       'AGGRESSIVE - ANIMAL', 'HOUSE SOIL', 'EUTHANASIA ILL', 'NO TIME',
       'UNKNOWN', 'DESTRUCTIVE AT HOME', 'OTHER PET', 'ESCAPES',
       'CAUTIONCAT', 'CHILD PROBLEM', 'INJURED', 'ILL', 'MOVE',
       'LANDLORD', 'VOCAL', 'COST', 'ALLERGIC', 'BITES', 'HYPER',
       'KILLED ANOTHER ANIMAL', 'NEW BABY', 'TOO BIG', 'MOVE APT',
       'DEAD ON ARRIVAL', 'OWNER DIED', 'TRAVEL', 'ATTENTION',
       'DESTRUCTIVE OUTSIDE', 'FOUND ANIM', 'RESPONSIBLE', 'TOO OLD',
       'EUTHANASIA OLD', 'ABANDON', 'AFRAID', 'FOSTER', 'BLIND/DEAF',
       'CRUELTY', 'NOFRIENDLY', 'NO HOME', 'DEAF', 'NO YARD', 'BLIND',
       'CHASES PEOPLE', 'DISOBIDIEN', 'FENCE', 'QUARANTINE',
       'EUTHANASIA BEHAV', 'ZONE', 'WRONG SEX', 'JUMPS UP', 'R

In [53]:
dallas_combined['outcome_condition'].unique()

array(['DECEASED', 'APP WNL', nan, 'APP SICK', 'CRITICAL', 'APP INJ',
       'UNDERAGE', 'GERIATRIC', 'FATAL', 'UNKNOWN', 'DEAD',
       'TREATABLE REHABILITABLE NON-CONTAGIOUS',
       'UNHEALTHY UNTREATABLE NON-CONTAGIOUS', 'HEALTHY',
       'TREATABLE MANAGEABLE NON-CONTAGIOUS',
       'UNHEALTHY UNTREATABLE CONTAGIOUS',
       'TREATABLE REHABILITABLE CONTAGIOUS',
       'TREATABLE MANAGEABLE CONTAGIOUS'], dtype=object)

In [54]:
dallas_combined['chip_status'].value_counts()

chip_status
SCAN NO CHIP                 172114
SCAN CHIP                     58552
UNABLE TO SCAN                20560
WILDLIFE - UNABLE TO SCAN      3830
WILDLIFE - UNABEL TO SCAN      1038
Name: count, dtype: int64

In [55]:
dallas_combined['animal_origin'].unique()

array(['OVER THE COUNTER', 'AGGOPS', 'COM CAT', nan, 'FIELD', 'BITE',
       'HART', 'PSPICKUP', 'OPS', 'AGGDD', 'CARE', 'SWEEP', 'RAPID',
       'NIGHT DROP'], dtype=object)

* reason: These could be pertinent to an animal being adopted or not, will fill nulls with 'none' that no reason was given.
* outcome_condition: We are going to assume if nothing was entered the animals was healthy.
* chip_status: We are assuming this will relate somewhat to outcome type, that if they are able to scan a chip they will get returned to owner. Dropping this column.
* animal_origin: Not even sure what all of these entries are, dropping this column.

In [56]:
# drop chip_status and animal_origin columns
dallas_combined = dallas_combined.drop(columns=['chip_status', 'animal_origin'])

# fill nulls for reason and outcome_condition
dallas_combined = dallas_combined.fillna({'reason': 'NONE', 'outcome_condition': 'HEALTHY'})

In [57]:
dallas_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 274005 entries, 42903 to 1377
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   animal_id          274005 non-null  object        
 1   animal_type        274004 non-null  object        
 2   animal_breed       273889 non-null  object        
 3   kennel_status      274005 non-null  object        
 4   intake_type        274005 non-null  object        
 5   reason             274005 non-null  object        
 6   intake_date        274005 non-null  datetime64[ns]
 7   intake_condition   274005 non-null  object        
 8   outcome_type       274005 non-null  object        
 9   outcome_date       274005 non-null  datetime64[ns]
 10  outcome_condition  274005 non-null  object        
 11  stay_duration      274005 non-null  int32         
dtypes: datetime64[ns](2), int32(1), object(9)
memory usage: 26.1+ MB


In [58]:
# saving combined data
dallas_combined.to_csv('../data/dallas-combined-shelter-data.csv', index=False)