# Animal Shelter Outcomes
---

Problem Statement...

### Background
...

---
## Data Cleaning

In [2]:
# imports
import pandas as pd
import numpy as np

### Austin Shelter Intakes

In [3]:
# reading in shelter intakes
intakes =  pd.read_csv('../data/austin_animal_center_intakes_20241017.csv')
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A857105,Johnny Ringo,05/12/2022 12:23:00 AM,May 2022,4404 Sarasota Drive in Austin (TX),Public Assist,Normal,Cat,Neutered Male,2 years,Domestic Shorthair,Orange Tabby


In [4]:
intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168040 entries, 0 to 168039
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         168040 non-null  object
 1   Name              119647 non-null  object
 2   DateTime          168040 non-null  object
 3   MonthYear         168040 non-null  object
 4   Found Location    168040 non-null  object
 5   Intake Type       168040 non-null  object
 6   Intake Condition  168040 non-null  object
 7   Animal Type       168040 non-null  object
 8   Sex upon Intake   168038 non-null  object
 9   Age upon Intake   168039 non-null  object
 10  Breed             168040 non-null  object
 11  Color             168040 non-null  object
dtypes: object(12)
memory usage: 15.4+ MB


*The Name column has a lot of null values and we're assuming a pet's name won't affect their chances of adoption, so going to drop this column. Also dropping MonthYear because that information is also in the DateTime column.*

In [5]:
# dropping inconsequential columns
intakes.drop(columns=['Name', 'MonthYear'], inplace=True)

In [6]:
# renaming columns to be intake specific and snake case
columns = {
    'Animal ID': 'animal_id',
    'DateTime': 'intake_time',
    'Found Location': 'found_location',
    'Intake Type': 'intake_type',
    'Intake Condition': 'intake_condition',
    'Animal Type': 'animal_type',
    'Sex upon Intake': 'intake_gender',
    'Age upon Intake': 'intake_age',
    'Breed': 'intake_breed',
    'Color': 'intake_color'    
}

intakes = intakes.rename(columns=columns)

In [7]:
# converting intake_time column to datetime format
intakes['intake_time'] = pd.to_datetime(intakes['intake_time'], format='%m/%d/%Y %I:%M:%S %p')

In [8]:
intakes.nunique()

animal_id           151007
intake_time         115852
found_location       68164
intake_type              6
intake_condition        20
animal_type              5
intake_gender            5
intake_age              55
intake_breed          2969
intake_color           651
dtype: int64

*There are many animals that have more than one stay at a shelter. In order to have accurate merging between intake and outcomes we are dropping any duplicate animals. Sorting by intake time first to keep the most recent observation for each animal id.*

In [9]:
# sort intakes by most recent intakes first
intakes.sort_values(by=['intake_time', 'animal_id'], inplace=True, ascending = False)

# drop duplicate observations
intakes.drop_duplicates(subset='animal_id', inplace = True)

### Austin Shelter Outcomes

In [10]:
# reading in shelter outcomes
outcomes = pd.read_csv('../data/austin_animal_center_outcomes_20241017.csv')
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A882831,*Hamilton,07/01/2023 06:12:00 PM,Jul 2023,03/25/2023,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair Mix,Black/White
1,A794011,Chunk,05/08/2019 06:20:00 PM,May 2019,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
2,A776359,Gizmo,07/18/2018 04:02:00 PM,Jul 2018,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
3,A821648,,08/16/2020 11:38:00 AM,Aug 2020,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
4,A720371,Moose,02/13/2016 05:59:00 PM,Feb 2016,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff


In [11]:
outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167942 entries, 0 to 167941
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         167942 non-null  object
 1   Name              119733 non-null  object
 2   DateTime          167942 non-null  object
 3   MonthYear         167942 non-null  object
 4   Date of Birth     167942 non-null  object
 5   Outcome Type      167896 non-null  object
 6   Outcome Subtype   77144 non-null   object
 7   Animal Type       167942 non-null  object
 8   Sex upon Outcome  167940 non-null  object
 9   Age upon Outcome  167926 non-null  object
 10  Breed             167942 non-null  object
 11  Color             167942 non-null  object
dtypes: object(12)
memory usage: 15.4+ MB


*Dropping Name and MonthYear columns for outcomes data as well as Outcome Subtype. This column has even more null values than Name and we'll be focusing on the primary Outcome Type only.*

In [12]:
# dropping inconsequential columns
outcomes.drop(columns=['Name', 'MonthYear', 'Outcome Subtype'], inplace=True)

In [13]:
# renaming columns to be outcome specific and snake case
outcome_columns = {
    'Animal ID': 'animal_id',
    'DateTime': 'outcome_time',
    'Date of Birth': 'date_of_birth',
    'Outcome Type': 'outcome_type',
    'Animal Type': 'outcome_animal_type',
    'Sex upon Outcome': 'outcome_gender',
    'Age upon Outcome': 'outcome_age',
    'Breed': 'outcome_breed',
    'Color': 'outcome_color'    
}

outcomes = outcomes.rename(columns=outcome_columns)

In [14]:
# converting outcome_time and date of birth columns to datetime format
outcomes['outcome_time'] = pd.to_datetime(outcomes['outcome_time'], format='%m/%d/%Y %I:%M:%S %p')
outcomes['date_of_birth'] = pd.to_datetime(outcomes['date_of_birth'], format='%m/%d/%Y')

In [15]:
outcomes.nunique()

animal_id              150912
outcome_time           140118
date_of_birth            8501
outcome_type               11
outcome_animal_type         5
outcome_gender              5
outcome_age                55
outcome_breed            2969
outcome_color             653
dtype: int64

*Dropping duplicate animal_id's for outcomes as well.*

In [16]:
# sort intakes by most recent intakes first
outcomes.sort_values(by=['outcome_time', 'animal_id'], inplace=True, ascending = False)

# drop duplicate observations
outcomes.drop_duplicates(subset='animal_id', inplace = True)

### Merge Austin DataFrames

In [17]:
# meging intakes and outcomes
intakes_outcomes = pd.merge(left=outcomes, right=intakes, how='inner', on='animal_id')
intakes_outcomes.head()

Unnamed: 0,animal_id,outcome_time,date_of_birth,outcome_type,outcome_animal_type,outcome_gender,outcome_age,outcome_breed,outcome_color,intake_time,found_location,intake_type,intake_condition,animal_type,intake_gender,intake_age,intake_breed,intake_color
0,A915546,2024-10-17 13:50:00,2024-09-28,Transfer,Cat,Intact Male,,Domestic Shorthair,Black Smoke,2024-10-17 07:52:00,Pleasant Valley Rd/Teri Rd in Austin (TX),Stray,Sick,Cat,Intact Male,2 weeks,Domestic Shorthair,Black Smoke
1,A912799,2024-10-17 13:07:00,2024-07-21,Adoption,Cat,Spayed Female,2 months,Domestic Shorthair,Brown Tabby,2024-09-05 14:57:00,7201 Levander Loop in Austin (TX),Abandoned,Normal,Cat,Intact Female,1 month,Domestic Shorthair,Brown Tabby
2,A912797,2024-10-17 13:07:00,2024-07-27,,Cat,Neutered Male,2 months,Domestic Shorthair,Black,2024-09-05 14:57:00,7201 Levander Loop in Austin (TX),Abandoned,Normal,Cat,Intact Male,1 month,Domestic Shorthair,Black
3,A912055,2024-10-17 12:25:00,2023-10-25,Adoption,Cat,Neutered Male,11 months,Domestic Shorthair,Brown Tabby/White,2024-08-25 08:20:00,1800 Fairlawn Lane in Austin (TX),Stray,Injured,Cat,Intact Male,10 months,Domestic Shorthair,Brown Tabby/White
4,A915002,2024-10-17 12:21:00,2023-10-10,Return to Owner,Dog,Intact Male,1 year,German Shepherd Mix,Tan,2024-10-10 12:10:00,Austin (TX),Public Assist,Normal,Dog,Intact Male,1 year,German Shepherd Mix,Tan


*There were couple columns that were the same category, going to investigate these. Also converting age columns to age in months. First making a new column for how long an animal stays in the shelter.*

In [18]:
intakes_outcomes['stay_duration'] = intakes_outcomes['outcome_time'] - intakes_outcomes['intake_time']

# convert stay_duration to number of days
intakes_outcomes['stay_duration'] = intakes_outcomes['stay_duration'].dt.days.astype(int)

# check for negative stay_duration
intakes_outcomes[intakes_outcomes['stay_duration'] < 0]['stay_duration'].value_counts()

stay_duration
-1      617
-7        5
-4        5
-18       3
-23       3
       ... 
-227      1
-245      1
-80       1
-76       1
-42       1
Name: count, Length: 94, dtype: int64

In [19]:
intakes_outcomes[intakes_outcomes['stay_duration'] == -1]['outcome_type'].value_counts()

outcome_type
Transfer           401
Euthanasia          97
Return to Owner     55
Died                34
Disposal            14
Adoption            11
Relocate             2
Missing              1
Name: count, dtype: int64

*There are a lot of observations where the stay is -1 days. We are making the assumption this is because of reporting errors between AM and PM and are going to assume they are all zero. There are only 93 other negative stay durations, instead of making assumptions with those we will drop these observations.*

In [20]:
# change -1 day stay durations to 0
intakes_outcomes['stay_duration'] = intakes_outcomes['stay_duration'].map(lambda x: 0 if x == -1 else x)

# drop observations with a negative stay duration
print(intakes_outcomes.shape)
intakes_outcomes = intakes_outcomes[intakes_outcomes['stay_duration'] >= 0]
print(intakes_outcomes.shape)

(150098, 19)
(149984, 19)


In [21]:
# compare intake/outcome animal_type, gender, breed, and color
print(f'Number of animal type changes: {intakes_outcomes[intakes_outcomes['animal_type'] != intakes_outcomes['outcome_animal_type']].shape[0]}')
print(f'Number of neuters/spays: {intakes_outcomes[intakes_outcomes['intake_gender'] != intakes_outcomes['outcome_gender']].shape[0]}')
print(f'Number of breed changes: {intakes_outcomes[intakes_outcomes['intake_breed'] != intakes_outcomes['outcome_breed']].shape[0]}')
print(f'Number of color changes: {intakes_outcomes[intakes_outcomes['intake_color'] != intakes_outcomes['outcome_color']].shape[0]}')

Number of animal type changes: 0
Number of neuters/spays: 60506
Number of breed changes: 0
Number of color changes: 0


*No changes in animal type, breed, or color from intake to outcome, dropping the duplicate column. Also making column showing if an animal is neutered or spayed while in the shelter.*

In [22]:
df.duplicated('animal_stay').sum()

0

In [None]:
# dropping duplicate columns
intakes_outcomes.drop(columns=['outcome_animal_type', 'outcome_breed', 'outcome_color'], inplace=True)

# renaming columns without intake specifier
intakes_outcomes.rename(columns={'intake_breed': 'breed', 'intake_color': 'color'}, inplace=True)

In [23]:
intakes_outcomes['spay_neuter'] = (intakes_outcomes['intake_gender'] != intakes_outcomes['outcome_gender']).astype(int)

In [24]:
# checking remaining null values
intakes_outcomes.isnull().sum()

animal_id            0
outcome_time         0
date_of_birth        0
outcome_type        38
outcome_gender       2
outcome_age         16
intake_time          0
found_location       0
intake_type          0
intake_condition     0
animal_type          0
intake_gender        2
intake_age           1
breed                0
color                0
stay_duration        0
spay_neuter          0
dtype: int64

*At most there are 60 observations with null values, dropping these rows.*

In [25]:
print(intakes_outcomes.shape)
intakes_outcomes.dropna(inplace=True)
print(intakes_outcomes.shape)

(149984, 17)
(149930, 17)


In [26]:
# function for converting age columns to age in months
def convert_age(age): 
    '''
    Convert an age in string to age in months.
   
    Argument:
    age(str): The animal age in str format, eg. '7 years'.
   
    Return:
    float: The age converted to months. Return 0 if the unit is not list int the fucntion.
    '''
    value, unit = age.split()
    value = abs(int(value)) # assume the nagetive age is typo 
    
    if 'year' in unit:
        return value * 12
    elif 'month' in unit:
        return value
    elif 'week' in unit:
        return round(float(value * 0.23), 2)
    elif 'day' in unit:
        return round(float(value * 0.033), 2)
    else:
        return 0

In [27]:
intakes_outcomes['intake_age'] = intakes_outcomes['intake_age'].map(convert_age)
intakes_outcomes['outcome_age'] = intakes_outcomes['outcome_age'].map(convert_age)

In [28]:
# checking animal types
intakes_outcomes['animal_type'].value_counts(normalize=True)

animal_type
Dog          0.515080
Cat          0.421590
Other        0.057534
Bird         0.005603
Livestock    0.000193
Name: proportion, dtype: float64

In [29]:
# see breeds under Other animal type
intakes_outcomes[intakes_outcomes['animal_type'] == 'Other']['breed'].unique()

array(['Guinea Pig', 'Bat', 'Raccoon', 'Squirrel', 'Bat/Mex Free-Tail',
       'Opossum', 'Fox', 'Lizard/Gecko', 'Tortoise', 'Rabbit Sh', 'Deer',
       'Skunk', 'Rat', 'Jersey Wooly', 'Ringtail', 'Snake', 'Lop-French',
       'Ferret', 'Cottontail', 'Angora-English Mix',
       'Turtle/Redeared Slider', 'Lizard', 'Florida White',
       'Rabbit Sh Mix', 'Chinchilla', 'Coyote', 'Hamster',
       'Lizard/Bearded Dragon', 'Hedgehog', 'Flemish Giant',
       'Californian', 'Lop-Holland', 'Rex Mix', 'Lop-Holland Mix',
       'Californian Mix', 'Lionhead', 'Lop-Mini', 'Himalayan', 'Turtle',
       'Lionhead Mix', 'Dutch Mix', 'Rabbit Sh/Dwarf Hotot', 'Gerbil',
       'Rabbit Lh', 'Lop-English Mix', 'Angora-English', 'Snake/Python',
       'New Zealand Wht/Lop-Holland', 'Lop-Amer Fuzzy', 'Rex',
       'English Spot Mix', 'Hotot', 'New Zealand Wht', 'Harlequin Mix',
       'Armadillo', 'Rex-Mini', 'Dutch', 'English Spot', 'Cold Water',
       'Chinchilla-Stnd', 'Lop-Mini/Hotot', 'Mouse', 'Dwa

*There are very fiew observations labeled as Bird or Livestock. Animals labeled as Other contains some household pets but also a lot of wildlife that doesn't pertain to our problem statement. Dropping everything that isn't a cat or dog.*

In [30]:
# dropping obsevations that aren't cats or dogs
print(intakes_outcomes.shape)
intakes_outcomes = intakes_outcomes[intakes_outcomes['animal_type'] != 'Other']
intakes_outcomes = intakes_outcomes[intakes_outcomes['animal_type'] != 'Bird']
intakes_outcomes = intakes_outcomes[intakes_outcomes['animal_type'] != 'Livestock']
intakes_outcomes.shape

(149930, 17)


(140435, 17)

In [31]:
intakes_outcomes.head()

Unnamed: 0,animal_id,outcome_time,date_of_birth,outcome_type,outcome_gender,outcome_age,intake_time,found_location,intake_type,intake_condition,animal_type,intake_gender,intake_age,breed,color,stay_duration,spay_neuter
1,A912799,2024-10-17 13:07:00,2024-07-21,Adoption,Spayed Female,2.0,2024-09-05 14:57:00,7201 Levander Loop in Austin (TX),Abandoned,Normal,Cat,Intact Female,1.0,Domestic Shorthair,Brown Tabby,41,1
3,A912055,2024-10-17 12:25:00,2023-10-25,Adoption,Neutered Male,11.0,2024-08-25 08:20:00,1800 Fairlawn Lane in Austin (TX),Stray,Injured,Cat,Intact Male,10.0,Domestic Shorthair,Brown Tabby/White,53,1
4,A915002,2024-10-17 12:21:00,2023-10-10,Return to Owner,Intact Male,12.0,2024-10-10 12:10:00,Austin (TX),Public Assist,Normal,Dog,Intact Male,12.0,German Shepherd Mix,Tan,7,0
5,A832172,2024-10-17 12:20:00,2021-01-24,Return to Owner,Neutered Male,36.0,2024-10-10 12:10:00,Austin (TX),Public Assist,Normal,Dog,Neutered Male,36.0,Pit Bull Mix,Brown/White,7,0
6,A912548,2024-10-17 11:45:00,2021-09-02,Adoption,Neutered Male,36.0,2024-09-02 22:31:00,6900 Bryn Mawr in Austin (TX),Stray,Normal,Dog,Intact Male,36.0,Siberian Husky Mix,Black/White,44,1


In [32]:
# saving combined data to use in other notebooks
intakes_outcomes.to_csv('../data/austin-combined-shelter-data.csv', index=False)

---
## Data Dictionary

All intake data is from [Austin Animal Center Intakes](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm/about_data) and outcome data is from [Austin Animal Center Outcomes](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/about_data).

|feature|type|description|
|---|---|---|
|**animal_id**|*str*|Unique animal ID|
|**outcome_time**|*datetime*|Day and time of animal outcome|
|**date_of_birth**|*datetime*|Animal's date of birth|
|**outcome_type**|*str*|Outcome of animal|
|**outcome_gender**|*str*|Neuter/spay status at outcome|
|**outcome_age**|*float*|Animal age at outcome|
|**intake_time**|*datetime*|Day and time animal is taken in by shelter|
|**found_location**|*str*|Where animal is found|
|**intake_type**|*str*|How animal is taken in|
|**intake_condition**|*str*|Animal's health condition upon intake|
|**animal_type**|*str*|Type of animal|
|**intake_gender**|*str*|Neuter/spay status upon intake|
|**intake_age**|*float*|Animal age in months upon intake|
|**breed**|*str*|Animal breed|
|**color**|*str*|Animal color|
|**stay_duration**|*str*|How many days animal is in shelter before outcome|
|**spay_neuter**|*int*|1 if an animal is spayed or neutered while in the shelter|

---
## Dallas Shelters

In [33]:
# reading in data for 2014-2015
dallas_2014 = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_2014_-_2015_20241028.csv')

  dallas_2014 = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_2014_-_2015_20241028.csv')


In [34]:
pd.set_option('display.max_columns', 35)
dallas_2014.head()

Unnamed: 0,Animal Id,Animal Type,Animal Breed,Kennel Number,Kennel Status,Tag Type,Activity Number,Activity Sequence,Source Id,Census Tract,Council District,Intake Type,Intake Subtype,Intake Total,Reason,Staff Id,Intake Date,Intake Time,Due Out,Intake Condition,Hold Request,Outcome Type,Outcome Date,Outcome Time,Receipt Number,Impound Number,Service Request Number,Outcome Condition,Chip Status,Animal Origin,Additional Information,Month,Year
0,A0000575,CAT,DOMESTIC SH,AC 035,UNAVAILABLE,,,1,P0671044,W,W,STRAY,CONFINED,1,,SN,10/02/2014 12:00:00 AM,12/31/1899 11:56:00 AM,10/06/2014 12:00:00 AM,TREATABLE REHABILITABLE NON-CONTAGIOUS,ADOPTION,ADOPTION,10/12/2014 12:00:00 AM,12/31/1899 03:25:00 PM,R14-372380,K14-297573,,TREATABLE REHABILITABLE NON-CONTAGIOUS,SCAN NO CHIP,OVER THE COUNTER,ADOPTED,OCT.2014,FY2015
1,A0008962,DOG,LABRADOR RETR,LFD 088,LAB,,,1,P0053980,75218,18,CONFISCATED,KEEP SAFE,1,,MB,09/24/2015 12:00:00 AM,12/31/1899 03:50:00 PM,10/03/2015 12:00:00 AM,TREATABLE REHABILITABLE NON-CONTAGIOUS,,EUTHANIZED,10/04/2015 12:00:00 AM,12/31/1899 12:22:00 PM,,K15-328347,442631.0,TREATABLE MANAGEABLE NON-CONTAGIOUS,SCAN NO CHIP,FIELD,,SEP.2015,FY2015
2,A0121376,DOG,GERM SHEPHERD,LFD 042,LAB,,,1,P0661191,39A,9A,STRAY,CONFINED,1,,MB,05/01/2015 12:00:00 AM,12/31/1899 12:09:00 PM,05/02/2015 12:00:00 AM,TREATABLE MANAGEABLE NON-CONTAGIOUS,,EUTHANIZED,05/03/2015 12:00:00 AM,12/31/1899 11:53:00 AM,,K15-314218,,TREATABLE MANAGEABLE NON-CONTAGIOUS,SCAN CHIP,FIELD,,MAY.2015,FY2015
3,A0129114,CAT,DOMESTIC SH,PSCAT 11,UNAVAILABLE,,,1,P0055049,75243,43,OWNER SURRENDER,GENERAL,1,ALLERGIC,CBM/JS,09/19/2015 12:00:00 AM,12/31/1899 04:46:00 PM,09/22/2015 12:00:00 AM,TREATABLE REHABILITABLE NON-CONTAGIOUS,EVERYDAY ADOPTION CENTER,ADOPTION,10/26/2015 12:00:00 AM,12/31/1899 02:09:00 PM,R15-425259,K15-327996,,TREATABLE REHABILITABLE NON-CONTAGIOUS,SCAN CHIP,OVER THE COUNTER,VOMIT 5X 9/20,SEP.2015,FY2015
4,A0157434,DOG,ROTTWEILER,FREEZER,UNAVAILABLE,,,1,P0093154,38G,8G,OWNER SURRENDER,- DEAD ON ARRIVAL,1,,DD,12/03/2014 12:00:00 AM,12/31/1899 08:06:00 PM,12/03/2014 12:00:00 AM,UNHEALTHY UNTREATABLE NON-CONTAGIOUS,,DEAD ON ARRIVAL,12/04/2014 12:00:00 AM,12/31/1899 12:00:00 PM,,K14-302641,,UNHEALTHY UNTREATABLE NON-CONTAGIOUS,SCAN NO CHIP,FIELD,,DEC.2014,FY2015


*The time of intake and outcome are split between two columns, one for the data and one for the time of day. Since we're only evaluating how long an animal is in a shelter with how many days we'll only convert the date to datetime and sort by that. Also converting column names to snake case.*

In [39]:
# function to convert column names
def to_snake_case(columns):
    '''
    Convert dataframe column names to snake case

    Keyword arguments:
    columns -- original column names of dataframe

    Returns a dictionary to pass into pandas rename function
    '''
    return {column: column.lower().replace(' ', '_') for column in columns}

In [40]:
dallas_2014.rename(columns = to_snake_case(dallas_2014.columns), inplace = True)

In [41]:
# converting intake_date and outcome_date to datetime
dallas_2014['intake_date'] = pd.to_datetime(dallas_2014['intake_date'], format='%m/%d/%Y %I:%M:%S %p')
dallas_2014['outcome_date'] = pd.to_datetime(dallas_2014['outcome_date'], format='%m/%d/%Y %I:%M:%S %p')

# sorting by outcome_date with most recent first
dallas_2014.sort_values(by = 'outcome_date', ascending = False, inplace = True)

*Will read in the datasets from other years and combine before looking at nulls, inconsequential columns, and duplicates similar to the Austin data.*

In [42]:
# function to read in data from remainging years
def read_in_dallas_data(year):
    '''
    Function to read in csv data for animal shelter data by year

    Keyword arguments:
    year -- which year to bring in data from

    Returns dataframe with column names converted to snake case,
    intake and outcome dates converted to datetime,
    and sorted by outcome date descending.
    '''
    year = str(year)
    # read in data
    if year != '2023':
        df = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_'+year+'_-_'+str(int(year)+1)+'_20241028.csv', low_memory=False)
    else:
        df = pd.read_csv('../data/Dallas_Animal_Shelter_Data_Fiscal_Year_'+year+'_-_'+str(int(year)+2)+'_20241028.csv', low_memory=False)
    # convert column names
    df.rename(columns=to_snake_case(df.columns), inplace = True)
    # convert intake and outcome dates to datetime
    df['intake_date'] = pd.to_datetime(df['intake_date'], format='mixed')
    df['outcome_date'] = pd.to_datetime(df['outcome_date'], format='mixed')
    # sort by outcome_date
    df.sort_values(by = 'outcome_date', ascending = False, inplace = True)
    return df

In [43]:
dallas_2015 = read_in_dallas_data(2015)
dallas_2016 = read_in_dallas_data(2016)
dallas_2017 = read_in_dallas_data(2017)
dallas_2018 = read_in_dallas_data(2018)
dallas_2019 = read_in_dallas_data(2019)
dallas_2020 = read_in_dallas_data(2020)
dallas_2021 = read_in_dallas_data(2021)
dallas_2022 = read_in_dallas_data(2022)
dallas_2023 = read_in_dallas_data(2023)

In [44]:
dallas_combined = pd.concat([dallas_2014, dallas_2015, dallas_2016, dallas_2017, dallas_2018,
                             dallas_2019, dallas_2020, dallas_2021, dallas_2022, dallas_2023])

In [45]:
# resorting combined dataframe by outcome date
dallas_combined.sort_values(by = 'outcome_date', ascending = False, inplace = True)

# dropping duplicates by animal_id to focus on most recent observation per unique animal
print(dallas_combined.shape)
dallas_combined.drop_duplicates(subset = 'animal_id', inplace = True)
print(dallas_combined.shape)
dallas_combined.head()

(344923, 34)
(275158, 34)


Unnamed: 0,animal_id,animal_type,animal_breed,kennel_number,kennel_status,tag_type,activity_number,activity_sequence,source_id,census_tract,council_district,intake_type,intake_subtype,intake_total,reason,staff_id,intake_date,intake_time,due_out,intake_condition,hold_request,outcome_type,outcome_date,outcome_time,receipt_number,impound_number,service_request_number,outcome_condition,chip_status,animal_origin,additional_information,month,year,outcome_subtype
42903,A1229376,CAT,DOMESTIC SH,FREEZER,UNAVAILABLE,,,1,P1110761,3902.0,7.0,DISPOS REQ,OTC,1.0,OTHRINTAKS,JVW,2024-10-04,17:06:00,10/04/2024,DECEASED,ADOP RESCU,DISPOSAL,2027-10-04,18:00:00,,K24-644567,,DECEASED,SCAN NO CHIP,OVER THE COUNTER,,FY2024,FY2024,DISPOSAL
43357,A1229851,DOG,MIXED BREED,B17,AVAILABLE,,,1,P1111387,5200.0,1.0,STRAY,AT LARGE,1.0,OTHRINTAKS,JVW,2024-10-09,17:42:00,10/15/2024,APP WNL,ADOP RESCU,ADOPTION,2024-10-27,17:50:00,R24-622569,K24-645135,,APP WNL,SCAN CHIP,OVER THE COUNTER,,FY2024,FY2024,WESTMORELD
39489,A1225816,CAT,DOMESTIC SH,413,UNAVAILABLE,,,1,P1101968,,,FOSTER,APPOINT,1.0,SURGERY,BEC,2024-10-26,10:35:00,10/26/2024,APP WNL,ADOP RESCU,ADOPTION,2024-10-27,14:15:00,R24-622542,K24-646788,,APP WNL,SCAN CHIP,OVER THE COUNTER,,FY2024,FY2024,WESTMORELD
16153,A1204135,DOG,MIXED BREED,FOSTER,PENDING,,,1,P1083409,,,FOSTER,APPOINT,1.0,FOR ADOPT,JLC,2024-10-27,12:15:00,10/27/2024,APP WNL,ADOP RESCU,ADOPTION,2024-10-27,12:15:00,R24-622533,K24-646916,,APP WNL,SCAN CHIP,OVER THE COUNTER,,FY2024,FY2024,BY FOSTER
44442,A1231147,DOG,CHIHUAHUA SH,K03,AVAILABLE,,,1,P1113395,12900.0,9.0,OWNER SURRENDER,WALK IN,1.0,PERSNLISSU,GRA,2024-10-24,18:14:00,10/24/2024,APP WNL,ADOP RESCU,ADOPTION,2024-10-27,14:53:00,R24-622444,K24-646666,,APP WNL,SCAN NO CHIP,OVER THE COUNTER,BBW,FY2024,FY2024,WESTMORELD


In [46]:
dallas_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 275158 entries, 42903 to 44634
Data columns (total 34 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   animal_id               275158 non-null  object        
 1   animal_type             275157 non-null  object        
 2   animal_breed            275042 non-null  object        
 3   kennel_number           275158 non-null  object        
 4   kennel_status           275158 non-null  object        
 5   tag_type                2 non-null       object        
 6   activity_number         114170 non-null  object        
 7   activity_sequence       275158 non-null  int64         
 8   source_id               275158 non-null  object        
 9   census_tract            228718 non-null  object        
 10  council_district        228718 non-null  object        
 11  intake_type             275158 non-null  object        
 12  intake_subtype          270831 n

*tag_type and service_request_number are primarily null values, going to drop these columns. Also going to drop several columns that don't have an impact on an animal's outcome type.*

In [47]:
dallas_combined.drop(columns = ['kennel_number', 'tag_type', 'activity_number', 'activity_sequence',
                                'source_id', 'census_tract', 'council_district', 'intake_subtype',
                                'intake_total', 'staff_id', 'intake_time', 'due_out', 'hold_request',
                                'outcome_time', 'receipt_number', 'impound_number', 'service_request_number',
                                'additional_information', 'month', 'year', 'outcome_subtype'], inplace = True)

In [48]:
# see animal types
dallas_combined['animal_type'].value_counts()

animal_type
DOG          191178
CAT           65699
WILDLIFE       9724
BIRD           8275
LIVESTOCK       280
D                 1
Name: count, dtype: int64

*Same as the data from Austin shelters, we are going to ignore anything that isn't a cat or dog.*

In [49]:
# see remainging null values
dallas_combined.isnull().sum()

animal_id                 0
animal_type               1
animal_breed            116
kennel_status             0
intake_type               0
reason               127247
intake_date               0
intake_condition          0
outcome_type              0
outcome_date           1139
outcome_condition     22376
chip_status           18293
animal_origin         18379
dtype: int64

*Going to create a duration column for how long an animal is in the shelter. Dropping observations with no outcome date first.*

In [50]:
# drop null outcome date rows
dallas_combined.dropna(subset = 'outcome_date', inplace = True)

# column for stay duration
dallas_combined['stay_duration'] = dallas_combined['outcome_date'] - dallas_combined['intake_date']
# convert to number of days
dallas_combined['stay_duration'] = dallas_combined['stay_duration'].dt.days.astype(int)

# check for negative stay_duration
dallas_combined[dallas_combined['stay_duration'] < 0]['stay_duration'].value_counts()

stay_duration
-333      3
-30       2
-332      2
-330      1
-331      1
-300      1
-55       1
-698      1
-272      1
-19963    1
Name: count, dtype: int64

*Just a few instances with negative stays, just dropping these rows.*

In [51]:
dallas_combined = dallas_combined[dallas_combined['stay_duration'] >= 0]

*Looking at the columns with remaing null values.*

In [52]:
dallas_combined['reason'].unique()

array(['OTHRINTAKS', 'SURGERY', 'FOR ADOPT', 'PERSNLISSU', 'OTHER',
       'NOTRIGHTFT', nan, 'SHORT-TERM', 'TRANSFER', 'BEHAVIOR', 'MEDICAL',
       'HOUSING', 'FINANCIAL', 'EVICTION', 'TNR CLINIC', 'STRAY',
       'TOO MANY', 'OWNER PROBLEM', 'AGGRESSIVE - PEOPLE',
       'AGGRESSIVE - ANIMAL', 'HOUSE SOIL', 'EUTHANASIA ILL', 'NO TIME',
       'UNKNOWN', 'DESTRUCTIVE AT HOME', 'OTHER PET', 'ESCAPES',
       'CAUTIONCAT', 'CHILD PROBLEM', 'INJURED', 'ILL', 'MOVE',
       'LANDLORD', 'VOCAL', 'COST', 'ALLERGIC', 'BITES', 'HYPER',
       'KILLED ANOTHER ANIMAL', 'NEW BABY', 'TOO BIG', 'MOVE APT',
       'DEAD ON ARRIVAL', 'OWNER DIED', 'TRAVEL', 'ATTENTION',
       'DESTRUCTIVE OUTSIDE', 'FOUND ANIM', 'RESPONSIBLE', 'TOO OLD',
       'EUTHANASIA OLD', 'ABANDON', 'AFRAID', 'FOSTER', 'BLIND/DEAF',
       'CRUELTY', 'NOFRIENDLY', 'NO HOME', 'DEAF', 'NO YARD', 'BLIND',
       'CHASES PEOPLE', 'DISOBIDIEN', 'FENCE', 'QUARANTINE',
       'EUTHANASIA BEHAV', 'ZONE', 'WRONG SEX', 'JUMPS UP', 'R

In [53]:
dallas_combined['outcome_condition'].unique()

array(['DECEASED', 'APP WNL', nan, 'APP SICK', 'CRITICAL', 'APP INJ',
       'UNDERAGE', 'GERIATRIC', 'FATAL', 'UNKNOWN', 'DEAD',
       'TREATABLE REHABILITABLE NON-CONTAGIOUS',
       'UNHEALTHY UNTREATABLE NON-CONTAGIOUS', 'HEALTHY',
       'TREATABLE MANAGEABLE NON-CONTAGIOUS',
       'UNHEALTHY UNTREATABLE CONTAGIOUS',
       'TREATABLE REHABILITABLE CONTAGIOUS',
       'TREATABLE MANAGEABLE CONTAGIOUS'], dtype=object)

In [54]:
dallas_combined['chip_status'].value_counts()

chip_status
SCAN NO CHIP                 172114
SCAN CHIP                     58552
UNABLE TO SCAN                20560
WILDLIFE - UNABLE TO SCAN      3830
WILDLIFE - UNABEL TO SCAN      1038
Name: count, dtype: int64

In [55]:
dallas_combined['animal_origin'].unique()

array(['OVER THE COUNTER', 'AGGOPS', 'COM CAT', nan, 'FIELD', 'BITE',
       'HART', 'PSPICKUP', 'OPS', 'AGGDD', 'CARE', 'SWEEP', 'RAPID',
       'NIGHT DROP'], dtype=object)

* reason: These could be pertinent to an animal being adopted or not, will fill nulls with 'none' that no reason was given.
* outcome_condition: We are going to assume if nothing was entered the animals was healthy.
* chip_status: We are assuming this will relate somewhat to outcome type, that if they are able to scan a chip they will get returned to owner. Dropping this column.
* animal_origin: Not even sure what all of these entries are, dropping this column.

In [56]:
# drop chip_status and animal_origin columns
dallas_combined = dallas_combined.drop(columns=['chip_status', 'animal_origin'])

# fill nulls for reason and outcome_condition
dallas_combined = dallas_combined.fillna({'reason': 'NONE', 'outcome_condition': 'HEALTHY'})

In [57]:
dallas_combined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 274005 entries, 42903 to 1377
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   animal_id          274005 non-null  object        
 1   animal_type        274004 non-null  object        
 2   animal_breed       273889 non-null  object        
 3   kennel_status      274005 non-null  object        
 4   intake_type        274005 non-null  object        
 5   reason             274005 non-null  object        
 6   intake_date        274005 non-null  datetime64[ns]
 7   intake_condition   274005 non-null  object        
 8   outcome_type       274005 non-null  object        
 9   outcome_date       274005 non-null  datetime64[ns]
 10  outcome_condition  274005 non-null  object        
 11  stay_duration      274005 non-null  int32         
dtypes: datetime64[ns](2), int32(1), object(9)
memory usage: 26.1+ MB


In [58]:
# saving combined data
dallas_combined.to_csv('../data/dallas-combined-shelter-data.csv', index=False)