## Austin Animal Shelter Outcomes - Basic Data Cleaning and ML

The Austin Animal Shelter Outcomes dataset (https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and) contains information about animals that have come out of their care from the years 2013-2018. In this notebook, I will do a quick cleaning of the dataset, and then train some machine learning models to predict, based on the features in the dataset (as well as some new ones that will be created) what outcome an animal will have (i.e. has the animal been adopted, transferred to another shelter, euthanised, etc.)

Some visualisations of the data have been created with Tableau and can be found on Tableau Public here:

The first two cells contain code that helps to import the data into a Kaggle notebook using their online server facility. We do not need to use this code here, except for the importing of the pandas and numpy libraries. The scikit learn library will be imported later.

In [43]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [44]:
#df = pd.read_csv("/kaggle/input/austin-animal-center-shelter-outcomes-and/aac_shelter_outcomes.csv")
#df

In [45]:
# Read in our dataset to the variable df
df = pd.read_csv("aac_shelter_outcomes.csv")

First we will get a feel for the data by looking at its shape, and looking at some of the features in a little more detail.

In [46]:
# Check out the shape of the dataset in terms of the number of rows and columns it contains
df.shape

(78256, 12)

In [47]:
# Look at the datatypes 
df.dtypes

age_upon_outcome    object
animal_id           object
animal_type         object
breed               object
color               object
date_of_birth       object
datetime            object
monthyear           object
name                object
outcome_subtype     object
outcome_type        object
sex_upon_outcome    object
dtype: object

No datatypes (except for object) have been defined. We will eventually need to decide whether these features are categorical or numerical.

In [48]:
# Check out the different animal types
df["animal_type"].unique()

array(['Cat', 'Dog', 'Other', 'Bird', 'Livestock'], dtype=object)

Each animal_id is a unique identifier, so we will remove duplicate rows from the dataset based on this column.

In [49]:
df.drop_duplicates(subset=['animal_id'], inplace=True)

In [50]:
df

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...
78251,1 month,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04T00:00:00,2018-02-01T18:26:00,2018-02-01T18:26:00,,Foster,Adoption,Spayed Female
78252,3 years,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30T00:00:00,2018-02-01T18:06:00,2018-02-01T18:06:00,Max,,Adoption,Neutered Male
78253,,A766098,Other,Bat Mix,Brown,2017-02-01T00:00:00,2018-02-01T18:08:00,2018-02-01T18:08:00,,Rabies Risk,Euthanasia,Unknown
78254,2 months,A765858,Dog,Standard Schnauzer,Red,2017-11-13T00:00:00,2018-02-01T18:32:00,2018-02-01T18:32:00,,,Adoption,Spayed Female


In [51]:
df[df["animal_type"] == "Other"].nunique()

age_upon_outcome      37
animal_id           4235
animal_type            1
breed                 94
color                110
date_of_birth       1626
datetime            3696
monthyear           3696
name                 409
outcome_subtype       13
outcome_type           7
sex_upon_outcome       5
dtype: int64

In [52]:
# Checking for null values: this is important as it may show us features we have to drop or modify
df.isnull().sum()

age_upon_outcome        8
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23653
outcome_subtype     36369
outcome_type            8
sex_upon_outcome        2
dtype: int64

8 row values are missing for the "age_upon_outcome" column. We should be able to calculate these based on other existing columns.

In [53]:
# List the rows in the dataset where the "age_upon_outcome" column consists of a null value
df[df["age_upon_outcome"].isna()]

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
68246,,A737705,Dog,Labrador Retriever Mix,Black/White,2013-11-02T00:00:00,2016-11-19T16:35:00,2016-11-19T16:35:00,*Heddy,,,
76825,,A764319,Dog,Pit Bull Mix,Black/White,2016-12-27T00:00:00,2017-12-30T16:47:00,2017-12-30T16:47:00,*Emma,,,Intact Female
77976,,A765547,Bird,Leghorn Mix,White/Red,2017-01-22T00:00:00,2018-01-25T13:23:00,2018-01-25T13:23:00,,Partner,Transfer,Intact Female
78081,,A765899,Dog,Miniature Poodle Mix,Black,2011-01-29T00:00:00,2018-01-29T15:49:00,2018-01-29T15:49:00,,Suffering,Euthanasia,Neutered Male
78114,,A765914,Cat,Domestic Shorthair Mix,Lynx Point,2017-01-29T00:00:00,2018-01-29T18:08:00,2018-01-29T18:08:00,,Suffering,Euthanasia,Intact Male
78162,,A765901,Dog,Maltese Mix,Buff,2017-01-29T00:00:00,2018-01-31T08:14:00,2018-01-31T08:14:00,,Partner,Transfer,Intact Male
78208,,A765960,Dog,Beagle/Catahoula,Tan/White,2010-02-01T00:00:00,2018-02-01T09:21:00,2018-02-01T09:21:00,,Suffering,Euthanasia,Intact Male
78253,,A766098,Other,Bat Mix,Brown,2017-02-01T00:00:00,2018-02-01T18:08:00,2018-02-01T18:08:00,,Rabies Risk,Euthanasia,Unknown


We can calculate these values by finding the difference between the "date_of_birth" and "datetime" columns: this will give us the number of days between the date of birth of the animal and the date the animal leaves the shelter. When we get these values, they will be formatted as a number of days, i.e. "15 days", so we will convert and enter them into the dataset in a similar fashion to the format they already appear, i.e. "1 week", "1 month" etc.

We saw previously that all features in this dataset are type "object". To make calculations between these two features, we need to convert both of them into "datetype" format.

In [54]:
# Converting "datetime" and "date_of_birth" features into datetime format
df['datetime'] = pd.to_datetime(df['datetime'])
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])

In [55]:
# Check that our calculation works and gives us the number of days between the two dates
df['datetime'] - df['date_of_birth']

0         15 days 16:04:00
1        366 days 11:47:00
2        429 days 14:20:00
3       3300 days 15:50:00
4        181 days 14:04:00
               ...        
78251     59 days 18:26:00
78252   1129 days 18:06:00
78253    365 days 18:08:00
78254     80 days 18:32:00
78255     80 days 18:44:00
Length: 70855, dtype: timedelta64[ns]

In [56]:
# List the eight figures we need to impute
ageuponoutcome_nanvalues = df[df["age_upon_outcome"].isna()]
ageuponoutcome_nanvalues['datetime'] - ageuponoutcome_nanvalues['date_of_birth']

68246   1113 days 16:35:00
76825    368 days 16:47:00
77976    368 days 13:23:00
78081   2557 days 15:49:00
78114    365 days 18:08:00
78162    367 days 08:14:00
78208   2922 days 09:21:00
78253    365 days 18:08:00
dtype: timedelta64[ns]

As there are only eight values, we can fill these in manually into our original dataframe.

In [57]:
# Manually locating animals by animal_id in order to fill in the missing values
df.loc[df["animal_id"] == "A737705", 'age_upon_outcome'] = "3 years"
df.loc[df["animal_id"] == "A764319", 'age_upon_outcome'] = "1 year"
df.loc[df["animal_id"] == "A765547", 'age_upon_outcome'] = "1 year"
df.loc[df["animal_id"] == "A765899", 'age_upon_outcome'] = "7 years"
df.loc[df["animal_id"] == "A765914", 'age_upon_outcome'] = "1 year"
df.loc[df["animal_id"] == "A765901", 'age_upon_outcome'] = "1 year"
df.loc[df["animal_id"] == "A765960", 'age_upon_outcome'] = "8 years"
df.loc[df["animal_id"] == "A766098", 'age_upon_outcome'] = "1 year"

In [58]:
df.loc[df["animal_id"] == "A764319"]

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
76825,1 year,A764319,Dog,Pit Bull Mix,Black/White,2016-12-27,2017-12-30 16:47:00,2017-12-30T16:47:00,*Emma,,,Intact Female


In [59]:
# Check that our null values for age_upon_outcome are all filled
# Also identify other columns where we need to address null values
df.isnull().sum()

age_upon_outcome        0
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23653
outcome_subtype     36369
outcome_type            8
sex_upon_outcome        2
dtype: int64

In [60]:
# Identify the number of unique names in the dataset
df["name"].nunique()

14574

23881 names are missing. This column can be deleted as the vast majority of these are unique (14574) and tell us nothing about the animal condition.
Over half of the outcome_subtype are missing, so this column will be deleted.

The outcome_type and sex_upon_outcome columns have figures missing for 12 and 2 rows respectively. These will all be changed to the value "Unknown", and then the rows for which the outcome_type feature are "Unknown" will be deleted.


In [61]:
# Drop name and outcome_subtype columns
df = df.drop(columns=["name","outcome_subtype"])

In [62]:
# Change NaN values in outcome_type and sex_upon_outcome columns to string "Unknown"
values = {'sex_upon_outcome': "Unknown", 'outcome_type': "Unknown"}
df = df.fillna(value=values)

# Remove values where outcome_type is "Unknown"
df = df[~df["outcome_type"].str.contains("Unknown")]

In [63]:
# Review our changes
df.head(5)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22 16:04:00,2014-07-22T16:04:00,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07 11:47:00,2013-11-07T11:47:00,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03 14:20:00,2014-06-03T14:20:00,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15 15:50:00,2014-06-15T15:50:00,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07,2014-07-07 14:04:00,2014-07-07T14:04:00,Euthanasia,Unknown


Lets have another look at our null values to see if we have any more features to address:

In [64]:
df.isnull().sum()

age_upon_outcome    0
animal_id           0
animal_type         0
breed               0
color               0
date_of_birth       0
datetime            0
monthyear           0
outcome_type        0
sex_upon_outcome    0
dtype: int64

This amount of cleaning is the bare minimum at which we can train a ML model. We will save a copy of our dataframe to a new .csv file as a first draft, but we will continue cleaning before we build our model as there is more that can be done.

In [65]:
df.to_csv("aac_shelter_outcomes_firstdraft.csv", index=False)

In [66]:
df = pd.read_csv("aac_shelter_outcomes_firstdraft.csv")

In [67]:
df

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07 00:00:00,2014-07-22 16:04:00,2014-07-22T16:04:00,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06 00:00:00,2013-11-07 11:47:00,2013-11-07T11:47:00,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31 00:00:00,2014-06-03 14:20:00,2014-06-03T14:20:00,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02 00:00:00,2014-06-15 15:50:00,2014-06-15T15:50:00,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07 00:00:00,2014-07-07 14:04:00,2014-07-07T14:04:00,Euthanasia,Unknown
...,...,...,...,...,...,...,...,...,...,...
70842,1 month,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04 00:00:00,2018-02-01 18:26:00,2018-02-01T18:26:00,Adoption,Spayed Female
70843,3 years,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30 00:00:00,2018-02-01 18:06:00,2018-02-01T18:06:00,Adoption,Neutered Male
70844,1 year,A766098,Other,Bat Mix,Brown,2017-02-01 00:00:00,2018-02-01 18:08:00,2018-02-01T18:08:00,Euthanasia,Unknown
70845,2 months,A765858,Dog,Standard Schnauzer,Red,2017-11-13 00:00:00,2018-02-01 18:32:00,2018-02-01T18:32:00,Adoption,Spayed Female


There are a few more features we can look at which we can make improvements to, and possibly new features to be created from those that already exist.

We will start with the age_upon_outcome feature. Earlier, we calculated the age values that were missing for 8 rows in the dataset; we entered these results manually as strings in the same format as they were initially in the raw dataset. As none of the datetime columns contain NaN values, we can make these calculations for every row in the dataset, and replace the values that were in the dataset originally. This means two things:
- The newly calculated data should be more accurate in terms of defining an animal's age, as it will now be calculated as a specific number of days rather than fluctuating terms like weeks, months, years
- This means the new data can be cast as type INT

In [68]:
# As done previously, we have to convert our 'datetime' and 'date_of_birth' features
# into datetime format before making our calculations
df['datetime'] = pd.to_datetime(df['datetime'])
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])
df['datetime'] - df['date_of_birth']

0         15 days 16:04:00
1        366 days 11:47:00
2        429 days 14:20:00
3       3300 days 15:50:00
4        181 days 14:04:00
               ...        
70842     59 days 18:26:00
70843   1129 days 18:06:00
70844    365 days 18:08:00
70845     80 days 18:32:00
70846     80 days 18:44:00
Length: 70847, dtype: timedelta64[ns]

We will save the above in the place of the values of the age_upon_outcome feature, remove the text after the integer value we need, and then convert the column to type INT.

In [69]:
# Overwrite the original values in age_upon_outcome with our new values
df['age_upon_outcome'] = df['datetime'] - df['date_of_birth']

# Change the age_upon_outcome datatype back to object, 
# then convert it into a string and strip the last 14 characters
df['age_upon_outcome'] = df['age_upon_outcome'].astype(object)
df['age_upon_outcome'] = df['age_upon_outcome'].astype(str).str[:-14]

In [70]:
# Look at the first few rows of the dataset to review the changes made
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_type,sex_upon_outcome
0,15,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22 16:04:00,2014-07-22T16:04:00,Transfer,Intact Male
1,366,A666430,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07 11:47:00,2013-11-07T11:47:00,Transfer,Spayed Female
2,429,A675708,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03 14:20:00,2014-06-03T14:20:00,Adoption,Neutered Male
3,3300,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15 15:50:00,2014-06-15T15:50:00,Transfer,Neutered Male
4,181,A683115,Other,Bat Mix,Brown,2014-01-07,2014-07-07 14:04:00,2014-07-07T14:04:00,Euthanasia,Unknown


We need to ensure that all values have been replaced succesfully and there are no null values - we should also ensure none of the resulting values are negative

In [71]:
# Retrieve sum of null values in age_upon_outcome column
df["age_upon_outcome"].isna().sum()

0

In [72]:
# Check if any of the resulting values are negative - they should all be positive
# To do this, we first need to convert our age_upon_outcome feature to datatype int
df["age_upon_outcome"] = df["age_upon_outcome"].astype(int)
cols = ['age_upon_outcome']
df[df[cols] > 0][cols]

Unnamed: 0,age_upon_outcome
0,15.0
1,366.0
2,429.0
3,3300.0
4,181.0
...,...
70842,59.0
70843,1129.0
70844,365.0
70845,80.0


From the datetime column, we can work out the month of outcome and year of outcome for each animal. We can find out whether the month of outcome has any influence on the outcome_type of the animal, and year too so we can visualise this later. Again, we can use the datetime column to extract this information.

We can do this for month by slicing the final 12 characters and the first 5 characters from each datetime value.

In [73]:
# First make sure our feature is converted to a string so we can perform the slice
# Do each slice manually to keep things simple
df['month_of_outcome'] = df['datetime'].astype(str).str[:-12]
df['month_of_outcome'] = df['month_of_outcome'].str[5:]

In [74]:
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_type,sex_upon_outcome,month_of_outcome
0,15,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22 16:04:00,2014-07-22T16:04:00,Transfer,Intact Male,7
1,366,A666430,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07 11:47:00,2013-11-07T11:47:00,Transfer,Spayed Female,11
2,429,A675708,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03 14:20:00,2014-06-03T14:20:00,Adoption,Neutered Male,6
3,3300,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15 15:50:00,2014-06-15T15:50:00,Transfer,Neutered Male,6
4,181,A683115,Other,Bat Mix,Brown,2014-01-07,2014-07-07 14:04:00,2014-07-07T14:04:00,Euthanasia,Unknown,7


We can now do something similar to extract the year, by removing all but the first four characters of the values from the 'datetime' column.

In [75]:
# Remove all but first four characters of
# Make sure datatype is string
df['year_of_outcome'] = df['datetime'].astype(str).str[:4]
df

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_type,sex_upon_outcome,month_of_outcome,year_of_outcome
0,15,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07,2014-07-22 16:04:00,2014-07-22T16:04:00,Transfer,Intact Male,07,2014
1,366,A666430,Dog,Beagle Mix,White/Brown,2012-11-06,2013-11-07 11:47:00,2013-11-07T11:47:00,Transfer,Spayed Female,11,2013
2,429,A675708,Dog,Pit Bull,Blue/White,2013-03-31,2014-06-03 14:20:00,2014-06-03T14:20:00,Adoption,Neutered Male,06,2014
3,3300,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02,2014-06-15 15:50:00,2014-06-15T15:50:00,Transfer,Neutered Male,06,2014
4,181,A683115,Other,Bat Mix,Brown,2014-01-07,2014-07-07 14:04:00,2014-07-07T14:04:00,Euthanasia,Unknown,07,2014
...,...,...,...,...,...,...,...,...,...,...,...,...
70842,59,A764894,Dog,Golden Retriever/Labrador Retriever,Brown/White,2017-12-04,2018-02-01 18:26:00,2018-02-01T18:26:00,Adoption,Spayed Female,02,2018
70843,1129,A764468,Dog,Mastiff Mix,Blue/White,2014-12-30,2018-02-01 18:06:00,2018-02-01T18:06:00,Adoption,Neutered Male,02,2018
70844,365,A766098,Other,Bat Mix,Brown,2017-02-01,2018-02-01 18:08:00,2018-02-01T18:08:00,Euthanasia,Unknown,02,2018
70845,80,A765858,Dog,Standard Schnauzer,Red,2017-11-13,2018-02-01 18:32:00,2018-02-01T18:32:00,Adoption,Spayed Female,02,2018


We'll now remove the columns we will not use for prediction making (datetime, monthyear, dateofbirth, animal_id).

In [76]:
df = df.drop(columns=["datetime","monthyear","animal_id","date_of_birth"])

In [77]:
df

Unnamed: 0,age_upon_outcome,animal_type,breed,color,outcome_type,sex_upon_outcome,month_of_outcome,year_of_outcome
0,15,Cat,Domestic Shorthair Mix,Orange Tabby,Transfer,Intact Male,07,2014
1,366,Dog,Beagle Mix,White/Brown,Transfer,Spayed Female,11,2013
2,429,Dog,Pit Bull,Blue/White,Adoption,Neutered Male,06,2014
3,3300,Dog,Miniature Schnauzer Mix,White,Transfer,Neutered Male,06,2014
4,181,Other,Bat Mix,Brown,Euthanasia,Unknown,07,2014
...,...,...,...,...,...,...,...,...
70842,59,Dog,Golden Retriever/Labrador Retriever,Brown/White,Adoption,Spayed Female,02,2018
70843,1129,Dog,Mastiff Mix,Blue/White,Adoption,Neutered Male,02,2018
70844,365,Other,Bat Mix,Brown,Euthanasia,Unknown,02,2018
70845,80,Dog,Standard Schnauzer,Red,Adoption,Spayed Female,02,2018


In [78]:
df.nunique()

age_upon_outcome    4086
animal_type            5
breed               2128
color                525
outcome_type           9
sex_upon_outcome       5
month_of_outcome      12
year_of_outcome        6
dtype: int64

Now we will train some models using a subset of the features we have kept up until this point. 

We are not going to use the "breed" or "color" features as the cardinality is too high: these two features require deeper cleaning that goes beyond the scale of this particular notebook.

We are also going to remove the animal_id, datetime and monthyear columns.

In [79]:
from sklearn.model_selection import train_test_split

# Get dummies for features we're keeping (Except target feature)
features = ["age_upon_outcome", "animal_type", "sex_upon_outcome", "month_of_outcome", "year_of_outcome"] 
new_df = pd.get_dummies(df[features])

# Isolate the descriptive feature from the training set
X = new_df
Y = np.array(df['outcome_type'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test \
    = train_test_split(X, Y, \
                        shuffle=True, \
                        stratify = Y, \
                        train_size = 0.7)

In [80]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier

#features = ["age_upon_outcome", "animal_type", "breed", "color", "sex_upon_outcome"]
#X_train = pd.get_dummies(df[features])
#X_test = pd.get_dummies(df[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

y_pred=model.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)

print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

Accuracy: 0.6314278993178076


  _warn_prf(average, modifier, msg_start, len(result))


                 precision    recall  f1-score   support

       Adoption       0.60      0.95      0.74      8824
           Died       0.00      0.00      0.00       202
       Disposal       0.00      0.00      0.00        91
     Euthanasia       0.81      0.49      0.62      1775
        Missing       0.00      0.00      0.00        12
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.00      0.00      0.00      3549
      Rto-Adopt       0.00      0.00      0.00        38
       Transfer       0.66      0.62      0.64      6759

       accuracy                           0.63     21255
      macro avg       0.23      0.23      0.22     21255
   weighted avg       0.53      0.63      0.56     21255



In [81]:
import xgboost as xgb

model = xgb.XGBClassifier(objective="multi:softprob", random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

y_pred=model.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)

print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))





Accuracy: 0.7188426252646436
                 precision    recall  f1-score   support

       Adoption       0.72      0.89      0.79      8824
           Died       0.50      0.02      0.05       202
       Disposal       0.56      0.21      0.30        91
     Euthanasia       0.78      0.58      0.67      1775
        Missing       0.00      0.00      0.00        12
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.64      0.57      0.60      3549
      Rto-Adopt       0.00      0.00      0.00        38
       Transfer       0.75      0.65      0.70      6759

       accuracy                           0.72     21255
      macro avg       0.44      0.32      0.35     21255
   weighted avg       0.72      0.72      0.71     21255



Now we will train the models again using the same algorithms, but this time we will remove the "month_of_outcome" and "year_of_outcome" features.

In [82]:
# Get dummies for features we're keeping (Except target feature)
# This time we drop the features "month_of_outcome" and "year_of_outcome"
features = ["age_upon_outcome", "animal_type", "sex_upon_outcome"] 
new_df = pd.get_dummies(df[features])

# Isolate the descriptive feature from the training set
X = new_df
Y = np.array(df['outcome_type'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test \
    = train_test_split(X, Y, \
                        shuffle=True, \
                        stratify = Y, \
                        train_size = 0.7)

In [83]:
# Define and train a Random Forest Classifier model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)

# Make predictions based on the test set and save the accuracy score
y_pred=model.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)

# Print results including accuracy, precision, recall, f1-score and support
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

Accuracy: 0.6448365090566925


  _warn_prf(average, modifier, msg_start, len(result))


                 precision    recall  f1-score   support

       Adoption       0.61      0.95      0.74      8824
           Died       0.00      0.00      0.00       202
       Disposal       0.00      0.00      0.00        91
     Euthanasia       0.79      0.51      0.62      1775
        Missing       0.00      0.00      0.00        12
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.52      0.10      0.16      3549
      Rto-Adopt       0.00      0.00      0.00        38
       Transfer       0.72      0.60      0.65      6759

       accuracy                           0.64     21255
      macro avg       0.29      0.24      0.24     21255
   weighted avg       0.63      0.64      0.60     21255



In [84]:
# Define and train an XGBoost Classifier model
model = xgb.XGBClassifier(objective="multi:softprob", random_state=42)
model.fit(X_train, y_train)
#predictions = model.predict(X_test)

# Make predictions based on the test set and save the accuracy score
y_pred=model.predict(X_test)
accuracy = metrics.accuracy_score(y_test,y_pred)

# Print results including accuracy, precision, recall, f1-score and support
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))





Accuracy: 0.7165843330980946
                 precision    recall  f1-score   support

       Adoption       0.72      0.89      0.79      8824
           Died       0.54      0.03      0.07       202
       Disposal       0.40      0.07      0.11        91
     Euthanasia       0.77      0.58      0.66      1775
        Missing       0.00      0.00      0.00        12
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.63      0.56      0.59      3549
      Rto-Adopt       0.00      0.00      0.00        38
       Transfer       0.75      0.65      0.70      6759

       accuracy                           0.72     21255
      macro avg       0.42      0.31      0.32     21255
   weighted avg       0.71      0.72      0.71     21255



The results show that while features such as "month_outcome" and "year_outcome" are useful in the context of data exploration and creating visualisations, they have little or no impact on the accuracy of the models trained to predict the possible outcome of an animal in a shelter. For this reason, we will keep the model that we have trained with these features removed, as it will be smaller in size and perform better when deployed, with little or no change in terms of classification accuracy.