In [1]:
#import necessary tools to begin
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
#open, read, and explore data set
df = pd.read_csv("aac_shelter_outcomes.csv")
df.head(20)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact Male
6,1 year,A693700,Other,Squirrel Mix,Tan,2013-12-13T00:00:00,2014-12-13T12:20:00,2014-12-13T12:20:00,,Suffering,Euthanasia,Unknown
7,3 years,A692618,Dog,Chihuahua Shorthair Mix,Brown,2011-11-23T00:00:00,2014-12-08T15:55:00,2014-12-08T15:55:00,*Ella,Partner,Transfer,Spayed Female
8,1 month,A685067,Cat,Domestic Shorthair Mix,Blue Tabby/White,2014-06-16T00:00:00,2014-08-14T18:45:00,2014-08-14T18:45:00,Lucy,,Adoption,Intact Female
9,3 months,A678580,Cat,Domestic Shorthair Mix,White/Black,2014-03-26T00:00:00,2014-06-29T17:45:00,2014-06-29T17:45:00,*Frida,Offsite,Adoption,Spayed Female


In [3]:
df.shape

(78256, 12)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age_upon_outcome  78248 non-null  object
 1   animal_id         78256 non-null  object
 2   animal_type       78256 non-null  object
 3   breed             78256 non-null  object
 4   color             78256 non-null  object
 5   date_of_birth     78256 non-null  object
 6   datetime          78256 non-null  object
 7   monthyear         78256 non-null  object
 8   name              54370 non-null  object
 9   outcome_subtype   35963 non-null  object
 10  outcome_type      78244 non-null  object
 11  sex_upon_outcome  78254 non-null  object
dtypes: object(12)
memory usage: 7.2+ MB


In [5]:
#What are the types of outcomes possible and how do we categorize them?
df.groupby(["outcome_type"]).agg({"outcome_type":"count"})
#Adopted = adoption, transfer 
#not adopted = died, disposal, euthanasia, missing, relocate, return to owner, Rto-Adopt

Unnamed: 0_level_0,outcome_type
outcome_type,Unnamed: 1_level_1
Adoption,33112
Died,680
Disposal,307
Euthanasia,6080
Missing,46
Relocate,16
Return to Owner,14354
Rto-Adopt,150
Transfer,23499


In [6]:
df.groupby(["outcome_subtype"]).agg({"outcome_subtype":"count"})


Unnamed: 0_level_0,outcome_subtype
outcome_subtype,Unnamed: 1_level_1
Aggressive,506
At Vet,59
Barn,3
Behavior,142
Court/Investigation,18
Enroute,45
Foster,5558
In Foster,182
In Kennel,343
In Surgery,16


In [7]:
#look for missing variables/values
df.isnull().sum()

age_upon_outcome        8
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        2
dtype: int64

**how do we deal with these missing values?**

age_upon_outcome: we need to change it to an integer and then fill missing values with age based on date of birth

name: this isn't an important feature. we can drop this entire column

outcome_subtype: we cannot determine this, as it is a categorical variable and there are a lot of them. instead, we can make them into dummies since well need it for regression later anyways.

outcome_type: same procedure as above

sex_upon_outcome: we can drop these rows. there is only 2 which is so insignificant it wont bias outcomes

In [8]:
qry = "age_upon_outcome == True"
df.isnull().query(qry)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
68246,True,False,False,False,False,False,False,False,False,True,True,True
76825,True,False,False,False,False,False,False,False,False,True,True,False
77976,True,False,False,False,False,False,False,False,True,False,False,False
78081,True,False,False,False,False,False,False,False,True,False,False,False
78114,True,False,False,False,False,False,False,False,True,False,False,False
78162,True,False,False,False,False,False,False,False,True,False,False,False
78208,True,False,False,False,False,False,False,False,True,False,False,False
78253,True,False,False,False,False,False,False,False,True,False,False,False


In [9]:
#locate missing values and replace with new age based on birth date and datetime
df.iloc[68246]
df["age_upon_outcome"].iloc[68246] = "3 years"
df.iloc[68246]

age_upon_outcome                   3 years
animal_id                          A737705
animal_type                            Dog
breed               Labrador Retriever Mix
color                          Black/White
date_of_birth          2013-11-02T00:00:00
datetime               2016-11-19T16:35:00
monthyear              2016-11-19T16:35:00
name                                *Heddy
outcome_subtype                        NaN
outcome_type                           NaN
sex_upon_outcome                       NaN
Name: 68246, dtype: object

In [10]:
df.iloc[76825]
df["age_upon_outcome"].iloc[76825] = "1 year"
df.iloc[76825]

age_upon_outcome                 1 year
animal_id                       A764319
animal_type                         Dog
breed                      Pit Bull Mix
color                       Black/White
date_of_birth       2016-12-27T00:00:00
datetime            2017-12-30T16:47:00
monthyear           2017-12-30T16:47:00
name                              *Emma
outcome_subtype                     NaN
outcome_type                        NaN
sex_upon_outcome          Intact Female
Name: 76825, dtype: object

In [11]:
df.iloc[77976]
df["age_upon_outcome"].iloc[77976] = "1 year"
df.iloc[77976]

age_upon_outcome                 1 year
animal_id                       A765547
animal_type                        Bird
breed                       Leghorn Mix
color                         White/Red
date_of_birth       2017-01-22T00:00:00
datetime            2018-01-25T13:23:00
monthyear           2018-01-25T13:23:00
name                                NaN
outcome_subtype                 Partner
outcome_type                   Transfer
sex_upon_outcome          Intact Female
Name: 77976, dtype: object

In [12]:
df.iloc[78081]
df["age_upon_outcome"].iloc[78081] = "7 years"
df.iloc[78081]

age_upon_outcome                 7 years
animal_id                        A765899
animal_type                          Dog
breed               Miniature Poodle Mix
color                              Black
date_of_birth        2011-01-29T00:00:00
datetime             2018-01-29T15:49:00
monthyear            2018-01-29T15:49:00
name                                 NaN
outcome_subtype                Suffering
outcome_type                  Euthanasia
sex_upon_outcome           Neutered Male
Name: 78081, dtype: object

In [13]:
df.iloc[78114]
df["age_upon_outcome"].iloc[78114] = "1 year"
df.iloc[78114]


age_upon_outcome                    1 year
animal_id                          A765914
animal_type                            Cat
breed               Domestic Shorthair Mix
color                           Lynx Point
date_of_birth          2017-01-29T00:00:00
datetime               2018-01-29T18:08:00
monthyear              2018-01-29T18:08:00
name                                   NaN
outcome_subtype                  Suffering
outcome_type                    Euthanasia
sex_upon_outcome               Intact Male
Name: 78114, dtype: object

In [14]:
df.iloc[78162]
df["age_upon_outcome"].iloc[78162] = "1 year"
df.iloc[78162]


age_upon_outcome                 1 year
animal_id                       A765901
animal_type                         Dog
breed                       Maltese Mix
color                              Buff
date_of_birth       2017-01-29T00:00:00
datetime            2018-01-31T08:14:00
monthyear           2018-01-31T08:14:00
name                                NaN
outcome_subtype                 Partner
outcome_type                   Transfer
sex_upon_outcome            Intact Male
Name: 78162, dtype: object

In [15]:
df.iloc[78208]
df["age_upon_outcome"].iloc[78208] = "8 years"
df.iloc[78208]

age_upon_outcome                8 years
animal_id                       A765960
animal_type                         Dog
breed                  Beagle/Catahoula
color                         Tan/White
date_of_birth       2010-02-01T00:00:00
datetime            2018-02-01T09:21:00
monthyear           2018-02-01T09:21:00
name                                NaN
outcome_subtype               Suffering
outcome_type                 Euthanasia
sex_upon_outcome            Intact Male
Name: 78208, dtype: object

In [16]:
df.iloc[78253]
df["age_upon_outcome"].iloc[78253] = "1 year"
df.iloc[78253]

age_upon_outcome                 1 year
animal_id                       A766098
animal_type                       Other
breed                           Bat Mix
color                             Brown
date_of_birth       2017-02-01T00:00:00
datetime            2018-02-01T18:08:00
monthyear           2018-02-01T18:08:00
name                                NaN
outcome_subtype             Rabies Risk
outcome_type                 Euthanasia
sex_upon_outcome                Unknown
Name: 78253, dtype: object

In [17]:
#check to make sure missing values were replaced
qry = "age_upon_outcome == True"
df.isnull().query(qry)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome


In [18]:
#name: this isn't an important feature. we can drop this entire column
df = df[["age_upon_outcome","animal_id","animal_type","breed","color","date_of_birth","datetime","monthyear","outcome_subtype","outcome_type","sex_upon_outcome"]]

In [19]:
#check to see that 'name" was dropped
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,Rabies Risk,Euthanasia,Unknown


In [20]:
#outcome_subtype: we cannot determine this, as it is a categorical variable and there are a lot of them. instead, we can make them into dummies since well need it for regression later anyways.
outcome_sub_dum = pd.get_dummies(df["outcome_subtype"], prefix = "outcome_sub")
df = pd.concat([df, outcome_sub_dum], axis=1)
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_subtype,outcome_type,...,outcome_sub_In Surgery,outcome_sub_Medical,outcome_sub_Offsite,outcome_sub_Partner,outcome_sub_Possible Theft,outcome_sub_Rabies Risk,outcome_sub_SCRP,outcome_sub_Snr,outcome_sub_Suffering,outcome_sub_Underage
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,Partner,Transfer,...,0,0,0,1,0,0,0,0,0,0
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Partner,Transfer,...,0,0,0,1,0,0,0,0,0,0
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,,Adoption,...,0,0,0,0,0,0,0,0,0,0
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Partner,Transfer,...,0,0,0,1,0,0,0,0,0,0
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,Rabies Risk,Euthanasia,...,0,0,0,0,0,1,0,0,0,0


In [21]:
#sex_upon_outcome: we can drop these rows. there is only 2 which is so insignificant it wont bias outcomes
sex_dum = pd.get_dummies(df["sex_upon_outcome"], prefix = "sex")
df = pd.concat([df, sex_dum], axis=1)
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_subtype,outcome_type,...,outcome_sub_Rabies Risk,outcome_sub_SCRP,outcome_sub_Snr,outcome_sub_Suffering,outcome_sub_Underage,sex_Intact Female,sex_Intact Male,sex_Neutered Male,sex_Spayed Female,sex_Unknown
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,Partner,Transfer,...,0,0,0,0,0,0,1,0,0,0
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Partner,Transfer,...,0,0,0,0,0,0,0,0,1,0
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,,Adoption,...,0,0,0,0,0,0,0,1,0,0
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Partner,Transfer,...,0,0,0,0,0,0,0,1,0,0
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,Rabies Risk,Euthanasia,...,1,0,0,0,0,0,0,0,0,1


In [22]:
#check to see if dummies worked
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_subtype,outcome_type,...,outcome_sub_Rabies Risk,outcome_sub_SCRP,outcome_sub_Snr,outcome_sub_Suffering,outcome_sub_Underage,sex_Intact Female,sex_Intact Male,sex_Neutered Male,sex_Spayed Female,sex_Unknown
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,Partner,Transfer,...,0,0,0,0,0,0,1,0,0,0
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Partner,Transfer,...,0,0,0,0,0,0,0,0,1,0
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,,Adoption,...,0,0,0,0,0,0,0,1,0,0
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Partner,Transfer,...,0,0,0,0,0,0,0,1,0,0
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,Rabies Risk,Euthanasia,...,1,0,0,0,0,0,0,0,0,1


In [23]:
#turn outcomes to str so we can work with them and replace them with "adopted" vs "not adopted"
df["outcome_type"] = df["outcome_type"].astype(str)

In [24]:
#check if dtype change worked
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 35 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   age_upon_outcome                 78256 non-null  object
 1   animal_id                        78256 non-null  object
 2   animal_type                      78256 non-null  object
 3   breed                            78256 non-null  object
 4   color                            78256 non-null  object
 5   date_of_birth                    78256 non-null  object
 6   datetime                         78256 non-null  object
 7   monthyear                        78256 non-null  object
 8   outcome_subtype                  35963 non-null  object
 9   outcome_type                     78256 non-null  object
 10  sex_upon_outcome                 78254 non-null  object
 11  outcome_sub_Aggressive           78256 non-null  uint8 
 12  outcome_sub_At Vet              

In [25]:
#outcome_type: first need to make them into adopted vs not adopted 
#Adopted = adoption, transfer ==1
#not adopted = died, disposal, euthanasia, missing, relocate, return to owner, Rto-Adopt ==0
       
df["outcome_type"].replace("Adoption","Adopted", inplace=True)
df["outcome_type"].replace("Transfer","Adopted", inplace=True)
df["outcome_type"].replace("Died","Not Adopted", inplace=True)
df["outcome_type"].replace("Disposal","Not Adopted", inplace=True)
df["outcome_type"].replace("Euthanasia","Not Adopted", inplace=True)
df["outcome_type"].replace("Missing","Not Adopted", inplace=True)
df["outcome_type"].replace("Relocate","Not Adopted", inplace=True)
df["outcome_type"].replace("Return to Owner","Not Adopted", inplace=True)
df["outcome_type"].replace("Rto-Adopt","Not Adopted", inplace=True)
df["outcome_type"].replace("Nan","Not Adopted", inplace=True)
df["outcome_type"].replace("nan","Not Adopted", inplace=True)


df["outcome_type"].isnull().sum()


0

In [26]:
#check to see if new categories worked
df.groupby(["outcome_type"]).agg({"outcome_type":"count"})


Unnamed: 0_level_0,outcome_type
outcome_type,Unnamed: 1_level_1
Adopted,56611
Not Adopted,21645


In [27]:
#now can turn them into binary dummies for regression later
df["outcome_dum"]= df.outcome_type.map({"Not Adopted":0, "Adopted": 1})
df[["outcome_type", "outcome_dum"]].head()

Unnamed: 0,outcome_type,outcome_dum
0,Adopted,1
1,Adopted,1
2,Adopted,1
3,Adopted,1
4,Not Adopted,0


In [28]:
#check to see if dummies worked
df.sample(20)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,outcome_subtype,outcome_type,...,outcome_sub_SCRP,outcome_sub_Snr,outcome_sub_Suffering,outcome_sub_Underage,sex_Intact Female,sex_Intact Male,sex_Neutered Male,sex_Spayed Female,sex_Unknown,outcome_dum
63296,2 months,A697166,Dog,Labrador Retriever/Anatol Shepherd,White/Black,2015-02-18T00:00:00,2015-05-02T21:32:00,2015-05-02T21:32:00,Offsite,Adopted,...,0,0,0,0,0,0,0,1,0,1
16616,3 years,A693074,Cat,Domestic Longhair Mix,Black/White,2011-12-01T00:00:00,2014-12-10T11:12:00,2014-12-10T11:12:00,,Not Adopted,...,0,0,0,0,0,0,1,0,0,0
44317,1 year,A707384,Dog,Pit Bull Mix,White/Black,2014-07-12T00:00:00,2015-10-21T18:08:00,2015-10-21T18:08:00,,Adopted,...,0,0,0,0,0,0,0,1,0,1
49502,2 months,A705524,Cat,Domestic Shorthair Mix,Black,2015-04-11T00:00:00,2015-06-22T16:09:00,2015-06-22T16:09:00,,Adopted,...,0,0,0,0,0,0,1,0,0,1
60170,1 month,A742827,Cat,Domestic Shorthair Mix,Black,2016-11-29T00:00:00,2017-01-29T19:14:00,2017-01-29T19:14:00,,Adopted,...,0,0,0,0,0,1,0,0,0,1
24026,2 years,A695083,Dog,Pit Bull Mix,Tan/White,2013-01-08T00:00:00,2015-01-13T13:25:00,2015-01-13T13:25:00,,Not Adopted,...,0,0,0,0,0,0,1,0,0,0
45288,15 years,A708799,Cat,Domestic Shorthair Mix,Gray Tabby/White,2000-08-01T00:00:00,2015-08-01T10:26:00,2015-08-01T10:26:00,Suffering,Not Adopted,...,0,0,1,0,0,0,1,0,0,0
21374,1 year,A664560,Other,Rabbit Sh Mix,White,2012-10-05T00:00:00,2014-02-19T17:17:00,2014-02-19T17:17:00,Partner,Adopted,...,0,0,0,0,0,1,0,0,0,1
72552,1 year,A714708,Dog,Labrador Retriever/German Shepherd,Black/White,2014-04-27T00:00:00,2015-11-03T17:42:00,2015-11-03T17:42:00,,Adopted,...,0,0,0,0,0,0,1,0,0,1
44163,1 year,A700043,Dog,Miniature Poodle Mix,Cream,2014-04-06T00:00:00,2015-04-15T13:35:00,2015-04-15T13:35:00,Partner,Adopted,...,0,0,0,0,0,0,1,0,0,1


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 36 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   age_upon_outcome                 78256 non-null  object
 1   animal_id                        78256 non-null  object
 2   animal_type                      78256 non-null  object
 3   breed                            78256 non-null  object
 4   color                            78256 non-null  object
 5   date_of_birth                    78256 non-null  object
 6   datetime                         78256 non-null  object
 7   monthyear                        78256 non-null  object
 8   outcome_subtype                  35963 non-null  object
 9   outcome_type                     78256 non-null  object
 10  sex_upon_outcome                 78254 non-null  object
 11  outcome_sub_Aggressive           78256 non-null  uint8 
 12  outcome_sub_At Vet              

In [29]:
#check to see if missing values are all taken care of
df.isnull().sum()

age_upon_outcome                       0
animal_id                              0
animal_type                            0
breed                                  0
color                                  0
date_of_birth                          0
datetime                               0
monthyear                              0
outcome_subtype                    42293
outcome_type                           0
sex_upon_outcome                       2
outcome_sub_Aggressive                 0
outcome_sub_At Vet                     0
outcome_sub_Barn                       0
outcome_sub_Behavior                   0
outcome_sub_Court/Investigation        0
outcome_sub_Enroute                    0
outcome_sub_Foster                     0
outcome_sub_In Foster                  0
outcome_sub_In Kennel                  0
outcome_sub_In Surgery                 0
outcome_sub_Medical                    0
outcome_sub_Offsite                    0
outcome_sub_Partner                    0
outcome_sub_Poss

**missing values conclusion**

for age, we found the few missing data points and filled them in based on their birthdate and datetime

for name, we dropped it because it doesn't matter for our analysis. wont be a predictor.

for sex, we turned them into noncategory dummies 

for outcome_subtype, we turned them into noncategory dummies

for outcome_type, we turned them into adopted vs not adopted and then replaced with binary dummies

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 36 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   age_upon_outcome                 78256 non-null  object
 1   animal_id                        78256 non-null  object
 2   animal_type                      78256 non-null  object
 3   breed                            78256 non-null  object
 4   color                            78256 non-null  object
 5   date_of_birth                    78256 non-null  object
 6   datetime                         78256 non-null  object
 7   monthyear                        78256 non-null  object
 8   outcome_subtype                  35963 non-null  object
 9   outcome_type                     78256 non-null  object
 10  sex_upon_outcome                 78254 non-null  object
 11  outcome_sub_Aggressive           78256 non-null  uint8 
 12  outcome_sub_At Vet              

In [31]:
#need to make final dataframe first that we will use for analysis
df=df[["age_upon_outcome","age","unit","animal_id","animal_type","breed","color","date_of_birth","datetime","monthyear","outcome_sub_Aggressive","outcome_sub_At Vet","outcome_sub_Barn","outcome_sub_Behavior","outcome_sub_Court/Investigation","outcome_sub_Enroute","outcome_sub_Foster","outcome_sub_In Foster","outcome_sub_In Kennel","outcome_sub_In Surgery","outcome_sub_Medical","outcome_sub_Offsite","outcome_sub_Partner","outcome_sub_Possible Theft","outcome_sub_Rabies Risk",
"outcome_sub_SCRP","outcome_sub_Snr","outcome_sub_Suffering","outcome_sub_Underage","sex_Intact Female","sex_Intact Male","sex_Neutered Male","sex_Spayed Female","sex_Unknown","outcome_dum"]]


KeyError: "['unit', 'age'] not in index"

In [None]:
#should have 33 columns
df.head()

In [None]:
df.groupby(["age_upon_outcome"]).agg({"age_upon_outcome":"count"})


In [None]:
#need to split age column into units and numbers for regression later
df[["age","unit"]]= df.age_upon_outcome.str.split(" ", expand=True)

In [None]:
#need to make new age column a number. start by making it into an integer
df["age"].apply(pd.to_numeric)

In [None]:
#then turn it into a float to be able to plot it
df["age"] = df["age"].astype(float)

In [None]:
#check to make sure dtype is now float
df["age"].describe

In [None]:
#check to make sure new columns were made
df.head()

In [None]:
#look for outliers - only for continous data
#the only continous data is age
df["age"].hist()

**how do we deal with outliers here? drop or keep?**
the data is right skewed
there doesn't seem to be any extremely obvious outliers
however, its hard to tell what it would actually be since the ages are in different units...
due to this, we shouldn't drop any outliers at this time. 

In [None]:
#create density plot for continous data- only continous is age
#then center and scale dataset as needed
df["age"].plot.density()

In [None]:
#import tools to be able to do preprocessing of scaling and centering
from sklearn import preprocessing as prep

In [None]:
#scale data for age
df1 = df[["age"]]
scaled=prep.scale(df1)

In [None]:
#put back into a dataframe
scaled_data = pd.DataFrame(scaled,index=df1.index, columns = df1.columns)

In [None]:
#now the data is scaled around 0. 
scaled_data.hist()

Numeric variables, such as age here, are often on different scales. For age, the scales are different (days, weeks, months, years), so they can't be easily compared. Centering and scaling at 0 puts the the values for age on a common scale so no single variable will dominate the others. The new mean is now 0.

In [None]:
#transform the data as needed
#since it is still skewed after centering, we need to transform it by normalizing it 
#can normalize with square root or by log transformation
sqrt_transform = df1.apply(np.sqrt)
sqrt_transform.hist()

In [None]:
log_transformed = (df1+1).apply(np.log) 
log_transformed.hist()

**Justification for transformation**

Skewed data has extreme values. These extreme values in the "tail" can cause disproportionately which could influence the performance of the model. It is necessary to reduce the skew by normalizing it in order to improve the model outcomes. Taking the square root of each data point or taking the natural logarithm of each data point are transformations that can reduce skew. We will go with the natural log transformation since it appears most normalized. 

In [None]:
#features/predictors for model: age, sex, animal_type
#response: adopted/not adopted --> outcome_dum
#we need to make animal_type in numerical with dummies

In [None]:
#animal_type- need to see what the values are before making dummies
df.groupby(["animal_type"]).agg({"animal_type":"count"})


In [None]:
#assign dummies for animal types
animal_dum = pd.get_dummies(df["animal_type"],prefix="animal")
animal_dum.head()
#put back into original dataframe
df = pd.concat([df,animal_dum],axis=1)
df.head()


**Explanation for feature selection**

the response variable will be adopted vs not adopted (outcome_dum).
due to this we need to pick features that would influence an adopter.
some ones that commonly come to mind are sex, animal type, age, breed, and color. 
breed and color were not included because there were too many possible values for these features, which makes it difficult to assign dummies to make them numeric for linear regression predictions. Therefore, sex, animal type, and age were chosen as predictors. 

In [None]:
df.head()

In [33]:
#make final df for training that has X and y variables in it
df_final = df[["outcome_dum","animal_Bird","animal_Cat","animal_Dog","animal_Livestock","animal_Other","sex_Intact Female","sex_Intact Male","sex_Neutered Male","sex_Spayed Female","sex_Unknown"]]
df_final = pd.concat([df_final,log_transformed],axis=1)
df_final.head()

KeyError: "['animal_Cat', 'animal_Other', 'animal_Livestock', 'animal_Dog', 'animal_Bird'] not in index"

In [32]:
df_final.info()

NameError: name 'df_final' is not defined

**NAIVES BAYES MODEL**

In [None]:
#define x and y
X = df_final[["animal_Bird","animal_Cat","animal_Dog","animal_Livestock","animal_Other","sex_Intact Female","sex_Intact Male","sex_Neutered Male","sex_Spayed Female","sex_Unknown","age"]]
y = df[["outcome_dum"]]


In [None]:
#split x and y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state = 1)


In [None]:
#train with bayes
from sklearn.naive_bayes import GaussianNB  
nb = GaussianNB() 
nb.fit(X_train,y_train) 


In [None]:
y_pred = nb.predict(X_test)
pd.Series(y_pred) 


In [None]:
y_prob = nb.predict_proba(X_test)[:,1]

In [None]:
#calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
#calculate AUC score
print(metrics.roc_auc_score(y_test,y_prob))

**SVM** 

In [None]:
#import svm tools
from sklearn import svm
model = svm.SVC()


In [None]:
#fit and train model
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
#accuracy
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
#AUC
print(metrics.roc_auc_score(y_test, y_pred))

**KNN**

In [None]:
#train and fit KNN model
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5) 
model.fit(X_train, y_train)


In [None]:
y_pred = model.predict(X_test)

In [None]:
#accuracy
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
#AUC
print(metrics.roc_auc_score(y_test, y_pred))

**COMPARISON OF OUTCOMES**

Bayes:
    
    - accuracy = 72.95%
    - AUC = 71.11%
    
SVM:
    
    - accuracy = 76.49%
    - AUC = 60.31%
    
KNN:
    
    - accuracy = 74.87%
    - AUC = 62.43%

Classification accuracy = % of correct predictions made by model

AUC = % of ROC plot that is underneath the curve. The higher the value, the better the classifier model is. If randomly choose 1 postiive and 1 negative observation, AUC represents likelihood that classifier will assign a higher predicted probability to positive observation. AUC is a measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. For example, when AUC is a perfect score of = 1, then the classifier is able to perfectly distinguish between all posiive and negative points. An AUC of = 0 would mean that the classifier would be predict all negatives as positives and all positives as negatives, making it a bad prediction model. When >0.5 but <1,the classifier will can distinguish the positive from the negative. It can detect more true positives and true negatives rather than more false negatives and false positives, making it more accurate and better at predicting. At 0.5 (50%), it can't distinguish between positive and negative. 

We can see above that SVM has the highest accuracy but Bayes has the highest AUC. Seeing as the Bayes model's accuracy is very close to SVM and the AUC is much better than the SVM's AUC, I would suggest that the Bayes model performed the best. 


The model that was deemed "best" was the Bayes model. This model is 73% accurate, meaning that 73% of the time, it will make a correct prediction. The AUC decent at about 71%. This is >0.5 and <1, meaning that the classifier can distinguish between true positive and true negatives and detect them more than false positives and false negatives 71% of the time. Overall, I would say that this model is moderate. Accuracies in the 70s isn't amazing, but it's not bad. A "good" model in my opinion would have accuracy and AUC more in the mid 80s-mid 90s. 