Pull in the Data first into Pandas DataFrame:

In [6]:
import pandas as pd

ShelterDF = pd.read_csv('aac_shelter_outcomes.csv')
ShelterDF.head(20)

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact Male
6,1 year,A693700,Other,Squirrel Mix,Tan,2013-12-13T00:00:00,2014-12-13T12:20:00,2014-12-13T12:20:00,,Suffering,Euthanasia,Unknown
7,3 years,A692618,Dog,Chihuahua Shorthair Mix,Brown,2011-11-23T00:00:00,2014-12-08T15:55:00,2014-12-08T15:55:00,*Ella,Partner,Transfer,Spayed Female
8,1 month,A685067,Cat,Domestic Shorthair Mix,Blue Tabby/White,2014-06-16T00:00:00,2014-08-14T18:45:00,2014-08-14T18:45:00,Lucy,,Adoption,Intact Female
9,3 months,A678580,Cat,Domestic Shorthair Mix,White/Black,2014-03-26T00:00:00,2014-06-29T17:45:00,2014-06-29T17:45:00,*Frida,Offsite,Adoption,Spayed Female


The basic question we are trying to answer is: what are the major factors for an animal's outcome?  
The intended audience/cleint is: the animal shelter and anyone involved with animal welfare.

A few ideas for what factors will best predict an animal's outcome:
Age, Animal Type, Breed, Color, Name, Outcome Subtype, and Sex

Pull up the basic info about this DataFrame, and we are looking for how many blank or "null" entries we are dealing with:

In [7]:
ShelterDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78256 entries, 0 to 78255
Data columns (total 12 columns):
age_upon_outcome    78248 non-null object
animal_id           78256 non-null object
animal_type         78256 non-null object
breed               78256 non-null object
color               78256 non-null object
date_of_birth       78256 non-null object
datetime            78256 non-null object
monthyear           78256 non-null object
name                54370 non-null object
outcome_subtype     35963 non-null object
outcome_type        78244 non-null object
sex_upon_outcome    78254 non-null object
dtypes: object(12)
memory usage: 7.2+ MB


Another way to look at it:

In [8]:
print("Total number of Null Values for Each Column:")
ShelterDF.isnull().sum()

Total number of Null Values for Each Column:


age_upon_outcome        8
animal_id               0
animal_type             0
breed                   0
color                   0
date_of_birth           0
datetime                0
monthyear               0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        2
dtype: int64

So it looks like we have 8 missing values for "age_upon_outcome", 23886 for "name", 42293 for "outcome_subtype", 12 for "outcome_type", and 2 for "sex_upon_outcome"

Before we deal with the nulls, let's drop a few columns we know we're not going to use:

In [9]:
ShelterDropDF = ShelterDF[['age_upon_outcome', 'animal_type', 'breed', 'color', 'name', 'outcome_subtype', 'outcome_type', 'sex_upon_outcome']]
ShelterDropDF.head()

Unnamed: 0,age_upon_outcome,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,Cat,Domestic Shorthair Mix,Orange Tabby,,Partner,Transfer,Intact Male
1,1 year,Dog,Beagle Mix,White/Brown,Lucy,Partner,Transfer,Spayed Female
2,1 year,Dog,Pit Bull,Blue/White,*Johnny,,Adoption,Neutered Male
3,9 years,Dog,Miniature Schnauzer Mix,White,Monday,Partner,Transfer,Neutered Male
4,5 months,Other,Bat Mix,Brown,,Rabies Risk,Euthanasia,Unknown


We won't be using animal_id, date_of_birth (since we already have their age at outcome), datetime and monthyear(not sure what these two columns represent, and there was no description provided in Kaggle's Metadata)

Now let's have another look at the nulls for our updated DataFrame:

In [10]:
print("Total number of Null Values for Each Column:")
ShelterDropDF.isnull().sum()

Total number of Null Values for Each Column:


age_upon_outcome        8
animal_type             0
breed                   0
color                   0
name                23886
outcome_subtype     42293
outcome_type           12
sex_upon_outcome        2
dtype: int64

Let's drop all null rows for outcome_type:

In [11]:
ShelterDropDF = ShelterDropDF.dropna(subset=['outcome_type']) 
ShelterDropDF.isnull().sum()

age_upon_outcome        6
animal_type             0
breed                   0
color                   0
name                23881
outcome_subtype     42281
outcome_type            0
sex_upon_outcome        1
dtype: int64

Now, let's run a correlation chart to see if we can find some insights through our EDA:

In [12]:
import seaborn as sns

cm = sns.light_palette("green", as_cmap=True)
s = ShelterDropDF.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1).style.background_gradient(cmap=cm)
s

Unnamed: 0,age_upon_outcome,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome
age_upon_outcome,1.0,-0.0770082,-0.00531003,0.0131546,-0.0217777,-0.0424632,-0.00820382,0.0197621
animal_type,-0.0770082,1.0,0.257343,0.0267001,-0.0312165,-0.188544,0.311228,0.0538892
breed,-0.00531003,0.257343,1.0,0.0757076,0.0382518,-0.0780958,0.062102,-0.0248905
color,0.0131546,0.0267001,0.0757076,1.0,0.00172539,0.00380671,0.00566022,0.00731644
name,-0.0217777,-0.0312165,0.0382518,0.00172539,1.0,-0.131042,0.0555968,-0.090606
outcome_subtype,-0.0424632,-0.188544,-0.0780958,0.00380671,-0.131042,1.0,-0.137239,0.0835363
outcome_type,-0.00820382,0.311228,0.062102,0.00566022,0.0555968,-0.137239,1.0,-0.045623
sex_upon_outcome,0.0197621,0.0538892,-0.0248905,0.00731644,-0.090606,0.0835363,-0.045623,1.0


It looks like the animal_type shows most positive correlation with outcome_type.  The rest seem to be close to no correlation.

The problem with this method, however is that it replaces the categorical variables with numerical variables, which imply rank (such as 1,2,3) when there is no justification for implying rank:

In [13]:
RankExampleDF = ShelterDropDF.apply(lambda x : pd.factorize(x)[0])
RankExampleDF.head()

Unnamed: 0,age_upon_outcome,animal_type,breed,color,name,outcome_subtype,outcome_type,sex_upon_outcome
0,0,0,0,0,-1,0,0,0
1,1,1,1,1,0,0,0,1
2,1,1,2,2,1,-1,1,2
3,2,1,3,3,2,0,0,2
4,3,2,4,4,-1,1,2,3


Next, let's run some graphs to get a visual for our data, and how each of our feature variables compare to our target variable of outcome_type:

In [14]:
print("Outcome Type Variables:")
ShelterDropDF.outcome_type.unique()

Outcome Type Variables:


array(['Transfer', 'Adoption', 'Euthanasia', 'Return to Owner', 'Died',
       'Disposal', 'Relocate', 'Missing', 'Rto-Adopt'], dtype=object)

In [15]:
Adopted = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Adoption'].groupby('animal_type').size()
Transferred = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Transfer'].groupby('animal_type').size()
Euthanized = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Euthanasia'].groupby('animal_type').size()
Returned = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Return to Owner'].groupby('animal_type').size()
Died = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Died'].groupby('animal_type').size()
Disposed = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Disposal'].groupby('animal_type').size()
Relocated = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Relocate'].groupby('animal_type').size()
Missing = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Missing'].groupby('animal_type').size()
Rto_Adopt = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Rto-Adopt'].groupby('animal_type').size()

data = pd.concat([Adopted, Transferred, Euthanized, Returned, Died, Disposed, Relocated, Missing, Rto_Adopt], axis=1)
data.columns = ['Adopted', 'Transferred', 'Euthanized', 'Returned', 'Died', 'Disposed', 'Relocated', 'Missing', 'Rto_Adopt']
data.plot.bar(title='Animal Type Vs. Outcome Type')

<matplotlib.axes._subplots.AxesSubplot at 0x16bbb771208>

It looks like Dogs were most likely to be adopted, followed by Cats.  Dogs were also by far most likely to be returned back to the original owner.  Cats appeared to be most likely to be transferred, followed closely behind by Dogs.  "Other" animals were most liekly to be euthanized.

In [16]:
Adopted = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Adoption'].groupby('sex_upon_outcome').size()
Transferred = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Transfer'].groupby('sex_upon_outcome').size()
Euthanized = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Euthanasia'].groupby('sex_upon_outcome').size()
Returned = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Return to Owner'].groupby('sex_upon_outcome').size()
Died = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Died'].groupby('sex_upon_outcome').size()
Disposed = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Disposal'].groupby('sex_upon_outcome').size()
Relocated = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Relocate'].groupby('sex_upon_outcome').size()
Missing = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Missing'].groupby('sex_upon_outcome').size()
Rto_Adopt = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Rto-Adopt'].groupby('sex_upon_outcome').size()

data = pd.concat([Adopted, Transferred, Euthanized, Returned, Died, Disposed, Relocated, Missing, Rto_Adopt], axis=1)
data.columns = ['Adopted', 'Transferred', 'Euthanized', 'Returned', 'Died', 'Disposed', 'Relocated', 'Missing', 'Rto_Adopt']
data.plot.bar(title='Animal Sex Vs. Outcome Type')

<matplotlib.axes._subplots.AxesSubplot at 0x16bbc153080>

It looks like Fixed Animals were far more likely to be adopted than any of the other ones.  Unknown animal sex were most likely to be euthanized.  This may be due to the fact that most of the Unknowns belonged to the "Other" category, which contained uncommon pets, and therefore difficult to determine sex, and were unlikely to be adopted anyways.  

In [17]:
Adopted = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Adoption'].groupby('outcome_subtype').size()
Transferred = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Transfer'].groupby('outcome_subtype').size()
Euthanized = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Euthanasia'].groupby('outcome_subtype').size()
Returned = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Return to Owner'].groupby('outcome_subtype').size()
Died = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Died'].groupby('outcome_subtype').size()
Disposed = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Disposal'].groupby('outcome_subtype').size()
Relocated = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Relocate'].groupby('outcome_subtype').size()
Missing = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Missing'].groupby('outcome_subtype').size()
Rto_Adopt = ShelterDropDF[ShelterDropDF['outcome_type'] == 'Rto-Adopt'].groupby('outcome_subtype').size()

data = pd.concat([Adopted, Transferred, Euthanized, Returned, Died, Disposed, Relocated, Missing, Rto_Adopt], axis=1)
data.columns = ['Adopted', 'Transferred', 'Euthanized', 'Returned', 'Died', 'Disposed', 'Relocated', 'Missing', 'Rto_Adopt']
data.plot.bar(title='Outcome Subtype Vs. Outcome Type')

<matplotlib.axes._subplots.AxesSubplot at 0x16bbd9f1b00>

Based on this data, we find that Partners were most likely to be transferred, while Foster animals were most likely to be adopted.  Animals that were at risk for Rabies, or Suffering were most likely to be euthanized.

In [18]:
print("Breed Variables w/ Number of those Variables in DataFrame:")
ShelterDropDF.breed.value_counts()

Breed Variables w/ Number of those Variables in DataFrame:


Domestic Shorthair Mix                      23332
Pit Bull Mix                                 6133
Chihuahua Shorthair Mix                      4733
Labrador Retriever Mix                       4607
Domestic Medium Hair Mix                     2323
German Shepherd Mix                          1892
Bat Mix                                      1283
Domestic Longhair Mix                        1228
Australian Cattle Dog Mix                    1059
Siamese Mix                                   998
Bat                                           799
Dachshund Mix                                 798
Boxer Mix                                     674
Miniature Poodle Mix                          648
Border Collie Mix                             646
Catahoula Mix                                 476
Raccoon Mix                                   465
Rat Terrier Mix                               456
Australian Shepherd Mix                       454
Yorkshire Terrier Mix                         437


In [19]:
print("Color Variables w/ Number of those Variables in DataFrame:")
ShelterDropDF.color.value_counts()

Color Variables w/ Number of those Variables in DataFrame:


Black/White                  8151
Black                        6600
Brown Tabby                  4445
Brown                        3483
White                        2784
Brown/White                  2444
Tan/White                    2394
Brown Tabby/White            2338
Orange Tabby                 2180
White/Black                  2100
Blue/White                   2081
Tricolor                     1982
Tan                          1963
Black/Tan                    1829
White/Brown                  1577
Black/Brown                  1532
Brown Brindle/White          1353
Tortie                       1340
Calico                       1338
Blue                         1325
Brown/Black                  1322
White/Tan                    1160
Blue Tabby                   1130
Orange Tabby/White           1095
Red                          1029
Red/White                     860
Torbie                        845
Brown Brindle                 715
Tan/Black                     607
Chocolate/Whit

The rest of the columns have too many variable to be able to list in a graph, so we just won't visualize them for now.

I think that we're ready to go into machine learning now:  

So my idea is to create multiple models.  Some will contain only one feature variable, and some will contain multiple feature variables.  

Here are the feature variables of each model I will create:
1) (all feature variables)
2) animal_type
3) breed
4) color 
5) name
6) outcome_subtype
7) sex_upon_outcome

Then, I will combine the best performing single feature variable models to create an optimized multiple variable model, which will hopefully work the best.

On second thought, there are just too many variables for breed, names, and color.  I think that I'll drop them

In [20]:
#I decided to get rid of the 'names' column':
ShelterDropDF = ShelterDropDF[['animal_type', 'outcome_subtype', 'outcome_type', 'sex_upon_outcome']]
#Creating the different DataFrames I'll be building my models off of:
MLAll = ShelterDropDF.dropna()
MLAnimal = ShelterDropDF[['animal_type', 'outcome_type']].dropna()
#I only drop rows on an 'as needed basis' so that I have every variable filled in for the other models:
MLSubtype = ShelterDropDF[['outcome_subtype', 'outcome_type']].dropna()
MLSex = ShelterDropDF[['sex_upon_outcome', 'outcome_type']].dropna()

# Let's Create Model 1:

First, we are going to create a Decision Tree Classifier:

In [21]:
df = pd.get_dummies(MLAll[['animal_type', 'outcome_subtype', 'sex_upon_outcome']])
X = df
y = MLAll.outcome_type

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [23]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

0.9540272499768282

In [24]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

   Adoption       1.00      0.93      0.96      1778
       Died       0.00      0.00      0.00       178
 Euthanasia       0.78      1.00      0.88      1773
    Missing       0.00      0.00      0.00        10
   Transfer       1.00      0.97      0.99      7050

avg / total       0.95      0.95      0.95     10789



  'precision', 'predicted', average, warn_for)


Now, let's try a Random Forest Classifier:

In [25]:
from sklearn.ensemble import RandomForestClassifier

rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [26]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

0.99860969505978314

In [27]:
rfclf_predictions = rfclf.predict(X_test)

print(classification_report(y_test, rfclf_predictions))

             precision    recall  f1-score   support

   Adoption       1.00      1.00      1.00      1778
       Died       0.94      0.99      0.96       178
 Euthanasia       1.00      1.00      1.00      1773
    Missing       1.00      0.30      0.46        10
   Transfer       1.00      1.00      1.00      7050

avg / total       1.00      1.00      1.00     10789



# Model 2 (Animal Type Feature Variable):

In [28]:
df = pd.get_dummies(MLAnimal[['animal_type']])
X = df
y = MLAnimal.outcome_type

clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [29]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

0.46489733321973248

In [30]:
print(classification_report(y_test, predictions))

                 precision    recall  f1-score   support

       Adoption       0.45      0.61      0.52      9934
           Died       0.00      0.00      0.00       204
       Disposal       0.00      0.00      0.00        92
     Euthanasia       0.69      0.49      0.57      1824
        Missing       0.00      0.00      0.00        14
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.00      0.00      0.00      4306
      Rto-Adopt       0.00      0.00      0.00        45
       Transfer       0.45      0.56      0.50      7050

    avg / total       0.38      0.46      0.41     23474



  'precision', 'predicted', average, warn_for)


In [31]:
rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [32]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

0.46489733321973248

In [33]:
rfclf_predictions = rfclf.predict(X_test)

print(classification_report(y_test, rfclf_predictions))

                 precision    recall  f1-score   support

       Adoption       0.45      0.61      0.52      9934
           Died       0.00      0.00      0.00       204
       Disposal       0.00      0.00      0.00        92
     Euthanasia       0.69      0.49      0.57      1824
        Missing       0.00      0.00      0.00        14
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.00      0.00      0.00      4306
      Rto-Adopt       0.00      0.00      0.00        45
       Transfer       0.45      0.56      0.50      7050

    avg / total       0.38      0.46      0.41     23474



  'precision', 'predicted', average, warn_for)


# Model 3 (Animal Subtype Feature Variable):

In [34]:
df = pd.get_dummies(MLSubtype[['outcome_subtype']])
X = df
y = MLSubtype.outcome_type

clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [35]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

0.9540272499768282

In [36]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

   Adoption       1.00      0.93      0.96      1778
       Died       0.00      0.00      0.00       178
 Euthanasia       0.78      1.00      0.88      1773
    Missing       0.00      0.00      0.00        10
   Transfer       1.00      0.97      0.99      7050

avg / total       0.95      0.95      0.95     10789



  'precision', 'predicted', average, warn_for)


In [37]:
rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [38]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

0.99879506905181203

In [39]:
rfclf_predictions = rfclf.predict(X_test)

print(classification_report(y_test, rfclf_predictions))

             precision    recall  f1-score   support

   Adoption       1.00      1.00      1.00      1778
       Died       0.94      1.00      0.97       178
 Euthanasia       1.00      1.00      1.00      1773
    Missing       1.00      0.30      0.46        10
   Transfer       1.00      1.00      1.00      7050

avg / total       1.00      1.00      1.00     10789



# Model 4 (Animal Sex Feature Variable):

In [40]:
df = pd.get_dummies(MLSex[['sex_upon_outcome']])
X = df
y = MLSex.outcome_type

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X, y)
y_hat = clf.predict(df)
accuracy_score(y, y_hat)

0.59607888245593854

In [41]:
clf = DecisionTreeClassifier(max_depth=3)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0, stratify=y)

clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [42]:
predictions = clf.predict(X_test)

accuracy_score(y_test, predictions)

0.59740978997145655

In [43]:
print(classification_report(y_test, predictions))

                 precision    recall  f1-score   support

       Adoption       0.60      0.95      0.74      9933
           Died       0.00      0.00      0.00       204
       Disposal       0.00      0.00      0.00        92
     Euthanasia       0.48      0.53      0.50      1824
        Missing       0.00      0.00      0.00        14
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.00      0.00      0.00      4306
      Rto-Adopt       0.00      0.00      0.00        45
       Transfer       0.63      0.51      0.56      7050

    avg / total       0.48      0.60      0.52     23473



  'precision', 'predicted', average, warn_for)


In [44]:
rfclf = RandomForestClassifier()
rfclf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [45]:
rfclf_predictions = rfclf.predict(X_test)

accuracy_score(y_test, rfclf_predictions)

0.59740978997145655

In [46]:
print(classification_report(y_test, rfclf_predictions))

                 precision    recall  f1-score   support

       Adoption       0.60      0.95      0.74      9933
           Died       0.00      0.00      0.00       204
       Disposal       0.00      0.00      0.00        92
     Euthanasia       0.48      0.53      0.50      1824
        Missing       0.00      0.00      0.00        14
       Relocate       0.00      0.00      0.00         5
Return to Owner       0.00      0.00      0.00      4306
      Rto-Adopt       0.00      0.00      0.00        45
       Transfer       0.63      0.51      0.56      7050

    avg / total       0.48      0.60      0.52     23473



  'precision', 'predicted', average, warn_for)


It looks like Animal Subtype performs the best, and is the most reliable feature variable for determining adoption, according to the machine learning models.