# Shelter Animal Outcomes 3

About this attempt:

- Uses XGBoost 
- Predicts on AnimalType, SexuponOutcome, AgeuponOutcome, Breed (I tried without breed since I got better results without breed with RFC but got worse results without it). 
- Excludes training data with missing information 
- Missing ages in test data used the median of train data ages


For more information see [this link](https://www.kaggle.com/c/shelter-animal-outcomes).

In [264]:
from collections import defaultdict

import pandas as pd
import numpy as np

import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')


## Preprocessing

In [265]:
train_data = pd.read_csv('../DATA/train.csv')
class_headers = list(np.unique(train_data.OutcomeType))
class_headers

['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer']

In [266]:
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


color is questionable

OutcomeSubtype is useless

Uncertain about Datetime,

Name seems too difficult to work with now

Breed also seems difficult to work with

May consider removing it

In [267]:
useful_columns = ['OutcomeType','AnimalType',\
                  'SexuponOutcome','AgeuponOutcome','Breed'
                 ]
train_data = train_data[useful_columns]

In [268]:
train_data.isnull().sum()

OutcomeType        0
AnimalType         0
SexuponOutcome     1
AgeuponOutcome    18
Breed              0
dtype: int64

In [269]:
train_data = train_data[train_data.AgeuponOutcome.notnull()]
train_data = train_data[train_data.SexuponOutcome.notnull()]

In [270]:
train_data.isnull().sum()

OutcomeType       0
AnimalType        0
SexuponOutcome    0
AgeuponOutcome    0
Breed             0
dtype: int64

In [271]:
train_data.head()

Unnamed: 0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,Return_to_owner,Dog,Neutered Male,1 year,Shetland Sheepdog Mix
1,Euthanasia,Cat,Spayed Female,1 year,Domestic Shorthair Mix
2,Adoption,Dog,Neutered Male,2 years,Pit Bull Mix
3,Transfer,Cat,Intact Male,3 weeks,Domestic Shorthair Mix
4,Transfer,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle


In [272]:
train_data.groupby('Breed').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abyssinian Mix,2,2,2,2
Affenpinscher Mix,6,6,6,6
Afghan Hound Mix,1,1,1,1
Airedale Terrier,1,1,1,1
Airedale Terrier Mix,5,5,5,5
Airedale Terrier/Labrador Retriever,1,1,1,1
Airedale Terrier/Miniature Schnauzer,1,1,1,1
Akita,3,3,3,3
Akita Mix,11,11,11,11
Akita/Australian Cattle Dog,1,1,1,1


This time we will rework breed data according to the following:

- 1 = purebreed
- 0 = mixbreed

In [273]:
def is_purebreed(breed):
    if 'mix' in breed.lower() or '/' in breed:
        return 0
    else:
        return 1

In [274]:
train_data.Breed = train_data.Breed.apply(is_purebreed)
train_data.head()

Unnamed: 0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,Return_to_owner,Dog,Neutered Male,1 year,0
1,Euthanasia,Cat,Spayed Female,1 year,0
2,Adoption,Dog,Neutered Male,2 years,0
3,Transfer,Cat,Intact Male,3 weeks,0
4,Transfer,Dog,Neutered Male,2 years,0


Now lets fix the dates:

In [275]:
train_data.groupby('AgeuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,SexuponOutcome,Breed
AgeuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0 years,22,22,22,22
1 day,66,66,66,66
1 month,1281,1281,1281,1281
1 week,146,146,146,146
1 weeks,171,171,171,171
1 year,3969,3969,3969,3969
10 months,457,457,457,457
10 years,446,446,446,446
11 months,166,166,166,166
11 years,126,126,126,126


Many a puppy on this list

In [276]:
def convert_date(age):
    age = age.split()
    age[0] = int(age[0])
    if age[1] == 'day' or age[1] == 'days':
        return age[0]
    elif age[1] == 'week' or age[1] == 'weeks':
        return age[0]*7
    elif age[1] == 'month' or age[1] == 'months':
        return age[0]*30
    elif age[1] == 'year' or age[1] == 'years':
        return age[0]*365
    
train_data.AgeuponOutcome = train_data.AgeuponOutcome.apply(convert_date)

In [277]:
train_data.groupby('AgeuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,SexuponOutcome,Breed
AgeuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,22,22,22,22
1,66,66,66,66
2,99,99,99,99
3,109,109,109,109
4,50,50,50,50
5,24,24,24,24
6,50,50,50,50
7,317,317,317,317
14,529,529,529,529
21,659,659,659,659


It may be important to note that there are many unknown sexes

In [278]:
train_data.groupby('SexuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,AgeuponOutcome,Breed
SexuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Intact Female,3504,3504,3504,3504
Intact Male,3519,3519,3519,3519
Neutered Male,9779,9779,9779,9779
Spayed Female,8819,8819,8819,8819
Unknown,1089,1089,1089,1089


It may also be important to note that the classes are pretty imbalanced

In [279]:
train_data.groupby('OutcomeType').count()

Unnamed: 0_level_0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
OutcomeType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adoption,10769,10769,10769,10769
Died,197,197,197,197
Euthanasia,1553,1553,1553,1553
Return_to_owner,4785,4785,4785,4785
Transfer,9406,9406,9406,9406


Need to convert the string data to numerical

In [280]:
# This function operates in place
def map_str_to_int(df,col_name):
    categories = list(enumerate(np.unique(df[col_name])))
    map_dict = { name : i for i, name in categories }              
    df[col_name] = df[col_name].map( lambda x: map_dict[x]).astype(int)

In [281]:
#map_str_to_int(train_data,'Breed')
map_str_to_int(train_data,'AnimalType')
map_str_to_int(train_data,'SexuponOutcome')
map_str_to_int(train_data,'OutcomeType')

In [282]:
train_data.head()

Unnamed: 0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,3,1,2,365,0
1,2,0,3,365,0
2,0,1,2,730,0
3,4,0,1,21,0
4,4,1,2,730,0


## Predict on test data

In [283]:
#[['AnimalType','SexuponOutcome','AgeuponOutcome','Breed']]
train_data2 = np.array(train_data)
train_data = train_data2[:,1:]
label = train_data2[:,0]
label

array([3, 2, 0, ..., 0, 4, 4])

Import and process test data

In [284]:
original_test_data = test_data = pd.read_csv('../DATA/test.csv')

useful_columns = ['AnimalType','SexuponOutcome','AgeuponOutcome','Breed']

test_data2 = test_data[useful_columns]

test_data2.loc[test_data2.AgeuponOutcome.isnull(),'AgeuponOutcome'] = '1 year' #median age
test_data2.AgeuponOutcome = test_data2.AgeuponOutcome.apply(convert_date)
test_data2.Breed = test_data2.Breed.apply(is_purebreed)

map_str_to_int(test_data2,'AnimalType')
map_str_to_int(test_data2,'SexuponOutcome')

test_data2.head()


Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,1,0,300,0
1,1,3,730,0
2,0,2,365,0
3,1,1,120,0
4,1,2,730,0


In [285]:
test_data = np.array(test_data2)

#### XGBOOST

In [286]:
dtrain = xgb.DMatrix(train_data,label=label)
dtest = xgb.DMatrix(test_data)
param = {'max_depth':6, #depth of tree
         'eta':0.1,  #step shrinkage size
         'silent':0,  #verbose
         'objective':'multi:softprob',  #soft max for multiclass with prob of each class
         'num_class':5, #number of classes
         'eval_metric':'mlogloss', #evaluate with logloss for multiclass
         'nthread':4 #use 4 threads
         } 
#consider playing with gamma in the above parameters

num_round = 200
watchlist  = [(dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)



[0]	train-mlogloss:1.511683
[1]	train-mlogloss:1.432516
[2]	train-mlogloss:1.366634
[3]	train-mlogloss:1.310697
[4]	train-mlogloss:1.262502
[5]	train-mlogloss:1.220779
[6]	train-mlogloss:1.184210
[7]	train-mlogloss:1.152169
[8]	train-mlogloss:1.123914
[9]	train-mlogloss:1.098784
[10]	train-mlogloss:1.076475
[11]	train-mlogloss:1.056597
[12]	train-mlogloss:1.038864
[13]	train-mlogloss:1.022989
[14]	train-mlogloss:1.008775
[15]	train-mlogloss:0.995996
[16]	train-mlogloss:0.984534
[17]	train-mlogloss:0.974175
[18]	train-mlogloss:0.964834
[19]	train-mlogloss:0.956391
[20]	train-mlogloss:0.948774
[21]	train-mlogloss:0.941831
[22]	train-mlogloss:0.935530
[23]	train-mlogloss:0.929821
[24]	train-mlogloss:0.924596
[25]	train-mlogloss:0.919817
[26]	train-mlogloss:0.915443
[27]	train-mlogloss:0.911449
[28]	train-mlogloss:0.907777
[29]	train-mlogloss:0.904414
[30]	train-mlogloss:0.901303
[31]	train-mlogloss:0.898452
[32]	train-mlogloss:0.895772
[33]	train-mlogloss:0.893298
[34]	train-mlogloss:0.89

In [287]:
preds = bst.predict(dtest)
result = pd.DataFrame(preds,columns=class_headers, index=original_test_data.ID).reset_index()
result.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.032904,0.001728,0.015115,0.095234,0.855018
1,2,0.536967,0.001835,0.027061,0.255714,0.178422
2,3,0.399488,0.001729,0.022943,0.109058,0.466782
3,4,0.027344,0.009229,0.077792,0.091067,0.794568
4,5,0.452625,0.001646,0.039326,0.303438,0.202965


In [288]:
result.to_csv('./solution003.csv',index=False)

### Outcome

We improved a lot with this one. Most important thing to note was that results were much better when including Breed than without Breed with XGBoost (it was opposite with RFC).

Score: 0.87055
