# Shelter Animal Outcomes 4

About this attempt:

- Uses XGBoost 
- Predicts on Name, Datetime, AnimalType, SexuponOutcome, AgeuponOutcome, Breed 
- Excludes training data with missing information 
- Missing ages in test data used the median of train data ages


For more information see [this link](https://www.kaggle.com/c/shelter-animal-outcomes).

In [116]:
from collections import defaultdict
from math import isnan
from datetime import datetime
import time

import pandas as pd
import numpy as np

import xgboost as xgb

from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import auc

import warnings
warnings.filterwarnings('ignore')


## Preprocessing

In [84]:
train_data = pd.read_csv('../DATA/train.csv')
class_headers = list(np.unique(train_data.OutcomeType))
class_headers

['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer']

In [85]:
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


In [86]:
useful_columns = ['OutcomeType','Name','DateTime','AnimalType',\
                  'SexuponOutcome','AgeuponOutcome','Breed'
                 ]
train_data = train_data[useful_columns]

In [87]:
train_data.isnull().sum()

OutcomeType          0
Name              7691
DateTime             0
AnimalType           0
SexuponOutcome       1
AgeuponOutcome      18
Breed                0
dtype: int64

In [88]:
train_data = train_data[train_data.AgeuponOutcome.notnull()]
train_data = train_data[train_data.SexuponOutcome.notnull()]

In [89]:
train_data.isnull().sum()

OutcomeType          0
Name              7673
DateTime             0
AnimalType           0
SexuponOutcome       0
AgeuponOutcome       0
Breed                0
dtype: int64

In [90]:
train_data.head()

Unnamed: 0,OutcomeType,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,Return_to_owner,Hambone,2014-02-12 18:22:00,Dog,Neutered Male,1 year,Shetland Sheepdog Mix
1,Euthanasia,Emily,2013-10-13 12:44:00,Cat,Spayed Female,1 year,Domestic Shorthair Mix
2,Adoption,Pearce,2015-01-31 12:28:00,Dog,Neutered Male,2 years,Pit Bull Mix
3,Transfer,,2014-07-11 19:09:00,Cat,Intact Male,3 weeks,Domestic Shorthair Mix
4,Transfer,,2013-11-15 12:52:00,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle


This time we will rework breed data according to the following:

- 1 = purebreed
- 0 = mixbreed

And Name data according to the following:

- 1 = Has Name
- 2 = No Name

we will also convert datetime to a unix number of date

In [91]:
def is_purebreed(breed):
    if 'mix' in breed.lower() or '/' in breed:
        return 0
    else:
        return 1

def has_name(name):
    try:
        if isnan(name):
            return 0
    except: 
        return 1

def convert_datetime(dt):
    t = datetime.strptime(dt.split()[0],'%Y-%m-%d') #split and just take date
    d = t.timetuple()
    return time.mktime(d)

In [92]:
train_data.Breed = train_data.Breed.apply(is_purebreed)
train_data.Name = train_data.Name.apply(has_name)
train_data.DateTime = train_data.DateTime.apply(convert_datetime)
train_data.head()

Unnamed: 0,OutcomeType,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,Return_to_owner,1,1392181200,Dog,Neutered Male,1 year,0
1,Euthanasia,1,1381636800,Cat,Spayed Female,1 year,0
2,Adoption,1,1422680400,Dog,Neutered Male,2 years,0
3,Transfer,0,1405051200,Cat,Intact Male,3 weeks,0
4,Transfer,0,1384491600,Dog,Neutered Male,2 years,0


Now lets fix the ages:

In [93]:
def convert_date(age):
    age = age.split()
    age[0] = int(age[0])
    if age[1] == 'day' or age[1] == 'days':
        return age[0]
    elif age[1] == 'week' or age[1] == 'weeks':
        return age[0]*7
    elif age[1] == 'month' or age[1] == 'months':
        return age[0]*30
    elif age[1] == 'year' or age[1] == 'years':
        return age[0]*365
    
train_data.AgeuponOutcome = train_data.AgeuponOutcome.apply(convert_date)

It may be important to note that there are many unknown sexes

In [94]:
train_data.groupby('SexuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,Name,DateTime,AnimalType,AgeuponOutcome,Breed
SexuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Intact Female,3504,3504,3504,3504,3504,3504
Intact Male,3519,3519,3519,3519,3519,3519
Neutered Male,9779,9779,9779,9779,9779,9779
Spayed Female,8819,8819,8819,8819,8819,8819
Unknown,1089,1089,1089,1089,1089,1089


It may also be important to note that the classes are pretty imbalanced

In [95]:
train_data.groupby('OutcomeType').count()

Unnamed: 0_level_0,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
OutcomeType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adoption,10769,10769,10769,10769,10769,10769
Died,197,197,197,197,197,197
Euthanasia,1553,1553,1553,1553,1553,1553
Return_to_owner,4785,4785,4785,4785,4785,4785
Transfer,9406,9406,9406,9406,9406,9406


Need to convert the string data to numerical

In [96]:
# This function operates in place
def map_str_to_int(df,col_name):
    categories = list(enumerate(np.unique(df[col_name])))
    map_dict = { name : i for i, name in categories }              
    df[col_name] = df[col_name].map( lambda x: map_dict[x]).astype(int)

In [97]:
#map_str_to_int(train_data,'Breed')
map_str_to_int(train_data,'AnimalType')
map_str_to_int(train_data,'SexuponOutcome')
map_str_to_int(train_data,'OutcomeType')

In [98]:
train_data.head()

Unnamed: 0,OutcomeType,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,3,1,1392181200,1,2,365,0
1,2,1,1381636800,0,3,365,0
2,0,1,1422680400,1,2,730,0
3,4,0,1405051200,0,1,21,0
4,4,0,1384491600,1,2,730,0


## Predict on test data

In [99]:
#[['AnimalType','SexuponOutcome','AgeuponOutcome','Breed']]
train_data2 = np.array(train_data)
train_data = train_data2[:,1:]
label = train_data2[:,0]
label

array([ 3.,  2.,  0., ...,  0.,  4.,  4.])

Import and process test data

In [100]:
original_test_data = test_data = pd.read_csv('../DATA/test.csv')
test_data.head()

Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White


In [101]:
test_data.isnull().sum()

ID                   0
Name              3225
DateTime             0
AnimalType           0
SexuponOutcome       0
AgeuponOutcome       6
Breed                0
Color                0
dtype: int64

In [102]:
useful_columns = ['Name','DateTime','AnimalType','SexuponOutcome','AgeuponOutcome','Breed']

test_data2 = test_data[useful_columns]

test_data2.loc[test_data2.AgeuponOutcome.isnull(),'AgeuponOutcome'] = '1 year' #median age
test_data2.AgeuponOutcome = test_data2.AgeuponOutcome.apply(convert_date)
test_data2.Breed = test_data2.Breed.apply(is_purebreed)
test_data2.Name = test_data2.Name.apply(has_name)
test_data2.DateTime = test_data2.DateTime.apply(convert_datetime)

map_str_to_int(test_data2,'AnimalType')
map_str_to_int(test_data2,'SexuponOutcome')

test_data2.head()


Unnamed: 0,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,1,1444622400,1,0,300,0
1,1,1406347200,1,3,730,0
2,1,1452661200,0,2,365,0
3,1,1388206800,1,1,120,0
4,1,1443067200,1,2,730,0


In [103]:
test_data = np.array(test_data2)

#### XGBOOST

In [113]:
dtrain = xgb.DMatrix(train_data,label=label)
dtest = xgb.DMatrix(test_data)
param = {'max_depth':6, #depth of tree
         'eta':.1,  #step shrinkage size
         'silent':1,  #verbose=0,notverbose=1
         'objective':'multi:softprob',  #soft max for multiclass with prob of each class
         'num_class':5, #number of classes
         'eval_metric':'mlogloss', #evaluate with logloss for multiclass
         'nthread':4 #use 4 threads
         } 
#consider playing with gamma in the above parameters

num_round = 200
watchlist  = [(dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)



[0]	train-mlogloss:1.508326
[1]	train-mlogloss:1.426109
[2]	train-mlogloss:1.357440
[3]	train-mlogloss:1.299060
[4]	train-mlogloss:1.248829
[5]	train-mlogloss:1.205102
[6]	train-mlogloss:1.166604
[7]	train-mlogloss:1.132723
[8]	train-mlogloss:1.102603
[9]	train-mlogloss:1.075911
[10]	train-mlogloss:1.052022
[11]	train-mlogloss:1.030734
[12]	train-mlogloss:1.011690
[13]	train-mlogloss:0.994520
[14]	train-mlogloss:0.978730
[15]	train-mlogloss:0.964430
[16]	train-mlogloss:0.951613
[17]	train-mlogloss:0.940010
[18]	train-mlogloss:0.929499
[19]	train-mlogloss:0.919961
[20]	train-mlogloss:0.911219
[21]	train-mlogloss:0.903158
[22]	train-mlogloss:0.895824
[23]	train-mlogloss:0.889042
[24]	train-mlogloss:0.882595
[25]	train-mlogloss:0.876777
[26]	train-mlogloss:0.871391
[27]	train-mlogloss:0.866376
[28]	train-mlogloss:0.861738
[29]	train-mlogloss:0.857281
[30]	train-mlogloss:0.853266
[31]	train-mlogloss:0.849385
[32]	train-mlogloss:0.845901
[33]	train-mlogloss:0.842466
[34]	train-mlogloss:0.83

##### Cross validation grid search

In [143]:
gbm = xgb.XGBClassifier()
grid_params = {'max_depth':list(range(3,11)), #depth of tree
             'learning_rate':[0,0.1,0.5,1],  #step shrinkage size (eta)
             'silent':[1],  #verbose=0,notverbose=1
             'objective':['multi:softprob'],  #soft max for multiclass with prob of each class
             'n_estimators':[30,100,200], #number of trees
             #'num_class':[5], #number of classes
             #'eval_metric':['mlogloss'], #evaluate with logloss for multiclass
             'nthread':[4] #use 4 threads
             } 
cv = StratifiedKFold(label,n_folds=5,shuffle=True)
grid = GridSearchCV(gbm, grid_params,scoring='log_loss',cv=cv, verbose=1, refit=True)
grid.fit(train_data, label)

best_parameters, score, _ = max(grid.grid_scores_, key=lambda x: x[1])
print('Log Loss score:', score)
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:  2.2min
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed: 13.3min
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed: 29.8min
[Parallel(n_jobs=1)]: Done 480 out of 480 | elapsed: 32.7min finished


Log Loss score: -0.829475968846
learning_rate: 0.1
max_depth: 6
n_estimators: 200
nthread: 4
objective: 'multi:softprob'
silent: 1


In [144]:
preds = bst.predict(dtest)
result = pd.DataFrame(preds,columns=class_headers, index=original_test_data.ID).reset_index()
result.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.042894,0.001633,0.020324,0.154239,0.780909
1,2,0.604256,0.000777,0.030595,0.229634,0.134738
2,3,0.616,0.00114,0.003323,0.138099,0.241437
3,4,0.244507,0.000879,0.0265,0.126168,0.601946
4,5,0.586514,0.001957,0.015831,0.225554,0.170144


In [112]:
result.to_csv('./solution004.csv',index=False)

### Outcome

We improved even more with this one.

Score: 0.81651
