# Shelter Animal Outcomes 4

About this attempt:

- Uses XGBoost 
- Predicts on Name, Datetime, AnimalType, SexuponOutcome, AgeuponOutcome, Breed 
- Excludes training data with missing information 
- Missing ages in test data used the median of train data ages


For more information see [this link](https://www.kaggle.com/c/shelter-animal-outcomes).

In [685]:
from collections import defaultdict
from math import isnan
from datetime import datetime
import time

import pandas as pd
import numpy as np

import xgboost as xgb

from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')


## Preprocessing

In [686]:
train_data = pd.read_csv('../DATA/train.csv')
class_headers = list(np.unique(train_data.OutcomeType))
class_headers

['Adoption', 'Died', 'Euthanasia', 'Return_to_owner', 'Transfer']

In [687]:
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


In [688]:
train_data.isnull().sum()

AnimalID              0
Name               7691
DateTime              0
OutcomeType           0
OutcomeSubtype    13612
AnimalType            0
SexuponOutcome        1
AgeuponOutcome       18
Breed                 0
Color                 0
dtype: int64

In [689]:
train_data.loc[train_data.AgeuponOutcome.isnull(),'AgeuponOutcome'] = '1 year' #median age

In [690]:
train_data = train_data[train_data.SexuponOutcome.notnull()]

In [691]:
train_data.isnull().sum()

AnimalID              0
Name               7691
DateTime              0
OutcomeType           0
OutcomeSubtype    13611
AnimalType            0
SexuponOutcome        0
AgeuponOutcome        0
Breed                 0
Color                 0
dtype: int64

In [692]:
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


This time we will rework breed data according to the following:

- 1 = purebreed
- 0 = mixbreed

And Name data according to the following:

- 1 = Has Name
- 2 = No Name

we will also convert datetime to a unix number of date

In [693]:
def is_purebreed(breed):
    if 'mix' in breed.lower() or '/' in breed:
        return 0
    else:
        return 1

def has_name(name):
    try:
        if isnan(name):
            return 0
    except: 
        return 1

def convert_datetime(dt):
    t = datetime.strptime(dt,'%Y-%m-%d %H:%M:%S')
    d = t.timetuple()
    return time.mktime(d)

In [694]:
train_data['IsPurebreed'] = train_data.Breed.apply(is_purebreed)
train_data['HasName'] = train_data.Name.apply(has_name)
train_data.DateTime = train_data.DateTime.apply(convert_datetime)
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,IsPurebreed,HasName
0,A671945,Hambone,1392247320,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White,0,1
1,A656520,Emily,1381682640,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby,0,1
2,A686464,Pearce,1422725280,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White,0,1
3,A683430,,1405120140,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream,0,0
4,A667013,,1384537920,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan,0,0


Now lets fix the ages:

In [695]:
def convert_age(age):
    age = age.split()
    age[0] = int(age[0])
    if age[1] == 'day' or age[1] == 'days':
        return age[0]
    elif age[1] == 'week' or age[1] == 'weeks':
        return age[0]*7
    elif age[1] == 'month' or age[1] == 'months':
        return age[0]*30
    elif age[1] == 'year' or age[1] == 'years':
        return age[0]*365
    
train_data.AgeuponOutcome = train_data.AgeuponOutcome.apply(convert_age)

It may be important to note that there are many unknown sexes

In [696]:
train_data.groupby('SexuponOutcome').count()

Unnamed: 0_level_0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,AgeuponOutcome,Breed,Color,IsPurebreed,HasName
SexuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Intact Female,3511,1462,3511,3511,3015,3511,3511,3511,3511,3511,3511
Intact Male,3525,1526,3525,3525,2899,3525,3525,3525,3525,3525,3525
Neutered Male,9779,8434,9779,9779,3225,9779,9779,9779,9779,9779,9779
Spayed Female,8820,7578,8820,8820,2902,8820,8820,8820,8820,8820,8820
Unknown,1093,37,1093,1093,1076,1093,1093,1093,1093,1093,1093


In [697]:
train_data.groupby('AgeuponOutcome').count()

Unnamed: 0_level_0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,Breed,Color,IsPurebreed,HasName
AgeuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,22,0,22,22,22,22,22,22,22,22,22
1,66,0,66,66,60,66,66,66,66,66,66
2,99,0,99,99,99,99,99,99,99,99,99
3,109,5,109,109,109,109,109,109,109,109,109
4,50,2,50,50,50,50,50,50,50,50,50
5,24,0,24,24,24,24,24,24,24,24,24
6,50,2,50,50,50,50,50,50,50,50,50
7,317,17,317,317,315,317,317,317,317,317,317
14,529,35,529,529,511,529,529,529,529,529,529
21,659,21,659,659,657,659,659,659,659,659,659


In [698]:
## Lets make more categories based of age
def in_age_range(val,low,high):
    if low <= val <= high:
        return 1
    else:
        return 0
    
train_data['LessThanMonthOld'] = train_data.AgeuponOutcome.apply(lambda x:in_age_range(x,0,30))
train_data['31To365'] = train_data.AgeuponOutcome.apply(lambda x:in_age_range(x,31,365))
train_data['YearTo1000'] = train_data.AgeuponOutcome.apply(lambda x:in_age_range(x,366,1000))
train_data['1000to4000'] = train_data.AgeuponOutcome.apply(lambda x:in_age_range(x,1001,4000))
train_data['OlderThan4000'] = train_data.AgeuponOutcome.apply(lambda x:in_age_range(x,4001,12000))
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,IsPurebreed,HasName,LessThanMonthOld,31To365,YearTo1000,1000to4000,OlderThan4000
0,A671945,Hambone,1392247320,Return_to_owner,,Dog,Neutered Male,365,Shetland Sheepdog Mix,Brown/White,0,1,0,1,0,0,0
1,A656520,Emily,1381682640,Euthanasia,Suffering,Cat,Spayed Female,365,Domestic Shorthair Mix,Cream Tabby,0,1,0,1,0,0,0
2,A686464,Pearce,1422725280,Adoption,Foster,Dog,Neutered Male,730,Pit Bull Mix,Blue/White,0,1,0,0,1,0,0
3,A683430,,1405120140,Transfer,Partner,Cat,Intact Male,21,Domestic Shorthair Mix,Blue Cream,0,0,1,0,0,0,0
4,A667013,,1384537920,Transfer,Partner,Dog,Neutered Male,730,Lhasa Apso/Miniature Poodle,Tan,0,0,0,0,1,0,0


In [699]:
## Create Category for male/female
def male_or_female(val):
    if 'Female' in val:
        return 1
    elif 'Unknown' in val:
        return 2
    else:
        return 0
    
train_data['Female'] = train_data.SexuponOutcome.apply(male_or_female)

In [700]:
## Intact or not
def intact(val):
    if 'Unknown' in val:
        return 2
    if 'Intact' in val:
        return 1
    else:
        return 0
    
train_data['Intact'] = train_data.SexuponOutcome.apply(intact)

In [701]:
## Breeds
def parse_breed(val,contains):
    for i in contains:
        if i in val.lower():
            return 1
    return 0

train_data['IsMix'] = train_data['Breed'].apply(lambda x:parse_breed(x, ["mix"]))
train_data['Cross'] = train_data['Breed'].apply(lambda x:parse_breed(x, ["/"]))
train_data['Miniature'] = train_data['Breed'].apply(lambda x:parse_breed(x, ["miniature"]))
train_data['IsShihTzu'] = train_data['Breed'].apply(lambda x:parse_breed(x, ["shih tzu"]))
train_data['IsAggressive'] = train_data['Breed'].apply(lambda x:parse_breed(x, ["rottweiler", "pit bull", "siberian husky"]))

In [702]:
## colors
def parse_color(val, color):
    if color in val:
        return 1
    else:
        return 0
    
colors = ['Black','White','Brown','Gray','Yellow','Red','Blue',\
         'Orange','Calico','Chocolate','Gold','Tan','Tortie','Cream',\
         'Silver','Buff','Liver','Lilac','Tabby','Tricolor','Smoke',\
         'Brindle','Fawn','Flame','Point']

for color in colors:
    train_data['Has'+color] = train_data['Color'].apply(lambda x: parse_color(x,color))

train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,...,HasBuff,HasLiver,HasLilac,HasTabby,HasTricolor,HasSmoke,HasBrindle,HasFawn,HasFlame,HasPoint
0,A671945,Hambone,1392247320,Return_to_owner,,Dog,Neutered Male,365,Shetland Sheepdog Mix,Brown/White,...,0,0,0,0,0,0,0,0,0,0
1,A656520,Emily,1381682640,Euthanasia,Suffering,Cat,Spayed Female,365,Domestic Shorthair Mix,Cream Tabby,...,0,0,0,1,0,0,0,0,0,0
2,A686464,Pearce,1422725280,Adoption,Foster,Dog,Neutered Male,730,Pit Bull Mix,Blue/White,...,0,0,0,0,0,0,0,0,0,0
3,A683430,,1405120140,Transfer,Partner,Cat,Intact Male,21,Domestic Shorthair Mix,Blue Cream,...,0,0,0,0,0,0,0,0,0,0
4,A667013,,1384537920,Transfer,Partner,Dog,Neutered Male,730,Lhasa Apso/Miniature Poodle,Tan,...,0,0,0,0,0,0,0,0,0,0


It may also be important to note that the classes are pretty imbalanced

In [703]:
train_data.groupby('OutcomeType').count()

Unnamed: 0_level_0,AnimalID,Name,DateTime,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,IsPurebreed,...,HasBuff,HasLiver,HasLilac,HasTabby,HasTricolor,HasSmoke,HasBrindle,HasFawn,HasFlame,HasPoint
OutcomeType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Adoption,10769,9091,10769,1966,10769,10769,10769,10769,10769,10769,...,10769,10769,10769,10769,10769,10769,10769,10769,10769,10769
Died,197,77,197,181,197,197,197,197,197,197,...,197,197,197,197,197,197,197,197,197,197
Euthanasia,1555,740,1555,1554,1555,1555,1555,1555,1555,1555,...,1555,1555,1555,1555,1555,1555,1555,1555,1555,1555
Return_to_owner,4785,4632,4785,0,4785,4785,4785,4785,4785,4785,...,4785,4785,4785,4785,4785,4785,4785,4785,4785,4785
Transfer,9422,4497,9422,9416,9422,9422,9422,9422,9422,9422,...,9422,9422,9422,9422,9422,9422,9422,9422,9422,9422


Need to convert the string data to numerical

In [704]:
# This function operates in place
def map_str_to_int(df,col_name):
    categories = list(enumerate(np.unique(df[col_name])))
    map_dict = { name : i for i, name in categories }              
    df[col_name] = df[col_name].map( lambda x: map_dict[x]).astype(int)

In [705]:
#map_str_to_int(train_data,'Breed')
map_str_to_int(train_data,'AnimalType')
map_str_to_int(train_data,'SexuponOutcome')
map_str_to_int(train_data,'OutcomeType')

In [706]:
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,...,HasBuff,HasLiver,HasLilac,HasTabby,HasTricolor,HasSmoke,HasBrindle,HasFawn,HasFlame,HasPoint
0,A671945,Hambone,1392247320,3,,1,2,365,Shetland Sheepdog Mix,Brown/White,...,0,0,0,0,0,0,0,0,0,0
1,A656520,Emily,1381682640,2,Suffering,0,3,365,Domestic Shorthair Mix,Cream Tabby,...,0,0,0,1,0,0,0,0,0,0
2,A686464,Pearce,1422725280,0,Foster,1,2,730,Pit Bull Mix,Blue/White,...,0,0,0,0,0,0,0,0,0,0
3,A683430,,1405120140,4,Partner,0,1,21,Domestic Shorthair Mix,Blue Cream,...,0,0,0,0,0,0,0,0,0,0
4,A667013,,1384537920,4,Partner,1,2,730,Lhasa Apso/Miniature Poodle,Tan,...,0,0,0,0,0,0,0,0,0,0


In [707]:
a = 1392247320

print(datetime.utcfromtimestamp(a).hour)
print(datetime.utcfromtimestamp(a).year)

23
2014


In [708]:
# convert date time to day and month variables
def get_year(timestamp):
    return datetime.utcfromtimestamp(timestamp).year

def get_month(timestamp):
    return datetime.utcfromtimestamp(timestamp).month

def get_day(timestamp):
    return datetime.utcfromtimestamp(timestamp).day

def get_day_of_week(timestamp):
    return datetime.utcfromtimestamp(timestamp).weekday()

def get_hour_of_day(timestamp):
    return datetime.utcfromtimestamp(timestamp).hour

train_data['Month'] = train_data.DateTime.apply(get_month)
train_data['Day'] = train_data.DateTime.apply(get_day)
train_data['DayOfWeek'] = train_data.DateTime.apply(get_day_of_week)
train_data['Hour'] = train_data.DateTime.apply(get_hour_of_day)
train_data['Year'] = train_data.DateTime.apply(get_year)

train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,...,HasSmoke,HasBrindle,HasFawn,HasFlame,HasPoint,Month,Day,DayOfWeek,Hour,Year
0,A671945,Hambone,1392247320,3,,1,2,365,Shetland Sheepdog Mix,Brown/White,...,0,0,0,0,0,2,12,2,23,2014
1,A656520,Emily,1381682640,2,Suffering,0,3,365,Domestic Shorthair Mix,Cream Tabby,...,0,0,0,0,0,10,13,6,16,2013
2,A686464,Pearce,1422725280,0,Foster,1,2,730,Pit Bull Mix,Blue/White,...,0,0,0,0,0,1,31,5,17,2015
3,A683430,,1405120140,4,Partner,0,1,21,Domestic Shorthair Mix,Blue Cream,...,0,0,0,0,0,7,11,4,23,2014
4,A667013,,1384537920,4,Partner,1,2,730,Lhasa Apso/Miniature Poodle,Tan,...,0,0,0,0,0,11,15,4,17,2013


In [709]:
train_data.columns

Index(['AnimalID', 'Name', 'DateTime', 'OutcomeType', 'OutcomeSubtype',
       'AnimalType', 'SexuponOutcome', 'AgeuponOutcome', 'Breed', 'Color',
       'IsPurebreed', 'HasName', 'LessThanMonthOld', '31To365', 'YearTo1000',
       '1000to4000', 'OlderThan4000', 'Female', 'Intact', 'IsMix', 'Cross',
       'Miniature', 'IsShihTzu', 'IsAggressive', 'HasBlack', 'HasWhite',
       'HasBrown', 'HasGray', 'HasYellow', 'HasRed', 'HasBlue', 'HasOrange',
       'HasCalico', 'HasChocolate', 'HasGold', 'HasTan', 'HasTortie',
       'HasCream', 'HasSilver', 'HasBuff', 'HasLiver', 'HasLilac', 'HasTabby',
       'HasTricolor', 'HasSmoke', 'HasBrindle', 'HasFawn', 'HasFlame',
       'HasPoint', 'Month', 'Day', 'DayOfWeek', 'Hour', 'Year'],
      dtype='object')

In [710]:
useful_columns = ['OutcomeType',\
                'DateTime', \
               'AnimalType', \
                  'AgeuponOutcome', \
               'IsPurebreed', \
                  'HasName', \
                  'LessThanMonthOld', \
                  '31To365', \
                  'YearTo1000',\
               '1000to4000', \
                  'OlderThan4000', \
                  'Female', \
                  'Intact', \
         #         'IsMix', \
         #         'Cross',\
         #      'Miniature', \
         #         'IsShihTzu', \
         #         'IsAggressive', \
         #         'HasBlack', \
         #         'HasWhite',\
         #      'HasBrown', \
         #         'HasGray', \
         #         'HasYellow', \
         #         'HasRed', \
         #         'HasBlue', \
         #         'HasOrange',\
         #      'HasCalico', \
         #         'HasChocolate', \
         #         'HasGold', \
         #         'HasTan', \
         #         'HasTortie',\
         #      'HasCream', \
         #         'HasSilver', \
         #         'HasBuff', \
         #         'HasLiver', \
         #         'HasLilac', \
         #         'HasTabby',\
         #      'HasTricolor', \
         #         'HasSmoke', \
         #         'HasBrindle', \
         #         'HasFawn', \
         #         'HasFlame',\
         #      'HasPoint', \
                  'Year',\
                  'Month', \
                  'Day', \
                  'DayOfWeek',\
                 'Hour']

#uc = useful_columns
#useful_columns = uc[:17] + uc[18:23] + uc[24:27] + uc[28:29]+uc[30:31]+uc[32:33]+uc[37:39]+uc[40:41]+uc[44:45]+uc[47:]
#useful_columns = uc[:35]+uc[37:39]+uc[40:42]+uc[43:45]+uc[47:]

train_data=train_data[useful_columns]

## Predict on test data

In [711]:
#[['AnimalType','SexuponOutcome','AgeuponOutcome','Breed']]
train_data = np.array(train_data)
label = train_data[:,0]
train_data = train_data[:,1:]
label

array([ 3.,  2.,  0., ...,  0.,  4.,  4.])

Import and process test data

In [712]:
original_test_data = test_data = pd.read_csv('../DATA/test.csv')
test_data.head()

Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White


In [713]:
test_data.isnull().sum()

ID                   0
Name              3225
DateTime             0
AnimalType           0
SexuponOutcome       0
AgeuponOutcome       6
Breed                0
Color                0
dtype: int64

In [714]:
useful_columns = ['ID', \
                  'DateTime', \
               'AnimalType', \
                  'AgeuponOutcome', \
               'IsPurebreed', \
                  'HasName', \
                  'LessThanMonthOld', \
                  '31To365', \
                  'YearTo1000',\
               '1000to4000', \
                  'OlderThan4000', \
                  'Female', \
                  'Intact', \
         #         'IsMix', \
         #         'Cross',\
         #      'Miniature', \
         #         'IsShihTzu', \
         #         'IsAggressive', \
         #         'HasBlack', \
         #         'HasWhite',\
         #      'HasBrown', \
         #         'HasGray', \
         #         'HasYellow', \
         #         'HasRed', \
         #         'HasBlue', \
         #         'HasOrange',\
         #      'HasCalico', \
         #         'HasChocolate', \
         #         'HasGold', \
         #         'HasTan', \
         #         'HasTortie',\
         #      'HasCream', \
         #         'HasSilver', \
         #         'HasBuff', \
         #         'HasLiver', \
         #         'HasLilac', \
         #         'HasTabby',\
         #      'HasTricolor', \
         #         'HasSmoke', \
         #         'HasBrindle', \
         #         'HasFawn', \
         #         'HasFlame',\
         #      'HasPoint', \
                  'Year',\
                  'Month', \
                  'Day', \
                  'DayOfWeek',\
                 'Hour']

#uc = useful_columns
#useful_columns = uc[:17] + uc[18:23] + uc[24:27] + uc[28:29]+uc[30:31]+uc[32:33]+uc[37:39]+uc[40:41]+uc[44:45]+uc[47:]
#useful_columns = uc[:35]+uc[37:39]+uc[40:42]+uc[43:45]+uc[47:]

test_data.loc[test_data.AgeuponOutcome.isnull(),'AgeuponOutcome'] = '1 year' #median age
test_data.AgeuponOutcome = test_data.AgeuponOutcome.apply(convert_age)
test_data['IsPurebreed'] = test_data.Breed.apply(is_purebreed)
test_data['HasName'] = test_data.Name.apply(has_name)
test_data.DateTime = test_data.DateTime.apply(convert_datetime)
test_data['Year'] = test_data.DateTime.apply(get_year)
test_data['Month'] = test_data.DateTime.apply(get_month)
test_data['Day'] = test_data.DateTime.apply(get_day)
test_data['DayOfWeek'] = test_data.DateTime.apply(get_day_of_week)
test_data['Hour'] = test_data.DateTime.apply(get_hour_of_day)

test_data['LessThanMonthOld'] = test_data.AgeuponOutcome.apply(lambda x:in_age_range(x,0,30))
test_data['31To365'] = test_data.AgeuponOutcome.apply(lambda x:in_age_range(x,31,365))
test_data['YearTo1000'] = test_data.AgeuponOutcome.apply(lambda x:in_age_range(x,366,1000))
test_data['1000to4000'] = test_data.AgeuponOutcome.apply(lambda x:in_age_range(x,1001,4000))
test_data['OlderThan4000'] = test_data.AgeuponOutcome.apply(lambda x:in_age_range(x,4001,12000))

test_data['Female'] = test_data.SexuponOutcome.apply(male_or_female)
test_data['Intact'] = test_data.SexuponOutcome.apply(intact)

test_data['IsMix'] = test_data['Breed'].apply(lambda x:parse_breed(x, ["mix"]))
test_data['Cross'] = test_data['Breed'].apply(lambda x:parse_breed(x, ["/"]))
test_data['Miniature'] = test_data['Breed'].apply(lambda x:parse_breed(x, ["miniature"]))
test_data['IsShihTzu'] = test_data['Breed'].apply(lambda x:parse_breed(x, ["shih tzu"]))
test_data['IsAggressive'] = test_data['Breed'].apply(lambda x:parse_breed(x, ["rottweiler", "pit bull", "siberian husky"]))

for color in colors:
    test_data['Has'+color] = test_data['Color'].apply(lambda x: parse_color(x,color))

map_str_to_int(test_data,'AnimalType')
map_str_to_int(test_data,'SexuponOutcome')

test_data.head()


Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,IsPurebreed,HasName,...,HasBuff,HasLiver,HasLilac,HasTabby,HasTricolor,HasSmoke,HasBrindle,HasFawn,HasFlame,HasPoint
0,1,Summer,1444666500,1,0,300,Labrador Retriever Mix,Red/White,0,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Cheyenne,1406411940,1,3,730,German Shepherd/Siberian Husky,Black/Tan,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,Gus,1452705600,0,2,365,Domestic Shorthair Mix,Brown Tabby,0,1,...,0,0,0,1,0,0,0,0,0,0
3,4,Pongo,1388272320,1,1,120,Collie Smooth Mix,Tricolor,0,1,...,0,0,0,0,1,0,0,0,0,0
4,5,Skooter,1443131940,1,2,730,Miniature Poodle Mix,White,0,1,...,0,0,0,0,0,0,0,0,0,0


In [715]:
test_data = test_data[useful_columns]
test_data = np.array(test_data)
test_data = test_data[:,1:]
test_data = np.array(test_data)

In [716]:
print(test_data.shape,train_data.shape)

(11456, 17) (26728, 17)


#### XGBOOST

##### Cross validation grid search

In [725]:
gbm = xgb.XGBClassifier()
grid_params = {'max_depth':list(range(3,15)), #depth of tree
             'learning_rate':[0,0.1,0.5,1],  #step shrinkage size (eta)
             'silent':[1],  #verbose=0,notverbose=1
             'objective':['multi:softprob'],  #soft max for multiclass with prob of each class
             'n_estimators':[30,100,150], #number of trees
             #'num_class':[5], #number of classes
             #'eval_metric':['mlogloss'], #evaluate with logloss for multiclass
             'nthread':[4] #use 4 threads
             } 
cv = StratifiedKFold(label,n_folds=5,shuffle=True)
grid = GridSearchCV(gbm, grid_params,scoring='log_loss',cv=cv, verbose=1, refit=True)
grid.fit(train_data, label)

best_parameters, score, _ = max(grid.grid_scores_, key=lambda x: x[1])
print('Log Loss score:', score)
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
    


Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:  3.3min
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed: 30.2min
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed: 66.1min
[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed: 111.3min finished


Log Loss score: -0.74762453996
learning_rate: 0.1
max_depth: 9
n_estimators: 100
nthread: 4
objective: 'multi:softprob'
silent: 1


##### Actual prediction

In [726]:
dtrain = xgb.DMatrix(train_data,label=label)
dtest = xgb.DMatrix(test_data)
param = {'max_depth':9, #depth of tree
         'eta':.1,  #step shrinkage size
         'silent':1,  #verbose=0,notverbose=1
         'objective':'multi:softprob',  #soft max for multiclass with prob of each class
         'num_class':5, #number of classes
         'eval_metric':'mlogloss', #evaluate with logloss for multiclass
         'nthread':4 #use 4 threads
         } 
#consider playing with gamma in the above parameters

num_round = 100
watchlist  = [(dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)



[0]	train-mlogloss:1.490407
[1]	train-mlogloss:1.393340
[2]	train-mlogloss:1.312164
[3]	train-mlogloss:1.242590
[4]	train-mlogloss:1.182133
[5]	train-mlogloss:1.129309
[6]	train-mlogloss:1.082794
[7]	train-mlogloss:1.040827
[8]	train-mlogloss:1.002981
[9]	train-mlogloss:0.969260
[10]	train-mlogloss:0.938881
[11]	train-mlogloss:0.911583
[12]	train-mlogloss:0.887003
[13]	train-mlogloss:0.864424
[14]	train-mlogloss:0.843738
[15]	train-mlogloss:0.824329
[16]	train-mlogloss:0.806838
[17]	train-mlogloss:0.790893
[18]	train-mlogloss:0.776152
[19]	train-mlogloss:0.762455
[20]	train-mlogloss:0.749500
[21]	train-mlogloss:0.737658
[22]	train-mlogloss:0.726777
[23]	train-mlogloss:0.716612
[24]	train-mlogloss:0.707140
[25]	train-mlogloss:0.697619
[26]	train-mlogloss:0.688812
[27]	train-mlogloss:0.680432
[28]	train-mlogloss:0.672425
[29]	train-mlogloss:0.665079
[30]	train-mlogloss:0.657940
[31]	train-mlogloss:0.651092
[32]	train-mlogloss:0.644615
[33]	train-mlogloss:0.638760
[34]	train-mlogloss:0.63

In [727]:
preds = bst.predict(dtest)
result = pd.DataFrame(preds,columns=class_headers, index=original_test_data.ID).reset_index()
result.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.017896,0.00147,0.031945,0.158744,0.789946
1,2,0.750767,0.000946,0.011322,0.187868,0.049097
2,3,0.364796,0.001243,0.002932,0.082237,0.548791
3,4,0.255249,0.002037,0.009468,0.072842,0.660403
4,5,0.393921,0.003838,0.005396,0.506308,0.090537


In [728]:
result.to_csv('./solution005.csv',index=False)

### Random Forest

##### RFC Cross validation

In [721]:
'''print(sum(cross_val_score(RandomForestClassifier(1000), \
                          train_data, label, scoring='log_loss', cv=20, verbose=1))/20)'''

"print(sum(cross_val_score(RandomForestClassifier(1000),                           train_data, label, scoring='log_loss', cv=20, verbose=1))/20)"

##### RFC prediction

In [722]:
rfc = RandomForestClassifier(1000)
rfc.fit(train_data,label)
preds = rfc.predict_proba(test_data)
result = pd.DataFrame(preds,columns=class_headers, index=original_test_data.ID).reset_index()
result.head()


Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.007,0.0,0.051,0.082,0.86
1,2,0.721,0.002,0.001,0.259,0.017
2,3,0.665,0.0,0.008,0.094,0.233
3,4,0.179,0.0,0.043,0.118,0.66
4,5,0.779,0.001,0.0,0.174,0.046


In [723]:
rfc.feature_importances_

array([ 0.1685871 ,  0.02783846,  0.11763537,  0.00977408,  0.04471705,
        0.01555646,  0.02019466,  0.00437529,  0.01017138,  0.00459591,
        0.02611971,  0.12101298,  0.02569305,  0.08015352,  0.12349312,
        0.07606448,  0.12401739])

In [724]:
result.to_csv('./solution006.csv',index=False)

### Outcome

We improved significantly this time.

XGB - 0.73162

RFC - 0.79567
