# Shelter Animal Outcomes 1

About this attempt:

- Uses Random forest 
- Only predicts on AnimalType, SexuponOutcome, and AgeuponOutcome. 
- Excludes missing training data with missing information 
- Missing ages in test data used the median of train data ages


For more information see [this link](https://www.kaggle.com/c/shelter-animal-outcomes).

In [87]:
from collections import defaultdict

import pandas as pd
import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
                            accuracy_score,
                            precision_score,
                            recall_score,
                            f1_score,
                            roc_curve,
                            roc_auc_score,
                            confusion_matrix,
                            classification_report
                            )
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

import warnings
warnings.filterwarnings('ignore')


## Preprocessing

In [36]:
train_data = pd.read_csv('../DATA/train.csv')

In [37]:
train_data.head()

Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


color is questionable

OutcomeSubtype is useless

Uncertain about Datetime,

Name seems too difficult to work with now

Breed also seems difficult to work with

May consider removing it

In [38]:
useful_columns = ['OutcomeType','AnimalType',\
                  'SexuponOutcome','AgeuponOutcome','Breed'
                 ]
train_data = train_data[useful_columns]

In [39]:
train_data.isnull().sum()

OutcomeType        0
AnimalType         0
SexuponOutcome     1
AgeuponOutcome    18
Breed              0
dtype: int64

In [40]:
train_data = train_data[train_data.AgeuponOutcome.notnull()]
train_data = train_data[train_data.SexuponOutcome.notnull()]

In [41]:
train_data.isnull().sum()

OutcomeType       0
AnimalType        0
SexuponOutcome    0
AgeuponOutcome    0
Breed             0
dtype: int64

In [42]:
train_data.head()

Unnamed: 0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,Return_to_owner,Dog,Neutered Male,1 year,Shetland Sheepdog Mix
1,Euthanasia,Cat,Spayed Female,1 year,Domestic Shorthair Mix
2,Adoption,Dog,Neutered Male,2 years,Pit Bull Mix
3,Transfer,Cat,Intact Male,3 weeks,Domestic Shorthair Mix
4,Transfer,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle


In [43]:
train_data.groupby('Breed').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Abyssinian Mix,2,2,2,2
Affenpinscher Mix,6,6,6,6
Afghan Hound Mix,1,1,1,1
Airedale Terrier,1,1,1,1
Airedale Terrier Mix,5,5,5,5
Airedale Terrier/Labrador Retriever,1,1,1,1
Airedale Terrier/Miniature Schnauzer,1,1,1,1
Akita,3,3,3,3
Akita Mix,11,11,11,11
Akita/Australian Cattle Dog,1,1,1,1


As stated before, may need to rework the breed data, for now lets fix the dates:

In [48]:
train_data.groupby('AgeuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,SexuponOutcome,Breed
AgeuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0 years,22,22,22,22
1 day,66,66,66,66
1 month,1281,1281,1281,1281
1 week,146,146,146,146
1 weeks,171,171,171,171
1 year,3969,3969,3969,3969
10 months,457,457,457,457
10 years,446,446,446,446
11 months,166,166,166,166
11 years,126,126,126,126


Many a puppy on this list

In [49]:
def convert_date(age):
    age = age.split()
    age[0] = int(age[0])
    if age[1] == 'day' or age[1] == 'days':
        return age[0]
    elif age[1] == 'week' or age[1] == 'weeks':
        return age[0]*7
    elif age[1] == 'month' or age[1] == 'months':
        return age[0]*30
    elif age[1] == 'year' or age[1] == 'years':
        return age[0]*365
    
train_data.AgeuponOutcome = train_data.AgeuponOutcome.apply(convert_date)

In [50]:
train_data.groupby('AgeuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,SexuponOutcome,Breed
AgeuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,22,22,22,22
1,66,66,66,66
2,99,99,99,99
3,109,109,109,109
4,50,50,50,50
5,24,24,24,24
6,50,50,50,50
7,317,317,317,317
14,529,529,529,529
21,659,659,659,659


It may be important to note that there are many unknown sexes

In [54]:
train_data.groupby('SexuponOutcome').count()

Unnamed: 0_level_0,OutcomeType,AnimalType,AgeuponOutcome,Breed
SexuponOutcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Intact Female,3504,3504,3504,3504
Intact Male,3519,3519,3519,3519
Neutered Male,9779,9779,9779,9779
Spayed Female,8819,8819,8819,8819
Unknown,1089,1089,1089,1089


It may also be important to note that the classes are pretty imbalanced

In [52]:
train_data.groupby('OutcomeType').count()

Unnamed: 0_level_0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
OutcomeType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adoption,10769,10769,10769,10769
Died,197,197,197,197
Euthanasia,1553,1553,1553,1553
Return_to_owner,4785,4785,4785,4785
Transfer,9406,9406,9406,9406


Need to convert the string data to numerical

In [61]:
# This function operates in place
def map_str_to_int(df,col_name):
    categories = list(enumerate(np.unique(df[col_name])))
    map_dict = { name : i for i, name in categories }              
    df[col_name] = df[col_name].map( lambda x: map_dict[x]).astype(int)

In [62]:
map_str_to_int(train_data,'Breed')
map_str_to_int(train_data,'AnimalType')
map_str_to_int(train_data,'SexuponOutcome')

In [63]:
train_data.head()

Unnamed: 0,OutcomeType,AnimalType,SexuponOutcome,AgeuponOutcome,Breed
0,Return_to_owner,1,2,365,1221
1,Euthanasia,0,3,365,640
2,Adoption,1,2,730,1066
3,Transfer,0,1,21,640
4,Transfer,1,2,730,914


## Compare models

In [64]:
X = train_data[['AnimalType','SexuponOutcome','AgeuponOutcome','Breed']]
y = train_data['OutcomeType']

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4444)

In [88]:
def compare_models(X,y):
    model_list = [\
                  KNeighborsClassifier(n_neighbors=1), \
                  KNeighborsClassifier(n_neighbors=2), \
                  KNeighborsClassifier(n_neighbors=3), \
                  KNeighborsClassifier(n_neighbors=4), \
                  KNeighborsClassifier(n_neighbors=5), \
                  KNeighborsClassifier(n_neighbors=6), \
                  KNeighborsClassifier(n_neighbors=7), \
                  SVC(gamma=1, C=10, kernel='rbf'), \
                  SVC(),\
                  BernoulliNB(),\
                  RandomForestClassifier(n_estimators=100), \
                  DecisionTreeClassifier() \
                  ]

    index_func = [\
                  'KNeighborsClassifier(n_neighbors=1)', \
                  'KNeighborsClassifier(n_neighbors=2)', \
                  'KNeighborsClassifier(n_neighbors=3)', \
                  'KNeighborsClassifier(n_neighbors=4)', \
                  'KNeighborsClassifier(n_neighbors=5)', \
                  'KNeighborsClassifier(n_neighbors=6)', \
                  'KNeighborsClassifier(n_neighbors=7)', \
                  'SVC(gamma=1, C=10, kernel=\'rbf\')', \
                  'SVC()',\
                  'BernoulliNB()',\
                  'RandomForestClassifier(n_estimators=100)', \
                  'DecisionTreeClassifier()' \
                  ]
    
    scores_arr = [[],[],[],[]]
    scorers = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
    for model in model_list:
        for i,score in enumerate(scorers):
            s = np.mean(cross_val_score(model, X,y, scoring=score, cv=5))
            scores_arr[i].append(s)
            
    score_dict = defaultdict(list)
    for i, score in enumerate(scorers):
        score_dict[score] = scores_arr[i]
        
    df = pd.DataFrame(score_dict, index=index_func)
    return df

In [89]:
compare_models(X,y)

Unnamed: 0,accuracy,f1_weighted,precision_weighted,recall_weighted
KNeighborsClassifier(n_neighbors=1),0.522614,0.521064,0.521724,0.522614
KNeighborsClassifier(n_neighbors=2),0.538114,0.529641,0.566539,0.538114
KNeighborsClassifier(n_neighbors=3),0.565593,0.553174,0.555047,0.565593
KNeighborsClassifier(n_neighbors=4),0.575704,0.561243,0.563467,0.575704
KNeighborsClassifier(n_neighbors=5),0.571921,0.557952,0.55932,0.571921
KNeighborsClassifier(n_neighbors=6),0.577537,0.560344,0.560878,0.577537
KNeighborsClassifier(n_neighbors=7),0.577687,0.561545,0.562111,0.577687
"SVC(gamma=1, C=10, kernel='rbf')",0.598577,0.573149,0.577512,0.598577
SVC(),0.599663,0.574594,0.580199,0.599663
BernoulliNB(),0.491315,0.390195,0.439509,0.491315


In [76]:
def compare_models2(X,y):
    model_list = [\
                  SVC(gamma=1, C=10, kernel='rbf'), \
                  SVC(),\
                  BernoulliNB(),\
                  RandomForestClassifier(n_estimators=100), \
                  DecisionTreeClassifier() \
                  ]

    index_func = [\
                  'SVC(gamma=1, C=10, kernel=\'rbf\')', \
                  'SVC()',\
                  'BernoulliNB()',\
                  'RandomForestClassifier(n_estimators=100)', \
                  'DecisionTreeClassifier()' \
                  ]
    
    scores_arr = [[],[],[],[]]
    scorers = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
    for model in model_list:
        for i,score in enumerate(scorers):
            s = np.mean(cross_val_score(model, X,y, scoring=score, cv=5))
            scores_arr[i].append(s)
            
    score_dict = defaultdict(list)
    for i, score in enumerate(scorers):
        score_dict[score] = scores_arr[i]
        
    df = pd.DataFrame(score_dict, index=index_func)
    return df

In [90]:
X2 = train_data[['AnimalType','SexuponOutcome','AgeuponOutcome']]
y2 = train_data['OutcomeType']

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.33, random_state=4444)

In [91]:
compare_models2(X2,y2)

Unnamed: 0,accuracy,f1_weighted,precision_weighted,recall_weighted
"SVC(gamma=1, C=10, kernel='rbf')",0.633695,0.614349,0.620342,0.633695
SVC(),0.632871,0.612747,0.61913,0.632871
BernoulliNB(),0.491315,0.390195,0.439509,0.491315
RandomForestClassifier(n_estimators=100),0.633808,0.615266,0.62214,0.634482
DecisionTreeClassifier(),0.633995,0.616115,0.622947,0.633995


It seems better to remove breeds from the data for now

## Predicting on test data

Lets import the test data and apply the same transformations

In [144]:
test_data = pd.read_csv('../DATA/test.csv')
test_data.head()

Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White


In [145]:
test_data.isnull().sum()

ID                   0
Name              3225
DateTime             0
AnimalType           0
SexuponOutcome       0
AgeuponOutcome       6
Breed                0
Color                0
dtype: int64

In [146]:
train_data.AgeuponOutcome.median()

365.0

Median age of dogs is 1 year, we will apply this to the dogs with age missing 

In [147]:
useful_columns = ['AnimalType','SexuponOutcome','AgeuponOutcome']

test_data2 = test_data[useful_columns]

test_data2.loc[test_data.AgeuponOutcome.isnull(),'AgeuponOutcome'] = '1 year'
test_data2.AgeuponOutcome = test_data2.AgeuponOutcome.apply(convert_date)
test_data2.isnull().sum()



AnimalType        0
SexuponOutcome    0
AgeuponOutcome    0
dtype: int64

In [148]:
test_data2.head()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome
0,Dog,Intact Female,300
1,Dog,Spayed Female,730
2,Cat,Neutered Male,365
3,Dog,Intact Male,120
4,Dog,Neutered Male,730


In [149]:
#map_str_to_int(test_data,'Breed')
map_str_to_int(test_data2,'AnimalType')
map_str_to_int(test_data2,'SexuponOutcome')

In [150]:
a = RandomForestClassifier()
a.fit(X2,y2)
y_pred = a.predict(test_data2)
a.predict_proba(test_data2)

array([[ 0.        ,  0.        ,  0.        ,  0.13838578,  0.86161422],
       [ 0.54621082,  0.00190222,  0.02653727,  0.25617599,  0.16917369],
       [ 0.39066498,  0.        ,  0.02474976,  0.09129258,  0.49329267],
       ..., 
       [ 0.00249257,  0.00761462,  0.09366181,  0.00890271,  0.8873283 ],
       [ 0.2993767 ,  0.        ,  0.06043234,  0.44668282,  0.19350814],
       [ 0.02534453,  0.        ,  0.20169133,  0.49282503,  0.28013911]])

In [156]:
headers = a.classes_

In [153]:
prediction_data = pd.DataFrame(test_data.ID)

In [159]:
prediction_data2 = pd.DataFrame(a.predict_proba(test_data2), columns=headers)

In [163]:
prediction_data = prediction_data.join(prediction_data2)
prediction_data.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.0,0.0,0.0,0.138386,0.861614
1,2,0.546211,0.001902,0.026537,0.256176,0.169174
2,3,0.390665,0.0,0.02475,0.091293,0.493293
3,4,0.014781,0.025144,0.075033,0.070751,0.814291
4,5,0.439183,0.00153,0.037597,0.320959,0.200731


In [164]:
prediction_data.to_csv('./solution001.csv',index=False)

### Outcome

Rank at submission: 222/318

Score: 1.07691