# Shelter Animal Outcomes
- Mon 21 Mar 2016 - Sun 31 Jul 2016

Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we're asking Kagglers to predict the outcome for each animal.

https://www.kaggle.com/c/shelter-animal-outcomes

In [1]:
from __future__ import print_function
import numpy as np
import pandas as pd

In [2]:
datapath = "F:/ShelterAnimalOutcomes/"
train_file = pd.read_csv(datapath+"train.csv")
test_file = pd.read_csv(datapath+"test.csv")

### Some Feature Engineering

In [3]:
#Removing Names and Subtypes of Outcome, for now
train_file.drop(["Name", "OutcomeSubtype"], axis=1, inplace=True)
test_file.drop(["Name"], axis=1, inplace=True)

In [4]:
#Converting Dates to categorical Year, Month and Day of the Week

from datetime import datetime
def convert_date(dt):
    d = datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")
    return d.year, d.month, d.isoweekday()

train_file["Year"], train_file["Month"], train_file["WeekDay"] = zip(*train_file["DateTime"].map(convert_date))
test_file["Year"], test_file["Month"], test_file["WeekDay"] = zip(*test_file["DateTime"].map(convert_date))

train_file.drop(["DateTime"], axis=1, inplace=True)
test_file.drop(["DateTime"], axis=1, inplace=True)

In [5]:
#Separating IDs
train_id = train_file[["AnimalID"]]
test_id = test_file[["ID"]]
train_file.drop(["AnimalID"], axis=1, inplace=True)
test_file.drop(["ID"], axis=1, inplace=True)

In [6]:
#Target variable
train_outcome = train_file["OutcomeType"]
train_file.drop(["OutcomeType"], axis=1, inplace=True)

In [7]:
#Converting Age to weeks
def age_to_weeks(age1):
    if age1 is np.nan:
        return 25.0
    parts = age1.split()
    if parts[0] == '0':
        return 10.0
    if parts[1] == "weeks":
        return float(parts[0]) 
    elif parts[1] == "months":
        return float(parts[0]) * 4
    else:
        return float(parts[0]) * 52

In [8]:
train_file["AgeuponOutcome"] = train_file["AgeuponOutcome"].map(age_to_weeks)
test_file["AgeuponOutcome"] = test_file["AgeuponOutcome"].map(age_to_weeks)

In [9]:
#Checking that train and test sets are similar
print(train_file.head(1))
print(test_file.head(1))

  AnimalType SexuponOutcome  AgeuponOutcome                  Breed  \
0        Dog  Neutered Male              52  Shetland Sheepdog Mix   

         Color  Year  Month  WeekDay  
0  Brown/White  2014      2        3  
  AnimalType SexuponOutcome  AgeuponOutcome                   Breed  \
0        Dog  Intact Female              40  Labrador Retriever Mix   

       Color  Year  Month  WeekDay  
0  Red/White  2015     10        1  


### Binary encoding of the categorical variables
To correctly encode the variables, the encoding of the classes on both sets should be the same. 
To do this, we'll create a big set with the concatenation of both sets

In [10]:
categorical_variables = ['AnimalType', 'SexuponOutcome', 'Breed', 'Color', 'Year', 'Month', 'WeekDay']

In [11]:
#Mark the training set
train_file["Train"] = 1
test_file["Train"] = 0

#Concatenate the sets
conjunto = pd.concat([train_file, test_file])

In [12]:
#Get the encoded set
conjunto_encoded = pd.get_dummies(conjunto, columns=categorical_variables)

In [13]:
#Separate the sets
train = conjunto_encoded[conjunto_encoded["Train"] == 1]
test = conjunto_encoded[conjunto_encoded["Train"] == 0]
train = train.drop(["Train"], axis=1)
test = test.drop(["Train"], axis=1)

### Parameter Search

In [14]:
from sklearn.grid_search import GridSearchCV

In [20]:
from sklearn.ensemble import RandomForestClassifier
params = {"n_estimators": [500, 1000, 1500, 2000]}
est = RandomForestClassifier()

In [21]:
model = GridSearchCV(estimator=est, param_grid=params, scoring="log_loss", n_jobs=4)

In [None]:
model.fit(train, train_outcome)

In [19]:
print("The best parameters are %s with a score of %0.2f" % (model.best_params_, model.best_score_))

The best parameters are {'n_estimators': 1500} with a score of -1.38


### Obtaining the submission

In [None]:
#Getting predicted probabilities
y_pred = model.predict_proba(test)

In [None]:
results = pd.read_csv(datapath+"sample_submission.csv")

In [None]:
results['Adoption'], results['Died'], results['Euthanasia'], results['Return_to_owner'], results['Transfer'] = y_pred[:,0], y_pred[:,1], y_pred[:,2], y_pred[:,3], y_pred[:,4]

In [None]:
#Submission File
results.to_csv(datapath+"adaboost_submission.csv", index=False)