# MACHINE LEARNING OF TRAINING DATA

Data source: https://www.kaggle.com/c/shelter-animal-outcomes

The objective of this investigation is to predict the outcomes experienced by shelter animals (Adoption, Died, Euthanasia, Return_to_owner, Transfer) based on information such as animal type, animal sex and animal age.  I have performed a separate exploratory analysis of the data in order to identify which features may be most powerful.

I will begin by pre-processing a training dataset provided by Kaggle, carrying out the necessary steps to get that dataset ready for input into a Decision Tree Classifier. 

Once my classifier has been trained, I will import and preprocess the testing dataset provided by Kaggle, and use the classifier to predict outcomes for the animals in that dataset.

# PROCESSING OF TRAINING DATA

In [2]:
#Import CSV file as a pandas dataframe.

import numpy as np
import pandas as pd

animals = pd.read_csv("train.csv")

print animals.head()
print animals.count()

  AnimalID     Name             DateTime      OutcomeType OutcomeSubtype  \
0  A671945  Hambone  2014-02-12 18:22:00  Return_to_owner            NaN   
1  A656520    Emily  2013-10-13 12:44:00       Euthanasia      Suffering   
2  A686464   Pearce  2015-01-31 12:28:00         Adoption         Foster   
3  A683430      NaN  2014-07-11 19:09:00         Transfer        Partner   
4  A667013      NaN  2013-11-15 12:52:00         Transfer        Partner   

  AnimalType SexuponOutcome AgeuponOutcome                        Breed  \
0        Dog  Neutered Male         1 year        Shetland Sheepdog Mix   
1        Cat  Spayed Female         1 year       Domestic Shorthair Mix   
2        Dog  Neutered Male        2 years                 Pit Bull Mix   
3        Cat    Intact Male        3 weeks       Domestic Shorthair Mix   
4        Dog  Neutered Male        2 years  Lhasa Apso/Miniature Poodle   

         Color  
0  Brown/White  
1  Cream Tabby  
2   Blue/White  
3   Blue Cream  
4      

In [3]:
#Drop columns not being considered in current investigation
animals_clean = animals.drop("AnimalID", axis=1)
animals_clean = animals_clean.drop("Name", axis=1)
animals_clean = animals_clean.drop("OutcomeSubtype", axis=1)
animals_clean = animals_clean.drop("Color", axis=1)
animals_clean = animals_clean.drop("Breed", axis=1)

print animals_clean.head()

              DateTime      OutcomeType AnimalType SexuponOutcome  \
0  2014-02-12 18:22:00  Return_to_owner        Dog  Neutered Male   
1  2013-10-13 12:44:00       Euthanasia        Cat  Spayed Female   
2  2015-01-31 12:28:00         Adoption        Dog  Neutered Male   
3  2014-07-11 19:09:00         Transfer        Cat    Intact Male   
4  2013-11-15 12:52:00         Transfer        Dog  Neutered Male   

  AgeuponOutcome  
0         1 year  
1         1 year  
2        2 years  
3        3 weeks  
4        2 years  


In [4]:
#Remove NaNs from all remaining columns
def nan_remover(dataframe):
    list_of_columns = dataframe.columns.tolist()
    for column in list_of_columns:
        dataframe = dataframe[dataframe[column].notnull()]
    return dataframe

animals_clean = nan_remover(animals_clean)

print animals_clean.count()

DateTime          26710
OutcomeType       26710
AnimalType        26710
SexuponOutcome    26710
AgeuponOutcome    26710
dtype: int64


In [5]:
#Create two empty lists, one to contain the machine learning features, the other the labels
#Features and labels will be stored as tuples in the form indicated below
#(Name_as_string, values_as_pd_series)

list_of_features = []
list_of_labels = []

In [6]:
#Function that takes a categorical column from the animals dataframe, returns multiple dummy columns (listing 1 
#where condition true, 0 where false), and appends those columns to the appropriate list - features or labels

def add_dummies_to_list(column_to_dummy, target_list):
    dummies_df = pd.get_dummies(column_to_dummy)
    for column in dummies_df:
        target_list.append((column, dummies_df[str(column)]))



In [7]:
#Create dummies for OutcomeType (label), and SexuponOutcome and AnimalType (features). Append dummies to correct list.

add_dummies_to_list(animals_clean["OutcomeType"], list_of_labels)
add_dummies_to_list(animals_clean["SexuponOutcome"], list_of_features)
add_dummies_to_list(animals_clean["AnimalType"], list_of_features)

In [8]:
#Function that takes the DateTime entry and returns an integer between 1 and 12, to represent the calendar month.

def month_slicer(entry):
    entry = int(entry[5:7])
    return entry

animals_clean["Month"] = animals_clean["DateTime"].apply(month_slicer)

print animals_clean["Month"].head()
print animals_clean["Month"].count()

0     2
1    10
2     1
3     7
4    11
Name: Month, dtype: int64
26710


In [9]:
#Add newly created Month feature to the list of features

list_of_features.append(("Month", animals_clean["Month"]))

In [10]:
#Define function to convert age column to number of days (rounded to nearest day)
def age_converter(age_column):
    times_list = ["year", "month", "week", "day"]
    list_of_unconverted_ages = age_column.values.tolist()
    list_of_converted_ages = []
    for item in list_of_unconverted_ages:
        i = 0
        while i < len(times_list):
            if times_list[i] in item:
        #for time in times_list:
            #if time in item:
                current_time = times_list[i]
                i += 1
                space_location = item.index(" ")
                cut_item = item[:space_location]
                if current_time == "year":
                    list_of_converted_ages.append(int(cut_item)*365)
                if current_time == "month":
                    list_of_converted_ages.append(int(float(cut_item)*(365.0/12)))
                if current_time == "week":
                    list_of_converted_ages.append(int(cut_item)*7)
                if current_time == "day":
                    list_of_converted_ages.append(int(cut_item))
            else:
                i += 1
                #else:
                #    new_column.append(0.0)
    return pd.Series(list_of_converted_ages)

series_of_converted_ages = age_converter(animals_clean["AgeuponOutcome"])


In [11]:
#Add newly created Age feature to the list of features

list_of_features.append(("Age", series_of_converted_ages))

In [12]:
#Create master function that takes a list_of_inputs as its input, and returns an array that is ready
#for use in the machine learning pipeline.
def create_array_for_machine_learning(list_of_inputs):
    length_checker(list_of_inputs)
    ml_array = ml_array_maker(list_of_inputs)
    return ml_array

#The master function first runs a length checker, which confirms that every series in the list_of_inputs
#is the same length, or else returns an eror message.
def length_checker(list_of_inputs):
    length = list_of_inputs[0][1].count()
    for item in list_of_inputs:
        new_length = item[1].count()
        if new_length != length:
            print "Inconsistent column length at:"
            print item[0]
            return
        else:
            length = new_length
    print "Consistent column lengths."

#The master function then runs the array maker, which converts each series in the list_of_inputs to
#a numpy array, reshapes that numpy array, and then concatenates all the 1D arrays into a single 2D array.
#If the process somehow create any NaNs, an error message is returned and the process terminated.
def ml_array_maker(list_of_inputs):
    list_of_arrays = []
    for item in list_of_inputs:
        new_array = np.array(item[1])
        new_array = new_array[:,None]
        if np.isnan(np.sum(new_array)) == True:
            print column
            print "NAN!"
            return
        list_of_arrays.append(new_array)
    ml_array = np.concatenate(list_of_arrays, axis =1)
    print "Array completed with no NaNs."
    return ml_array

In [13]:
#Check we have the features we want
print len(list_of_features)

for feature in list_of_features:
    print feature[0]

9
Intact Female
Intact Male
Neutered Male
Spayed Female
Unknown
Cat
Dog
Month
Age


In [19]:
#Create 2D feature array
features = create_array_for_machine_learning(list_of_features)

Consistent column lengths.
Array completed with no NaNs.


In [20]:
#Check we have the labels we want

print len(list_of_labels)

for label in list_of_labels:
    print label[0]

5
Adoption
Died
Euthanasia
Return_to_owner
Transfer


In [21]:
#Create 2D label array
labels = create_array_for_machine_learning(list_of_labels)

Consistent column lengths.
Array completed with no NaNs.


# MACHINE LEARNING FROM TRAINING DATA

In [39]:
#Split data into training and testing sets

from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = 0.1)

In [40]:
#Use Grid Search to optimize parameters used in Decision Tree Classifier
#Return best parameter combination and accuracy score based on test set of data split

from sklearn import grid_search
from sklearn.tree import DecisionTreeClassifier

param_grid = [{"min_samples_split":[10,1000,10000], 
               "criterion": ["gini", "entropy"], 
               "splitter": ["best", "random"]}]

dtc = DecisionTreeClassifier()
clf = grid_search.GridSearchCV(dtc, param_grid)
clf.fit(features_train, labels_train)

print clf.best_params_

print clf.score(features_test, labels_test)

{'min_samples_split': 10000, 'splitter': 'random', 'criterion': 'entropy'}
0.560089853987


# PROCESSING OF TESTING DATA

In [41]:
#Import test CSV (from Kaggle competition) as pandas dataframe

test_data = pd.read_csv("test.csv")

print test_data.head()
print test_data.count()

   ID      Name             DateTime AnimalType SexuponOutcome AgeuponOutcome  \
0   1    Summer  2015-10-12 12:15:00        Dog  Intact Female      10 months   
1   2  Cheyenne  2014-07-26 17:59:00        Dog  Spayed Female        2 years   
2   3       Gus  2016-01-13 12:20:00        Cat  Neutered Male         1 year   
3   4     Pongo  2013-12-28 18:12:00        Dog    Intact Male       4 months   
4   5   Skooter  2015-09-24 17:59:00        Dog  Neutered Male        2 years   

                            Breed        Color  
0          Labrador Retriever Mix    Red/White  
1  German Shepherd/Siberian Husky    Black/Tan  
2          Domestic Shorthair Mix  Brown Tabby  
3               Collie Smooth Mix     Tricolor  
4            Miniature Poodle Mix        White  
ID                11456
Name               8231
DateTime          11456
AnimalType        11456
SexuponOutcome    11456
AgeuponOutcome    11450
Breed             11456
Color             11456
dtype: int64


In [42]:
#AgeuponOutcome columns contains NaNs. Cannot simply remove these entries as need all in place for Kaggle 
#scoring process. Instead will replace with mean age found in the training data set. 

print (series_of_converted_ages.mean())/365

2.17650390034


In [43]:
#Round down to closest value actually found in data set - "2 years". Replace NaNs with this value

test_data["Age no Nans"] = test_data["AgeuponOutcome"].fillna(value="2 years")


In [44]:
#Drop columns that were not used to train the machine learning algorithm

test_clean = test_data.drop("Name", axis=1)
test_clean = test_clean.drop("Color", axis=1)
test_clean = test_clean.drop("Breed", axis=1)
test_clean = test_clean.drop("AgeuponOutcome", axis=1)

print test_clean.head()

   ID             DateTime AnimalType SexuponOutcome Age no Nans
0   1  2015-10-12 12:15:00        Dog  Intact Female   10 months
1   2  2014-07-26 17:59:00        Dog  Spayed Female     2 years
2   3  2016-01-13 12:20:00        Cat  Neutered Male      1 year
3   4  2013-12-28 18:12:00        Dog    Intact Male    4 months
4   5  2015-09-24 17:59:00        Dog  Neutered Male     2 years


In [45]:
#Extract test-id from dataframe for later use

test_id = np.array(test_clean["ID"])[:,None]
print test_id[:10]


[[ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]]


In [46]:
#Create empty list to contain tuples of test features

list_of_test_features = []

In [47]:
#Create dummy columns and append to list of test feaures

add_dummies_to_list(test_clean["SexuponOutcome"], list_of_test_features)
add_dummies_to_list(test_clean["AnimalType"], list_of_test_features)

In [48]:
#Convert DateTime to integer month and append to list of test features

test_clean["Month"] = test_clean["DateTime"].apply(month_slicer)

list_of_test_features.append(("Month", test_clean["Month"]))

In [49]:
#Convert ages to integer number of days, and append to list of test features

series_of_converted_ages = age_converter(test_clean["Age no Nans"])

list_of_test_features.append(("Age", series_of_converted_ages))

In [50]:
#Check we have same features as used to train the machine learning algorithm

print len(list_of_test_features)

for feature in list_of_test_features:
    print feature[0]

9
Intact Female
Intact Male
Neutered Male
Spayed Female
Unknown
Cat
Dog
Month
Age


# MAKE PREDICTIONS FROM TESTING DATA

In [51]:
#Create 2D feature array from list of test_features
test_features = create_array_for_machine_learning(list_of_test_features)

Consistent column lengths.
Array completed with no NaNs.


In [52]:
#Create 2D prediction array
predictions_array = clf.predict(test_features)

#Concatenate test_id column with prediction array
predictions_array = np.concatenate((test_id,predictions_array),axis=1)

#Convert array to dataframe, name columns as per kaggle requirements, set integer data type as per kaggle requirements
predictions_df = pd.DataFrame(predictions_array, dtype=int)
predictions_df.rename(columns={0: "ID", 1: "Adoption", 2: "Died", 3: "Euthanasia", 4: "Return_to_owner", 5:"Transfer"},\
                      inplace=True)

#Check number of entries, datatypes
print predictions_df.head()
print predictions_df.count()
print predictions_df.dtypes


   ID  Adoption  Died  Euthanasia  Return_to_owner  Transfer
0   1         0     0           0                0         1
1   2         1     0           0                0         0
2   3         1     0           0                0         0
3   4         0     0           0                0         1
4   5         1     0           0                0         0
ID                 11456
Adoption           11456
Died               11456
Euthanasia         11456
Return_to_owner    11456
Transfer           11456
dtype: int64
ID                 int64
Adoption           int64
Died               int64
Euthanasia         int64
Return_to_owner    int64
Transfer           int64
dtype: object


In [53]:
#Write dataframe to csv file

predictions_df.to_csv("ml_output_080816.csv", index=False)