# Shelter Animal Outcomes
- Mon 21 Mar 2016 - Sun 31 Jul 2016

Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we're asking Kagglers to predict the outcome for each animal.

https://www.kaggle.com/c/shelter-animal-outcomes

### The Data

In [None]:
import numpy as np
import pandas as pd

In [None]:
datapath = "G:/KagglePast/ShelterAnimalOutcomes/"
train_file = pd.read_csv(datapath+"train.csv")
test_file = pd.read_csv(datapath+"test.csv")

In [None]:
train_file.head(3)

### Some Feature Engineering

In [None]:
#Removing Names and Subtypes of Outcome
train_file.drop(["Name", "OutcomeSubtype"], axis=1, inplace=True)
test_file.drop(["Name"], axis=1, inplace=True)

In [None]:
#Converting Dates to categorical Year, Month and Day of the Week

from datetime import datetime
def convert_date(dt):
    d = datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")
    return d.year, d.month, d.isoweekday()

train_file["Year"], train_file["Month"], train_file["WeekDay"] = zip(*train_file["DateTime"].map(convert_date))
test_file["Year"], test_file["Month"], test_file["WeekDay"] = zip(*test_file["DateTime"].map(convert_date))

train_file.drop(["DateTime"], axis=1, inplace=True)
test_file.drop(["DateTime"], axis=1, inplace=True)

In [None]:
#Separating IDs
train_id = train_file[["AnimalID"]]
test_id = test_file[["ID"]]
train_file.drop(["AnimalID"], axis=1, inplace=True)
test_file.drop(["ID"], axis=1, inplace=True)

In [None]:
#Target variable
train_outcome = train_file["OutcomeType"]
train_file.drop(["OutcomeType"], axis=1, inplace=True)

In [None]:
#Converting Age to weeks
def age_to_weeks(age1):
    if age1 is np.nan:
        return 25.0
    parts = age1.split()
    if parts[0] == '0':
        return 10.0
    if parts[1] == "weeks":
        return float(parts[0]) 
    elif parts[1] == "months":
        return float(parts[0]) * 4
    else:
        return float(parts[0]) * 52

In [None]:
train_file["AgeuponOutcome"] = train_file["AgeuponOutcome"].map(age_to_weeks)
test_file["AgeuponOutcome"] = test_file["AgeuponOutcome"].map(age_to_weeks)

In [None]:
#Checking that train and test sets are similar
print(train_file.head(1))
print(test_file.head(1))

### Binary encoding of the categorical variables
To correctly encode the variables, the encoding of the classes on both sets should be the same. 
To do this, we'll create a big set with the concatenation of both sets

In [None]:
categorical_variables = ['AnimalType', 'SexuponOutcome', 'Breed', 'Color', 'Year', 'Month', 'WeekDay']

In [None]:
#Mark the training set
train_file["Train"] = 1
test_file["Train"] = 0

#Concatenate the sets
conjunto = pd.concat([train_file, test_file])

In [None]:
#Get the encoded set
conjunto_encoded = pd.get_dummies(conjunto, columns=categorical_variables)

In [None]:
#Separate the sets
train = conjunto_encoded[conjunto_encoded["Train"] == 1]
test = conjunto_encoded[conjunto_encoded["Train"] == 0]
train = train.drop(["Train"], axis=1)
test = test.drop(["Train"], axis=1)

In [None]:
#Separating a validation set
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train, train_outcome, test_size=0.2)

### Training Models

In [None]:
#First Model: Logistic Regression
#http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
from sklearn.linear_model import LogisticRegression

In [None]:
model1 = LogisticRegression()

In [None]:
model1.fit(X_train, y_train)

In [None]:
y_pred = model1.predict(X_val)

In [None]:
#Evaluation the Model: Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred)

In [None]:
#A prettier representation
print("Model Accuracy: {:.2%}".format(accuracy_score(y_val, y_pred)))

In [None]:
#Evaluation the Model: the Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_val, y_pred)

### Ejercicios

Obtenga las métricas: precision, recall. ¿Qué miden estas funciones?

Obtenga los coeficientes de las variables en el modelo entrenado. ¿Cómo se podrían interpretar?

Entrene otros modelos (por ejemplo: Decision Trees y Random Forest) y obtenga las métricas. 

De alguno de los sugeridos en el punto anterior obtenga la importancia de las variables para el modelo entrenado