# About this notebook 

#### Feature: Health (vaccinated, dewormed, sterilized, health)

This is one of a series of notebooks (one for each feature of interest) that explores the feature for missing data, data characteristics, correlation with the rate of adoption (predictor variable) and other points of interest that might be helpful to know (and deal with) prior to machine learning.

<div class="span5 alert alert-success">
<p> <I> Feature Description: </I> A pet's "Health" is represented across 4 features: vaccinated, dewormed, sterilized, and health.  
    <br>
    Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)   
    Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)   
    Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)   
    Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)   
    <br>
The approach for this feature set is to compare the pets with a perfect health score (value of 1 for all features) with pet’s that don’t have a good health score (decided this is a value > 5 meaning the pet scored less than perfect in at least two health features.   
    <br>
    <I> Source: </I> https://www.kaggle.com/c/petfinder-adoption-prediction/data  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Predictor (Adoption Speed) Description: </I> 

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted.   
<br> 
The values are determined in the following way:   
0 - Pet was adopted on the same day as it was listed.    
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.    
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.    
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.    
4 - No adoption after 100 days of being listed.    

</p>
</div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [2]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Import the csv file
dfi = pd.read_csv('train.csv')
dfi.head(1)

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,1,Lil Milo,2,0,26,2,2,0,0,2,...,1,1,0,41326,1a2113010d6048d5410b265347b35c91,0,Milo went missing after a week with her new ad...,375905770,3,3


<div class="span5 alert alert-info">
<p> <B>  Missing Data: </B>  </p>
</div>

In [4]:
#Create Health Dataframe
dfa = dfi[['Vaccinated', 'Dewormed','Sterilized', 'Health','AdoptionSpeed']]

dfa['healthsum'] = dfa.Vaccinated + dfa.Dewormed + dfa.Sterilized + dfa.Health

dfa.columns = ['vaccinated','dewormed','sterilized','health', 'adoptionspeed','healthsum']

dfa = dfa.sort_values('healthsum')

In [5]:
# Percentage of missing values in each column
pd.DataFrame(
    dfa.isnull().sum() / len(dfa),
    columns=['% Missing Values']
).transpose()

Unnamed: 0,vaccinated,dewormed,sterilized,health,adoptionspeed,healthsum
% Missing Values,0.0,0.0,0.0,0.0,0.0,0.0


<div class="span5 alert alert-info">
<p> <B>  Characteristics of the data: </B>  </p>
</div>

In [6]:
#Create a dataframe of pets with perfect health (value of 1 in all health categories)
dfph = dfa[(dfa.vaccinated == 1) & (dfa.dewormed == 1) & (dfa.sterilized == 1) & (dfa.health == 1)]
dfph = dfph.reset_index()

dfph.describe()

Unnamed: 0,index,vaccinated,dewormed,sterilized,health,adoptionspeed,healthsum
count,2377.0,2377.0,2377.0,2377.0,2377.0,2377.0,2377.0
mean,7009.518721,1.0,1.0,1.0,1.0,2.91544,4.0
std,4612.085069,0.0,0.0,0.0,0.0,1.12666,0.0
min,6.0,1.0,1.0,1.0,1.0,0.0,4.0
25%,2371.0,1.0,1.0,1.0,1.0,2.0,4.0
50%,7117.0,1.0,1.0,1.0,1.0,3.0,4.0
75%,10983.0,1.0,1.0,1.0,1.0,4.0,4.0
max,14987.0,1.0,1.0,1.0,1.0,4.0,4.0


In [7]:
#Create a dataframe of pets that don't have a reasonable health score (total health score > 5)
dfbadh = dfa[(dfa.healthsum > 5)]
dfbadh = dfbadh.reset_index()

dfbadh.describe()

Unnamed: 0,index,vaccinated,dewormed,sterilized,health,adoptionspeed,healthsum
count,9516.0,9516.0,9516.0,9516.0,9516.0,9516.0,9516.0
mean,7516.92644,2.128731,1.876524,2.147646,1.051807,2.429487,7.204708
std,4149.228501,0.497319,0.694692,0.460478,0.237224,1.187325,1.257546
min,2.0,1.0,1.0,1.0,1.0,0.0,6.0
25%,3996.75,2.0,1.0,2.0,1.0,1.0,6.0
50%,7352.5,2.0,2.0,2.0,1.0,2.0,7.0
75%,11073.25,2.0,2.0,2.0,1.0,4.0,7.0
max,14992.0,3.0,3.0,3.0,3.0,4.0,12.0


<div class="span5 alert alert-info">
<p> <B>  Correlation with the Adoption Rate: </B>  </p>
</div>

In [8]:
#Calculate pearson correlation using the healthsum.
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)

    return corr_mat[0,1]

# Compute Pearson correlation coefficient
r = pearson_r(dfa.healthsum,dfa.adoptionspeed)

# Print the result
print('Correlation value: ' + str(round(r,2)*100) + '%')

Correlation value: -5.0%


<div class="span5 alert alert-info">
<p> <B>  Other Points of Interest: </B>  </p>
</div>