# About this notebook 

#### Feature: Type

This is one of a series of notebooks (one for each feature of interest) that explores the feature for missing data, data characteristics, correlation with the rate of adoption (predictor variable) and other points of interest that might be helpful to know (and deal with) prior to machine learning.

<div class="span5 alert alert-success">
<p> <I> Feature Description: </I> The "Type" data represents whether the pet is a dog (value of 1) or a cat (value of 2)  
    <I> Source: </I> https://www.kaggle.com/c/petfinder-adoption-prediction/data  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Predictor (Adoption Speed) Description: </I> 

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted.   
<br> 
The values are determined in the following way:   
0 - Pet was adopted on the same day as it was listed.    
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.    
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.    
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.    
4 - No adoption after 100 days of being listed.    

</p>
</div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [2]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Import the csv file

dfi = pd.read_csv('train.csv')
dfi.head(1)

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,1,Lil Milo,2,0,26,2,2,0,0,2,...,1,1,0,41326,1a2113010d6048d5410b265347b35c91,0,Milo went missing after a week with her new ad...,375905770,3,3


<div class="span5 alert alert-info">
<p> <B>  Missing Data: </B>  </p>
</div>

In [4]:
#Create Type Dataframe

dfa = dfi[['Type','AdoptionSpeed']]
dfa.columns = ['type','adoptionspeed']

In [5]:
# Percentage of missing values in each column
pd.DataFrame(
    dfa.isnull().sum() / len(dfa),
    columns=['% Missing Values']
).transpose()

Unnamed: 0,type,adoptionspeed
% Missing Values,0.0,0.0


<div class="span5 alert alert-info">
<p> <B>  Characteristics of the data: </B>  </p>
</div>

In [6]:
#Function to designate 'dog' or 'cat' text in dataframe
def dogorcat(intype):
    pettype = ''
    
    if intype == 1:
        pettype = 'dog'
    else:
        pettype = 'cat'
    
    return pettype        

In [7]:
#Create a dataframe of types

dfac = dfa['type'].value_counts()
dfac = dfac.reset_index()
dfac.columns = ['type','typecount']

dfac['typetext'] = dfac.type.apply(dogorcat)

dfac['typepercent'] = round(dfac.typecount/dfac.typecount.sum(),2)

dfac = dfac[['typetext','typecount','typepercent']]
dfac

Unnamed: 0,typetext,typecount,typepercent
0,dog,8132,0.54
1,cat,6861,0.46


In [8]:
#Average adoption speed for a dog
dfdogs = dfa[dfa.type == 1]

dogsavgadoptionspeed = round(dfdogs.adoptionspeed.sum()/dfdogs.type.count(),2)
print('Dogs average adoption speed: ' + str(dogsavgadoptionspeed))


Dogs average adoption speed: 2.62


In [9]:
#Average adoption speed for a cat
dfcats = dfa[dfa.type == 2]

catsavgadoptionspeed = round(dfcats.adoptionspeed.sum()/dfcats.type.count(),2)
print('Cats average adoption speed: ' + str(catsavgadoptionspeed))


Cats average adoption speed: 2.4


<div class="span5 alert alert-info">
<p> <B>  Correlation with the Adoption Rate: </B>  </p>
</div>

In [10]:
#Create a dataframe to calculate correlation
dfaa = dfi[['Type','AdoptionSpeed']]
dfaa.columns = ['type','adoptionspeed']

In [11]:
#Calculate pearson correlation between pet type and adoption speed
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)

    return corr_mat[0,1]

# Compute Pearson correlation coefficient
r = pearson_r(dfaa.type,dfaa.adoptionspeed)

# Print the result
print('Correlation value: ' + str(round(r,2)*100) + '%')

Correlation value: -9.0%


<div class="span5 alert alert-info">
<p> <B>  Other Points of Interest: </B>  </p>
</div>

In [12]:
#Create a dataframe for pivot table for dogs and for cats
dfaa = dfi[['Type','Age','AdoptionSpeed']]
dfaa['ageinyears'] = round(dfaa.Age/12)
dfaa.columns = ['type','ageinmonths','adoptionspeed','ageinyears']
dfaa = dfaa[dfaa.ageinyears < 13]
dfaa = dfaa[['type','ageinyears','adoptionspeed','ageinmonths']]

In [13]:
#Create a pivot table of adoption speed vs age for dogs and cats
#dfaas = dfaa[['ageinyears','adoptionspeed']]
dfasa = dfaa.pivot_table(columns='adoptionspeed', index=['ageinyears','type'], values='ageinmonths', aggfunc='count',margins=True)
dfasa.columns = ['oneday','oneweek','onemonth','threemonths','notadopted','totals']

dfasa['%adoptedinoneday'] = round(dfasa.oneday/dfasa.totals,2)
dfasa['%notadopted'] = round(dfasa.notadopted/dfasa.totals,2)
dfasa

Unnamed: 0_level_0,Unnamed: 1_level_0,oneday,oneweek,onemonth,threemonths,notadopted,totals,%adoptedinoneday,%notadopted
ageinyears,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,1.0,92.0,995.0,1564.0,1329.0,1254.0,5234,0.02,0.24
0.0,2.0,182.0,1362.0,1479.0,972.0,985.0,4980,0.04,0.2
1.0,1.0,32.0,130.0,235.0,241.0,491.0,1129,0.03,0.43
1.0,2.0,36.0,182.0,254.0,210.0,498.0,1180,0.03,0.42
2.0,1.0,17.0,110.0,108.0,149.0,274.0,658,0.03,0.42
2.0,2.0,18.0,65.0,71.0,77.0,170.0,401,0.04,0.42
3.0,1.0,9.0,68.0,74.0,61.0,147.0,359,0.03,0.41
3.0,2.0,,19.0,27.0,15.0,67.0,128,,0.52
4.0,1.0,5.0,52.0,53.0,52.0,93.0,255,0.02,0.36
4.0,2.0,1.0,12.0,18.0,8.0,28.0,67,0.01,0.42
