# About this notebook 

#### Feature: Maturity Size

This is one of a series of notebooks (one for each feature of interest) that explores the feature for missing data, data characteristics, correlation with the rate of adoption (predictor variable) and other points of interest that might be helpful to know (and deal with) prior to machine learning.

<div class="span5 alert alert-success">
<p> <I> Feature Description: </I> The "Maturity Size" data represents the size of the pet as an adult. The values are...   
    <br>
    0 = Not Specified   
    1 = Small   
    2 = Medium   
    3 = Large   
    4 = Extra Large   
    <br>
    <I> Source: </I> https://www.kaggle.com/c/petfinder-adoption-prediction/data  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Predictor (Adoption Speed) Description: </I> 

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted.   
<br> 
The values are determined in the following way:   
0 - Pet was adopted on the same day as it was listed.    
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.    
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.    
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.    
4 - No adoption after 100 days of being listed.    

</p>
</div>

In [12]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [13]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [14]:
#Import the csv file

dfi = pd.read_csv('train.csv')
dfi.head(1)

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,1,Lil Milo,2,0,26,2,2,0,0,2,...,1,1,0,41326,1a2113010d6048d5410b265347b35c91,0,Milo went missing after a week with her new ad...,375905770,3,3


<div class="span5 alert alert-info">
<p> <B>  Missing Data: </B>  </p>
</div>

In [15]:
#Create Type Dataframe

dfa = dfi[['MaturitySize','AdoptionSpeed']]
dfa.columns = ['matsize','adoptionspeed']

In [16]:
# Percentage of missing values in each column
pd.DataFrame(
    dfa.isnull().sum() / len(dfa),
    columns=['% Missing Values']
).transpose()

Unnamed: 0,matsize,adoptionspeed
% Missing Values,0.0,0.0


<div class="span5 alert alert-info">
<p> <B>  Characteristics of the data: </B>  </p>
</div>

In [17]:
#Function to designate maturity size in dataframe
def dogorcat(inmatsize):
    petmatsize = ''
    
    if inmatsize == 0:
        petmatsize = 'notspecified'
    elif inmatsize == 1:
        petmatsize = 'small'
    elif inmatsize == 2:
        petmatsize = 'medium'
    elif inmatsize == 3:
        petmatsize = 'large'
    else:
        petmatsize = 'extralarge'
        
    return petmatsize  

In [18]:
#Create a dataframe of matsizes

dfac = dfa['matsize'].value_counts()
dfac = dfac.reset_index()
dfac.columns = ['matsize','matsizecount']

dfac['matsizetext'] = dfac.matsize.apply(dogorcat)

dfac['matsizepercent'] = round(dfac.matsizecount/dfac.matsizecount.sum(),2)

dfac['matsizeadoptionspeedmean'] = dfa.groupby('matsize')['adoptionspeed'].mean()

dfac = dfac[['matsize','matsizetext','matsizecount','matsizepercent','matsizeadoptionspeedmean']].sort_values('matsize')
dfac

Unnamed: 0,matsize,matsizetext,matsizecount,matsizepercent,matsizeadoptionspeedmean
1,1,small,3395,0.23,2.357879
0,2,medium,10305,0.69,
2,3,large,1260,0.08,2.576904
3,4,extralarge,33,0.0,2.45873


<div class="span5 alert alert-info">
<p> <B>  Correlation with the Adoption Rate: </B>  </p>
</div>

In [19]:
#Create a dataframe to calculate correlation
dfaa = dfi[['MaturitySize','AdoptionSpeed']]
dfaa.columns = ['matsize','adoptionspeed']

In [20]:
#Calculate pearson correlation between pet type and adoption speed
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)

    return corr_mat[0,1]

# Compute Pearson correlation coefficient
r = pearson_r(dfaa.matsize,dfaa.adoptionspeed)

# Print the result
print('Correlation value: ' + str(round(r,2)*100) + '%')

Correlation value: 5.0%


<div class="span5 alert alert-info">
<p> <B>  Other Points of Interest: </B>  </p>
</div>

In [21]:
#Create a dataframe for pivot table for maturity size
dfaa = dfi[['MaturitySize','Age','AdoptionSpeed']]
dfaa['ageinyears'] = round(dfaa.Age/12)
dfaa.columns = ['matsize','ageinmonths','adoptionspeed','ageinyears']
dfaa = dfaa[dfaa.ageinyears < 13]
dfaa = dfaa[['matsize','ageinyears','adoptionspeed','ageinmonths']]

In [22]:
#Create a pivot table of adoption speed vs age for maturity size
dfasa = dfaa.pivot_table(columns='adoptionspeed', index=['ageinyears','matsize'], values='ageinmonths', aggfunc='count',margins=True)
dfasa.columns = ['oneday','oneweek','onemonth','threemonths','notadopted','totals']

dfasa['%adoptedinoneday'] = round(dfasa.oneday/dfasa.totals,2)
dfasa['%notadopted'] = round(dfasa.notadopted/dfasa.totals,2)
dfasa

Unnamed: 0_level_0,Unnamed: 1_level_0,oneday,oneweek,onemonth,threemonths,notadopted,totals,%adoptedinoneday,%notadopted
ageinyears,matsize,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,1.0,99.0,673.0,651.0,463.0,639.0,2525,0.04,0.25
0.0,2.0,155.0,1544.0,2236.0,1728.0,1511.0,7174,0.02,0.21
0.0,3.0,19.0,138.0,154.0,109.0,89.0,509,0.04,0.17
0.0,4.0,1.0,2.0,2.0,1.0,,6,0.17,
1.0,1.0,15.0,64.0,76.0,72.0,118.0,345,0.04,0.34
1.0,2.0,41.0,184.0,345.0,328.0,771.0,1669,0.02,0.46
1.0,3.0,11.0,63.0,66.0,50.0,100.0,290,0.04,0.34
1.0,4.0,1.0,1.0,2.0,1.0,,5,0.2,
2.0,1.0,8.0,43.0,44.0,38.0,32.0,165,0.05,0.19
2.0,2.0,18.0,90.0,107.0,151.0,349.0,715,0.03,0.49
