# About this notebook 

#### Feature: Name

This is one of a series of notebooks (one for each feature of interest) that explores the feature for missing data, data characteristics, correlation with the rate of adoption (predictor variable) and other points of interest that might be helpful to know (and deal with) prior to machine learning.

<div class="span5 alert alert-success">
<p> <I> Feature Description: </I> The "Name" data contains the pet's name or blank if the pet is not named.  
    <I> Source: </I> https://www.kaggle.com/c/petfinder-adoption-prediction/data  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Predictor (Adoption Speed) Description: </I> 

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted.   
<br> 
The values are determined in the following way:   
0 - Pet was adopted on the same day as it was listed.    
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.    
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.    
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.    
4 - No adoption after 100 days of being listed.    

</p>
</div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [2]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Import the csv file

dfi = pd.read_csv('train.csv')
dfi.head(1)

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,1,Lil Milo,2,0,26,2,2,0,0,2,...,1,1,0,41326,1a2113010d6048d5410b265347b35c91,0,Milo went missing after a week with her new ad...,375905770,3,3


<div class="span5 alert alert-info">
<p> <B>  Missing Data: </B>  </p>
</div>

In [4]:
#Create Type Dataframe

dfa = dfi[['Name','AdoptionSpeed']]
dfa.columns = ['name','adoptionspeed']

In [5]:
# Percentage of missing values in each column
pd.DataFrame(
    dfa.isnull().sum() / len(dfa),
    columns=['% Missing Values']
).transpose()

Unnamed: 0,name,adoptionspeed
% Missing Values,0.083839,0.0


<div class="span5 alert alert-info">
<p> <B>  Characteristics of the data: </B>  </p>
</div>

In [6]:
#Function to designate 'name' or 'noname' text in dataframe
def funcnameornoname(invalue):
    
    petname = 999
    
    if invalue == 'not named':
        petname = 0
    else:
        petname = 1
    
    return petname        

In [7]:
#Create a dataframe of names, including a column designating named or not named
dfname = dfa[['name','adoptionspeed']]
dfname = dfname.reset_index(drop=True)

dfname = dfname.fillna('not named')

dfname['nameornoname'] = dfname.name.apply(funcnameornoname)
dfname.head(1)

Unnamed: 0,name,adoptionspeed,nameornoname
0,Lil Milo,3,1


In [8]:
#Average adoption speed for pet's with names
dfhavename = dfname[dfname.nameornoname == 1]

havenameavgadoptionspeed = round(dfhavename.adoptionspeed.sum()/dfhavename.nameornoname.count(),2)
print('Have name average adoption speed: ' + str(havenameavgadoptionspeed))


Have name average adoption speed: 2.51


In [9]:
#Average adoption speed for no name
dfnoname = dfname[dfname.nameornoname == 0]

nonameavgadoptionspeed = round(dfnoname.adoptionspeed.sum()/dfnoname.nameornoname.count(),2)
print('No name average adoption speed: ' + str(nonameavgadoptionspeed))


No name average adoption speed: 2.6


<div class="span5 alert alert-info">
<p> <B>  Correlation with the Adoption Rate: </B>  </p>
</div>

In [10]:
#Create a dataframe to calculate correlation
dfaa = dfi[['Name','AdoptionSpeed']]
dfaa.columns = ['name','adoptionspeed']

In [11]:
#Calculate pearson correlation between pet type and adoption speed
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)

    return corr_mat[0,1]

# Compute Pearson correlation coefficient
r = pearson_r(dfname.nameornoname,dfaa.adoptionspeed)

# Print the result
print('Correlation value: ' + str(round(r,2)*100) + '%')

Correlation value: -2.0%


<div class="span5 alert alert-info">
<p> <B>  Other Points of Interest: </B>  </p>
</div>