# Pet Adoption Prediction

The purpouse of this project is to predict the speed at which a pet is adopted, based on the data extracted from the pet’s listing on PetFinder.my (Malaysia’s leading online animal welfare platform) and available on [https://www.kaggle.com/c/petfinder-adoption-prediction/data].

The data includes text, tabular and image data for the pets. Exploratory data analysis can be used to derive relationships between the adoption speed and the various parameters available from the pet’s profile and suggest improvements to the profiles that would increase the animal’s adoptability. 
Some profiles represent a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted

## Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese

In [1]:
#!/usr/bin/env python
# coding: utf-8

# ## Understanding the data

# Import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Define files and directories
os.chdir('C:/Users/isado/Documents/SpringBoard/Capstone1/')
         
# Create DataFrame with training and testing data 
df_train = pd.read_csv('./train/train.csv')
df_test = pd.read_csv('./test/test.csv')

# Import Malaysian states demographic summary 
state_data = pd.read_csv('MalayDemographics2010.csv')

# First look at imported data: select sample of 3 random rows 
df_train.sample(n=3)


Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
7580,1,Mama Brown,36,307,0,2,1,2,5,2,...,1,1,0,41326,f45d8c72a87f27427cd28fc3cd2d7ce3,0,Mama Brown was rescued in Klang. She was suffe...,71e96f977,3.0,4
10051,1,Brandy,1,307,0,2,5,0,0,2,...,1,1,0,41326,6f669dd57da5683477ce15d7067540b7,0,"1 mth old female pup for adoption. Healthy, ve...",e6ff63097,4.0,3
8424,2,,2,264,0,3,3,5,6,2,...,1,11,30,41326,232b1c56ed1c7c38753b91e97a390ac7,0,I have 9 kittens and 2 adult mother cat. Very ...,f90722a00,6.0,4


In [2]:
# Print DataFrame summary
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Data columns (total 24 columns):
Type             14993 non-null int64
Name             13736 non-null object
Age              14993 non-null int64
Breed1           14993 non-null int64
Breed2           14993 non-null int64
Gender           14993 non-null int64
Color1           14993 non-null int64
Color2           14993 non-null int64
Color3           14993 non-null int64
MaturitySize     14993 non-null int64
FurLength        14993 non-null int64
Vaccinated       14993 non-null int64
Dewormed         14993 non-null int64
Sterilized       14993 non-null int64
Health           14993 non-null int64
Quantity         14993 non-null int64
Fee              14993 non-null int64
State            14993 non-null int64
RescuerID        14993 non-null object
VideoAmt         14993 non-null int64
Description      14981 non-null object
PetID            14993 non-null object
PhotoAmt         14993 non-null float64
AdoptionSpe

In [3]:
# Print DataFrame summary
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3972 entries, 0 to 3971
Data columns (total 23 columns):
Type            3972 non-null int64
Name            3561 non-null object
Age             3972 non-null int64
Breed1          3972 non-null int64
Breed2          3972 non-null int64
Gender          3972 non-null int64
Color1          3972 non-null int64
Color2          3972 non-null int64
Color3          3972 non-null int64
MaturitySize    3972 non-null int64
FurLength       3972 non-null int64
Vaccinated      3972 non-null int64
Dewormed        3972 non-null int64
Sterilized      3972 non-null int64
Health          3972 non-null int64
Quantity        3972 non-null int64
Fee             3972 non-null int64
State           3972 non-null int64
RescuerID       3972 non-null object
VideoAmt        3972 non-null int64
Description     3971 non-null object
PetID           3972 non-null object
PhotoAmt        3972 non-null float64
dtypes: float64(1), int64(18), object(4)
memory usage: 713.

In [4]:
df_train.describe()

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,VideoAmt,PhotoAmt,AdoptionSpeed
count,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0
mean,1.457614,10.452078,265.272594,74.009738,1.776162,2.234176,3.222837,1.882012,1.862002,1.467485,1.731208,1.558727,1.914227,1.036617,1.576069,21.259988,41346.028347,0.05676,3.889215,2.516441
std,0.498217,18.15579,60.056818,123.011575,0.681592,1.745225,2.742562,2.984086,0.547959,0.59907,0.667649,0.695817,0.566172,0.199535,1.472477,78.414548,32.444153,0.346185,3.48781,1.177265
min,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,41324.0,0.0,0.0,0.0
25%,1.0,2.0,265.0,0.0,1.0,1.0,0.0,0.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,2.0,2.0
50%,1.0,3.0,266.0,0.0,2.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,3.0,2.0
75%,2.0,12.0,307.0,179.0,2.0,3.0,6.0,5.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,0.0,41401.0,0.0,5.0,4.0
max,2.0,255.0,307.0,307.0,3.0,7.0,7.0,7.0,4.0,3.0,3.0,3.0,3.0,3.0,20.0,3000.0,41415.0,8.0,30.0,4.0


In [5]:
state_data.describe()

Unnamed: 0,"Area (1,000 km)",Total Population,Population Density,Average Annual Population Growth Rate %,Distribution of Population by State %,Urban Population %,Median Age,Dependency Ratio,Sex Ratio,Ethnic Groups Bumiputera %,Ethnic Groups Chinese %,Ethnic Groups Indians %,Ethnic Groups Others %
count,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0,17.0
mean,38918.0,3333428.0,781.294118,2.729412,11.782353,70.935294,26.011765,49.370588,103.823529,72.788235,20.629412,5.788235,0.770588
std,81896.305116,6583168.0,1645.852468,3.905407,23.227199,18.149722,1.876793,6.424539,5.615106,17.28713,13.452823,5.172389,0.733796
min,49.0,72413.0,20.0,1.2,0.3,42.4,22.8,36.6,89.0,43.6,0.7,0.2,0.0
25%,1048.0,1021064.0,86.0,1.5,3.7,54.0,25.2,46.3,101.0,58.9,12.8,0.9,0.4
50%,9500.0,1561383.0,174.0,1.7,5.5,69.7,26.2,48.2,104.0,74.8,23.2,6.2,0.5
75%,21035.0,2471140.0,676.0,2.0,8.7,86.5,27.0,53.5,107.0,84.8,28.6,10.3,0.8
max,330803.0,28334140.0,6891.0,17.8,100.0,100.0,29.6,61.1,113.0,98.0,45.6,15.2,2.4


# Check for missing values

In [6]:
#Function that calculates percentage of missing data
def missingData(dataframe):
    
    for item in (dataframe.isnull().sum()).iteritems():
        if item[1] > 0:
            print('Missing Data percentage for '+item[0]+' is {:2.2%}'.format((item[1]/dataframe.shape[0])) )
        
missingData(df_train)

Missing Data percentage for Name is 8.38%
Missing Data percentage for Description is 0.08%


In [7]:
missingData(df_test)

Missing Data percentage for Name is 10.35%
Missing Data percentage for Description is 0.03%


In [14]:
missingData(state_data)

# Cross Tabulation

In [8]:
df_train.columns

Index(['Type', 'Name', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'RescuerID',
       'VideoAmt', 'Description', 'PetID', 'PhotoAmt', 'AdoptionSpeed'],
      dtype='object')

In [9]:
pd.crosstab(df_train.AdoptionSpeed, df_train.Type)#1 is Dog, 2 is Cat

Type,1,2
AdoptionSpeed,Unnamed: 1_level_1,Unnamed: 2_level_1
0,170,240
1,1435,1655
2,2164,1873
3,1949,1310
4,2414,1783


In [10]:
pd.crosstab(df_train.AdoptionSpeed, df_train.Age)

Age,0,1,2,3,4,5,6,7,8,9,...,132,135,144,147,156,168,180,212,238,255
AdoptionSpeed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,10,54,115,44,24,7,20,7,4,5,...,0,0,0,0,0,0,0,0,0,0
1,50,643,865,408,211,100,80,43,36,24,...,1,0,0,1,0,0,0,0,0,0
2,44,754,1120,586,265,157,117,59,68,51,...,1,0,1,0,1,0,0,0,0,0
3,39,511,783,458,260,135,115,62,70,31,...,3,1,1,0,0,1,1,2,0,0
4,36,342,620,470,349,196,226,110,131,73,...,3,0,2,0,0,0,1,1,1,2


With 106 columns it might be beneficial to convert the pet's age into years, rounding to the nearest .5. 
255 months seems pretty old at 21 years and change, but with the oldest recorded cat at 38 yo and the oldest recorded dog at 29yo, so we cannot rule it out.


In [11]:
pd.crosstab(df_train.AdoptionSpeed, df_train.Health)

Health,1,2,3
AdoptionSpeed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,392,17,1
1,2999,89,2
2,3925,106,6
3,3150,98,11
4,4012,171,14


In [12]:
#sns.boxplot()
#sns.jointplot(x='hp', y='mpg', data=auto)
# Generate a violin plot of 'hp' grouped horizontally by 'cyl'
#plt.subplot(2,1,1)
#sns.violinplot(x='cyl', y='hp', data=auto)
#sns.pairplot(data=auto, hue='origin', kind='reg')
## Visualize the covariance matrix using a heatmap
#sns.heatmap(cov_matrix)