# Data Dictionary

Source: https://www.kaggle.com/c/petfinder-adoption-prediction/data

PetID - Unique hash ID of pet profile

AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.

Type - Type of animal (1 = Dog, 2 = Cat)

Name - Name of pet (Empty if not named)

Age - Age of pet when listed, in months

Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)

Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)

Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)

Color1 - Color 1 of pet (Refer to ColorLabels dictionary)

Color2 - Color 2 of pet (Refer to ColorLabels dictionary)

Color3 - Color 3 of pet (Refer to ColorLabels dictionary)

MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)

FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)

Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)

Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)

Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)

Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)

Quantity - Number of pets represented in profile

Fee - Adoption fee (0 = Free)

State - State location in Malaysia (Refer to StateLabels dictionary)

RescuerID - Unique hash ID of rescuer

VideoAmt - Total uploaded videos for this pet

PhotoAmt - Total uploaded photos for this pet

Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

AdoptionSpeed

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 0 - Pet was adopted on the same day as it was listed. 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).


# Import Libraries

In [1]:
# import the library
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn :: utils
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.model_selection import KFold

# sklearn :: models
from sklearn.neighbors import KNeighborsClassifier

# sklearn :: evaluation metrics
from sklearn.metrics import cohen_kappa_score

sns.set_style('whitegrid')

# Problem definition

Apply classification models to predict customers default payments

# Load the data

In [2]:
#source: https://www.kaggle.com/c/cebd-1260-spring-2019-classification/data
df_train = pd.read_csv('train.csv')
df_test_true = pd.read_csv('test.csv')

df_state_lbl = pd.read_csv('state_labels.csv')
df_breed_lbl = pd.read_csv('breed_labels.csv')
df_color_lbl = pd.read_csv('color_labels.csv')

In [3]:
print(df_state_lbl.columns)
print('')
print(df_breed_lbl.columns)
print('')
print(df_color_lbl.columns)
print('')
# print(df_gender.columns)

Index(['StateID', 'StateName'], dtype='object')

Index(['BreedID', 'Type', 'BreedName'], dtype='object')

Index(['ColorID', 'ColorName'], dtype='object')



In [4]:
df = df_train.copy()
print(df.columns)
df.T

Index(['Type', 'Name', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'RescuerID',
       'VideoAmt', 'Description', 'PetID', 'PhotoAmt', 'AdoptionSpeed'],
      dtype='object')


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
Type,1,2,2,1,2,1,1,1,2,1,...,1,1,2,2,2,1,1,2,1,2
Name,â¥â¥â¥ Lily â¥â¥â¥,Cookie,Favour Speedy Abundance And Courage,,Abandoned Kitty,Duke,Lila,Doggie2_Selangor Area,,Brother,...,Amos,Minnie,Meow Zai,SOS ð Owner Leaving Msia,Levi,Dawn,,Isabella,Brisco,
Age,36,3,7,3,1,3,2,8,1,6,...,3,1,2,24,6,1,5,2,1,1
Breed1,307,266,250,307,266,218,307,307,243,307,...,239,307,299,266,285,307,307,265,307,266
Breed2,0,0,252,0,0,0,0,0,245,0,...,307,0,0,0,265,0,0,0,0,0
Gender,2,1,1,1,1,1,2,2,2,1,...,1,2,1,2,1,2,2,2,1,2
Color1,2,6,1,2,1,3,1,6,1,2,...,1,1,4,2,6,2,5,3,2,1
Color2,7,7,2,0,6,5,7,0,2,7,...,2,2,6,3,7,5,0,6,7,2
Color3,0,0,0,0,7,0,0,0,7,0,...,0,0,0,7,0,7,0,7,0,0
MaturitySize,2,2,2,3,1,2,1,2,1,1,...,2,2,2,2,2,2,2,2,2,1


In [5]:
# check for NaNs
df.isnull().sum(axis = 0)

Type               0
Name             842
Age                0
Breed1             0
Breed2             0
Gender             0
Color1             0
Color2             0
Color3             0
MaturitySize       0
FurLength          0
Vaccinated         0
Dewormed           0
Sterilized         0
Health             0
Quantity           0
Fee                0
State              0
RescuerID          0
VideoAmt           0
Description        8
PetID              0
PhotoAmt           0
AdoptionSpeed      0
dtype: int64

# Feature Engineering 

#### Remove columns 

In [6]:
# TODO: remove a confusing column
del df['Description']
del df['RescuerID']
del df['Name']

In [7]:
df.head().T

Unnamed: 0,0,1,2,3,4
Type,1,2,2,1,2
Age,36,3,7,3,1
Breed1,307,266,250,307,266
Breed2,0,0,252,0,0
Gender,2,1,1,1,1
Color1,2,6,1,2,1
Color2,7,7,2,0,6
Color3,0,0,0,0,7
MaturitySize,2,2,2,3,1
FurLength,2,1,1,1,1


In [8]:
print(df.columns)

Index(['Type', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'VideoAmt', 'PetID',
       'PhotoAmt', 'AdoptionSpeed'],
      dtype='object')


#### Rename columns 

In [9]:
df.columns = ['Type', 'Age', 'Breed1', 'Breed2', 
              'GenderID', 
              'Color1', 'Color2','Color3', 
              'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
              'Sterilized', 'Health', 'Quantity', 'Fee', 
              'StateID', 'VideoAmt','PetID','PhotoAmt', 'AdoptionSpeed']


#### Get dummies 

In [11]:
# Create a loop to transform the categorical columns to numerical
# Ref.: https://github.com/arybressane/CEBD1260-BIG-DATA-ANALYTICS/blob/master/week6/classification-credit-card-sklearn-extended.ipynb

for col in [
    'Type', 'GenderID', 'MaturitySize','FurLength', 'Vaccinated', 
    'Dewormed','Sterilized', 'Health','StateID','Quantity','Fee'
    ]:
    
    df_dummies = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df, df_dummies], axis=1)
    # Remove the original columns
    del df[col]
    
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
Age,36,3,7,3,1,3,2,8,1,6,...,3,1,2,24,6,1,5,2,1,1
Breed1,307,266,250,307,266,218,307,307,243,307,...,239,307,299,266,285,307,307,265,307,266
Breed2,0,0,252,0,0,0,0,0,245,0,...,307,0,0,0,265,0,0,0,0,0
Color1,2,6,1,2,1,3,1,6,1,2,...,1,1,4,2,6,2,5,3,2,1
Color2,7,7,2,0,6,5,7,0,2,7,...,2,2,6,3,7,5,0,6,7,2
Color3,0,0,0,0,7,0,0,0,7,0,...,0,0,0,7,0,7,0,7,0,0
VideoAmt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PetID,3f8824a3b,9238eb7fc,f0a1f2b90,7d028bdea,8377bfe97,965b31ba7,3760c73b1,f41a7de83,7b660c6af,f94c2a347,...,b09fa9385,b32339429,6dbb13673,9e7aa5866,9e11a3158,b05d0484f,b586e1ea4,13264f4f4,3422e4906,8991aff61
PhotoAmt,1,1,2,4,0,2,1,2,4,2,...,2,5,3,7,5,11,3,5,7,4
AdoptionSpeed,4,2,4,2,2,1,4,4,3,4,...,3,4,4,1,2,1,4,3,3,3


In [12]:
print(df.columns)

Index(['Age', 'Breed1', 'Breed2', 'Color1', 'Color2', 'Color3', 'VideoAmt',
       'PetID', 'PhotoAmt', 'AdoptionSpeed',
       ...
       'Fee_500', 'Fee_550', 'Fee_600', 'Fee_650', 'Fee_700', 'Fee_750',
       'Fee_800', 'Fee_1000', 'Fee_2000', 'Fee_3000'],
      dtype='object', length=132)


In [13]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
Age,36,3,7,3,1,3,2,8,1,6,...,3,1,2,24,6,1,5,2,1,1
Breed1,307,266,250,307,266,218,307,307,243,307,...,239,307,299,266,285,307,307,265,307,266
Breed2,0,0,252,0,0,0,0,0,245,0,...,307,0,0,0,265,0,0,0,0,0
Color1,2,6,1,2,1,3,1,6,1,2,...,1,1,4,2,6,2,5,3,2,1
Color2,7,7,2,0,6,5,7,0,2,7,...,2,2,6,3,7,5,0,6,7,2
Color3,0,0,0,0,7,0,0,0,7,0,...,0,0,0,7,0,7,0,7,0,0
VideoAmt,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PetID,3f8824a3b,9238eb7fc,f0a1f2b90,7d028bdea,8377bfe97,965b31ba7,3760c73b1,f41a7de83,7b660c6af,f94c2a347,...,b09fa9385,b32339429,6dbb13673,9e7aa5866,9e11a3158,b05d0484f,b586e1ea4,13264f4f4,3422e4906,8991aff61
PhotoAmt,1,1,2,4,0,2,1,2,4,2,...,2,5,3,7,5,11,3,5,7,4
AdoptionSpeed,4,2,4,2,2,1,4,4,3,4,...,3,4,4,1,2,1,4,3,3,3


In [14]:
X_columns = ['Age', 
             'Breed1', 'Breed2', 
             'Color1', 'Color2', 'Color3', 
             'Quantity_1', 
             'Fee_0', 
             #'VideoAmt', 
             #'PhotoAmt', 
             'Type_1', 'Type_2',
             'GenderID_1', 'GenderID_2', 'GenderID_3', 
             #'MaturitySize_1','MaturitySize_2', 'MaturitySize_3', 'MaturitySize_4', 
             'FurLength_1','FurLength_2', 'FurLength_3', 
             'Vaccinated_1', 'Vaccinated_2','Vaccinated_3', 
             'Dewormed_1', 'Dewormed_2', 'Dewormed_3',
             'Sterilized_1', 'Sterilized_2', 'Sterilized_3', 
             'Health_1', 'Health_2','Health_3',
             'StateID_41324', 'StateID_41325', 'StateID_41326', 'StateID_41327', 
             'StateID_41330', 'StateID_41332', 'StateID_41335', 'StateID_41336', 
             'StateID_41342', 'StateID_41345', 'StateID_41361',
             'StateID_41367', 'StateID_41401', 'StateID_41415'
            ]

# X_columns = [x for x in df.columns 
#              if x != 'AdoptionSpeed' and x != 'PetID']


y_column = ['AdoptionSpeed']

In [15]:
# list(X_columns)

# Model Training

In [16]:
# Ref.: https://github.com/arybressane/CEBD1260-BIG-DATA-ANALYTICS/blob/master/week6/kaggle_submission_pets.ipynb

In [17]:
# split the data

threshold = 0.8
X = df[X_columns]
y = df[y_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1.0-threshold, shuffle=True)

print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

X_train (8000, 42)
y_train (8000, 1)
X_test (2000, 42)
y_test (2000, 1)


In [18]:
# train a KNN Classifier
model = KNeighborsClassifier()
model.fit(X_train, y_train.values.ravel())
y_pred = model.predict(X_test)

# Model Training / Evaluation - Using Split

In [19]:
kappa = cohen_kappa_score(y_test, y_pred, weights ='quadratic')
print('kappa', round(kappa, 4))
print(confusion_matrix(y_test, y_pred))

kappa 0.2489
[[  5  21  11  10   6]
 [  9 166 133  60  50]
 [ 12 150 213  95  89]
 [  6  89 143  96  80]
 [  8 103 134 111 200]]


In [20]:
importance = []
if hasattr(model, 'feature_importances_'):
    print('Feature Importance')
    importance = []
    for i in range(len(X_columns)):
        importance.append([X_columns[i], model.feature_importances_[i]])
    print(pd.DataFrame(importance).sort_values(by=1, ascending=False).head(10))
elif hasattr(model, 'coef_'):
    print('Feature Importance')
    for i in range(len(X_columns)):
        importance.append([X_columns[i], model.coef_[i]])
    print(pd.DataFrame(importance).sort_values(by=1, ascending=False).head(10))
        
print('')





# Model Training / Evaluation - Cross Validation

In [21]:
k = 10
results = []
kf = KFold(n_splits=k)
for train_index, test_index in kf.split(X):
    X_train, X_test = X.values[train_index], X.values[test_index]
    y_train, y_test = y.values[train_index], y.values[test_index]
    model.fit(X_train, y_train.ravel())
    y_pred = model.predict(X_test)
    kappa = cohen_kappa_score(y_test, y_pred, weights ='quadratic')
    results.append(round(kappa, 4))

print('Kappa for each fold:', results)
print('AVG(kappa)', round(np.mean(results), 4))
print('STD(kappa)', round(np.std(results), 4))

Kappa for each fold: [0.23, 0.2117, 0.2241, 0.2453, 0.1458, 0.191, 0.1853, 0.2184, 0.2025, 0.2245]
AVG(kappa) 0.2079
STD(kappa) 0.0269


# Prepare submission

In [22]:
del df_test_true['Description']
del df_test_true['RescuerID']
del df_test_true['Name']
df_test_true.columns

Index(['Type', 'Age', 'Breed1', 'Breed2', 'Gender', 'Color1', 'Color2',
       'Color3', 'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
       'Sterilized', 'Health', 'Quantity', 'Fee', 'State', 'VideoAmt', 'PetID',
       'PhotoAmt'],
      dtype='object')

In [23]:
df_test_true.columns = ['Type', 'Age', 'Breed1', 'Breed2', 
              'GenderID', 
              'Color1', 'Color2','Color3', 
              'MaturitySize', 'FurLength', 'Vaccinated', 'Dewormed',
              'Sterilized', 'Health', 'Quantity', 'Fee', 
              'StateID', 'VideoAmt','PetID','PhotoAmt']

In [24]:
for col in [
    'Type', 'GenderID', 'MaturitySize','FurLength', 'Vaccinated', 
    'Dewormed','Sterilized', 'Health','StateID','Quantity','Fee'
    ]:
    
    df_dummies = pd.get_dummies(df_test_true[col], prefix=col)
    df_test_true = pd.concat([df_test_true, df_dummies], axis=1)
    # Remove the original columns
    del df_test_true[col]
    
df_test_true.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4983,4984,4985,4986,4987,4988,4989,4990,4991,4992
Age,1,6,2,10,12,1,4,3,2,2,...,4,5,1,18,3,48,3,12,5,16
Breed1,266,307,307,128,307,266,266,307,307,265,...,307,266,266,266,307,189,307,266,266,266
Breed2,266,0,0,0,0,0,0,0,0,0,...,307,0,0,0,0,307,0,0,0,0
Color1,1,2,3,7,2,1,2,1,1,4,...,2,7,1,6,2,1,5,7,1,1
Color2,7,0,0,0,0,2,3,0,2,7,...,0,0,7,7,7,0,7,0,4,7
Color3,0,0,0,0,0,0,7,0,0,0,...,0,0,0,0,0,0,0,0,5,0
VideoAmt,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
PetID,f42161740,0118db3a8,e5164d828,5335bfb38,ff2cf88a0,1d13441b9,7d835cf7c,577d15fea,91736f444,db194aec8,...,878e36da4,46cf6356e,ef50eab33,a40116bf1,7ed14130d,3b2b69d20,360215e23,0c612e8df,9b1241ef0,ffe7f0b70
PhotoAmt,10,2,2,0,2,1,9,1,4,6,...,3,1,4,2,2,4,3,1,2,5
Type_1,0,1,1,1,1,0,0,1,1,0,...,1,0,0,0,1,1,1,0,0,0


In [25]:
print(df_test_true.columns)

Index(['Age', 'Breed1', 'Breed2', 'Color1', 'Color2', 'Color3', 'VideoAmt',
       'PetID', 'PhotoAmt', 'Type_1',
       ...
       'Fee_390', 'Fee_400', 'Fee_450', 'Fee_500', 'Fee_550', 'Fee_599',
       'Fee_600', 'Fee_688', 'Fee_700', 'Fee_750'],
      dtype='object', length=115)


In [26]:
print(X_columns)

['Age', 'Breed1', 'Breed2', 'Color1', 'Color2', 'Color3', 'Quantity_1', 'Fee_0', 'Type_1', 'Type_2', 'GenderID_1', 'GenderID_2', 'GenderID_3', 'FurLength_1', 'FurLength_2', 'FurLength_3', 'Vaccinated_1', 'Vaccinated_2', 'Vaccinated_3', 'Dewormed_1', 'Dewormed_2', 'Dewormed_3', 'Sterilized_1', 'Sterilized_2', 'Sterilized_3', 'Health_1', 'Health_2', 'Health_3', 'StateID_41324', 'StateID_41325', 'StateID_41326', 'StateID_41327', 'StateID_41330', 'StateID_41332', 'StateID_41335', 'StateID_41336', 'StateID_41342', 'StateID_41345', 'StateID_41361', 'StateID_41367', 'StateID_41401', 'StateID_41415']


In [27]:
df_prediction = df_test_true[X_columns]
df_test_true['AdoptionSpeed'] = model.predict(df_prediction)
df_test_true[['PetID', 'AdoptionSpeed']]

Unnamed: 0,PetID,AdoptionSpeed
0,f42161740,2
1,0118db3a8,4
2,e5164d828,2
3,5335bfb38,1
4,ff2cf88a0,3
5,1d13441b9,2
6,7d835cf7c,2
7,577d15fea,3
8,91736f444,4
9,db194aec8,1


In [28]:
df_test_true[
    ['PetID', 'AdoptionSpeed']].to_csv('submission_classification_modified.csv', index=False)