# Introduction

In this competition you will predict the speed at which a pet is adopted, based on the pet抯 listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. See below for details. 
This is a Kernels-only competition. At the end of the competition, test data will be replaced in their entirety with new data of approximately the same size, and your kernels will be rerun on the new data.

In [None]:
import numpy as np 
import pandas as pd 
import json
import os
import time
import random

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, cohen_kappa_score


# File descriptions

* train.csv - Tabular/text data for the training set
* test.csv - Tabular/text data for the test set
* sample_submission.csv - A sample submission file in the correct format
* breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
* color_labels.csv - Contains ColorName for each ColorID
* state_labels.csv - Contains StateName for each StateID

In [None]:
train_df = pd.read_csv('../input/train/train.csv')
test_df = pd.read_csv('../input/test/test.csv')

breed_labels = pd.read_csv('../input/breed_labels.csv')
color_labels = pd.read_csv('../input/color_labels.csv')
state_labels = pd.read_csv('../input/state_labels.csv')

train_sen = os.listdir('../input/train_sentiment/')
train_meta = os.listdir('../input/train_metadata/')

test_sen = os.listdir('../input/test_sentiment/')
test_meta = os.listdir('../input/test_metadata/')

train = 'train'
test = 'test'

print(train_df.shape)
print(test_df.shape)
print(len(train_sen))
print(len(test_sen))
print(len(train_meta))
print(len(test_meta))


# Data Fields

* PetID - Unique hash ID of pet profile
* AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
* Type - Type of animal (1 = Dog, 2 = Cat)
* Name - Name of pet (Empty if not named)
* Age - Age of pet when listed, in months
* Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
* Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
* Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
* Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
* Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
* Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
* MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
* FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
* Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
* Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
* Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
* Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
* Quantity - Number of pets represented in profile
* Fee - Adoption fee (0 = Free)
* State - State location in Malaysia (Refer to StateLabels dictionary)
* RescuerID - Unique hash ID of rescuer
* VideoAmt - Total uploaded videos for this pet
* PhotoAmt - Total uploaded photos for this pet
* Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

In [None]:
train_df.head()

In [None]:
breed_labels.head()

In [None]:
state_labels.head()

In [None]:
color_labels.head()

Now lets take a look at how many types of animals are listed at the adoption centre 

In [None]:
train_df['Type'].value_counts().plot.bar()

So 1 & 2 (If we check the DataFields we will understand that* 1 --> Dogs* and *2--> Cats* )

# New Features

# Sentiment Data
They have run each pet profile's description through Google's Natural Language API, providing analysis on sentiment and key entities. You may optionally utilize this supplementary information for your pet description analysis. There are some descriptions that the API could not analyze. As such, there are fewer sentiment files than there are rows in the dataset.

File name format is PetID.json.

Google Natural Language API reference: https://cloud.google.com/natural-language/docs/basics

I'm extracting two features from the **Sentiment Data** .

   * Sentiment Scoring
   * Sentiment Magnitude'

After Extracting these features I'm making two new rows(sen_score,  sen_mag) on both test and train data.
   

In [None]:
 def sen_score(df, sen_source, test_train ):   
    sen = []
    for i in df['PetID']:
        a = i+'.json'
        if a in sen_source:
            x = '../input/%s_sentiment/%s' % (test_train, a)
            with open(x, 'r') as f:
                    sentiment = json.load(f)

            y = sentiment['documentSentiment']['score']
        else:
            y = 0

        sen.append(y)
    return sen
    
train_df['sen_score'] = sen_score(train_df, train_sen, train)
test_df['sen_score'] = sen_score(test_df, test_sen, test)

In [None]:
 def sen_mag(df, sen_source, test_train ):   
    sen = []
    for i in df['PetID']:
        a = i+'.json'
        if a in sen_source:
            x = '../input/%s_sentiment/%s' % (test_train, a)
            with open(x, 'r') as f:
                    sentiment = json.load(f)

            y = sentiment['documentSentiment']['magnitude']
        else:
            y = 0

        sen.append(y)
    return sen
    
train_df['sen_mag'] = sen_mag(train_df, train_sen, train)
test_df['sen_mag'] = sen_mag(test_df, test_sen, test)

In [None]:
train_df['sen_score'].plot.hist()


In [None]:
train_df['sen_mag'].plot.hist()

# Image Metadata

They have run the images through Google's Vision API, providing analysis on Face Annotation, Label Annotation, Text Annotation and Image Properties. You may optionally utilize this supplementary information for your image analysis.

File name format is **PetID-ImageNumber.json**.

Some properties will not exist in JSON file if not present, i.e. Face Annotation. Text Annotation has been simplified to just 1 entry of the entire text description (instead of the detailed JSON result broken down by individual characters and words). Phone numbers and emails are already anonymized in Text Annotation.

Google Vision API reference: https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate

As listed above the file-name is in the format PETID - Image number so we should check how many df are having zero images.

In [None]:
train_df['PhotoAmt'].value_counts().plot.bar()

Seems like every df has atleast 1 image so we are good to go.

Now from the metadata I've extracted the following features(*for the beginerrs sake I've only used the first image from all the images of a pet*)

   * Red
   * Green
   * Blue
   * Score
   * PixelFraction
   *  Vertex X
   * Vertex Y
   * Confidence
   
  Then made each column for them in both training and testing data.

In [None]:
def meta_red(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['red']
        else:
            y = 0

        meta.append(y)
    return meta
 
train_df['meta_red'] = meta_red(train_df, train_meta, train)
test_df['meta_red'] = meta_red(test_df, test_meta, test)



In [None]:
def meta_green(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['green']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_green'] = meta_green(train_df, train_meta, train)
test_df['meta_green'] = meta_green(test_df, test_meta, test)

In [None]:
def meta_blue(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['imagePropertiesAnnotation']['dominantColors']['colors'][0]['color']['blue']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_blue'] = meta_blue(train_df, train_meta, train)
test_df['meta_blue'] = meta_blue(test_df, test_meta, test)

In [None]:
def meta_score(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['imagePropertiesAnnotation']['dominantColors']['colors'][0]['score']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_score'] = meta_score(train_df, train_meta, train)
test_df['meta_score'] = meta_score(test_df, test_meta, test)

In [None]:
def meta_pixelfraction(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['imagePropertiesAnnotation']['dominantColors']['colors'][0]['pixelFraction']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_pixelfraction'] = meta_pixelfraction(train_df, train_meta, train)
test_df['meta_pixelfraction'] = meta_pixelfraction(test_df, test_meta, test)

In [None]:
def meta_ver_x(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][1]['x']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_ver_x'] = meta_ver_x(train_df, train_meta, train)
test_df['meta_ver_x'] = meta_ver_x(test_df, test_meta, test)

In [None]:
def meta_ver_y(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['cropHintsAnnotation']['cropHints'][0]['boundingPoly']['vertices'][3]['y']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_ver_y'] = meta_ver_y(train_df, train_meta, train)
test_df['meta_ver_y'] = meta_ver_y(test_df, test_meta, test)

In [None]:
def meta_conf(df, meta_source, train_test = 'train_test'):    
    meta = []
    for i in df['PetID']:
        a = i+'-1.json'
        if a in meta_source:
                x = '../input/%s_metadata/%s' % (train_test, a)
                with open(x, 'r') as f:
                    meta_r = json.load(f)
                    y = meta_r['cropHintsAnnotation']['cropHints'][0]['confidence']
        else:
            y = 0

        meta.append(y)
    return meta

train_df['meta_conf'] = meta_conf(train_df, train_meta, train)
test_df['meta_conf'] = meta_conf(test_df, test_meta, test)

### Our New Dataset

In [None]:
train_df.head()

## Age

Age of pet are listed  in months so we need to convert them to years.

In [None]:
train_df['Age_in_yrs'] = [i//12 for i in train_df['Age'] ]
train_df['Age_in_yrs'].value_counts().plot.bar()

In [None]:
test_df['Age_in_yrs'] = [i//12 for i in test_df['Age']]
test_df['Age_in_yrs'].value_counts().plot.bar()

Done... Let's check out our new df.

In [None]:
train_df.head()

## Cross Breed

Mmh...Everything seems fine.Can we get much more insights. Okay let's see we have two features called Breed1 and Breed2 .Okay lets create a new feature to check wheather the given pet is a cross breed or not.

In [None]:
def Cross(df):
    cross = [1 if df['Breed1'][i] and df['Breed2'][i] != 0 else 0 for i in range(len(df['Breed1']))]
    df['Cross_Y/N'] = cross
    return df
train_df = Cross(train_df)
test_df  = Cross(test_df)

 

In [None]:
def cross(df):    
    cross = []
    a = 0
    for i in df['Breed1']:
        cross.append(i*(df['Breed2'][a]+1))
        a += 1
    return cross

train_df['Cross_BreedScore'] = cross(train_df)
test_df['Cross_BreedScore']  = cross(test_df)

Our new df

In [None]:
train_df.head()

# Missing values

Okay now lets check are there any null values in our df. 

In [None]:
train_df.isnull().sum().plot.bar()

From thegraph above is clearly understood that Name and Description column are having some Null values. As Name is not an important feature we can Drop it and From the sentiment analysis we took all the important features from Description also.So we can drop'em both. 

The same as name RescuerID and PetID are no important Features rather they could affect our model in a bad way.
As we have created synthetic features for Age, Breed1 & Breed2 we can count them out too.

In [None]:
drop = ['Name', 'Age', 'Breed1','Breed2', 'RescuerID', 'PetID', 'Description']
train = train_df.drop(drop, axis = 1)
test  = test_df.drop(drop, axis = 1)

X = train.drop(['AdoptionSpeed'], axis = 1)
y = train.AdoptionSpeed

# AdoptionSpeed

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
* 0 - Pet was adopted on the same day as it was listed. 
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

In [None]:
y.value_counts().plot.bar()

Checking our preprocessed df.

In [None]:
test.head()

In [None]:
train.head()

# Model Bulding

In [None]:
def train_model(clf,X, y, val_x, val_y):
    model = clf
    #t1 = time()
    model.fit(X, y)
    #t2 = round((t1-time()), 3)
    #t3 = time()
    pred = model.predict(val_x)
    #t4 = round((t3-time()), 3)
    score = accuracy_score(pred, val_y)
    ck_score = cohen_kappa_score(pred, val_y)
    cross_score = cross_val_score(clf, X, y, scoring='accuracy', cv = 10)
    
    print("Model : %s" % clf)
    #print("Training Time : %d" % t2)
    #print("Prediction Time : %d" % t4)
    print("Accuracy : %s" % score)
    print("Cohem_Kappa : %s" % ck_score)
    print("Cross_Val_Score : %s" % cross_score.mean())
    

Splitting our data into training and testing data

In [None]:
#Train-Test Split

train_X , val_X, train_y, val_y = train_test_split(X, y, random_state = 2, test_size = 0.2)


I will run my current data through three models and evaluate the best one and will tune it for my final model. The models that I will use are

* RandomForestClassifier
* AdaBoostClassifier
* XGBClassifier

In [None]:
clf1 = RandomForestClassifier()
clf2 = AdaBoostClassifier()
clf3 = xgb.XGBClassifier()

In [None]:
train_model(clf1, train_X, train_y, val_X, val_y)


In [None]:
train_model(clf2, train_X, train_y, val_X, val_y)

In [None]:
train_model(clf3, train_X, train_y, val_X, val_y)

After the evaluation XGBClassifier outperforms the other two classifiers.So I'll go with XGB

# Tuning model

You can tune the parameters for a better model.I've done a small example below for how to tuning them. You can save more time and get a much accurate model if you use any CV methods like  GridSearchCV.If you want a better tutorial on that please let me know in the comments.

In [None]:
# You can try different combinations and check each scores until you are satisfied.
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,    
       colsample_bytree=1, gamma=0.2, learning_rate=0.1
                          , max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=140,
       n_jobs=1, nthread=4, objective='multi:softprob', random_state=42,
       reg_alpha=0.001, reg_lambda=1, scale_pos_weight=1, seed=27,
       silent=True, subsample=1)                                                  

In [None]:
model.fit(train_X, train_y)

In [None]:
pred = model.predict(val_X)
score = cohen_kappa_score(pred, val_y)
acc = accuracy_score(pred, val_y) 

print("Cohen Kappa : %s" % score) # 0.2175871739419205
print("Accuracy : %s" % acc) # 0.4171390463487829

In [None]:
n = random.randint(0, 100) # just for checking how my model works
print(list(val_y)[n])
print(list(pred)[n])

In [None]:
cross_val_score(model, X, y, scoring = 'accuracy', cv= 10).mean() #0.39845585964987384

# Final Model

After satisfied with your tuning results you can use that model to fit the entire dataset and predict the result.

In [None]:
model.fit(X, y)

In [None]:
result = model.predict(test)
result

# Submission

Ahh... We have trained a model and sucessfully obtained predictions from it and it's time to submit them.

In [None]:
submission = pd.DataFrame({'PetID' : test_df.PetID, 'AdoptionSpeed' : result })
submission.to_csv('submission.csv', index=False)

In [None]:
submission['AdoptionSpeed'].value_counts().plot.bar()

I am a newbie so if there are any corrections please give them in the comments so that I can improve. And Iif you like the kernel please upvote it ;)