We have an interesting competition where we have different type of data available. There are tabular data with general information of each instance in addition to images, sentiment (text) and metadata of the images. I think the best option for the competion is to combine them all but for a first approach I decide to only use the general information of the pets.

In [None]:
#packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.over_sampling import SMOTE
import imblearn
import seaborn as sns

## Dataset Description

In [None]:
#Reading 
train = pd.read_csv('../input/train/train.csv')
test = pd.read_csv('../input/test/test.csv')
print(train.head())
train.describe()

We can see that the target variable is **Adoption Speed**.  Other variables are the following:

*Data Fields*

**PetID** - Unique hash ID of pet profile

**AdoptionSpeed** - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.

**Type** - Type of animal (1 = Dog, 2 = Cat)

**Name** - Name of pet (Empty if not named)

**Age** - Age of pet when listed, in months

**Breed1** - Primary breed of pet (Refer to BreedLabels dictionary)

**Breed2** - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)

**Gender** - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)

**Color1** - Color 1 of pet (Refer to ColorLabels dictionary)

**Color2** - Color 2 of pet (Refer to ColorLabels dictionary)

**Color3** - Color 3 of pet (Refer to ColorLabels dictionary)

**MaturitySize** - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)

**FurLength** - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)

**Vaccinated** - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)

**Dewormed** - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)

**Sterilized** - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)

**Health** - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)

**Quantity** - Number of pets represented in profile

**Fee** - Adoption fee (0 = Free)

**State** - State location in Malaysia (Refer to StateLabels dictionary)

**RescuerID** - Unique hash ID of rescuer

**VideoAmt** - Total uploaded videos for this pet

**PhotoAmt** - Total uploaded photos for this pet

**Description** - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

The very first thing that every data scientist have to think is: are there any missing values or our data is complete? 
With a fast check we can see that there are missing values at **Name** and **Description**. In this kernel we will not discuss about description so we only drop this column. For the **Name**, is the **Name** important?

In [None]:
print(train.isnull().sum(0))
test.isnull().sum(0)

## Mutual Information (feature importance)

Let's see statistically about how important is each feature. We use the mutual information that returns a value between 0 and 1 indicating how each variable is dependent of the target variable.  Notice that this metric is intrinsic of the data and not have to affect to the behaviour of a particular classifier. For this, we compute both for classification case and the regression case. 

**Exactly**: we are tackling a ordinal regression. It is a classification problem where the classes are ordered. It is not exactly either classification or classification.

In [None]:
target = train.loc[:,'AdoptionSpeed']
subtrain = train.drop(columns=['Name', 'RescuerID', 'Description', 'PetID', 'AdoptionSpeed'])
MI_clas = mutual_info_classif(subtrain, target, random_state = 123)
MI_reg = mutual_info_regression(subtrain, target, random_state = 123)
idx_clas = np.argsort(MI_clas)[::-1]
idx_reg = np.argsort(MI_reg)[::-1]
column_names = subtrain.columns.values
MI = pd.DataFrame(data = {'Column_clas': column_names[idx_clas], 'MI_clas': MI_clas[idx_clas], 'Column_reg': column_names[idx_reg], 'MI_reg': MI_reg[idx_reg] })


In [None]:
#Show mutual information
MI

Statiscally the most significant features are **Age**, **Breed1**, **Sterilized**, **State** and **PhotoAmt**.  On the other hand, **Health**, **Gender**, **Color1**, **VideoAmt** and **Color3**. These statements can be possible and it is likelihood to a reality could be. We can not conclude anything right now but a very first sight intution.

## Is the dataset balanced?

In [None]:
# Plot target variables
target.value_counts().sort_index(ascending=False).plot(kind='barh',figsize=(15,6))
plt.title('Adoption Speed (Target Variable)', fontsize='xx-large')

Clearly, the dataset is imbalanced. The class 0 is minority. This could affect us because the prediction will be likely to avoid the 0 class. We can solve this problem by some technique, for example, using SMOTE or bagging.  Let's see how we can balance data with SMOTETomek (it creates data of the minority class with SMOTE and clean data with Tomek-Links).

In [None]:
sme = SMOTETomek(random_state=42)
X_res, y_res = sme.fit_resample(subtrain, target.values)

In [None]:
y_res.shape
unique, counts = np.unique(y_res, return_counts=True)
count = dict(zip(unique, counts))
plt.bar(range(len(count)), list(count.values()), align='center')
plt.xticks(range(len(count)), list(count.keys()))
plt.title('Balance dataset with SMOTETomek')

In [None]:
MI_clas = mutual_info_classif(X_res, y_res, random_state = 123)
MI_reg = mutual_info_regression(X_res, y_res, random_state = 123)
idx_clas = np.argsort(MI_clas)[::-1]
idx_reg = np.argsort(MI_reg)[::-1]
column_names = subtrain.columns.values
MI = pd.DataFrame(data = {'Column_clas': column_names[idx_clas], 'MI_clas': MI_clas[idx_clas], 'Column_reg': column_names[idx_reg], 'MI_reg': MI_reg[idx_reg] })

In [None]:
MI

With the balanced dataset, **Age** and **PhotoAmt** seem still important but the **Color** seem to have acquired a bigger importance.

## Random Forest classification

Let's see the Random Forest classification. In future kernel I want to explore regression and ordinal regression.

In [None]:
subtrain[["Breed1", "Breed2", "Gender", "Color1", "Color2", "Color3", "MaturitySize", 
          "FurLength", "Vaccinated", "Dewormed", "Sterilized", "Health", "State"]] = subtrain[["Breed1", "Breed2", "Gender", "Color1", "Color2", "Color3", "MaturitySize", 
          "FurLength", "Vaccinated", "Dewormed", "Sterilized", "Health", "State"]].astype('category')
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(subtrain, target, test_size=0.33, random_state=42)

clf1 = sklearn.ensemble.RandomForestClassifier(n_estimators=200, max_depth=20,
                             random_state=0)
clf1.fit(X_train, y_train)
pred = clf1.predict(X_test)
rf = sklearn.metrics.cohen_kappa_score(y_test, pred)
print('RF classifier: ', rf)

In [None]:
importance = clf1.feature_importances_
idx = np.argsort(importance)[::-1]
column_names = subtrain.columns.values
MI = pd.DataFrame(data = {'Column_clas': column_names[idx], 'RF_importance': importance[idx]})
MI

In the imbalanced dataset we have not been successful at regression so classification seems better.  Furthermore, we can see the feature importance with RF, this classifier tells which are the variables that it rely strongly for the classification. Et Voilà, the **Age** and **PhotoAmt** are the most important as we guessed.  However, the color comes to be important. We can think, how good will be the Random Forest with only these two features?

In [None]:
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=100, max_depth=20,
                             random_state=0)
clf.fit(X_train.loc[:, ['Age', 'PhotoAmt']], y_train)
pred = clf.predict(X_test.loc[:, ['Age', 'PhotoAmt']])
print('RF using only Age and PhotoAmt: ', sklearn.metrics.cohen_kappa_score(y_test, pred))

The quality of the model have decreased so although they are important there are not the only one to take into account for an accurate prediction.

## Balanced Radom Forest (SMOTETomek, SMOTEENN and SMOTE)

SMOTE is a method for imbalanced dataset. This method create synthetic data of minority classes rebalancing all of them. Variants of this methods like SMOTETomek and SMOTEENN include noise removal simultaneosly. In this case, SMOTEENN is not a very good option because removes to many data that it consider noise imbalacing data again.

In [None]:
sme = SMOTETomek(random_state=42)
X_train_B, y_train_B = sme.fit_resample(X_train, y_train)

clf1 = sklearn.ensemble.RandomForestClassifier(n_estimators=200, max_depth=30,
                             random_state=0)
clf1.fit(X_train_B, y_train_B)
pred = clf1.predict(X_test)
smotetomek = sklearn.metrics.cohen_kappa_score(y_test, pred)
print('RF-SMOTETomek classifier: ', smotetomek)

In [None]:
sme = SMOTEENN(random_state=42)
X_train_B, y_train_B = sme.fit_resample(X_train, y_train)

clf1 = sklearn.ensemble.RandomForestClassifier(n_estimators=200, max_depth=30,
                             random_state=0)
clf1.fit(X_train_B, y_train_B)
pred = clf1.predict(X_test)
smoteenn = sklearn.metrics.cohen_kappa_score(y_test, pred)
print('RF-SMOTENN classifier: ', smoteenn)

In [None]:
sme = SMOTE(random_state=42)
X_train_B, y_train_B = sme.fit_resample(X_train, y_train)

clf1 = sklearn.ensemble.RandomForestClassifier(n_estimators=200, max_depth=30,
                             random_state=0)
clf1.fit(X_train_B, y_train_B)
pred = clf1.predict(X_test)
smote = sklearn.metrics.cohen_kappa_score(y_test, pred)
print('RF-SMOTE  classifier: ', smote)

## Balanced Random Forest (undersampling)

This is a modified version of RF which balance data in every sample of each subclassifier with an undersampling.

In [None]:
clf1 = imblearn.ensemble.BalancedRandomForestClassifier(n_estimators = 200, max_depth=30, random_state=0)
clf1.fit(X_train, y_train)
pred = clf1.predict(X_test)
under = sklearn.metrics.cohen_kappa_score(y_test, pred)
print('RF - undersampling: ', under)

How about combining SMOTE and Balanced Random Forest?

In [None]:
clf1 = imblearn.ensemble.BalancedRandomForestClassifier(n_estimators = 200, max_depth=20, random_state=0)
clf1.fit(X_train_B, y_train_B)
pred = clf1.predict(X_test)
under_smote = sklearn.metrics.cohen_kappa_score(y_test, pred)
print('RF - undersampling and SMOTE: ', under_smote)

In [None]:
importance = clf1.feature_importances_
idx = np.argsort(importance)[::-1]
column_names = subtrain.columns.values
MI = pd.DataFrame(data = {'Column_clas': column_names[idx], 'RF-under-smote_importance': importance[idx]})
MI

### Kappa-metrics of all methods

In [None]:
count = {'SMOTEENN': smoteenn, 'SMOTETomek': smotetomek, 'SMOTE': smote, 'RF': rf, 'under': under, 'under-SMOTE': under_smote}
plt.barh(range(len(count)), list(count.values()), align='center')
plt.yticks(range(len(count)), list(count.keys()))
plt.title('Metrics')

## Conclusions

In [None]:
age = pd.DataFrame(data = {'Age' : subtrain['Age'].values, 'target': target.values})
# Iterate through the five airlines
for speed in [0, 1, 2, 3, 4]:
    # Subset to the airline
    subset = age[age['target'] == speed]
    
    # Draw the density plot
    sns.distplot(subset['Age'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3}, label = speed)
    
# Plot formatting
plt.legend(prop={'size': 16}, title = 'Age')
plt.title('Density Plot')
plt.xlabel('Age')
plt.ylabel('Density')
plt.xlim(0, 50)


In [None]:
photo = pd.DataFrame(data = {'photo' : subtrain['PhotoAmt'].values, 'target': target.values})
# Iterate through the five airlines
for speed in [0, 1, 2, 3, 4]:
    # Subset to the airline
    subset = photo[photo['target'] == speed]
    
    # Draw the density plot
    sns.distplot(subset['photo'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3}, label = speed)
    
# Plot formatting
plt.legend(prop={'size': 16}, title = 'Photo amount')
plt.title('Density Plot')
plt.xlabel('Photos')
plt.ylabel('Density')
plt.xlim(0, 20)


In [None]:
photo_age = pd.DataFrame(data = {'Age': subtrain['Age'].values, 'Photos' : subtrain['PhotoAmt'].values, 'target': target.values})
ax = sns.scatterplot(x= "Age", y="Photos", data=photo_age, hue="target", palette='Spectral')

* We have seen that statiscally (intrisic to data) and with a Random Forest classifier the most important features are **Age** and **PhotoAmt**.  Although these features need others for a more accurate prediction this features are the most discriminative.
* The data is imbalanced and this could be disadventageous for a tree based model like Random Forest. Though RF does not seem very harmed by this imbalanced scenario, we have tried SMOTE based methods for dealing with this obtaining a very good result combining SMOTE with a balanced RF. Furthemore, we have explored other SMOTE method which integrates noise removal where SMOTETomek have a very good result.
* *Which information have we extracted?* pets with a short age and little amount of photos are less likely to be adopted in a short range of time.  Looking both features together we can see that older pets have less photos and effectively they are adopted more slowly. So, PetFinder.my, PLEASE, for every old pet that you have TAKE MORE PHOTOS OF THEM and put more *emphasis* on them. 

## Test Prediction (Submission)

In [None]:
test2 = test.drop(columns=['Name', 'RescuerID', 'Description', 'PetID'])
sme = SMOTE(random_state=42)
X_balance, y_balance =  sme.fit_resample(subtrain, target)
clf1 = imblearn.ensemble.BalancedRandomForestClassifier(n_estimators = 200, max_depth=30, random_state=0)
clf1.fit(X_balance, y_balance)
pred = clf1.predict(test2)

#Write submission
submission_df = pd.DataFrame(data={'PetID' : test['PetID'], 
                                   'AdoptionSpeed' : pred})
submission_df.to_csv('under-SMOTE.csv', index=False)

In [None]:
clf1 = sklearn.ensemble.RandomForestClassifier(n_estimators = 200, max_depth=30, random_state=0)
clf1.fit(subtrain, target)
test2[["Breed1", "Breed2", "Gender", "Color1", "Color2", "Color3", "MaturitySize", 
          "FurLength", "Vaccinated", "Dewormed", "Sterilized", "Health", "State"]] = test2[["Breed1", "Breed2", "Gender", "Color1", "Color2", "Color3", "MaturitySize", 
          "FurLength", "Vaccinated", "Dewormed", "Sterilized", "Health", "State"]].astype('category')
pred = clf1.predict(test2)

#Write submission
submission_df = pd.DataFrame(data={'PetID' : test['PetID'], 
                                   'AdoptionSpeed' : pred})
submission_df.to_csv('RF.csv', index=False)

In [None]:
sme = SMOTE(random_state=42)
X_balance, y_balance =  sme.fit_resample(subtrain, target)
clf1 = sklearn.ensemble.RandomForestClassifier(n_estimators = 200, max_depth=30, random_state=0)
clf1.fit(X_balance, y_balance)
pred = clf1.predict(test2)

#Write submission
submission_df = pd.DataFrame(data={'PetID' : test['PetID'], 
                                   'AdoptionSpeed' : pred})
submission_df.to_csv('SMOTETomek.csv', index=False)
