# Pets: Definitive CatBoost parameter tuning
We'll do parameter tuning for [CatBoost](https://catboost.ai/) algorithm using [Pandas](https://pandas.pydata.org/) and [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). I'm a huge fan of pipelines as they ensure that (1) one **never** needs to modify the training data in-place and (2) all preprocessing steps can be parametrized and therefore tuned with, e.g., `GridSearchCV` (see the [example](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html)).

Comments on this notebook are most welcome!

### TODO
- [ ] More tuning

### Define imports and load data

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import catboost
import tensorflow as tf  # Just for checking if GPU is available :)

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, cohen_kappa_score, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, cross_val_score

%matplotlib inline
plt.rc('figure', figsize=(20.0, 10.0))
GPU_AVAILABLE = tf.test.is_gpu_available()
QUADRATIC_WEIGHT_SCORER = make_scorer(cohen_kappa_score, weights='quadratic')
print("GPU available:", GPU_AVAILABLE)

In [None]:
INPUT_DIR = "../input"
print(os.listdir(INPUT_DIR))

## Data description (copied from [competition description](https://www.kaggle.com/c/petfinder-adoption-prediction/data))

<i>
In this competition you will predict the speed at which a pet is adopted, based on the pet’s listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. See below for details. 
This is a Kernels-only competition. At the end of the competition, test data will be replaced in their entirety with new data of approximately the same size, and your kernels will be rerun on the new data.

### File descriptions
- train.csv - Tabular/text data for the training set
- test.csv - Tabular/text data for the test set
- sample_submission.csv - A sample submission file in the correct format
- breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
- color_labels.csv - Contains ColorName for each ColorID
- state_labels.csv - Contains StateName for each StateID

### Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
- AdoptionSpeed Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
    0 - Pet was adopted on the same day as it was listed. 
    1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
    2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
    3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
    4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

### Images

For pets that have photos, they will be named in the format of PetID-ImageNumber.jpg. Image 1 is the profile (default) photo set for the pet. For privacy purposes, faces, phone numbers and emails have been masked.

### Image Metadata
We have run the images through Google's Vision API, providing analysis on Face Annotation, Label Annotation, Text Annotation and Image Properties. You may optionally utilize this supplementary information for your image analysis.

File name format is PetID-ImageNumber.json.

Some properties will not exist in JSON file if not present, i.e. Face Annotation. Text Annotation has been simplified to just 1 entry of the entire text description (instead of the detailed JSON result broken down by individual characters and words). Phone numbers and emails are already anonymized in Text Annotation.

Google Vision API reference: https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate

### Sentiment Data
We have run each pet profile's description through Google's Natural Language API, providing analysis on sentiment and key entities. You may optionally utilize this supplementary information for your pet description analysis. There are some descriptions that the API could not analyze. As such, there are fewer sentiment files than there are rows in the dataset.

File name format is PetID.json.

Google Natural Language API reference: https://cloud.google.com/natural-language/docs/basics

What will change in the 2nd stage of the competition?
In the second stage of the competition, we will re-run your selected Kernels. The following files will be swapped with new data:

test.zip including test.csv and sample_submission.csv
test_images.zip
test_metadata.zip
test_sentiment.zip

In stage 2, all data will be replaced with approximately the same amount of different data. The stage 1 test data will not be available when kernels are rerun in stage 2.
</i>

Read training data CSV to pandas dataframe, marking categorical columns as category type.

In [None]:
def read_csv_to_pandas(is_train=True):
    path = os.path.join(INPUT_DIR, 'train', 'train.csv') if is_train else os.path.join(INPUT_DIR, 'test', 'test.csv')
    return pd.read_csv(path, dtype={
        'Type': 'category',
        'Name': 'category',
        'Breed1': 'category',
        'Breed2': 'category',
        'Gender': 'category',
        'Color1': 'category',
        'Color2': 'category',
        'Color3': 'category',
        'MaturitySize': 'category',
        'FurLength': 'category',
        'Vaccinated': 'category',
        'Dewormed': 'category',
        'Sterilized': 'category',
        'Health': 'category',
        'State': 'category',
        'RescuerID': 'category'
    })

train_df = read_csv_to_pandas(is_train=True)
X_test = read_csv_to_pandas(is_train=False)
train_df.info()

## Preprocessing

### Split training data into training and validation set
We split `train_df` into two sets. `X_train` is used for the cross-validation, `X_val` is used at the end of the notebook to estimate the generalization error.

In [None]:
from sklearn.model_selection import train_test_split

def to_features_and_labels(df):
    y = df['AdoptionSpeed'].values
    X = df.drop('AdoptionSpeed', axis=1)
    return X, y

X_train_val, y_train_val = to_features_and_labels(train_df)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.20, random_state=42,
                                                  stratify=y_train_val)

print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)

### Define transformers
We'll use [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to define our data preprocessing transforms. We'll use a few custom transformers for the purpose:
- `DataFrameColumnMapper`: Map DataFrame column to a new column (similar to `DataFrameMapper` from [sklearn-pandas](https://github.com/scikit-learn-contrib/sklearn-pandas))
- `CategoricalTruncator`: Keep only `N` most frequent categories for a given column, replace others with "Other"
- `CategoricalOneHotEncoder`: One-hot encode columns
- `DataFrameColumnDropper`: Drop given columns
- `ColumnByFeatureImportancePicker`: Pick `N` most important columns based on a classifier feature importance
- `DataFrameToValuesTransformer`: Map DataFrame to NumPy array, used before predictors

In [None]:
class DataFrameColumnMapper(BaseEstimator, TransformerMixin):
    """
    Map DataFrame column to a new column (similar to DataFrameMapper from sklearn-pandas)
    
    Attributes:
        column_name (str): Column name to transform
        mapping_func (func): Function to apply to given column values
        new_column_name (str): Name for the new column, leave empty if replacing `column_name`
        drop_original (bool): Drop original column if true and new_column_name != column_name
    """
    def __init__(self, column_name, mapping_func, new_column_name=None, drop_original=True):
        """
        """
        self.column_name = column_name
        self.mapping_func = mapping_func
        self.new_column_name = new_column_name if new_column_name is not None else self.column_name
        self.drop_original = drop_original

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        transformed_column = X.transform({self.column_name: self.mapping_func})
        Y = X.copy()
        Y = Y.assign(**{self.new_column_name: transformed_column})
        if self.column_name != self.new_column_name and self.drop_original:
            Y = Y.drop(self.column_name, axis=1)
        return Y

class CategoricalToOneHotEncoder(BaseEstimator, TransformerMixin):
    """
    One-hot encode given columns.
    
    Attributes:
        columns (List[str]): Columns to one-hot encode.
        mappings_ (Dict[str, Dict]): Mapping from original column name to the one-hot-encoded column names
    """
    def __init__(self, columns=None):
        self.columns = columns
        self.mappings_ = None
    def fit(self, X, y=None):
        # Pick all categorical attributes if no columns to transform were specified
        if self.columns is None:
            self.columns = X.select_dtypes(exclude='number')
        
        # Keep track of which categorical attributes are assigned to which integer. This is important 
        # when transforming the test set.
        mappings = {}
        
        for col in self.columns:
            labels, uniques = X.loc[:, col].factorize() # Assigns unique integers for all categories
            int_and_cat = list(enumerate(uniques))
            cat_and_int = [(x[1], x[0]) for x in int_and_cat]
            mappings[col] = {'int_to_cat': dict(int_and_cat), 'cat_to_int': dict(cat_and_int)}
    
        self.mappings_ = mappings
        return self

    def transform(self, X):
        Y = X.copy()
        for col in self.columns:
            transformed_col = Y.loc[:, col].transform(lambda x: self.mappings_[col]['cat_to_int'][x])
            for key, val in self.mappings_[col]['cat_to_int'].items():
                one_hot = (transformed_col == val) + 0 # Cast boolean to int by adding zero
                Y = Y.assign(**{'{}_{}'.format(col, key): one_hot})
            Y = Y.drop(col, axis=1)
        return Y

class CategoricalTruncator(BaseEstimator, TransformerMixin):
    """
    Keep only N most frequent categories for a given column, replace others with "Other"
    
    Attributes:
        column_name (str): Column for which to truncate categories
        n_values_to_keep (int): How many of the most frequent values to keep (1 for keeping only most frequent, etc.)
        values_ (List[str]): List of category names to keep, others are replaced with "Other"
    """
    def __init__(self, column_name, n_values_to_keep=5):
        self.column_name = column_name
        self.n_values_to_keep = n_values_to_keep
        self.values_ = None
    def fit(self, X, y=None):
        # Here we must ensure that the test set is transformed similarly in the later phase and that the same values are kept
        self.values_ = list(X[self.column_name].value_counts()[:self.n_values_to_keep].keys())
        return self
    def transform(self, X):
        transform = lambda x: x if x in self.values_ else 'Other'
        Y = X.copy()
        y = Y.transform({self.column_name: transform})
        return Y.assign(**{self.column_name: y})

class DataFrameColumnDropper(BaseEstimator, TransformerMixin):
    """
    Drop given columns.
    
    Attributes:
        column_names (List[Str]): List of columns to drop
    """
    def __init__(self, column_names):
        self.column_names = column_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.copy().drop(self.column_names, axis=1)

class DataFrameToValuesTransformer(BaseEstimator, TransformerMixin):
    """
    Transform DataFrame to NumPy array.
    
    Attributes:
        attributes_ (List[str]): List of DataFrame column names
    """
    def __init__(self):
        self.attributes_ = None
        pass
    def fit(self, X, y=None):
        # Remember the order of attributes before converting to NumPy to ensure the columns
        # are included in the same order when transforming validation or test dataset
        self.attributes_ = list(X)
        return self
    def transform(self, X):
        return X.loc[:, self.attributes_].values

### Define preprocessing pipeline
Build [scikit-learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) mapping `X_train` to `X_train_preprocessed`. Note that we **never** modify `X_train`: This ensures that our results are independendent of the order in which cells are executed (as long as all variables and functions are defined).

Because we're using CatBoost that can handle categorical features quite well, we don't do any other preprocessing than drop some of the columns.

In [None]:
def build_preprocessing_pipeline() -> Pipeline:
     return Pipeline([
        ('drop_unused_columns', DataFrameColumnDropper(
            column_names=['PetID', 'Description', 'RescuerID', 'Name'])
        )
     ])

preprocessing_pipeline = build_preprocessing_pipeline()
X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train, y_train)

X_train_preprocessed.head(10)

### Print the columns:

In [None]:
print("Number of features:", len(list(X_train_preprocessed)))
print("")

print("Numerical columns:", list(X_train_preprocessed.select_dtypes(include="number")))
print("")

print("Categorical columns:", list(X_train_preprocessed.select_dtypes(include="category")))

## CatBoost classifier
We define a small helper class called `CatBoostPandasClassifier` for training `CatBoostClassifier` with Pandas dataframes. The helper class correctly passes the categorical columns as `cat_features` to CatBoostClassifier's `fit` method and ensures that the order of the columns in fitting and prediction phases matches.

In [None]:
from catboost import CatBoostClassifier

class CatBoostPandasClassifier:
    """
    Helper class for training `CatBoostClassifier` with Pandas dataframes. The class
    passes the columns of type "category" as `cat_features` to `CatBoostClassifier`'s fit method and ensures
    that the order of the columns in the fitting and prediction phases matches.
    
    Author: Kimmo Sääskilahti
    """
    def __init__(self, *args, **kwargs):
        self.catboost_classifier = CatBoostClassifier(*args, **kwargs)
        self.columns = None
        
    def fit(self, X, y, *args, **kwargs):
        self.columns = list(X)
        cat_columns = list(X.select_dtypes(include='category'))
        cat_features = [X.columns.get_loc(name) for name in cat_columns]  # Indices of categorical features
        return self.catboost_classifier.fit(X.values, y, *args, cat_features=cat_features, **kwargs)
        
    def copy(self, *args, **kwargs):
        returned_classifier = CatBoostPandasClassifier()
        returned_classifier.catboost_classifier = self.catboost_classifier.copy()
        returned_classifier.columns = self.columns
        return returned_classifier
        
    def predict(self, X, *args, **kwargs):
        X_copy = X.loc[:, self.columns]
        return self.catboost_classifier.predict(X_copy.values, *args, **kwargs)
    
    def predict_proba(self, X, *args, **kwargs):
        X_copy = X.loc[:, self.columns]
        return self.catboost_classifier.predict_proba(X_copy.values, *args, **kwargs)
        
    def __getattr__(self, attr):
        """
        Pass all other method calls to self.catboost_classifier.
        """
        return getattr(self.catboost_classifier, attr)

# Example usage
catboost_pandas_clf = CatBoostPandasClassifier(iterations=10, learning_rate=0.1, loss_function='MultiClass',
                                               allow_writing_files=False)
# catboost_pandas_clf.fit(X_train_preprocessed, y_train)
cross_val_score(catboost_pandas_clf, X_train_preprocessed, y_train, cv=5, scoring=QUADRATIC_WEIGHT_SCORER)

## Parameter tuning

Let us first define a few helper functions to help in parameter tuning:

In [None]:
def build_search(pipeline, param_distributions, n_iter=10):
    """
    Builder function for RandomizedSearch.
    """
    return RandomizedSearchCV(pipeline, param_distributions=param_distributions, 
                              cv=5, return_train_score=True, refit='cohen_kappa_quadratic',
                              n_iter=n_iter,
                              n_jobs=None,
                              scoring={
                                    'accuracy': make_scorer(accuracy_score),
                                    'cohen_kappa_quadratic': QUADRATIC_WEIGHT_SCORER
                               },
                              verbose=1,
                              random_state=42)

def pretty_cv_results(cv_results, 
                      sort_by='rank_test_cohen_kappa_quadratic',
                      sort_ascending=True,
                      n_rows=10):
    """
    Return pretty Pandas dataframe from the `cv_results_` attribute of finished parameter search,
    ranking by test performance and only keeping the columns of interest.
    """
    df = pd.DataFrame(cv_results)
    cols_of_interest = [key for key in df.keys() if key.startswith('param_') 
                        or key.startswith('mean_train') 
                        or key.startswith('mean_test_')
                        or key.startswith('mean_fit_time')
                        or key.startswith('rank')]
    return df.loc[:, cols_of_interest].sort_values(by=sort_by, ascending=sort_ascending).head(n_rows)

def run_search(search):
    search.fit(X_train, y_train)
    print('Best score is:', search.best_score_)
    return pretty_cv_results(search.cv_results_)

### Tune CatBoostPandasClassifier

In [None]:
task_type = "GPU" if GPU_AVAILABLE else "CPU"

catboost_pipeline = Pipeline([
    ('preprocessing', preprocessing_pipeline),
    ('classifier', CatBoostPandasClassifier(verbose=0, loss_function='MultiClass',
                                            allow_writing_files=False, task_type=task_type))
])

param_distributions = {
    'classifier__iterations': [500],  # Sets the value of `iterations` to the pipeline step `classifier` in parameter search
    'classifier__learning_rate': [0.03, 0.10, 0.20],
    'classifier__max_depth': [4, 6, 8],
    'classifier__one_hot_max_size': [10, 30],
    'classifier__l2_leaf_reg': [1, 3, 5]
}

catboost_search = build_search(catboost_pipeline, param_distributions=param_distributions, n_iter=10)
catboost_cv_results = run_search(search=catboost_search)
catboost_cv_results

In [None]:
best_estimator = catboost_search.best_estimator_
best_estimator.fit(X_train, y_train)
y_val_pred = best_estimator.predict(X_val)

print("Performance of best estimator on the hold-out set:", cohen_kappa_score(y_val, y_val_pred, weights='quadratic'))

### Train the final estimator with all data available

In [None]:
best_estimator.fit(X_train_val, y_train_val)

Evaluate predictions on the test set:

In [None]:
def get_predictions(estimator, X):
    predictions = estimator.predict(X).astype(np.int32)
    predictions = np.squeeze(predictions)  # Estimators may return arrays for each prediction
    indices = X.loc[:, 'PetID']
    as_dict = [{'PetID': index, 'AdoptionSpeed': prediction} for index, prediction in zip(indices, predictions)]
    df = pd.DataFrame.from_dict(as_dict)
    df = df.reindex(['PetID', 'AdoptionSpeed'], axis=1)
    return df

predictions = get_predictions(best_estimator, X=X_test)

Write `submission.csv`:

In [None]:
def write_submission(predictions):
    submission_folder = '.'
    dest_file = os.path.join(submission_folder, 'submission.csv')
    predictions.to_csv(dest_file, index=False)
    print("Wrote to {}".format(dest_file))
    
write_submission(predictions)