# Feature importance exploration and scikit-learn baseline
We'll do feature importance exploration for the training data CSV with [Pandas](https://pandas.pydata.org/) and [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). The benefit of pipelines is that we **never** modify our training data in-place. Therefore, our results are independent of the order in which the cells are executed (as long as all variables, functions and imports are defined). One can also easily run parameter tuning on the preprocessing steps as well (see the [example](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html)).

### TODO
- [ ] Compute quadratic weighted kappa score in parameter tuning
- [x] Write a transformer for picking the best features from feature importances
- [ ] Sweep over preprocessing parameters in parameter search
- [ ] Automatically choose the estimators chosen to the voting classifier

### Define imports and load data

In [33]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.metrics import make_scorer, cohen_kappa_score, accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

from xgboost import XGBClassifier

%matplotlib inline
plt.rc('figure', figsize=(20.0, 10.0))

In [2]:
INPUT_DIR = "../input"
print(os.listdir(INPUT_DIR))

In [3]:
train_df = pd.read_csv(os.path.join(INPUT_DIR, 'train', 'train.csv'))
X_test = pd.read_csv(os.path.join(INPUT_DIR, 'test', 'test.csv'))

## Data description (copied from [competition description](https://www.kaggle.com/c/petfinder-adoption-prediction/data))

<i>
In this competition you will predict the speed at which a pet is adopted, based on the pet’s listing on PetFinder. Sometimes a profile represents a group of pets. In this case, the speed of adoption is determined by the speed at which all of the pets are adopted. The data included text, tabular, and image data. See below for details. 
This is a Kernels-only competition. At the end of the competition, test data will be replaced in their entirety with new data of approximately the same size, and your kernels will be rerun on the new data.

### File descriptions
- train.csv - Tabular/text data for the training set
- test.csv - Tabular/text data for the test set
- sample_submission.csv - A sample submission file in the correct format
- breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
- color_labels.csv - Contains ColorName for each ColorID
- state_labels.csv - Contains StateName for each StateID

### Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
- AdoptionSpeed Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
    0 - Pet was adopted on the same day as it was listed. 
    1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
    2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
    3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
    4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

### Images

For pets that have photos, they will be named in the format of PetID-ImageNumber.jpg. Image 1 is the profile (default) photo set for the pet. For privacy purposes, faces, phone numbers and emails have been masked.

### Image Metadata
We have run the images through Google's Vision API, providing analysis on Face Annotation, Label Annotation, Text Annotation and Image Properties. You may optionally utilize this supplementary information for your image analysis.

File name format is PetID-ImageNumber.json.

Some properties will not exist in JSON file if not present, i.e. Face Annotation. Text Annotation has been simplified to just 1 entry of the entire text description (instead of the detailed JSON result broken down by individual characters and words). Phone numbers and emails are already anonymized in Text Annotation.

Google Vision API reference: https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate

### Sentiment Data
We have run each pet profile's description through Google's Natural Language API, providing analysis on sentiment and key entities. You may optionally utilize this supplementary information for your pet description analysis. There are some descriptions that the API could not analyze. As such, there are fewer sentiment files than there are rows in the dataset.

File name format is PetID.json.

Google Natural Language API reference: https://cloud.google.com/natural-language/docs/basics

What will change in the 2nd stage of the competition?
In the second stage of the competition, we will re-run your selected Kernels. The following files will be swapped with new data:

test.zip including test.csv and sample_submission.csv
test_images.zip
test_metadata.zip
test_sentiment.zip

In stage 2, all data will be replaced with approximately the same amount of different data. The stage 1 test data will not be available when kernels are rerun in stage 2.
</i>

## Preprocessing

### What we'll do here  for each column
- `PetId`: Drop
- `Type`: One-hot encode into `Type_Cat` field, drop `Type_Dog`
- `Name`: Create a field for if name exists or not, drop *Name* column
- `Age`: Leave as-is
- `Breed1`: Keep `N` most frequent categories, one-hot encode rest
- `Breed2`: Keep `N` most frequent categories, one-hot encode rest
- `Gender`: One-hot encode
- `Color1`, `Color2`, `Color3`: One-hot encode all of them
- `MaturitySize`: Leave as-is
- `FurLength`: One-hot encode (should account for zero?)
- `Vaccinated`: One-hot encode (should account for 3?)
- `Dewormed`: One-hot encode (should account for 3?)
- `Sterilized`: One-hot encode (should account for 3?)
- `Health`: One-hot encode, accounting for 0 (not specified)
- `Quantity`: Leave as is
- `Fee`: Leave as is
- `State`: Keep `N` most frequent categories, one-hot encode
- `VideoAmt`, `PhotosAmt`: Leave as is
- `Description`: Leave as is

### Quick look at the distributions

In [4]:
train_df.hist(figsize=(20, 10))
plt.tight_layout()

### Define transformers
We'll use [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to define our data preprocessing transforms. We'll use a few custom transformers for the purpose:
- `DataFrameColumnMapper`: Map DataFrame column to a new column (similar to `DataFrameMapper` from [sklearn-pandas](https://github.com/scikit-learn-contrib/sklearn-pandas))
- `CategoricalTruncator`: Keep only `N` most frequent categories for a given column, replace others with "Other"
- `CategoricalOneHotEncoder`: One-hot encode columns
- `DataFrameColumnDropper`: Drop given columns
- `ColumnByFeatureImportancePicker`: Pick `N` most important columns based on a classifier feature importance
- `DataFrameToValuesTransformer`: Map DataFrame to NumPy array, used before predictors

In [5]:
class DataFrameColumnMapper(BaseEstimator, TransformerMixin):
    """
    Map DataFrame column to a new column (similar to DataFrameMapper from sklearn-pandas)
    
    Attributes:
        column_name (str): Column name to transform
        mapping_func (func): Function to apply to given column values
        new_column_name (str): Name for the new column, leave empty if replacing `column_name`
        drop_original (bool): Drop original column if true and new_column_name != column_name
    """
    def __init__(self, column_name, mapping_func, new_column_name=None, drop_original=True):
        """
        """
        self.column_name = column_name
        self.mapping_func = mapping_func
        self.new_column_name = new_column_name if new_column_name is not None else self.column_name
        self.drop_original = drop_original

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        transformed_column = X.transform({self.column_name: self.mapping_func})
        Y = X.copy()
        Y = Y.assign(**{self.new_column_name: transformed_column})
        if self.column_name != self.new_column_name and self.drop_original:
            Y = Y.drop(self.column_name, axis=1)
        return Y

In [6]:
class CategoricalToOneHotEncoder(BaseEstimator, TransformerMixin):
    """
    One-hot encode given columns.
    
    Attributes:
        columns (List[str]): Columns to one-hot encode.
        mappings_ (Dict[str, Dict]): Mapping from original column name to the one-hot-encoded column names
    """
    def __init__(self, columns=None):
        self.columns = columns
        self.mappings_ = None
    def fit(self, X, y=None):
        # Pick all categorical attributes if no columns to transform were specified
        if self.columns is None:
            self.columns = X.select_dtypes(exclude='number')
        
        # Keep track of which categorical attributes are assigned to which integer. This is important 
        # when transforming the test set.
        mappings = {}
        
        for col in self.columns:
            labels, uniques = X.loc[:, col].factorize() # Assigns unique integers for all categories
            int_and_cat = list(enumerate(uniques))
            cat_and_int = [(x[1], x[0]) for x in int_and_cat]
            mappings[col] = {'int_to_cat': dict(int_and_cat), 'cat_to_int': dict(cat_and_int)}
    
        self.mappings_ = mappings
        return self

    def transform(self, X):
        Y = X.copy()
        for col in self.columns:
            transformed_col = Y.loc[:, col].transform(lambda x: self.mappings_[col]['cat_to_int'][x])
            for key, val in self.mappings_[col]['cat_to_int'].items():
                one_hot = (transformed_col == val) + 0 # Cast boolean to int by adding zero
                Y = Y.assign(**{'{}_{}'.format(col, key): one_hot})
            Y = Y.drop(col, axis=1)
        return Y

In [7]:
class CategoricalTruncator(BaseEstimator, TransformerMixin):
    """
    Keep only N most frequent categories for a given column, replace others with "Other"
    
    Attributes:
        column_name (str): Column for which to truncate categories
        n_values_to_keep (int): How many of the most frequent values to keep (1 for keeping only most frequent, etc.)
        values_ (List[str]): List of category names to keep, others are replaced with "Other"
    """
    def __init__(self, column_name, n_values_to_keep=5):
        self.column_name = column_name
        self.n_values_to_keep = n_values_to_keep
        self.values_ = None
    def fit(self, X, y=None):
        # Here we must ensure that the test set is transformed similarly in the later phase and that the same values are kept
        self.values_ = list(X[self.column_name].value_counts()[:self.n_values_to_keep].keys())
        return self
    def transform(self, X):
        transform = lambda x: x if x in self.values_ else 'Other'
        Y = X.copy()
        y = Y.transform({self.column_name: transform})
        return Y.assign(**{self.column_name: y})

In [8]:
class DataFrameColumnDropper(BaseEstimator, TransformerMixin):
    """
    Drop given columns.
    
    Attributes:
        column_names (List[Str]): List of columns to drop
    """
    def __init__(self, column_names):
        self.column_names = column_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.copy().drop(self.column_names, axis=1)

In [9]:
class ColumnByFeatureImportancePicker(BaseEstimator, TransformerMixin):
    """
    Pick columns by feature importance
    Attributes:
        n_features (Optional[int]): How many most important features to keep, None for noop transformation
        classifier (: Classifier, must have `feature_importances_` available after `fit` has been called
    """
    def __init__(self, n_features: int = 20, classifier=RandomForestClassifier(n_estimators=100, random_state=42)):
        self.n_features = n_features
        self.classifier = classifier
        self.attributes_ = None
        
    def fit_and_compute_importances(self, X_df, y):
        """
        :return: Sorted list of tuples containing column name and its feature importance
        """
        X_numeric = X_df.select_dtypes(include='number')
        X = X_numeric.values
        self.classifier.fit(X, y)
        feature_importances = self.classifier.feature_importances_
        feature_names = list(X_numeric)
        feature_importances_with_names = [(feature_name, feature_importance) for feature_name, feature_importance in zip(feature_names, feature_importances)]
        feature_importances_with_names.sort(key=lambda x: x[1], reverse=True)
        return feature_importances_with_names
        
    def fit(self, X, y=None):
        if self.n_features is None:
            # Do nothing but keep the order
            self.attributes_ = list(X)
            return self
        
        assert y is not None, "Feature importances cannot be computed without y!"
        feature_importances_with_names = self.fit_and_compute_importances(X, y)
        self.attributes_ = [feature_name for feature_name, _ in feature_importances_with_names[:self.n_features]]
        return self
    
    def transform(self, X):
        return X.copy().loc[:, self.attributes_]

In [10]:
class DataFrameToValuesTransformer(BaseEstimator, TransformerMixin):
    """
    Transform DataFrame to NumPy array.
    
    Attributes:
        attributes_ (List[str]): List of DataFrame column names
    """
    def __init__(self):
        self.attributes_ = None
        pass
    def fit(self, X, y=None):
        # Remember the order of attributes before converting to NumPy to ensure the columns
        # are included in the same order when transforming validation or test dataset
        self.attributes_ = list(X)
        return self
    def transform(self, X):
        return X.loc[:, self.attributes_].values

### Split training data into training and validation set
We split `train_df` into two sets. `X_train` is used for the cross-validation, `X_val` is used at the end of the notebook to estimate the generalization error.

In [11]:
from sklearn.model_selection import train_test_split

def to_features_and_labels(df):
    y = df['AdoptionSpeed'].values
    X = df.drop('AdoptionSpeed', axis=1)
    return X, y

X_train_val, y_train_val = to_features_and_labels(train_df)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.20, random_state=42,
                                                  stratify=y_train_val)

print("Shape of X_train:", X_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_val:", y_val.shape)

In [12]:
X_train.head()

### Define preprocessing pipeline
Build [scikit-learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) mapping `X_train` to `X_train_preprocessed`. Note that we **never** modify `X_train`: This ensures that our results are independendent of the order in which cells are executed (as long as all variables and functions are defined).

In [13]:
def has_field_transformer(column_name, new_column_name=None, is_missing_func=pd.notna) -> TransformerMixin:
    return DataFrameColumnMapper(column_name=column_name,
                                 mapping_func=lambda name: np.int(is_missing_func(name)),
                                 drop_original=True,
                                 new_column_name=new_column_name if new_column_name is not None else column_name)

def value_matches_transformer(column_name, new_column_name=None, matches=pd.notna) -> TransformerMixin:
    return DataFrameColumnMapper(column_name=column_name,
                                 mapping_func=lambda value: np.int(matches(value)),
                                 drop_original=False,
                                 new_column_name=new_column_name if new_column_name is not None else column_name)

def map_categories(column_name, mapping_dict) -> TransformerMixin:
    return DataFrameColumnMapper(column_name=column_name,
                                 mapping_func=lambda x: mapping_dict[x])

def onehot_encode(columns) -> TransformerMixin:
    return CategoricalToOneHotEncoder(columns=columns)

def truncate_categorical(column_name, n_values_to_keep=10):
    return CategoricalTruncator(column_name=column_name, n_values_to_keep=n_values_to_keep)

ONEHOT_ENCODED_COLUMNS = ["Type", "Breed1", "Breed2", "Gender", "Color1", "Color2", "Color3", "Health",
                          "FurLength", "Vaccinated", "Dewormed", "Sterilized", "State", "RescuerID"]

def build_preprocessing_pipeline() -> Pipeline:
     return Pipeline([
        ('add_has_name', has_field_transformer(column_name="Name", new_column_name="hasName")),
        ('add_is_free', value_matches_transformer(column_name="Fee", new_column_name="isFree",
                                                  matches=lambda value: value < 1)),
        ('map_type_to_species', map_categories(column_name="Type", mapping_dict={1: 'dog', 2: 'cat'})),
        ('map_gender_to_names', map_categories(column_name="Gender", mapping_dict={1: 'male', 2: 'female', 3: 'mixed'})),
        ('truncate_breed1', truncate_categorical(column_name="Breed1", n_values_to_keep=10)),
        ('truncate_breed2', truncate_categorical(column_name="Breed2", n_values_to_keep=10)),
        ('truncate_state', truncate_categorical(column_name="State", n_values_to_keep=10)),
        ('truncate_rescuer_id', truncate_categorical(column_name="RescuerID", n_values_to_keep=10)),
        ('onehot_encode', CategoricalToOneHotEncoder(columns=ONEHOT_ENCODED_COLUMNS)),
        ('drop_unused_columns', DataFrameColumnDropper(
            column_names=['PetID', 'Description', 'Type_dog'
        ])),
        ('pick_columns_by_importance', ColumnByFeatureImportancePicker(n_features=None))
    ])

preprocessing_pipeline = build_preprocessing_pipeline()
X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train, y_train)
X_val_preprocessed = preprocessing_pipeline.transform(X_val)

X_train_preprocessed.head(10)

### Print the columns:

In [14]:
print("Number of features:", len(list(X_train_preprocessed)))
print("")

print("Numerical columns:", list(X_train_preprocessed.select_dtypes(include="number")))
print("")

print("Non-numerical columns:", list(X_train_preprocessed.select_dtypes(exclude="number")))

### Check that only numerical fields exist in the preprocessed DataFrame

In [15]:
X_train_preprocessed.info()

## Run classifier

First define helper factory functions for building pipelines:

In [16]:
def build_preparation_pipeline():
    return Pipeline([
        ('to_numpy', DataFrameToValuesTransformer()),
        ('scaler', StandardScaler())
    ])

def build_full_pipeline(classifier=None):
    preprocessing_pipeline = build_preprocessing_pipeline()
    preparation_pipeline = build_preparation_pipeline()
    return Pipeline([
        ('preprocessing', preprocessing_pipeline),
        ('preparation', preparation_pipeline),
        ('classifier', classifier)  # Expected to be filled by parameter search
    ])

### Analyze feature importance
Train [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to compute ``feature_importances_``. Note that one can access the attributes of any transformer in the pipeline via `named_steps` attribute containing a dictionary of, well, named steps of the pipeline. For example, to access the column names via the `DataFrameToValuesTransformer` class and its `attributes_` attribute:
```python
pipeline.named_steps['preparation'].named_steps['to_numpy'].attributes_
```

In [17]:
def compute_feature_importances(classifier):
    """
    :param classifier: Classifier to use for computing feature importances, must have `feature_importances_` attribute
    :return: List of tuples containing column name and its feature importance
    """
    pipeline = build_full_pipeline(classifier=classifier)
    pipeline.fit(X_train, y_train)
    assert hasattr(classifier, 'feature_importances_')
    feature_importances = classifier.feature_importances_
    feature_names = pipeline.named_steps['preparation'].named_steps['to_numpy'].attributes_
    feature_importances_with_names = [(feature_name, feature_importance) for feature_name, feature_importance in zip(feature_names, feature_importances)]
    feature_importances_with_names.sort(key=lambda x: x[1], reverse=True)
    return feature_importances_with_names

rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=None)
feature_importances_with_names = compute_feature_importances(rf_classifier)

N_MOST_IMPORTANT_TO_SHOW = 50
print("Feature importances (top {}):".format(N_MOST_IMPORTANT_TO_SHOW))
for feature_name, feature_importance in feature_importances_with_names[:N_MOST_IMPORTANT_TO_SHOW]:
    print("{} -> {}".format(feature_name, feature_importance))
    
rf_pipeline = build_full_pipeline(classifier=rf_classifier)
cross_val_score(rf_pipeline, X_train, y_train, cv=5, scoring=make_scorer(cohen_kappa_score))

Let us check the cross validation score for the simple Random Forest classifier with `cohen_kappa_score`:

In [18]:
rf_classifier = RandomForestClassifier(n_estimators=100)
rf_pipeline = build_full_pipeline(classifier=rf_classifier)
cross_val_score(rf_pipeline, X_train, y_train, cv=5, scoring=make_scorer(cohen_kappa_score))

### Investigate confusion matrix

In [19]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

# From https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    import itertools
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    # print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    
    
y_pred = cross_val_predict(rf_pipeline, X=X_train, y=y_train, cv=5)

cnf_matrix = confusion_matrix(y_true=y_train, y_pred=y_pred)

plt.figure(figsize=(15, 7))
plot_confusion_matrix(cnf_matrix, classes=range(0, 5),
                      title='Confusion matrix, without normalization')

## Parameter tuning

The performance on the test set is clearly quite bad. Let us do some randomized search to see how much we can improve.

First define helper functions:

In [20]:
def build_search(pipeline, param_distributions, n_iter=10):
    return RandomizedSearchCV(pipeline, param_distributions=param_distributions, 
                              cv=5, return_train_score=True, refit='cohen_kappa',
                              n_iter=n_iter,
                              scoring={
                                    'accuracy': make_scorer(accuracy_score),
                                    'cohen_kappa': make_scorer(cohen_kappa_score)
                               },
                              verbose=1, random_state=42)

def pretty_cv_results(cv_results, 
                      sort_by='rank_test_cohen_kappa',
                      sort_ascending=True,
                      n_rows=20):
    df = pd.DataFrame(cv_results)
    cols_of_interest = [key for key in df.keys() if key.startswith('param_') 
                        or key.startswith('mean_train') 
                        or key.startswith('mean_test_')
                        or key.startswith('rank')]
    return df.loc[:, cols_of_interest].sort_values(by=sort_by, ascending=sort_ascending).head(n_rows)

def run_search(search):
    search.fit(X_train, y_train)
    print('Best score is:', search.best_score_)
    return pretty_cv_results(search.cv_results_)

Let us first check the effect of the number of features on the performance:

In [None]:
param_distributions = {
        'preprocessing__pick_columns_by_importance__n_features': [None, 50],
        'classifier': [RandomForestClassifier(n_estimators=250, random_state=42, max_depth=10)],
        'classifier__max_depth': [10, None]
    }

rf_feature_search = build_search(build_full_pipeline(), param_distributions=param_distributions, n_iter=4)
rf_feature_cv_results = run_search(search=rf_feature_search)
rf_feature_cv_results

As it's not obvious from this quick parameter search whether the number of features should be restricted, we'll keep it as a search parameter below.

Let's search for the best `RandomForestClassifier`:

In [21]:
param_distributions = {
        'preprocessing__pick_columns_by_importance__n_features': [None],
        'classifier': [RandomForestClassifier(n_estimators=500, random_state=42)],
        'classifier__n_estimators': [500],
        'classifier__max_features': ['auto', 'log2'],
        'classifier__max_depth': [None, 10],
        'classifier__bootstrap': [False, True],
        'classifier__min_samples_leaf': [1, 5, 10],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__criterion': ['gini', 'entropy'],
    }

rf_search = build_search(build_full_pipeline(), param_distributions=param_distributions, n_iter=50)
rf_cv_results = run_search(search=rf_search)
rf_cv_results

## [Extra-trees classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

In [24]:
param_distributions = {
        'preprocessing__pick_columns_by_importance__n_features': [None],
        'classifier': [ExtraTreesClassifier(n_estimators=500, random_state=42)],
        'classifier__n_estimators': [500],
        'classifier__max_features': ['auto', 'log2'],
        'classifier__max_depth': [None],
        # 'classifier__bootstrap': [False, True],
        # 'classifier__min_samples_leaf': [0.001, 0.01, 0.05, 1, 5, 10],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__criterion': ['gini', 'entropy'],
    }

et_search = build_search(build_full_pipeline(), param_distributions=param_distributions, n_iter=50)
et_cv_results = run_search(search=et_search)
et_cv_results

## [Logistic regression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
from sklearn.linear_model import LogisticRegression

param_distributions = {
        'preprocessing__pick_columns_by_importance__n_features': [None, 50],
        'classifier': [LogisticRegression(solver='lbfgs', random_state=42)],
        'classifier__multi_class': ['ovr', 'multinomial'],
        'classifier__C': np.logspace(-3, 0, 4),
    }

logistic_search = build_search(build_full_pipeline(), param_distributions=param_distributions, n_iter=20)
logistic_cv_results = run_search(search=logistic_search)
logistic_cv_results

## [Multilayer perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)

In [None]:
from sklearn.neural_network import MLPClassifier

param_distributions = {
        'preprocessing__pick_columns_by_importance__n_features': [None, 50],
        'preparation__scaler': [MinMaxScaler()],
        'classifier': [MLPClassifier(hidden_layer_sizes=(100, ), random_state=42)],
        'classifier__hidden_layer_sizes': [[10], [10, 10,], [10, 10, 10]],
        'classifier__alpha': np.logspace(-4, -2, 3),
        'classifier__solver': ['adam'],
        'classifier__tol': np.logspace(-4, -2, 3),
        'classifier__learning_rate_init': np.logspace(-3, -1, 3),
        'classifier__activation': ['relu', 'tanh'],
    }

mlp_search = build_search(build_full_pipeline(), param_distributions=param_distributions, n_iter=5)
mlp_cv_results = run_search(search=mlp_search)
mlp_cv_results

## [Gaussian process classifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html)
This turned out to be too heavy so it's skipped.

## [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
Does not scale well to large datasets so skipped.

In [None]:
"""
from sklearn.svm import SVC
param_distributions = { 
        'preprocessing__pick_columns_by_importance__n_features': [None],
        'classifier': [ SVC(random_state=42, probability=True) ], # Probability to use soft voting later
        'classifier__C': np.logspace(-1, 1, 3),
        'classifier__kernel': ['linear', 'poly', 'rbf'],
        'classifier__gamma': ['auto', 'scale']
    }


svm_search = build_search(pipeline=build_full_pipeline(), param_distributions=param_distributions, n_iter=1)
svm_cv_results = run_search(search=svm_search)
svm_cv_results
"""

### [Gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

param_distributions = { 
        'preprocessing__pick_columns_by_importance__n_features': [None],
        'classifier': [ GradientBoostingClassifier(random_state=42) ],
        'classifier__loss': ['deviance'],
        'classifier__n_estimators': [100, 300],
        'classifier__max_features': ['log2', None],
        'classifier__max_depth': [5, 10],
        'classifier__min_samples_leaf': [1, 5, 10],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__learning_rate': [0.1, 0.2],
        'classifier__subsample': [0.75, 0.90, 1.0]
    }

gb_search = build_search(pipeline=build_full_pipeline(), param_distributions=param_distributions, n_iter=20)
gb_cv_results = run_search(search=gb_search)
gb_cv_results

### [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)


In [None]:
param_distributions = {
    'preprocessing__pick_columns_by_importance__n_features': [None],
    'classifier': [ lgb.sklearn. LGBMClassifier(random_state=42, objective='multiclass') ],
    'classifier__boosting_type': ['gbdt', 'dart'],
    'classifier__num_leaves': [20, 31, 50],
    'classifier__max_depth': [-1],
    'classifier__learning_rate': [0.1, 0.2],
    'classifier__n_estimators': [100, 300],
    'classifier__subsample': [1.0, 0.9],
    'classifier__reg_alpha': [0.0, *np.logspace(-3, -2, 2)],
    'classifier__reg_lambda': [0.0],
}

# cross_val_score(lgbm_pipeline, X_train, y_train, cv=5, scoring=make_scorer(cohen_kappa_score))
# compute_feature_importances(classifier=lgbm_classifier)

lgbm_search = build_search(pipeline=build_full_pipeline(), param_distributions=param_distributions, n_iter=20)
lgbm_cv_results = run_search(search=lgbm_search)
lgbm_cv_results

## [XGBoost](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)

In [34]:
param_distributions = {
    'preprocessing__pick_columns_by_importance__n_features': [None],
    'classifier': [ XGBClassifier(random_state=42) ],
    'classifier__max_depth': [3, 5],
    'classifier__learning_rate': [0.1, 0.2],
    'classifier__n_estimators': [100, 200],
    'classifier__reg_alpha': [0, 1e-3],
    'classifier__lambda': [1],
}

# cross_val_score(lgbm_pipeline, X_train, y_train, cv=5, scoring=make_scorer(cohen_kappa_score))
# compute_feature_importances(classifier=lgbm_classifier)

xgb_search = build_search(pipeline=build_full_pipeline(), param_distributions=param_distributions, n_iter=20)
xgb_cv_results = run_search(search=xgb_search)
xgb_cv_results

## Submission

Create a stacking ensemble from the best estimators.

In [25]:
def cross_val_predictions(classifiers, X, y):
    """
    Stack all cross validation prediction probabilities from classifiers into a single matrix.
    Predictions are computed using `cross_val_predict`, ensuring that predictions are clean.
    """
    return np.hstack([cross_val_predict(classifier, X, y, cv=5, method='predict_proba') for classifier in classifiers])
 
def first_level_predictions(classifiers, X):
    """
    Stack all prediction probabilities from classifier probability predictions.
    """
    return np.hstack([classifier.predict_proba(X) for classifier in classifiers])

best_estimators = [
    rf_search.best_estimator_,
    et_search.best_estimator_,
    gb_search.best_estimator_,
    lgbm_search.best_estimator_,
    xgb_search.best_estimator_
]

X_train_second_level = cross_val_predictions(classifiers=best_estimators, X=X_train, y=y_train)

stacking_classifier = GradientBoostingClassifier(random_state=42)

stacking_classifier.fit(X_train_second_level, y_train)

Check the performance on the hold-out set `(X_val, y_val)` when trained on `(X_train, y_train)`:

In [26]:
for estimator in best_estimators:
    estimator.fit(X_train, y_train)

X_val_second_level = first_level_predictions(best_estimators, X_val)
y_val_pred = stacking_classifier.predict(X_val_second_level)

print("Performance of stacking classifier on the hold-out set:", cohen_kappa_score(y_val, y_val_pred))

Train stacking classifier with all data available:

In [29]:
X_train_val_second_level = cross_val_predictions(best_estimators, X=X_train_val, y=y_train_val)
stacking_classifier.fit(X_train_val_second_level, y_train_val)

for estimator in best_estimators:
    estimator.fit(X_train_val, y_train_val)

X_test_second_level = first_level_predictions(best_estimators, X=X_test)

Evaluate predictions on the test set:

In [30]:
def get_predictions(estimator, X):
    predictions = estimator.predict(X)
    indices = X_test.loc[:, 'PetID']
    as_dict = [{'PetID': index, 'AdoptionSpeed': prediction} for index, prediction in zip(indices, predictions)]
    df = pd.DataFrame.from_dict(as_dict)
    df = df.reindex(['PetID', 'AdoptionSpeed'], axis=1)
    return df

predictions = get_predictions(stacking_classifier, X=X_test_second_level)

Write `submission.csv`:

In [31]:
def write_submission(predictions):
    submission_folder = '.'
    dest_file = os.path.join(submission_folder, 'submission.csv')
    predictions.to_csv(dest_file, index=False)
    print("Wrote to {}".format(dest_file))
    
write_submission(predictions)