# To run the notebook

**Run the other notebooks (categorical.ipynb, noisy-images.ipynb, noisy-text.ipynb) before this one!**

Use Google Colab (should try to use GPU since training is really slow without it). Before running any code, upload the predictions from the other notebooks

*   categorical_train_pred.csv
*   categorical_test_pred.csv
*   image_train_pred.csv
*   image_test_pred.csv
*   desc_train_pred.csv
*   desc_test_pred.csv

as well as *train.csv* and *test.csv* to the filesystem. Then run all the cells. 

After the last cell runs, a file called *catboost_pred.csv* should be saved. That is my final prediction file that I submit to Kaggle.

# Summary of techniques

For this notebook I use ensemble learning and combine the results of my other 3 notebooks.

I perform both bagging and boosting. I take the training predictions of my 3 other models. I then add them to the training data, and I run them through a library called CatBoost, which performs gradient boosting on decision trees.

As such, I perform bagging by combining and considering the results of my 3 networks. However, I don't want to just take a weighted average since there might be situations where a model should be considered more than normal.

For that, I use the CatBoost decision trees, which also take in the categorical data and text descriptions (using various text processing techniques including bag of words). So hopefully they can determine which results to consider. The gradient boosting also makes it much more accurate.

I also used hyperparameter tuning to determine the optimal hyperparameters when training. They're all evaluated based on their accuracies when doing 3-fold cross validation.

I then take the optimal hyperparameters and pass them into a model. I train the model and perform 3-fold cross validation to evaluate its accuracy. Then I generate my predictions and save them to catboost_pred.csv.

I experimented with many different hyperparameters. I was able to get my optimal Kaggle submission, but after changing a few different things I can no longer seem to replicate the same accuracy (although it is still close).

In [1]:
!pip install catboost
!pip install scikit-learn
!pip install hyperopt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from catboost import Pool, CatBoostClassifier, metrics, cv
import hyperopt
import numpy as np
import os
import pandas as pd
import shutil
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


In [3]:
df = pd.read_csv('train.csv')

df = df.sample(frac=1).reset_index(drop=True)

cat_features = ['gender', 'baseColour', 'season', 'usage']
text_features = ['noisyTextDescription']

bagging = True

if bagging:
    df_cat = pd.read_csv('categorical_train_pred.csv')
    df['c_cat'] = df_cat['category']
    df_image = pd.read_csv('image_train_pred.csv')
    df['c_image'] = df_image['category']
    df_text = pd.read_csv('desc_train_pred.csv')
    df['c_text'] = df_text['category']

    cat_features.extend(['c_cat', 'c_image', 'c_text'])

X = df.drop(['category', 'id'], axis=1)
y = df.category

# X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=42)

In [4]:
train_pool = Pool(
    X,
    y,
    cat_features=cat_features,
    text_features=text_features,
    feature_names=list(X)
)

In [5]:
def hyperopt_objective(params):
    model = CatBoostClassifier(
        l2_leaf_reg=int(params['l2_leaf_reg']),
        learning_rate=params['learning_rate'],
        iterations=1000,
        eval_metric=metrics.Accuracy(),
        verbose=False,
        loss_function=metrics.MultiClass(),
        text_processing='NaiveBayes+Word|BoW+Word|BM25+Word',
        # text_processing='NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word',
        task_type='GPU',
        devices='0:1',
    )
    
    cv_data = cv(
        train_pool, 
        model.get_params(),
        logging_level='Silent',
    )

    shutil.rmtree('catboost_info')
    best_accuracy = np.max(cv_data['test-Accuracy-mean'])
    
    return 1 - best_accuracy # hyperopt minimises

In [6]:
%%time
from numpy.random import RandomState

params_space = {
    'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),
    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
}

trials = hyperopt.Trials()

if os.path.exists('catboost_info'):
    shutil.rmtree('catboost_info')

best = hyperopt.fmin(
    hyperopt_objective,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=10,
    trials=trials,
)

print(best)

100%|██████████| 10/10 [18:06<00:00, 108.67s/trial, best loss: 0.17025078286935713]
{'l2_leaf_reg': 1.0, 'learning_rate': 0.16664747799956628}
CPU times: user 19min 51s, sys: 9min 57s, total: 29min 48s
Wall time: 18min 6s


In [7]:
%%time

if os.path.exists('catboost_info'):
    shutil.rmtree('catboost_info')

model = CatBoostClassifier(
    l2_leaf_reg=int(best['l2_leaf_reg']),
    learning_rate=best['learning_rate'],
    iterations=2000,
    eval_metric=metrics.Accuracy(),
    verbose=False,
    loss_function=metrics.MultiClass(),
    text_processing='NaiveBayes+Word|BoW+Word|BM25+Word',
    # text_processing='NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word',
    task_type='GPU',
    devices='0:1'
)

# model = CatBoostClassifier(
#     learning_rate=0.3,
#     iterations=1200,
#     eval_metric=metrics.Accuracy(),
#     verbose=False,
#     loss_function=metrics.MultiClass(),
#     # text_processing='NaiveBayes+Word|BoW+Word|BM25+Word',
#     text_processing='NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word',
#     task_type='GPU',
#     devices='0:1'
# )

cv_data = cv(train_pool, model.get_params())

Training on fold [0/3]
bestTest = 0.8348573012
bestIteration = 1993
Training on fold [1/3]
bestTest = 0.8269070735
bestIteration = 1887
Training on fold [2/3]
bestTest = 0.831782192
bestIteration = 1354
CPU times: user 3min 52s, sys: 1min 57s, total: 5min 49s
Wall time: 3min 32s


In [8]:
model.fit(train_pool, verbose=100)

0:	learn: 0.6246359	total: 31.9ms	remaining: 1m 3s
100:	learn: 0.8154159	total: 3.43s	remaining: 1m 4s
200:	learn: 0.8282240	total: 8.37s	remaining: 1m 14s
300:	learn: 0.8393212	total: 11.5s	remaining: 1m 4s
400:	learn: 0.8448236	total: 14.5s	remaining: 57.8s
500:	learn: 0.8507884	total: 19.4s	remaining: 57.9s
600:	learn: 0.8562907	total: 22.4s	remaining: 52.3s
700:	learn: 0.8608221	total: 25.5s	remaining: 47.3s
800:	learn: 0.8647986	total: 28.6s	remaining: 42.7s
900:	learn: 0.8691913	total: 33.3s	remaining: 40.6s
1000:	learn: 0.8731216	total: 36.3s	remaining: 36.3s
1100:	learn: 0.8769594	total: 39.4s	remaining: 32.2s
1200:	learn: 0.8815370	total: 44.2s	remaining: 29.4s
1300:	learn: 0.8855135	total: 47.3s	remaining: 25.4s
1400:	learn: 0.8895362	total: 50.5s	remaining: 21.6s
1500:	learn: 0.8932353	total: 53.6s	remaining: 17.8s
1600:	learn: 0.8982753	total: 58.6s	remaining: 14.6s
1700:	learn: 0.9018357	total: 1m 1s	remaining: 10.8s
1800:	learn: 0.9054885	total: 1m 4s	remaining: 7.15s
190

<catboost.core.CatBoostClassifier at 0x7f9dff621d00>

In [9]:
# train_pool = Pool(
#     X_train,
#     y_train,
#     cat_features=cat_features,
#     text_features=text_features,
#     feature_names=list(X_train)
# )

# val_pool = Pool(
#     X_val,
#     y_val,
#     cat_features=cat_features,
#     text_features=text_features,
#     feature_names=list(X_train)
# )

# catboost_default_params = {
#         'iterations': 1000,
#         'learning_rate': 0.03,
#         'eval_metric': 'Accuracy',
#         'l2_leaf_reg': 2.0,
#         'text_processing': [
#             'NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word'
#         ],
#         'task_type': 'GPU',
#         'devices': '0:1',
#         'use_best_model': True
#     }

# model = CatBoostClassifier(**catboost_default_params)
# model.fit(train_pool, eval_set=val_pool, verbose=100)

In [10]:
df_test = pd.read_csv('test.csv')

X_test = df_test

if bagging:
    df_cat = pd.read_csv('categorical_test_pred.csv')
    X_test['c_cat'] = df_cat['category']
    df_image = pd.read_csv('image_test_pred.csv')
    X_test['c_image'] = df_image['category']
    df_text = pd.read_csv('desc_test_pred.csv')
    X_test['c_text'] = df_text['category']

test_pool = Pool(
    data=X_test,
    cat_features=cat_features,
    text_features=text_features,
    feature_names=list(X_test)
)

submission = pd.DataFrame()
submission['id'] = X_test['id']
submission['category'] = model.predict(test_pool)

submission.to_csv('catboost_pred.csv', index=False)
