Here's my workbook for the February 2022 Tabular Playground. Nothing too special here, just an ExtraTreesClassifier.

I learnt a lot looking through others' notebooks. In particular, the following had lots of good ideas (some of which I borrowed for my entries):

https://www.kaggle.com/ayoubchaoui/extratreesclassifier-vs-randomforestclassifier

https://www.kaggle.com/munumbutt/extratrees-stratifiedkfold-memory-optimization

https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees

https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense

https://www.kaggle.com/kotrying/extra-blender-addition

https://www.kaggle.com/ambrosm/tpsfeb22-03-clustering-improves-the-predictions

Script parameters:

In [1]:
num_estimators = 80
num_splits = 10

Library imports and data importing:

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/tabular-playground-series-feb-2022/sample_submission.csv
/kaggle/input/tabular-playground-series-feb-2022/train.csv
/kaggle/input/tabular-playground-series-feb-2022/test.csv


In [3]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv',
                   index_col='row_id')
#y = train['target']
from sklearn.preprocessing import LabelEncoder
target_encoder = LabelEncoder()
y = pd.Series(target_encoder.fit_transform(train["target"]))
X = train.drop(labels=['target'], axis=1)

In [4]:
test = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv',
                    index_col='row_id')
print(test.head())

           A0T0G0C10  A0T0G1C9      A0T0G2C8  A0T0G3C7  A0T0G4C6  A0T0G5C5  \
row_id                                                                       
200000 -9.536743e-07 -0.000002 -9.153442e-07  0.000024  0.000034 -0.000002   
200001 -9.536743e-07 -0.000010 -4.291534e-05 -0.000114  0.001800 -0.000240   
200002  4.632568e-08  0.000003  8.465576e-08 -0.000014  0.000007 -0.000005   
200003 -9.536743e-07 -0.000008  8.084656e-06  0.000216  0.000420  0.000514   
200004 -9.536743e-07 -0.000010 -4.291534e-05 -0.000114 -0.000200 -0.000240   

        A0T0G6C4  A0T0G7C3  A0T0G8C2  A0T0G9C1  ...  A8T0G0C2  A8T0G1C1  \
row_id                                          ...                       
200000  0.000021  0.000024 -0.000009 -0.000008  ...  0.000039  0.000085   
200001  0.001800 -0.000114  0.000957 -0.000010  ... -0.000043  0.000914   
200002 -0.000004  0.000003  0.000004 -0.000008  ...  0.000041  0.000102   
200003  0.000452  0.000187 -0.000005 -0.000008  ...  0.000069  0.000158   
200

Train an ExtraTreesClassifier. I tried RandomForest and lots of different hyperparameters evaluated using sklearn.model_selection.GridSearchCV etc., but nothing really provided much improvement over an ExtraTreesClassifier.

For each fold of the cross validation, test-set prediction probabilities are saved for later use.

In [5]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import time

fold_probs = [] # Store the probabilities from each fold for later use
                # The final predicted value is determined by the
                # average across all cross-validation folds

# evaluate the model using Stratified K-Fold cross validation:
for fold, (train_id, test_id) in enumerate(StratifiedKFold(n_splits=num_splits, 
                                                           shuffle=True, 
                                                           random_state=456).split(X,y)): 
                                                                                   
    Xt = X.iloc[train_id]
    yt = y.iloc[train_id]
    Xv = X.iloc[test_id]
    yv = y.iloc[test_id]
    model = ExtraTreesClassifier(n_estimators = num_estimators)
    start = time.time()

    model.fit(Xt, yt)
    
    end = time.time()
    
    valid_pred = model.predict(Xv)
    valid_score = accuracy_score(yv, valid_pred)
    
    print("Fold:", fold + 1, "Accuracy:", valid_score, 'Time (min.):', (end - start)/60)
    
    
    fold_probs.append(model.predict_proba(test))
    

Fold: 1 Accuracy: 0.99555 Time (min.): 0.9133802612622579
Fold: 2 Accuracy: 0.9949 Time (min.): 0.9318466265996297
Fold: 3 Accuracy: 0.99515 Time (min.): 0.9136050780614217
Fold: 4 Accuracy: 0.99625 Time (min.): 0.938936996459961
Fold: 5 Accuracy: 0.99545 Time (min.): 0.9208490173021953
Fold: 6 Accuracy: 0.99545 Time (min.): 0.9412370800971985
Fold: 7 Accuracy: 0.99525 Time (min.): 0.9276675899823507
Fold: 8 Accuracy: 0.99605 Time (min.): 0.9280829389890035
Fold: 9 Accuracy: 0.99495 Time (min.): 0.9220328609148661
Fold: 10 Accuracy: 0.9959 Time (min.): 0.9335474252700806


Although the cross-validated accuracy is quite high, the accuracy on the test set is a lot lower due to target drift. Essentially, bacteria mutated between the training and test set, and decision boundaries calculated on the training set are not as accurate. This is explained in more detail (with figures as well) in AmbrosM's notebooks (see above) and elsewhere. I spent a bit of time exploring this, but ran out of time in coming up with a novel way to improve the test-set predictions.

Next, we average the category probabilities across the cross-validation to come up with the best prediction:

In [6]:
mean_prob = sum(fold_probs) / len(fold_probs) # Mean probability for each row
print(mean_prob)

mean_pred = target_encoder.inverse_transform(np.argmax(mean_prob, axis=1))
print(mean_pred)

[[0.      0.      0.      ... 0.      0.      0.     ]
 [0.03625 0.00125 0.00625 ... 0.00125 0.00875 0.0025 ]
 [0.00625 0.00125 0.95375 ... 0.      0.00625 0.03125]
 ...
 [0.4     0.0425  0.0775  ... 0.05125 0.12    0.08625]
 [0.39    0.0575  0.085   ... 0.075   0.11125 0.08   ]
 [0.      0.      0.01375 ... 0.      0.0075  0.97875]]
['Escherichia_fergusonii' 'Salmonella_enterica' 'Enterococcus_hirae' ...
 'Bacteroides_fragilis' 'Bacteroides_fragilis' 'Streptococcus_pyogenes']


In [7]:
output = pd.DataFrame(data = {'row_id': test.index, 'target': mean_pred})
print(output)
output.to_csv('submission.csv', index=False)

       row_id                    target
0      200000    Escherichia_fergusonii
1      200001       Salmonella_enterica
2      200002        Enterococcus_hirae
3      200003       Salmonella_enterica
4      200004     Staphylococcus_aureus
...       ...                       ...
99995  299995  Streptococcus_pneumoniae
99996  299996      Bacteroides_fragilis
99997  299997      Bacteroides_fragilis
99998  299998      Bacteroides_fragilis
99999  299999    Streptococcus_pyogenes

[100000 rows x 2 columns]
