# Postprocessing for the February TPS

Most of this code has been copied from @[maxencefzr](https://www.kaggle.com/maxencefzr)'s [notebook](https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees). I've added the postprocessing and the code for dealing with duplicate training samples. 

Release notes:
- V3: Using scipy.optimize to optimize the postprocessing (didn't improve the lb score)
- V4: Dealing with duplicate training data and sample weights

In [1]:
#%%capture

# Intel® Extension for Scikit-learn installation:
!pip install scikit-learn-intelex

import os
import warnings

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import mode
from tqdm import tqdm
from pathlib import Path

from sklearnex import patch_sklearn
patch_sklearn()

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

# Mute warnings
warnings.filterwarnings("ignore")

Collecting scikit-learn-intelex
  Downloading scikit_learn_intelex-2021.5.3-py37-none-manylinux1_x86_64.whl (69 kB)
     |████████████████████████████████| 69 kB 260 kB/s            
Collecting daal4py==2021.5.3
  Downloading daal4py-2021.5.3-py37-none-manylinux1_x86_64.whl (22.5 MB)
     |████████████████████████████████| 22.5 MB 1.5 MB/s            
Collecting daal==2021.5.3
  Downloading daal-2021.5.3-py2.py3-none-manylinux1_x86_64.whl (284.3 MB)
     |████████████████████████████████| 284.3 MB 1.8 kB/s            
[?25hCollecting tbb==2021.*
  Downloading tbb-2021.5.1-py2.py3-none-manylinux1_x86_64.whl (4.0 MB)
     |████████████████████████████████| 4.0 MB 70.7 MB/s            
Installing collected packages: tbb, daal, daal4py, scikit-learn-intelex
Successfully installed daal-2021.5.3 daal4py-2021.5.3 scikit-learn-intelex-2021.5.3 tbb-2021.5.1


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
data_dir = Path('../input/tabular-playground-series-feb-2022')

df_train = pd.read_csv(data_dir / 'train.csv', index_col='row_id')
df_test  = pd.read_csv(data_dir / 'test.csv', index_col='row_id')

TARGET = df_train.columns.difference(df_test.columns)[0]
features = df_train.columns[df_train.columns != TARGET]

# Deduplicating the training data

Among the 200000 training samples, there are 76007 duplicates. These duplicates are an issue for two reasons:
1. They make training times unnecessarily long (ok, if you have enough patience, this could be a non-issue).
2. They inflate the cv scores, if not handled correctly.

We must not simply drop the duplicates because this would change the probability distribution. After all, if one particular measurement outcome has been measured 18 times, it should have higher weight than an outcome which has been measured only once. Fortunately, the `fit()` method of most scikit-learn estimators has an optional parameter `sample_weight` for this purpose.

In the following, we convert the training dataframe to a new dataframe without the duplicated rows. To compensate for dropping the duplicates, we add a column `sample_weight` to the dataframe.

In [3]:
# Count the duplicates in the training data
df_train.duplicated().sum()

76007

In [4]:
# Create a new dataframe without duplicates, but with an additional sample_weight column
vc = df_train.value_counts()
dedup_train = pd.DataFrame([list(tup) for tup in vc.index.values], columns=df_train.columns)
dedup_train['sample_weight'] = vc.values
dedup_train

Unnamed: 0,A0T0G0C10,A0T0G1C9,A0T0G2C8,A0T0G3C7,A0T0G4C6,A0T0G5C5,A0T0G6C4,A0T0G7C3,A0T0G8C2,A0T0G9C1,...,A8T0G2C0,A8T1G0C1,A8T1G1C0,A8T2G0C0,A9T0G0C1,A9T0G1C0,A9T1G0C0,A10T0G0C0,target,sample_weight
0,-9.536743e-07,-0.000010,-0.000043,-0.000114,-0.000200,-0.000240,-0.000200,-0.000114,-4.291534e-05,-0.000010,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.000010,-0.000010,-0.000010,-9.536743e-07,Escherichia_coli,18
1,-9.536743e-07,-0.000010,-0.000043,0.000886,-0.000200,0.000760,-0.000200,0.000886,-4.291534e-05,-0.000010,...,-0.000043,-0.000086,-0.000086,-0.000043,-0.000010,-0.000010,0.000990,-9.536743e-07,Salmonella_enterica,17
2,-9.536743e-07,-0.000010,-0.000043,-0.000114,-0.000200,-0.000240,-0.000200,-0.000114,-4.291534e-05,-0.000010,...,-0.000043,-0.000086,0.000914,0.002957,-0.000010,-0.000010,-0.000010,-9.536743e-07,Staphylococcus_aureus,17
3,-9.536743e-07,-0.000010,-0.000043,-0.000114,-0.000200,-0.000240,-0.000200,-0.000114,-4.291534e-05,-0.000010,...,-0.000043,-0.000086,0.009914,-0.000043,-0.000010,-0.000010,-0.000010,-9.536743e-07,Bacteroides_fragilis,16
4,-9.536743e-07,-0.000010,-0.000043,-0.000114,-0.000200,-0.000240,-0.000200,-0.000114,-4.291534e-05,-0.000010,...,0.000957,0.001914,0.000914,-0.000043,-0.000010,-0.000010,-0.000010,-9.536743e-07,Campylobacter_jejuni,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123988,-9.536743e-07,-0.000006,0.000003,0.000059,0.000078,0.000033,0.000051,0.000049,-4.915344e-06,-0.000007,...,0.000051,0.000117,0.000135,0.000102,0.000012,0.000012,0.000032,4.632568e-08,Escherichia_coli,1
123989,-9.536743e-07,-0.000006,0.000003,0.000059,0.000072,0.000033,0.000034,0.000050,-5.915344e-06,-0.000008,...,0.000067,0.000140,0.000123,0.000097,0.000013,0.000011,0.000030,4.632568e-08,Escherichia_coli,1
123990,-9.536743e-07,-0.000006,0.000003,0.000059,0.000063,0.000023,0.000036,0.000041,8.465576e-08,-0.000008,...,0.000042,0.000124,0.000130,0.000088,0.000008,0.000015,0.000026,4.632568e-08,Escherichia_coli,1
123991,-9.536743e-07,-0.000006,0.000003,0.000058,0.000074,0.000041,0.000070,0.000051,-5.915344e-06,-0.000008,...,0.000054,0.000141,0.000130,0.000103,0.000011,0.000012,0.000024,4.632568e-08,Escherichia_coli,1


Let's do a quick check for correctness. The first row of `dedup_train` has a sample_weight of 18. If everything is correct, the original dataframe should have 18 rows with the same data:

In [5]:
(df_train[features].values == dedup_train[features].iloc[0].values.reshape(1, -1)).all(axis=1).sum()

18

# Training, cross-validation & inference

After deduplicating the training data, we apply two small changes to the training loop:
1. When calling `fit()`, we add the sample weights of the training data.
2. When calling `accuracy_score()`, we add the sample weights of the validation data.

In [6]:
from sklearn.preprocessing import LabelEncoder

# Encoding categorical features
le = LabelEncoder()

X = dedup_train[features]
y = pd.DataFrame(le.fit_transform(dedup_train[TARGET]), columns=[TARGET])
sample_weight = dedup_train['sample_weight']

In [7]:
#%%time

N_SPLITS = 10
folds = StratifiedKFold(n_splits=N_SPLITS, shuffle=True)
y_pred_list, y_proba_list, scores = [], [], []

for fold, (train_id, valid_id) in enumerate(tqdm(folds.split(X, y), total=N_SPLITS)):
    print('####### Fold: ', fold)
    
    # Splitting
    X_train, y_train, sample_weight_train = X.iloc[train_id], y.iloc[train_id], sample_weight.iloc[train_id]
    X_valid, y_valid, sample_weight_valid = X.iloc[valid_id], y.iloc[valid_id], sample_weight.iloc[valid_id]
    
    # Model
    model = ExtraTreesClassifier(
        n_estimators=300,
        n_jobs=-1,
        verbose=0,
        random_state=1
    )

    # Training
    model.fit(X_train, y_train, sample_weight_train)
        
    # Validation
    valid_pred = model.predict(X_valid)
    valid_score = accuracy_score(y_valid, valid_pred, sample_weight=sample_weight_valid)
    print(f'Accuracy score: {valid_score:5f}\n')
    scores.append(valid_score)
    
    # Prediction for submission
    y_pred_list.append(model.predict(df_test))
    y_proba_list.append(model.predict_proba(df_test))
    
score = np.array(scores).mean()
print(f'Mean accuracy score: {score:6f}')

  0%|          | 0/10 [00:00<?, ?it/s]

####### Fold:  0
Accuracy score: 0.957644



 10%|█         | 1/10 [00:57<08:36, 57.35s/it]

####### Fold:  1
Accuracy score: 0.959443



 20%|██        | 2/10 [01:54<07:37, 57.13s/it]

####### Fold:  2
Accuracy score: 0.954496



 30%|███       | 3/10 [02:52<06:42, 57.47s/it]

####### Fold:  3
Accuracy score: 0.956148



 40%|████      | 4/10 [03:41<05:24, 54.05s/it]

####### Fold:  4
Accuracy score: 0.957416



 50%|█████     | 5/10 [04:32<04:24, 52.96s/it]

####### Fold:  5
Accuracy score: 0.956583



 60%|██████    | 6/10 [05:23<03:29, 52.40s/it]

####### Fold:  6
Accuracy score: 0.958146



 70%|███████   | 7/10 [06:16<02:37, 52.59s/it]

####### Fold:  7
Accuracy score: 0.953858



 80%|████████  | 8/10 [07:07<01:44, 52.17s/it]

####### Fold:  8
Accuracy score: 0.954004



 90%|█████████ | 9/10 [07:58<00:51, 51.74s/it]

####### Fold:  9
Accuracy score: 0.955283



100%|██████████| 10/10 [08:50<00:00, 53.03s/it]

Mean accuracy score: 0.956302





# Ensembling

We are happy about the high cv score and ensemble the ten predictions by majority vote:

In [8]:
# Majority vote
y_pred = mode(y_pred_list).mode[0]
y_pred = le.inverse_transform(y_pred)

# The surprise

Let's compare the distribution of classes in training and in our predictions. Something went wrong:

In [9]:
target_distrib = pd.DataFrame({
    'count': df_train.target.value_counts(),
    'share': df_train[TARGET].value_counts() / df_train.shape[0] * 100
})

target_distrib['pred_count'] = pd.Series(y_pred, index=df_test.index).value_counts()
target_distrib['pred_share'] = target_distrib['pred_count'] / len(df_test) * 100
target_distrib.sort_index()

Unnamed: 0,count,share,pred_count,pred_share
Bacteroides_fragilis,20139,10.0695,10047,10.047
Campylobacter_jejuni,20063,10.0315,10233,10.233
Enterococcus_hirae,19947,9.9735,9715,9.715
Escherichia_coli,19958,9.979,8654,8.654
Escherichia_fergusonii,19937,9.9685,10862,10.862
Klebsiella_pneumoniae,19847,9.9235,10171,10.171
Salmonella_enterica,20030,10.015,10225,10.225
Staphylococcus_aureus,19929,9.9645,9927,9.927
Streptococcus_pneumoniae,20074,10.037,10098,10.098
Streptococcus_pyogenes,20076,10.038,10068,10.068


What went wrong? In the training data, all classes have equal frequencies of 10 %. In our predictions, *E. coli* is underpredicted with a frequency of only 8.7 %. Two explanations are possible:
1. In the test data, *E. coli* really has a frequency of only 8.7 %. And *E. fergusonii* really has a frequency of 10.8 %.
2. Because the bacteria have mutated and changed their DNA, our classifier no longer classifies them correctly.

I think the correct explanation is 2, because the [EDA has already shown that the bacteria mutate between training and test](https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense).

Fortunately, we can account for the mutations with a little postprocessing.

# Postprocessing

Our classifier predicts not only classes, but also probabilities. These probabilities have already been collected in `y_proba_list`. We now tune these probabilities by manually adding a small bias to the probabilities of `Enterococcus hirae` and `E. coli`.

From these tuned probabilities, we can determine new predictions by applying `np.argmax(axis=1)`, and we see that the class frequencies now are much better.

In [10]:
y_proba = sum(y_proba_list) / len(y_proba_list)
y_proba += np.array([0, 0, 0.01, 0.03, 0, 0, 0, 0, 0, 0])
y_pred_tuned = le.inverse_transform(np.argmax(y_proba, axis=1))
pd.Series(y_pred_tuned, index=df_test.index).value_counts().sort_index() / len(df_test) * 100

Bacteroides_fragilis        10.015
Campylobacter_jejuni        10.208
Enterococcus_hirae           9.803
Escherichia_coli             9.730
Escherichia_fergusonii      10.077
Klebsiella_pneumoniae       10.104
Salmonella_enterica         10.032
Staphylococcus_aureus        9.911
Streptococcus_pneumoniae    10.057
Streptococcus_pyogenes      10.063
dtype: float64

In [11]:
submission = pd.read_csv(data_dir / 'sample_submission.csv')
submission[TARGET] = y_pred_tuned
submission.to_csv('submission.csv', index=False)
submission

Unnamed: 0,row_id,target
0,200000,Escherichia_fergusonii
1,200001,Salmonella_enterica
2,200002,Enterococcus_hirae
3,200003,Salmonella_enterica
4,200004,Staphylococcus_aureus
...,...,...
99995,299995,Streptococcus_pneumoniae
99996,299996,Bacteroides_fragilis
99997,299997,Bacteroides_fragilis
99998,299998,Bacteroides_fragilis


# Final remark

Understanding a model's weaknesses is part of data science. The present ExtraTreesClassifier has the weakness that it does not take the train-test drift into account.

But please note that the postprocessing in this notebook is not data science. It is a workaround to compensate for the model's weakness. The real data science remains to be done: Create a model for the train-test drift which doesn't need postprocessing workarounds.