# SUMMARY

This notebook reproduces my best submission that scores **1.31** on the private LB and reaches the **47th place**. The notebook implements ensemble of CNN-LSTM models using model predictions saved as Kaggle datasets. A detailed summary of the solution is published [in this discussion topic](https://www.kaggle.com/c/bms-molecular-translation/discussion/243845). The table with the main model parameters and CV performance (before beam searchg and normalization) is provided below.

![models](https://i.postimg.cc/cLrTp1Pc/Screen-2021-06-04-at-10-17-02.jpg)

# PREPARATIONS

In [1]:
##### LIBRARIES

import numpy as np 
import pandas as pd 
from tqdm import tqdm 

Below we define a list with seven base models. For each of these models, test molecule predictions are stored as Kaggle datasets after beam-search with `k = 5` (thanks @tugstugi) and `RDKit`-based normalization (thanks @nofreewill). The models are sorted by their performance after beam search and normalization in the ascending order (the first model performs best).

In [2]:
##### BASE MODELS

model_list = ['/kaggle/input/bms-norm-v22/submission_norm.csv',
              '/kaggle/input/bms-norm-v17/submission_norm.csv',
              '/kaggle/input/bms-normalization-v21/submission_norm.csv',
              '/kaggle/input/bms-normalization-v2733/submission_norm.csv',
              '/kaggle/input/bms-normalization-v20/submission_norm.csv',
              '/kaggle/input/bms-normalization-v6/submission_norm.csv',
              '/kaggle/input/bms-normalization-public/submission_norm.csv']

In [3]:
##### PREPARE MODEL PREDICTIONS

models = []

for model in model_list:
    sub = pd.read_csv(model)
    sub = sub.sort_values('image_id').reset_index(drop = True)
    print('- {}: {}'.format(model, sub.shape))
    models.append(sub)

- /kaggle/input/bms-norm-v22/submission_norm.csv: (1616107, 3)
- /kaggle/input/bms-norm-v17/submission_norm.csv: (1616107, 3)
- /kaggle/input/bms-normalization-v21/submission_norm.csv: (1616107, 3)
- /kaggle/input/bms-normalization-v2733/submission_norm.csv: (1616107, 3)
- /kaggle/input/bms-normalization-v20/submission_norm.csv: (1616107, 3)
- /kaggle/input/bms-normalization-v6/submission_norm.csv: (1616107, 3)
- /kaggle/input/bms-normalization-public/submission_norm.csv: (1616107, 3)


I also import another model that produces partial predictions for 273k molecules that proved to be more challenging to translate in my previous experiments. The predictions are done using a beam search with a larger `k`.

In [4]:
##### IMPORT PARTIAL PREDICTIONS

part_sub = pd.read_csv('/kaggle/input/bms-normalization-bad-27/submission_norm.csv')
part_sub = part_sub.sort_values('image_id').reset_index(drop = True)
print(part_sub.shape)

(273054, 3)


Following the idea of @nofreewill, I store three possible events in the process of InChI normalization: `['valid', 'none', 'error']`. Value `valid` means that RDKit was able to convert prediction to a valid InChI string.  

In [5]:
##### CHECK PREDICTION FORMAT

sub = models[0].copy()
display(sub.head())
print('\nEvents:')
display(sub['event'].value_counts())

Unnamed: 0,image_id,InChI,event
0,00000d2a601c,InChI=1S/C10H14BrN5S/c1-6-10(11)9(16(3)14-6)4-...,valid
1,00001f7fc849,InChI=1S/C14H18ClN3/c1-2-7-16-9-13-10-17-14(18...,valid
2,000037687605,InChI=1S/C16H13BrN2O/c1-11(20)12-6-7-13(9-18)1...,valid
3,00004b6d55b6,"InChI=1S/C14H19FN4O/c1-14(2,3)12-13(16)17-18-1...",valid
4,00004df0fe53,InChI=1S/C9H12O2/c1-4-5-2-6-7(3-5)11-9(10)8(4)...,valid



Events:


valid    1495604
none      120478
error         25
Name: event, dtype: int64

# ENSEMBLING

The ensembling is done in the following way:
1. If 4/7 models have the same output, set the final prediction to this mode value.
2. Else:
    - go through each model starting from the best-performing one and set prediction to a first model with valid output
    - if there are now valid outputs, set prediction to the output of the lowest-CV model
    - if output of the partial model is available and valid, overwrite prediction for this molecule

In [6]:
##### ENSEMBLING

# placeholders
num_equals  = []
model_preds = []

# loop through test molecules
for i in tqdm(range(len(sub))):
    
    # extract base model predictions and mode
    preds     = [model.iloc[i]['InChI'] for model in models]
    mode      = max(set(preds), key = preds.count)
    num_equal = preds.count(mode)
    num_equals.append(num_equal)
    
    # set prediction to mode
    if num_equal >= 4:
        sub.loc[i, 'InChI'] = mode
        model_preds.append('mode')
        
    else:
        
        # look for valid pred from all models
        valid_pred = False
        for m in range(len(models)):
            if models[m].loc[i, 'event'] == 'valid':
                sub.loc[i, 'InChI'] = models[m].loc[i, 'InChI']
                model_preds.append(model_list[m])
                valid_pred = True
                break
                
        # set preds to lowest-CV model
        if not valid_pred:
            sub.loc[i, 'InChI'] = models[0].loc[i, 'InChI']
            model_preds.append(model_list[0])
                
        # set preds to better model if possible
        if not valid_pred:
            image_id = sub.loc[i, 'image_id']
            if image_id in list(part_sub['image_id'].values):
                if part_sub.loc[part_sub['image_id'] == image_id]['event'].item() == 'valid':
                    sub.loc[i, 'InChI'] = part_sub.loc[part_sub['image_id'] == image_id, 'InChI'].item()
                    model_preds.append('part_model')

100%|██████████| 1616107/1616107 [1:19:30<00:00, 338.79it/s]


In [7]:
##### CHECK NO. EQUAL PREDS DISTRIBUTION

pd.Series(num_equals).value_counts()

7    1014488
6     202503
5     122019
4     109406
3      71478
1      50641
2      45572
dtype: int64

In many cases, our models agree quite well with each other.

In [8]:
##### CHECK MODEL PREDS DISTRIBUTION

pd.Series(model_preds).value_counts()

mode                                                          1448416
/kaggle/input/bms-norm-v22/submission_norm.csv                 121351
/kaggle/input/bms-norm-v17/submission_norm.csv                  18720
/kaggle/input/bms-normalization-v21/submission_norm.csv         12606
/kaggle/input/bms-normalization-v2733/submission_norm.csv        8409
/kaggle/input/bms-normalization-v20/submission_norm.csv          3764
part_model                                                       2605
/kaggle/input/bms-normalization-v6/submission_norm.csv           1858
/kaggle/input/bms-normalization-public/submission_norm.csv        983
dtype: int64

As expected, most predictions are coming from model `v22`. Very few molecules are translated by the last models in a row, suggesting that usually at least one of the models is able to provide valid edictions.

# SUBMISSION

In [9]:
##### EXPORT SUBMISSION

sub = sub[['image_id', 'InChI']]
sub.to_csv('submission.csv', index = False)
sub.head()

Unnamed: 0,image_id,InChI
0,00000d2a601c,InChI=1S/C10H14BrN5S/c1-6-10(11)9(16(3)14-6)4-...
1,00001f7fc849,InChI=1S/C14H18ClN3/c1-2-7-16-9-13-10-17-14(18...
2,000037687605,InChI=1S/C16H13BrN2O/c1-11(20)12-6-7-13(9-18)1...
3,00004b6d55b6,"InChI=1S/C14H19FN4O/c1-14(2,3)12-13(16)17-18-1..."
4,00004df0fe53,InChI=1S/C9H12O2/c1-4-5-2-6-7(3-5)11-9(10)8(4)...
