# About this Notebook
---
This notebook is derived from the notebook series by Y. Nakama:
- Preprocessing Notebook: https://www.kaggle.com/yasufuminakama/inchi-preprocess-2
- Training Notebook: https://www.kaggle.com/yasufuminakama/inchi-resnet-lstm-with-attention-starter
- Inference Notebook: https://www.kaggle.com/yasufuminakama/inchi-resnet-lstm-with-attention-inference

The major approach involves:
- PyTorch Resnet + LSTM with Attention
- Basic image transformations
- Tokenize by characters and numbers
- Rotate test images upright to follow train set orientation

The original notebook gets a score of about 20 with 2 epochs of training. I aim to add/test a bunch of improvements over time, including:
- [x] Refactor and simplify
- [ ] Try JIT to speed things up
- [ ] Convert code to pytorch lightning
- [ ] Add wandb logging
- [ ] Half point precision training
- [ ] Try pytorch lightning XLA for TPU training
- [ ] Use EfficientNet80 with the additional of a FC layer
- [ ] Replace LSTM-Attention based model to an image captioning transformer
- [ ] Better preprocesssing (ie add precise crop, better normalization, better augmentations)
- [ ] Larger model, train for more epochs
- [ ] Play around with different tokenization methods

## References
- https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
- https://github.com/dacon-ai/LG_SMILES_3rd
- https://www.kaggle.com/kaushal2896/bms-mt-show-attend-and-tell-pytorch-baseline

## Change Log
* reran the notebooks to reproduce the results
* combined notebooks all into one file
* refactor code to include imports, common util functions, CFG, model code, tokenizer class, config, and common variables at the very top so they aren't repeated
    * modified the logger function
    * changed path related functions to use os.path.join and changed file/dir locations to either come from and input dir or be written to an output dir
    * TestDataset was not the same between inference and training, so the training TestDataset class was renamed to ValidDataset
    * converted all print statements to logger.info
    * after these modifications, the train losses were one-to-one the same up to step 1000, and the pipeline works under test mode so we should be good
* converted code to use pytorch lightning
* refactored code into files

# Setup
---

In [29]:
import os
import gc
import sys
import random

import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
import pytorch_lightning as pl

from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

In [24]:
from data import TrainDataset, ValidDataset, TestDataset, BMSCollator, get_transforms
from models import ImageCaptioner
from tokenizers import Tokenizer, split_form, split_form2
from utils import get_data_paths, path_from_image_id, set_seed, get_logger, get_device, get_score

In [3]:
tqdm.pandas()
%load_ext autoreload
%autoreload 2

In [12]:
factor = 1
class CFG:
    input_dir = "../input/bms-molecular-translation"
    output_dir = "models/resnet_attention_baseline_pl_run2"
    debug = False
    max_len = 275
    num_workers = 8
    model_name = "resnet34"
    size = 224
    scheduler = "CosineAnnealingLR"  # ['ReduceLROnPlateau', 'CosineAnnealingLR', 'CosineAnnealingWarmRestarts']
    epochs = 1  # not to exceed 9h
    # factor=0.2 # ReduceLROnPlateau
    # patience=4 # ReduceLROnPlateau
    # eps=1e-6 # ReduceLROnPlateau
    T_max = 4  # CosineAnnealingLR
    # T_0=4 # CosineAnnealingWarmRestarts
    encoder_lr = 1e-4*factor
    decoder_lr = 4e-4*factor
    min_lr = 1e-6*factor
    batch_size = 64*factor
    weight_decay = 1e-6
    gradient_accumulation_steps = 1
    max_grad_norm = 5
    attention_dim = 256
    embed_dim = 256
    decoder_dim = 512
    dropout = 0.5
    seed = 42
    test_size=0.01

In [6]:
set_seed(CFG.seed)
TRAIN_DIR, TEST_DIR, TRAIN_FILE, TEST_FILE = get_data_paths(CFG.input_dir)
os.makedirs(CFG.output_dir)

In [7]:
LOGGER = get_logger(__name__, os.path.join(CFG.output_dir, "run.log"))
device = get_device()

# Preprocess
---

In [8]:
train = pd.read_csv(TRAIN_FILE)
LOGGER.info(f'train.shape: {train.shape}')

[2021-04-08 13:54:49,544] INFO:__main__: train.shape: (2424186, 2)


In [11]:
train['InChI_1'] = train['InChI'].progress_apply(lambda x: x.split('/')[1])
train['InChI_text'] = train['InChI_1'].progress_apply(split_form) + ' ' + \
                    train['InChI'].apply(lambda x: '/'.join(x.split('/')[2:])).progress_apply(split_form2).values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train['InChI_text'].values)
torch.save(tokenizer, os.path.join(CFG.output_dir, 'tokenizer2.pth'))
LOGGER.info('Saved tokenizer')

lengths = []
tk0 = tqdm(train['InChI_text'].values, total=len(train))
for text in tk0:
    seq = tokenizer.text_to_sequence(text)
    length = len(seq) - 2
    lengths.append(length)
train['InChI_length'] = lengths
train.to_pickle(os.path.join(CFG.output_dir, 'train2.pkl'))
LOGGER.info('Saved preprocessed to ' + os.path.join(CFG.output_dir, "train2.pkl"))

100%|██████████| 2424186/2424186 [00:02<00:00, 1056060.32it/s]
100%|██████████| 2424186/2424186 [00:17<00:00, 140551.11it/s]
100%|██████████| 2424186/2424186 [02:37<00:00, 15347.86it/s]
[2021-04-08 13:58:20,863] INFO:__main__: Saved tokenizer
100%|██████████| 2424186/2424186 [00:24<00:00, 98554.35it/s]
[2021-04-08 13:58:48,815] INFO:__main__: Saved preprocessed to models/resnet_attention_baseline_pl_run2/train2.pkl


# Train
---

In [13]:
train_df = pd.read_pickle(os.path.join(CFG.output_dir, 'train2.pkl'))
train_df['file_path'] = train_df['image_id'].apply(lambda x: path_from_image_id(x, TRAIN_DIR))

LOGGER.info(f'train.shape: {train_df.shape}')
display(train_df.head())

[2021-04-08 14:05:47,956] INFO:__main__: train.shape: (2424186, 6)


Unnamed: 0,image_id,InChI,InChI_1,InChI_text,InChI_length,file_path
0,000011a64c74,InChI=1S/C13H20OS/c1-9(2)8-15-13-6-5-10(3)7-12...,C13H20OS,C 13 H 20 O S /c 1 - 9 ( 2 ) 8 - 15 - 13 - 6 -...,59,../input/bms-molecular-translation/train/0/0/0...
1,000019cc0cd2,InChI=1S/C21H30O4/c1-12(22)25-14-6-8-20(2)13(1...,C21H30O4,C 21 H 30 O 4 /c 1 - 12 ( 22 ) 25 - 14 - 6 - 8...,108,../input/bms-molecular-translation/train/0/0/0...
2,0000252b6d2b,InChI=1S/C24H23N5O4/c1-14-13-15(7-8-17(14)28-1...,C24H23N5O4,C 24 H 23 N 5 O 4 /c 1 - 14 - 13 - 15 ( 7 - 8 ...,112,../input/bms-molecular-translation/train/0/0/0...
3,000026b49b7e,InChI=1S/C17H24N2O4S/c1-12(20)18-13(14-7-6-10-...,C17H24N2O4S,C 17 H 24 N 2 O 4 S /c 1 - 12 ( 20 ) 18 - 13 (...,108,../input/bms-molecular-translation/train/0/0/0...
4,000026fc6c36,InChI=1S/C10H19N3O2S/c1-15-10(14)12-8-4-6-13(7...,C10H19N3O2S,C 10 H 19 N 3 O 2 S /c 1 - 15 - 10 ( 14 ) 12 -...,72,../input/bms-molecular-translation/train/0/0/0...


In [14]:
tokenizer = torch.load(os.path.join(CFG.output_dir, 'tokenizer2.pth'))
LOGGER.info(f"tokenizer.stoi: {tokenizer.stoi}")

[2021-04-08 14:05:50,741] INFO:__main__: tokenizer.stoi: {'(': 0, ')': 1, '+': 2, ',': 3, '-': 4, '/b': 5, '/c': 6, '/h': 7, '/i': 8, '/m': 9, '/s': 10, '/t': 11, '0': 12, '1': 13, '10': 14, '100': 15, '101': 16, '102': 17, '103': 18, '104': 19, '105': 20, '106': 21, '107': 22, '108': 23, '109': 24, '11': 25, '110': 26, '111': 27, '112': 28, '113': 29, '114': 30, '115': 31, '116': 32, '117': 33, '118': 34, '119': 35, '12': 36, '120': 37, '121': 38, '122': 39, '123': 40, '124': 41, '125': 42, '126': 43, '127': 44, '128': 45, '129': 46, '13': 47, '130': 48, '131': 49, '132': 50, '133': 51, '134': 52, '135': 53, '136': 54, '137': 55, '138': 56, '139': 57, '14': 58, '140': 59, '141': 60, '142': 61, '143': 62, '144': 63, '145': 64, '146': 65, '147': 66, '148': 67, '149': 68, '15': 69, '150': 70, '151': 71, '152': 72, '153': 73, '154': 74, '155': 75, '156': 76, '157': 77, '158': 78, '159': 79, '16': 80, '161': 81, '163': 82, '165': 83, '167': 84, '17': 85, '18': 86, '19': 87, '2': 88, '20': 

In [15]:
train_df['InChI_length'].max()

275

In [27]:
def train(train_df, valid_df):
    # todo: save best model and last model
    # TODO: fix scheduler situation
    # TODO: add raw sequence accuracy?
    # todo: add get_score for validation accuracy, potentially use https://github.com/1ytic/pytorch-edit-distance
    #       so you can compute the distance without detaching to cpu and converting to numpy.
    #       this would be ideal since then if it doesn't have a large computation impact
    #       we can add it as a metric in the training loop as well.
    #       you'll need to change the valid dataset so it'll pass in the labels
    # todo: tensorboard logging
    # todo: gradient clipping
    # todo: figure out folds situation
    # todo: model saving
    # todo: num workers
    # TODO: replace all instances of CFG
    # TODO: save hparams/config
    # TODO: remove device calls where possible
    valid_labels = valid_df["InChI"].values

    train_dataset = TrainDataset(
        train_df, tokenizer, transform=get_transforms(data="train", size=CFG.size)
    )
    valid_dataset = ValidDataset(valid_df, transform=get_transforms(data="valid", size=CFG.size))

    bms_collator = BMSCollator(pad_idx=tokenizer.stoi["<pad>"])
    train_loader = DataLoader(
        train_dataset,
        batch_size=CFG.batch_size,
        shuffle=True,
        num_workers=CFG.num_workers,
        pin_memory=True,
        drop_last=True,
        collate_fn=bms_collator,
    )
    valid_loader = DataLoader(
        valid_dataset,
        batch_size=CFG.batch_size,
        shuffle=False,
        num_workers=CFG.num_workers,
        pin_memory=True,
        drop_last=False,
    )

    model = ImageCaptioner(
        model_name=CFG.model_name,
        tokenizer=tokenizer,
        encoder_lr=CFG.encoder_lr,
        decoder_lr=CFG.decoder_lr,
        weight_decay=CFG.weight_decay,
        amsgrad=False,
        attention_dim=CFG.attention_dim,
        embed_dim=CFG.embed_dim,
        decoder_dim=CFG.decoder_dim,
        dropout=CFG.dropout,
        max_len=CFG.max_len,
        valid_labels=valid_labels,
        gradient_accumulation_steps=CFG.gradient_accumulation_steps,
        max_grad_norm=CFG.max_grad_norm,
        device=device
    )
    
    from pytorch_lightning.callbacks import LearningRateMonitor, GPUStatsMonitor, ModelCheckpoint
    
    checkpoint_callback = ModelCheckpoint(
        monitor="score",
        dirpath=CFG.output_dir,
        filename="best_model",
        save_last=True,
        save_top_k=1,
        mode="min",
    )

    trainer = pl.Trainer(
        default_root_dir=CFG.output_dir,  # set directory to save weights, logs, etc ...
        num_processes=CFG.num_workers,  # num processes to use if using cpu
        gpus=1,  # num gpus to use if using gpu
        tpu_cores=None,  # num tpu cores to use if using tpu
        progress_bar_refresh_rate=1,  # change to 20 if using google colab
        fast_dev_run=False,  # set to True to quickly verify your code works
#         gradient_clip_val=CFG.max_grad_norm, # READ!!!, this param has no affect since we are doing manual optimization and need to deal with grad clipping ourselves
#         accumulate_grad_batches=CFG.gradient_accumulation_steps, # READ!!!, this param has no affect since we are doing manual optimization and need to do grad accum ourselves
        max_epochs=CFG.epochs,
        min_epochs=1,
        max_steps=None,  # use if you want to train based on step rather than epoch
        min_steps=None,  # use if you want to train based on step rather than epoch
        limit_train_batches=1.0/100,  # percentage of train data to use
        limit_val_batches=1.0/100,  # percentage of validation data to use
        limit_test_batches=1.0,  # percentage of test data to use
        check_val_every_n_epoch=1,  # run validation every n epochs
        val_check_interval=0.20,  # run validation after every n percent of an epoch
        precision=32,  # use 16 for half point precision
        resume_from_checkpoint=None,  # place path to checkpoint if resuming training
        auto_lr_find=False,  # set to True to optimize learning rate
        auto_scale_batch_size=False,  # set to True to find largest batch size that fits in hardware
        log_every_n_steps=20,
        callbacks=[checkpoint_callback, LearningRateMonitor("step"), GPUStatsMonitor(temperature=True, fan_speed=True)]
    )
    trainer.fit(model, train_loader, valid_loader)

In [21]:
train_data, valid_data = train_test_split(train_df, shuffle=True, test_size=CFG.test_size)
print(len(train_data), len(valid_data))

2399944 24242


In [31]:
train(train_data, valid_data)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type                 | Params
---------------------------------------------------
0 | encoder   | Encoder              | 21.3 M
1 | decoder   | DecoderWithAttention | 3.8 M 
2 | critereon | CrossEntropyLoss     | 0     
---------------------------------------------------
25.1 M    Trainable params
0         Non-trainable params
25.1 M    Total params
100.438   Total estimated model params size (MB)


Epoch 0:  19%|█▉        | 74/389 [01:25<06:02,  1.15s/it, v_num=1, score=565.0, loss=2.440]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/3 [00:00<?, ?it/s][A
Epoch 0:  20%|█▉        | 76/389 [01:27<05:59,  1.15s/it, v_num=1, score=565.0, loss=2.440]
Validating:  67%|██████▋   | 2/3 [00:02<00:01,  1.07s/it][A
Epoch 0:  20%|██        | 78/389 [01:32<06:10,  1.19s/it, v_num=1, score=258.0, loss=2.420]
Epoch 0:  39%|███▉      | 152/389 [02:28<03:52,  1.02it/s, v_num=1, score=258.0, loss=1.850]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/3 [00:00<?, ?it/s][A
Epoch 0:  40%|███▉      | 154/389 [02:32<03:52,  1.01it/s, v_num=1, score=258.0, loss=1.850]
Validating:  67%|██████▋   | 2/3 [00:03<00:01,  1.64s/it][A
Epoch 0:  40%|████      | 156/389 [02:36<03:53,  1.00s/it, v_num=1, score=182.0, loss=1.890]
Epoch 0:  59%|█████▉    | 230/389 [03:24<02:21,  1.12it/s, v_num=1, score=182.0, loss=1.720]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|     

Saving latest checkpoint...


Epoch 0: 100%|██████████| 389/389 [05:26<00:00,  1.19it/s, v_num=1, score=98.70, loss=1.630]


# Inference
---

In [32]:
test_df = pd.read_csv(TEST_FILE)
test_df['file_path'] = test_df['image_id'].apply(lambda x: path_from_image_id(x, TEST_DIR))

LOGGER.info(f'test.shape: {test_df.shape}')
display(test_df.head())

[2021-04-08 14:19:57,889] INFO:__main__: test.shape: (1616107, 3)


Unnamed: 0,image_id,InChI,file_path
0,00000d2a601c,InChI=1S/H2O/h1H2,../input/bms-molecular-translation/test/0/0/0/...
1,00001f7fc849,InChI=1S/H2O/h1H2,../input/bms-molecular-translation/test/0/0/0/...
2,000037687605,InChI=1S/H2O/h1H2,../input/bms-molecular-translation/test/0/0/0/...
3,00004b6d55b6,InChI=1S/H2O/h1H2,../input/bms-molecular-translation/test/0/0/0/...
4,00004df0fe53,InChI=1S/H2O/h1H2,../input/bms-molecular-translation/test/0/0/0/...


In [33]:
tokenizer = torch.load(os.path.join(CFG.output_dir, 'tokenizer2.pth'))
LOGGER.info(f"tokenizer.stoi: {tokenizer.stoi}")

[2021-04-08 14:20:26,189] INFO:__main__: tokenizer.stoi: {'(': 0, ')': 1, '+': 2, ',': 3, '-': 4, '/b': 5, '/c': 6, '/h': 7, '/i': 8, '/m': 9, '/s': 10, '/t': 11, '0': 12, '1': 13, '10': 14, '100': 15, '101': 16, '102': 17, '103': 18, '104': 19, '105': 20, '106': 21, '107': 22, '108': 23, '109': 24, '11': 25, '110': 26, '111': 27, '112': 28, '113': 29, '114': 30, '115': 31, '116': 32, '117': 33, '118': 34, '119': 35, '12': 36, '120': 37, '121': 38, '122': 39, '123': 40, '124': 41, '125': 42, '126': 43, '127': 44, '128': 45, '129': 46, '13': 47, '130': 48, '131': 49, '132': 50, '133': 51, '134': 52, '135': 53, '136': 54, '137': 55, '138': 56, '139': 57, '14': 58, '140': 59, '141': 60, '142': 61, '143': 62, '144': 63, '145': 64, '146': 65, '147': 66, '148': 67, '149': 68, '15': 69, '150': 70, '151': 71, '152': 72, '153': 73, '154': 74, '155': 75, '156': 76, '157': 77, '158': 78, '159': 79, '16': 80, '161': 81, '163': 82, '165': 83, '167': 84, '17': 85, '18': 86, '19': 87, '2': 88, '20': 

In [35]:
model = ImageCaptioner.load_from_checkpoint(
    os.path.join(CFG.output_dir, "last.ckpt"),
    model_name=CFG.model_name,
    tokenizer=tokenizer,
    encoder_lr=CFG.encoder_lr,
    decoder_lr=CFG.decoder_lr,
    weight_decay=CFG.weight_decay,
    amsgrad=False,
    attention_dim=CFG.attention_dim,
    embed_dim=CFG.embed_dim,
    decoder_dim=CFG.decoder_dim,
    dropout=CFG.dropout,
    max_len=CFG.max_len,
    valid_labels=None,
    gradient_accumulation_steps=None,
    max_grad_norm=None,
    device=device
)

In [36]:
test_dataset = TestDataset(test_df.head(2000), transform=get_transforms(data='valid', size=CFG.size))
test_loader = DataLoader(test_dataset, batch_size=512, shuffle=False, num_workers=CFG.num_workers, drop_last=False)

In [37]:
model.to(device)
model.eval()

predictions = []
for images in tqdm(test_loader, total=len(test_loader)):
    images = images.to(device)
    predictions.extend(model.predict(images))


  0%|          | 0/4 [00:00<?, ?it/s][A
 25%|██▌       | 1/4 [00:04<00:12,  4.30s/it][A
 50%|█████     | 2/4 [00:04<00:04,  2.10s/it][A
 75%|███████▌  | 3/4 [00:05<00:01,  1.47s/it][A
100%|██████████| 4/4 [00:06<00:00,  1.59s/it][A


In [38]:
del test_dataset, test_loader, model; gc.collect()

2775

In [40]:
predictions[:10]

['InChI=1S/C13H16N2O2/c1-3-4-7-11(13)11-11(14)12-11(2)4-5-8-11(14)15(16)15/h3-6,8,11H,4-6,8-10H2,1-3H3',
 'InChI=1S/C15H21N2O2/c1-3-3-4-6-11-14(17)16-11-14-14-11-12-11-12-14(17)16-11-14-15/h3-6,10,12,16H,3-5,8-10H2,1-3H3',
 'InChI=1S/C14H17ClN2O/c1-11-11-8-9-11(17)16(18)18-11-8-9-11(17)16-11-11-12(2)3-4-6-11/h1-4,8-10H,1-2H3',
 'InChI=1S/C15H23NO2/c1-3-4-6-11(17)16(17)18-11-8-9-11(17)16(17)18-11-8-5-5-7-11(16)17/h3-7,10-14H,4-6,8-11H2,1-3H3',
 'InChI=1S/C10H12N2O2/c1-4-6-8-9(2)7-8(11)12(2)4-6-8-10(2)4/h3-8,10H,1-2H3,(H,14,,)(,14,,))',
 'InChI=1S/C21H29NO/c1-2-3-4-6-11-14-14-16-17-14-16-17-14-16-17-16-17-16-17-16-17-16-15-16-17/h2-5,8-10,14-14,16H,3-5,8-10,13-14H2,1-2H3',
 'InChI=1S/C14H23N2O/c1-2-3-4-8-11-12-11-12-14(14)14-10-11-12-14-12-14-12-14/h2-6,8,11,14H,3-5,8-10H2,1-2H3',
 'InChI=1S/C24H29N4O5/c1-2-3-4-8-14(24)25(25)25-14-8-9-17(25)24(25)25(25)25-14-8-9-17(25)24(25)25-14-8-9-17(24)25(25)25/h3-7,11-14,17H,4-6,8-11,13-14H2,1-2H3,(H,28,29)(H,28,29)(H,28,29)(H,28,29)(H,28,29)(H,28,2

In [41]:
# submission
test_df['InChI'] = [f"InChI=1S/{text}" for text in predictions]
test_df[['image_id', 'InChI']].to_csv(os.path.join(CFG.output_dir, 'submission.csv'), index=False)
test_df[['image_id', 'InChI']].head()