# Retraining Parrot

---

We want to investigate whether retraining Parrot can lead to improved results on smaller, single reaction class datasets. In this case we are going to use a pre-processed Suzuki reaction dataset which is a subset of USPTO-Condition.


In [1]:
import gdown

We start by importing the pre-trained and trained models into their required directories. Then we copy our suzuki dataset.


In [2]:
# import files using gdown:
files_and_paths = [
    [
        "https://drive.google.com/uc?id=1gFV2KdVKaLCTeb3nrzopyYHXbM0G_cr_",
        "outputs/Parrot_train_in_USPTO_Condition_enhance.zip",
    ],
    [
        "https://drive.google.com/uc?id=1bVB89ByGkYjiUtbvEcp1mgwmoKy5Ka2b",
        "outputs/best_rcm_model_pretrain.zip",
    ],
    [
        "https://drive.google.com/uc?id=1DmHILXSOhUuAzqF0JmRTx1EcOOQ7Bm5O",
        "outputs/best_mlm_model_pretrain.zip",
    ],
    [
        "https://drive.google.com/uc?id=1aX70qzZrJ9TZ9KpqnvUVR8WBxiTwXOsI",
        "dataset/source_dataset/USPTO_condition_final.zip",
    ],
]

for file, path in files_and_paths:
    gdown.download(file, path, quiet=False)

KeyboardInterrupt: 

In [41]:
import zipfile
import os

files_and_output_paths = [
    [
        "outputs/Parrot_train_in_USPTO_Condition_enhance.zip",
        "outputs/Parrot_train_in_USPTO_Condition_enhance",
    ],
    [
        "outputs/best_rcm_model_pretrain.zip",
        "outputs/best_rcm_model_pretrain",
    ],
    [
        "outputs/best_mlm_model_pretrain.zip",
        "outputs/best_mlm_model_pretrain",
    ],
    [
        "dataset/source_dataset/USPTO_condition_final.zip",
        "dataset/source_dataset/USPTO_condition_final",
    ],
]

for zip_file_path, extract_to_path in files_and_output_paths:
    # Create the directory if it does not exist
    os.makedirs(extract_to_path, exist_ok=True)

    # Unzip the file
    with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
        zip_ref.extractall(extract_to_path)

    print(f"Files extracted to {extract_to_path}")

# Finally remove the zip files
for zip_file_path, _ in files_and_output_paths:
    os.remove(zip_file_path)
    print(f"Deleted {zip_file_path}")

Files extracted to outputs/Parrot_train_in_USPTO_Condition_enhance
Files extracted to outputs/best_rcm_model_pretrain
Files extracted to outputs/best_mlm_model_pretrain
Files extracted to dataset/source_dataset/USPTO_condition_final
Deleted outputs/Parrot_train_in_USPTO_Condition_enhance.zip
Deleted outputs/best_rcm_model_pretrain.zip
Deleted outputs/best_mlm_model_pretrain.zip
Deleted dataset/source_dataset/USPTO_condition_final.zip


Next we inspect the Suzuki file:


In [3]:
import pandas as pd

df = pd.read_csv("dataset/source_dataset/suzuki/uspto_condition_suzuki_with_ids.csv")
cleaned_df = df[
    (df["catalyst1"].str.contains("Pd")) & (df["rxn_centre_strs"].str.contains("B;"))
]
cleaned_df.head(3)

Unnamed: 0,canonical_rxn,catalyst1,solvent1,solvent2,reagent1,reagent2,source,dataset,rxn_category,rxn_class_name,remapped_rxn,rxn_centre_strs,rxn_id,valid_rxn
1,CB1OB(C)OB(C)O1.Cc1cccc(Nc2nc(N[C@@H]3CCCC[C@@...,Cl[Pd]Cl,C1COCCO1,O,O=C([O-])[O-].[K+].[K+],,US20110152273A1,train,3.1,Suzuki coupling,Br[c:26]1[c:10]([NH:11][C@@H:12]2[CH2:13][CH2:...,[O;s>s2>2;][B;s>s3>2;]([O;s>s2>2;])[->.][C;s>s...,1,True
2,Nc1cnc(Br)cn1.OB(O)c1ccc(Br)cc1>>Nc1cnc(-c2ccc...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,CCO,CCOC(C)=O,Cc1ccccc1,O=C([O-])[O-].[K+].[K+],US20140221311A1,train,3.1,Suzuki coupling,Br[c:5]1[n:4][cH:3][c:2]([NH2:1])[n:14][cH:13]...,[O;s>s1>1;][B;s>s3>2;]([O;s>s1>1;])[->.][C;a>a...,2,True
3,CN1CCC(Oc2ccc(B3OC(C)(C)C(C)(C)O3)cc2)CC1.O=C(...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,CCO,,O=C([O-])[O-].[Na+].[Na+],,US20140051679A1,train,3.1,Suzuki coupling,CC1(C)OB([c:10]2[cH:9][cH:8][c:7]([O:6][CH:5]3...,[O;s>s2>2;][B;s>s3>2;]([O;s>s2>2;])[->.][C;a>a...,3,True


Compare this to the raw dataset so that we can convert this dataset into a form that the model can be trained on.


In [4]:
uspto_ref_df = pd.read_csv(
    "/data1/mball/rcr-benchmark/datasets/uspto-condition/parrot/USPTO_condition.csv"
)
uspto_ref_df.head(1)

Unnamed: 0,source,canonical_rxn,catalyst1,solvent1,solvent2,reagent1,reagent2,dataset
0,US20090239848A1,O=[N+]([O-])c1ccc(N2CCOCC2)cc1>>Nc1ccc(N2CCOCC...,[Zn],C1CCOC1,O,CO,[Cl-].[NH4+],train


Ok, so the only columns that we need are the canonmical_rxn, reagents (cat, solv, reag) and dataset columns.


In [5]:
suzuki_cleaned_df = cleaned_df[uspto_ref_df.columns]
suzuki_cleaned_df.fillna("", inplace=True)
suzuki_cleaned_df.to_csv(
    "dataset/source_dataset/suzuki/suzuki_cleaned.csv", index=False
)
suzuki_cleaned_df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Unnamed: 0,source,canonical_rxn,catalyst1,solvent1,solvent2,reagent1,reagent2,dataset
1,US20110152273A1,CB1OB(C)OB(C)O1.Cc1cccc(Nc2nc(N[C@@H]3CCCC[C@@...,Cl[Pd]Cl,C1COCCO1,O,O=C([O-])[O-].[K+].[K+],,train
2,US20140221311A1,Nc1cnc(Br)cn1.OB(O)c1ccc(Br)cc1>>Nc1cnc(-c2ccc...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,CCO,CCOC(C)=O,Cc1ccccc1,O=C([O-])[O-].[K+].[K+],train
3,US20140051679A1,CN1CCC(Oc2ccc(B3OC(C)(C)C(C)(C)O3)cc2)CC1.O=C(...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,CCO,,O=C([O-])[O-].[Na+].[Na+],,train


Finally we need to write our config for the training:
We are going to use the `configs/config_uspto_condition.yaml`, the only modifications are:

- using the `outputs/Parrot_train_in_USPTO_Condition_enhance/Parrot_train_in_USPTO_Condition_enhance` for the pretrained_path
- your choice of output directory
- all other configs, including the model args are left the same

Finally, we don't actually need the USPTO-condition dataset, but rather the idx.pkl file, which we copy over.


In [46]:
for file in os.listdir(
    "dataset/source_dataset/USPTO_condition_final/USPTO_condition_final/"
):
    if not file.endswith(".pkl"):
        os.remove(
            f"dataset/source_dataset/USPTO_condition_final/USPTO_condition_final/{file}"
        )
    elif file.endswith("labels.pkl"):
        os.remove(
            f"dataset/source_dataset/USPTO_condition_final/USPTO_condition_final/{file}"
        )
    else:
        print(f"Moving {file} to dataset/source_dataset/suzuki/")
        os.rename(
            f"dataset/source_dataset/USPTO_condition_final/USPTO_condition_final/{file}",
            f"dataset/source_dataset/suzuki/{file}",
        )

os.removedirs("dataset/source_dataset/USPTO_condition_final/USPTO_condition_final")
os.removedirs("dataset/source_dataset/USPTO_condition_final")

os.remove("dataset/source_dataset/suzuki/USPTO_condition_alldata_idx.pkl")

# Finally we rename our desired idx file:
os.rename(
    "dataset/source_dataset/suzuki/USPTO_condition_aug_n5_alldata_idx.pkl",
    "dataset/source_dataset/suzuki/suzuki_cleaned_alldata_idx.pkl",
)

FileNotFoundError: [Errno 2] No such file or directory: 'dataset/source_dataset/USPTO_condition_final'

Before we do anything else, we need to make one small adjustment to the best model config file. We change:

```json
{
  "processing_for_evaluation": false,
  "use_early_stopping": true
}
```

IF we want to use early stopping.

We then train the model with

```bash
python train_parrot_model.py --gpu 0 --config_path configs/config_suzuki_retrain.yaml
```

and

```bash
python test_parrot_model.py --gpu 0 --config_path ./configs/config_test_suzuki_retrain.yaml --verbose
```

Where we then find the `suzuki_retrain_topk_accuracy.csv` and `verbose_output.csv` files in the checkpoint folder of the chosen model.
