# Preparing Data to Re-Train Parrot

---

We want to investigate whether retraining Parrot can lead to improved results on smaller, single reaction class datasets. In this case we are going to use a pre-processed Suzuki reaction dataset which is a subset of USPTO-Condition.


In [None]:
import os
import zipfile

import gdown
import pandas as pd

We start by importing the pre-trained and trained models into their required directories. Then we copy our Suzuki dataset which has been generated in a different repository.


In [2]:
# import files using gdown:
files_and_paths = [
    [
        "https://drive.google.com/uc?id=1gFV2KdVKaLCTeb3nrzopyYHXbM0G_cr_",
        "../outputs/best_uspto_condition.zip",
    ],
    [
        "https://drive.google.com/uc?id=1bVB89ByGkYjiUtbvEcp1mgwmoKy5Ka2b",
        "../outputs/best_rcm_model_pretrain.zip",
    ],
    [
        "https://drive.google.com/uc?id=1DmHILXSOhUuAzqF0JmRTx1EcOOQ7Bm5O",
        "../outputs/best_mlm_model_pretrain.zip",
    ],
    [
        "https://drive.google.com/uc?id=1aX70qzZrJ9TZ9KpqnvUVR8WBxiTwXOsI",
        "../dataset/source_dataset/uspto_condition.zip",
    ],
]

for file, path in files_and_paths:
    gdown.download(file, path, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1gFV2KdVKaLCTeb3nrzopyYHXbM0G_cr_
To: /data1/mball/parrot-verbose/outputs/best_uspto_condition.zip
100%|██████████| 101M/101M [00:01<00:00, 98.4MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1bVB89ByGkYjiUtbvEcp1mgwmoKy5Ka2b
To: /data1/mball/parrot-verbose/outputs/best_rcm_model_pretrain.zip
100%|██████████| 72.4M/72.4M [00:11<00:00, 6.07MB/s]
Downloading...
From: https://drive.google.com/uc?id=1DmHILXSOhUuAzqF0JmRTx1EcOOQ7Bm5O
To: /data1/mball/parrot-verbose/outputs/best_mlm_model_pretrain.zip
100%|██████████| 72.5M/72.5M [00:00<00:00, 98.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1aX70qzZrJ9TZ9KpqnvUVR8WBxiTwXOsI
To: /data1/mball/parrot-verbose/dataset/source_dataset/uspto_condition.zip
100%|██████████| 536M/536M [00:05<00:00, 94.0MB/s] 


In [3]:
files_and_output_paths = [
    [
        "../outputs/best_uspto_condition.zip",
        "../outputs/best_uspto_condition",
    ],
    [
        "../outputs/best_rcm_model_pretrain.zip",
        "../outputs/best_rcm_model_pretrain",
    ],
    [
        "../outputs/best_mlm_model_pretrain.zip",
        "../outputs/best_mlm_model_pretrain",
    ],
    [
        "../dataset/source_dataset/uspto_condition.zip",
        "../dataset/source_dataset/uspto_condition",
    ],
]

for zip_file_path, extract_to_path in files_and_output_paths:
    # Create the directory if it does not exist
    os.makedirs(extract_to_path, exist_ok=True)

    # Unzip the file
    with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
        zip_ref.extractall(extract_to_path)

    print(f"Files extracted to {extract_to_path}")

Files extracted to ../outputs/best_uspto_condition
Files extracted to ../outputs/best_rcm_model_pretrain
Files extracted to ../outputs/best_mlm_model_pretrain
Files extracted to ../dataset/source_dataset/uspto_condition


In [4]:
# Finally remove the zip files
for zip_file_path, _ in files_and_output_paths:
    os.remove(zip_file_path)
    print(f"Deleted {zip_file_path}")

Deleted ../outputs/best_uspto_condition.zip
Deleted ../outputs/best_rcm_model_pretrain.zip
Deleted ../outputs/best_mlm_model_pretrain.zip
Deleted ../dataset/source_dataset/uspto_condition.zip


Next we inspect the Suzuki file, note that this has already been 'cleaned':


In [None]:
# Use this to set the name of the folder where the data will be saved
folder_name = "uspto_suzuki"

clean_df = pd.read_csv(
    f"../dataset/source_dataset/uspto_suzuki{folder_name}/uspto_condition_valid_suzuki_no_rc.csv"
)
clean_df.head(3)

Unnamed: 0,source,canonical_rxn,catalyst1,solvent1,solvent2,reagent1,reagent2,dataset,rxn_category,rxn_class_name,remapped_rxn,rxn_id
0,US20130190293A1,CC1(C)OB(c2ccc(O)nc2)OC1(C)C.Cc1ccc2c(c1)c1c(n...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,COCCOC,,O,O=C([O-])[O-].[K+].[K+],train,3.1,Suzuki coupling,Br[c:15]1[cH:14][cH:13][cH:12][c:11](-[n:10]2[...,8390
1,US20100016368A1,C=C(C)B1OC(C)(C)C(C)(C)O1.CCOC(=O)Cn1ncc2c1CCC...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,CN(C)C=O,,[Cl-].[NH4+],CC(C)(C)[O-].[K+],train,3.1,Suzuki coupling,Br[c:4]1[cH:5][c:6]([C:7]([F:8])([F:9])[F:10])...,13508
2,US20140163009A1,CB1OB(C)OB(C)O1.COC(=O)[C@@H]1[C@H](c2ccccc2)[...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,C1COCCO1,O,O=C([O-])[O-].[Cs+].[Cs+],,train,3.1,Suzuki coupling,Br[c:10]1[cH:9][cH:8][c:7]([C@H:6]2[C@H:5]([C:...,13901


Compare this to the raw dataset so that we can convert this dataset into a form that the model can be trained on.


In [6]:
uspto_ref_df = pd.read_csv(
    "../dataset/source_dataset/uspto_condition/USPTO_condition_final/USPTO_condition.csv"
)
uspto_ref_df.head(1)

Unnamed: 0,source,canonical_rxn,catalyst1,solvent1,solvent2,reagent1,reagent2,dataset
0,US20090239848A1,O=[N+]([O-])c1ccc(N2CCOCC2)cc1>>Nc1ccc(N2CCOCC...,[Zn],C1CCOC1,O,CO,[Cl-].[NH4+],train


Ok, so the only columns that we need are the canonical_rxn, reagents (cat, solv, reag) and dataset columns.


In [8]:
suzuki_cleaned_df = clean_df[uspto_ref_df.columns]
suzuki_cleaned_df.fillna("", inplace=True)
suzuki_cleaned_df.to_csv(
    f"../dataset/source_dataset/{folder_name}/suzuki_cleaned.csv", index=False
)
suzuki_cleaned_df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Unnamed: 0,source,canonical_rxn,catalyst1,solvent1,solvent2,reagent1,reagent2,dataset
0,US20130190293A1,CC1(C)OB(c2ccc(O)nc2)OC1(C)C.Cc1ccc2c(c1)c1c(n...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,COCCOC,,O,O=C([O-])[O-].[K+].[K+],train
1,US20100016368A1,C=C(C)B1OC(C)(C)C(C)(C)O1.CCOC(=O)Cn1ncc2c1CCC...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,CN(C)C=O,,[Cl-].[NH4+],CC(C)(C)[O-].[K+],train
2,US20140163009A1,CB1OB(C)OB(C)O1.COC(=O)[C@@H]1[C@H](c2ccccc2)[...,c1ccc([P](c2ccccc2)(c2ccccc2)[Pd]([P](c2ccccc2...,C1COCCO1,O,O=C([O-])[O-].[Cs+].[Cs+],,train


Finally we need to write our config for the training:
We are going to use the `configs/config_uspto_condition.yaml`, the only modifications are:

- using the `outputs/Parrot_train_in_USPTO_Condition_enhance/Parrot_train_in_USPTO_Condition_enhance` for the pretrained_path
- your choice of output directory
- all other configs, including the model args are left the same

Finally, we don't actually need the USPTO-condition dataset, but rather the idx.pkl file, which we copy over.


In [None]:
for file in os.listdir(
    "../dataset/source_dataset/uspto_condition/USPTO_condition_final/"
):
    if not file.endswith(".pkl"):
        os.remove(
            f"../dataset/source_dataset/uspto_condition/USPTO_condition_final/{file}"
        )
    elif file.endswith("labels.pkl"):
        os.remove(
            f"../dataset/source_dataset/uspto_condition/USPTO_condition_final/{file}"
        )
    else:
        print(f"Moving {file} to dataset/source_dataset/{folder_name}/")
        os.rename(
            f"../dataset/source_dataset/uspto_condition/USPTO_condition_final/{file}",
            f"../dataset/source_dataset/{folder_name}/{file}",
        )

os.removedirs("../dataset/source_dataset/uspto_condition/USPTO_condition_final")
os.removedirs("../dataset/source_dataset/uspto_condition")

Moving USPTO_condition_aug_n5_alldata_idx.pkl to dataset/source_dataset/suzuki_new/
Moving USPTO_condition_alldata_idx.pkl to dataset/source_dataset/suzuki_new/


FileNotFoundError: [Errno 2] No such file or directory: '../dataset/source_dataset/uspto_condition'

In [17]:
os.remove(f"../dataset/source_dataset/{folder_name}/USPTO_condition_alldata_idx.pkl")

# Finally we rename our desired idx file:
os.rename(
    f"../dataset/source_dataset/{folder_name}/USPTO_condition_aug_n5_alldata_idx.pkl",
    f"../dataset/source_dataset/{folder_name}/suzuki_cleaned_alldata_idx.pkl",
)

Before we do anything else, we need to make one small adjustment to the best model config file. This is NOT the config in /configs, but rather the `model_args.json` file in `outputs/best_uspto_condition` or `outputs/best_mlm_model_pretrain` We change:

```json
{
  "multiprocessing_for_evaluation": false,
  "use_early_stopping": false,
  "early_stopping_consider_epochs": false,
  "evaluate_during_training_silent": true,
  "evaluate_during_training_verbose": false
}
```

IF we want to use early stopping.

We need to generate the tokens for our data, which we can do by navigating into the `./preprocess_script/uspto_script` directory and running:

```bash
python 5.0.convert_context_tokens.py --source_data_path ../../dataset/source_dataset/ --dataset_dir_name dataset_dir --dataset_fname dataset_fname --idx2data_fpath ../../dataset/source_dataset/{folder_name}/{file_name}_alldata_idx.pkl
```

We then train the model with

```bash
python train_parrot_model.py --gpu 0 --config_path configs/{config_name}.yaml
```

and

```bash
python test_parrot_model.py --gpu 0 --config_path ./configs/{test_config_name}.yaml --verbose
```

Where we then find the `suzuki_retrain_topk_accuracy.csv` and `verbose_output.csv` files in the checkpoint folder of the chosen model.

Note that the config file for testing is different to training (since we are specifying a different trained model).

You should find that your trained models (and checkpoints from each epoch) are under the /out folder. We take the best models as the model in this folder (the one produced after the final training epoch). All other models in these folders can be deleted.
