model cannot be trained #28

yuanqidu · 2021-12-07T06:59:05Z

When I attempted to train the code with train.py and the provided simple example it2_tt_0_lowrmsd_valid_mols_test0_1000.types and crossdock folder, the model froze in the dataloader next_batch part.

Looking forward to your help!

mattragoza · 2021-12-07T23:06:53Z

The data file that references the structures in the data/crossdock2020/ directory is data/crossdock2020/selected_test_targets.types. This only includes the 10 targets that we selected for test evaluations, so I would not advise training a model on it.

Instructions for downloading the full crossdocked dataset can be found here:
https://github.com/gnina/models/tree/master/data/CrossDocked2020

If you would like to train a model from scratch using the full CrossDocked2020 data set, I can make the required data files (.types files) available.

yuanqidu · 2021-12-08T01:11:17Z

Thanks for your help! Yes, I did try to use crossdock for training, but the code just froze at next_batch under libmolgrid library.

mattragoza · 2021-12-09T02:54:03Z

Can you provide your conda environment?

yuanqidu · 2021-12-09T06:35:06Z

my conda environment is with torch 1.10.0+cu102

mattragoza · 2021-12-09T16:15:48Z

How did you install openbabel and libmolgrid?

RMeli · 2021-12-09T16:21:41Z

@yuanqidu posted this error, before editing the message above:

examples = self.ex_provider.next_batch(self.batch_size)
ValueError: Vector typer used with gninatypes files

Isn't this related to a mismatch between the types contained in the .gninatypes files for the CrossDocked2020 dataset (original typing scheme used in GNINA) and the new typing scheme that the refactored version of liGAN uses?

mattragoza · 2021-12-09T16:31:54Z

@RMeli Yes, that error message indicates that you are trying to train using gninatypes files or a molcache2 file, which is not compatible with vectorized atom typing that is used by this project. If that's the case, you have to download the full crossdocked2020 dataset (the structure files, not the molcaches) and use that instead. You can reference config/train.config file for how the data paths should be set up.

yuanqidu · 2021-12-11T02:54:30Z

Thanks for the catch... I loaded the wrong package from the crossdock dataset for that run, BUT it was not the main reason, I deleted it because I corrected it very soon. So, it was not the error. I still stuck with the loading data step...

I install openbabel via conda install openbabel -c conda-forge and molgrid via pip install molgrid

Wait a second, I think you are right, I am still getting this "Vector typer used with gninatypes files" error even when I use the full crossdock dataset.

I basically download the full set of crossdock and the types.
I used train and test file: types/it2_tt_v1.1_0_test0.types
I set data_root as the CrossDocked2020 dataset
following the example in the train.config file, but still getting this error
BTW I didn't see any type files under the full set of crossdocked2020, but a separate types folder.

Some observations: the above error could be circumvented with the very small test type file provided in THIS repo. BUT it still could not proceed and froze at next_batch

mattragoza · 2021-12-11T03:25:13Z

Those train and test files reference the custom gninatypes format that is not compatible with liGAN. You should use the following train and test files:

http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types
http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types

yuanqidu · 2021-12-11T03:48:31Z

Thanks so much. But it seems the error is still there, when calling next_batch, it froze forever and later the process is killed.

It looks like this error is inside libmolgrid library

mattragoza · 2021-12-12T19:03:08Z

In that case, this is probably due to a mismatch between the conda-installed openbabel and the version used by pip-installed molgrid. This is a known issue that is resolvable by building libmolgrid from source using the version of openbabel that you have installed on your system.

yuanqidu · 2021-12-14T03:04:27Z

When I tried to manually build the libmolgrid, I could not use the command apt install libeigen3-dev libboost-all-dev since I am on centos 7. What openbabel version is compatible with the package? May I re-install the openbabel instead of the molgrid package?

yuanqidu · 2021-12-14T09:15:49Z

Finally, after manually installing libmolgrid, I solved the problem.

But I have one more question, the files mentioned in the given types for training are .sdf.gz, while in the dataset donwloaded are gninatypes, there is a mismatch.

mattragoza · 2021-12-14T18:30:39Z

Great, I'm glad you were able to manually build libmolgrid.

Please follow the steps in the download_data.sh script to download the full crossdocked dataset, which should include .sdf.gz:

#!/bin/bash
wget https://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.1.tgz -P data/crossdock2020/
tar -C data/crossdock2020/ -xzf data/crossdock2020/CrossDocked2020_v1.1.tgz
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types -P data/crossdock2020/
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types -P data/crossdock2020/

mattragoza · 2021-12-14T18:43:05Z

Also note that the types2xyz.py script that David mentioned in the libmolgrid issue you opened will not work for this project, since we use an alternative typing scheme. The gninatypes files contain atoms that already have a different type scheme applied, so there is no way to use them for this project. You will need to download the original molecule files to train a generative model.

yuanqidu · 2021-12-15T00:33:17Z

Thanks, I did download the dataset following the bash script. But in the folder it contains many ginatypes while the code in liGAN asks for sdf. I think there is a mismatch and when I run the code, it reports file cannot be open/found.

mattragoza · 2021-12-15T15:25:27Z

Ah, the problem is that the molecules in the downloaded crossdocked2020 dataset contain multiple poses, but we need them to be split out so that individual poses are in separate files. There is a script to do this but the step is currently missing from the download_data.sh script, my apologies. I will update the script and let you know how to run it ASAP.

yuanqidu · 2021-12-15T15:27:56Z

Yes, thanks, please :)

mattragoza · 2021-12-15T17:58:06Z

The following commands will split out the poses in the sdf.gz files that you need for training:

# split multi-pose files into single-pose files
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_train0.types data/crossdock2020
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_test0.types data/crossdock2020

mattragoza · 2021-12-15T19:47:40Z

Just FYI, there are some issues with the data set that I am working to resolve. The train and test files will have to be updated to remove some bad/missing molecules.

yuanqidu · 2021-12-16T08:37:52Z

Thanks very much for your help!

yuanqidu · 2021-12-17T01:23:55Z

May I just move the missing files from the types file?

yuanqidu · 2021-12-17T04:16:51Z

I have one more question about the method. Did you first extract the pocket from the protein or did you use the full protein and ligand for conditional generation? If you did extract pocket, how did you do so?

mattragoza · 2021-12-17T22:41:57Z

Yes, you can simply remove the missing data rows from the .types files. We provide the full protein as input to conditional generation, but only the binding pocket will fit in the grid bounds. The grid will be centered on the reference ligand, so it is assumed that the reference ligand is in the binding pocket.

mattragoza · 2021-12-17T22:48:39Z

There are functions to extract pockets for UFF minimization in the binding pocket context (see liGAN.molecules.get_rd_mol_pocket), but I would not recommend doing for proteins that are input to the generative model

wxfsd · 2021-12-22T13:21:03Z

Hello, I ran into the same problem. What is your openbabel version? How to install lib manually? @yuanqidu
Looking forward to your help!

mattragoza · 2021-12-24T18:49:48Z

Hello @yuanqidu @wxfsd, I wanted to update you on this issue as I am actively working to resolve it. The problem is that there is an incompatibility between the conda-installed openbabel and conda/pip-installed molgrid (they are the same binary). I am working on a conda build recipe that will hopefully resolve this issue (https://github.com/mattragoza/conda-molgrid), but it is still under construction. I have provided an environment.yaml file in the conda-molgrid repo that you should be able to use to create a conda environment in which you can successfully build molgrid from source. Please let me know if you run into issues using this conda environment (if you do, please open an issue in the conda-molgrid repo).

wxfsd · 2021-12-27T02:33:01Z

Okay, I'm trying, I will ask on that link (https://github.com/mattragoza/conda-molgrid) if I have any issues. Thank you very much. @mattragoza

mattragoza · 2021-12-27T22:56:01Z

@yuanqidu also I have uploaded new types files that have the problematic poses removed, you can find the the links in the download_data.sh script

yuanqidu · 2022-01-04T07:22:25Z

Thanks! May I ask how many (percentage) problematic files were removed?

Happy new year!

mattragoza · 2022-01-07T22:00:32Z

9 total poses were removed from the data files.

mattragoza closed this as completed Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model cannot be trained #28

model cannot be trained #28

yuanqidu commented Dec 7, 2021

mattragoza commented Dec 7, 2021

yuanqidu commented Dec 8, 2021

mattragoza commented Dec 9, 2021

yuanqidu commented Dec 9, 2021 •

edited

Loading

mattragoza commented Dec 9, 2021

RMeli commented Dec 9, 2021 •

edited

Loading

mattragoza commented Dec 9, 2021

yuanqidu commented Dec 11, 2021 •

edited

Loading

mattragoza commented Dec 11, 2021

yuanqidu commented Dec 11, 2021 •

edited

Loading

mattragoza commented Dec 12, 2021

yuanqidu commented Dec 14, 2021

yuanqidu commented Dec 14, 2021

mattragoza commented Dec 14, 2021

mattragoza commented Dec 14, 2021

yuanqidu commented Dec 15, 2021

mattragoza commented Dec 15, 2021

yuanqidu commented Dec 15, 2021

mattragoza commented Dec 15, 2021

mattragoza commented Dec 15, 2021

yuanqidu commented Dec 16, 2021

yuanqidu commented Dec 17, 2021

yuanqidu commented Dec 17, 2021

mattragoza commented Dec 17, 2021 •

edited

Loading

mattragoza commented Dec 17, 2021

wxfsd commented Dec 22, 2021

mattragoza commented Dec 24, 2021

wxfsd commented Dec 27, 2021

mattragoza commented Dec 27, 2021

yuanqidu commented Jan 4, 2022

mattragoza commented Jan 7, 2022

model cannot be trained #28

model cannot be trained #28

Comments

yuanqidu commented Dec 7, 2021

mattragoza commented Dec 7, 2021

yuanqidu commented Dec 8, 2021

mattragoza commented Dec 9, 2021

yuanqidu commented Dec 9, 2021 • edited Loading

mattragoza commented Dec 9, 2021

RMeli commented Dec 9, 2021 • edited Loading

mattragoza commented Dec 9, 2021

yuanqidu commented Dec 11, 2021 • edited Loading

mattragoza commented Dec 11, 2021

yuanqidu commented Dec 11, 2021 • edited Loading

mattragoza commented Dec 12, 2021

yuanqidu commented Dec 14, 2021

yuanqidu commented Dec 14, 2021

mattragoza commented Dec 14, 2021

mattragoza commented Dec 14, 2021

yuanqidu commented Dec 15, 2021

mattragoza commented Dec 15, 2021

yuanqidu commented Dec 15, 2021

mattragoza commented Dec 15, 2021

mattragoza commented Dec 15, 2021

yuanqidu commented Dec 16, 2021

yuanqidu commented Dec 17, 2021

yuanqidu commented Dec 17, 2021

mattragoza commented Dec 17, 2021 • edited Loading

mattragoza commented Dec 17, 2021

wxfsd commented Dec 22, 2021

mattragoza commented Dec 24, 2021

wxfsd commented Dec 27, 2021

mattragoza commented Dec 27, 2021

yuanqidu commented Jan 4, 2022

mattragoza commented Jan 7, 2022

yuanqidu commented Dec 9, 2021 •

edited

Loading

RMeli commented Dec 9, 2021 •

edited

Loading

yuanqidu commented Dec 11, 2021 •

edited

Loading

yuanqidu commented Dec 11, 2021 •

edited

Loading

mattragoza commented Dec 17, 2021 •

edited

Loading