Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model cannot be trained #28

Closed
yuanqidu opened this issue Dec 7, 2021 · 31 comments
Closed

model cannot be trained #28

yuanqidu opened this issue Dec 7, 2021 · 31 comments

Comments

@yuanqidu
Copy link

yuanqidu commented Dec 7, 2021

When I attempted to train the code with train.py and the provided simple example it2_tt_0_lowrmsd_valid_mols_test0_1000.types and crossdock folder, the model froze in the dataloader next_batch part.

Looking forward to your help!

@mattragoza
Copy link
Owner

The data file that references the structures in the data/crossdock2020/ directory is data/crossdock2020/selected_test_targets.types. This only includes the 10 targets that we selected for test evaluations, so I would not advise training a model on it.

Instructions for downloading the full crossdocked dataset can be found here:
https://github.com/gnina/models/tree/master/data/CrossDocked2020

If you would like to train a model from scratch using the full CrossDocked2020 data set, I can make the required data files (.types files) available.

@yuanqidu
Copy link
Author

yuanqidu commented Dec 8, 2021

Thanks for your help! Yes, I did try to use crossdock for training, but the code just froze at next_batch under libmolgrid library.

@mattragoza
Copy link
Owner

Can you provide your conda environment?

@yuanqidu
Copy link
Author

yuanqidu commented Dec 9, 2021

my conda environment is with torch 1.10.0+cu102

@mattragoza
Copy link
Owner

How did you install openbabel and libmolgrid?

@RMeli
Copy link
Contributor

RMeli commented Dec 9, 2021

@yuanqidu posted this error, before editing the message above:

examples = self.ex_provider.next_batch(self.batch_size)
ValueError: Vector typer used with gninatypes files

Isn't this related to a mismatch between the types contained in the .gninatypes files for the CrossDocked2020 dataset (original typing scheme used in GNINA) and the new typing scheme that the refactored version of liGAN uses?

@mattragoza
Copy link
Owner

@RMeli Yes, that error message indicates that you are trying to train using gninatypes files or a molcache2 file, which is not compatible with vectorized atom typing that is used by this project. If that's the case, you have to download the full crossdocked2020 dataset (the structure files, not the molcaches) and use that instead. You can reference config/train.config file for how the data paths should be set up.

@yuanqidu
Copy link
Author

yuanqidu commented Dec 11, 2021

Thanks for the catch... I loaded the wrong package from the crossdock dataset for that run, BUT it was not the main reason, I deleted it because I corrected it very soon. So, it was not the error. I still stuck with the loading data step...

I install openbabel via conda install openbabel -c conda-forge and molgrid via pip install molgrid

Wait a second, I think you are right, I am still getting this "Vector typer used with gninatypes files" error even when I use the full crossdock dataset.

I basically download the full set of crossdock and the types.
I used train and test file: types/it2_tt_v1.1_0_test0.types
I set data_root as the CrossDocked2020 dataset
following the example in the train.config file, but still getting this error
BTW I didn't see any type files under the full set of crossdocked2020, but a separate types folder.

Some observations: the above error could be circumvented with the very small test type file provided in THIS repo. BUT it still could not proceed and froze at next_batch

@mattragoza
Copy link
Owner

Those train and test files reference the custom gninatypes format that is not compatible with liGAN. You should use the following train and test files:

http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types
http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types

@yuanqidu
Copy link
Author

yuanqidu commented Dec 11, 2021

Thanks so much. But it seems the error is still there, when calling next_batch, it froze forever and later the process is killed.
image

It looks like this error is inside libmolgrid library

@mattragoza
Copy link
Owner

In that case, this is probably due to a mismatch between the conda-installed openbabel and the version used by pip-installed molgrid. This is a known issue that is resolvable by building libmolgrid from source using the version of openbabel that you have installed on your system.

@yuanqidu
Copy link
Author

When I tried to manually build the libmolgrid, I could not use the command apt install libeigen3-dev libboost-all-dev since I am on centos 7. What openbabel version is compatible with the package? May I re-install the openbabel instead of the molgrid package?

@yuanqidu
Copy link
Author

Finally, after manually installing libmolgrid, I solved the problem.

But I have one more question, the files mentioned in the given types for training are .sdf.gz, while in the dataset donwloaded are gninatypes, there is a mismatch.

@mattragoza
Copy link
Owner

Great, I'm glad you were able to manually build libmolgrid.

Please follow the steps in the download_data.sh script to download the full crossdocked dataset, which should include .sdf.gz:

#!/bin/bash
wget https://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.1.tgz -P data/crossdock2020/
tar -C data/crossdock2020/ -xzf data/crossdock2020/CrossDocked2020_v1.1.tgz
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types -P data/crossdock2020/
wget https://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_test0.types -P data/crossdock2020/

@mattragoza
Copy link
Owner

Also note that the types2xyz.py script that David mentioned in the libmolgrid issue you opened will not work for this project, since we use an alternative typing scheme. The gninatypes files contain atoms that already have a different type scheme applied, so there is no way to use them for this project. You will need to download the original molecule files to train a generative model.

@yuanqidu
Copy link
Author

Thanks, I did download the dataset following the bash script. But in the folder it contains many ginatypes while the code in liGAN asks for sdf. I think there is a mismatch and when I run the code, it reports file cannot be open/found.

image

image

@mattragoza
Copy link
Owner

Ah, the problem is that the molecules in the downloaded crossdocked2020 dataset contain multiple poses, but we need them to be split out so that individual poses are in separate files. There is a script to do this but the step is currently missing from the download_data.sh script, my apologies. I will update the script and let you know how to run it ASAP.

@yuanqidu
Copy link
Author

Yes, thanks, please :)

@mattragoza
Copy link
Owner

The following commands will split out the poses in the sdf.gz files that you need for training:

# split multi-pose files into single-pose files
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_train0.types data/crossdock2020
python scripts/split_sdf.py data/crossdock2020/it2_tt_0_lowrmsd_mols_test0.types data/crossdock2020

@mattragoza
Copy link
Owner

Just FYI, there are some issues with the data set that I am working to resolve. The train and test files will have to be updated to remove some bad/missing molecules.

@yuanqidu
Copy link
Author

Thanks very much for your help!

@yuanqidu
Copy link
Author

May I just move the missing files from the types file?

@yuanqidu
Copy link
Author

I have one more question about the method. Did you first extract the pocket from the protein or did you use the full protein and ligand for conditional generation? If you did extract pocket, how did you do so?

@mattragoza
Copy link
Owner

mattragoza commented Dec 17, 2021

Yes, you can simply remove the missing data rows from the .types files. We provide the full protein as input to conditional generation, but only the binding pocket will fit in the grid bounds. The grid will be centered on the reference ligand, so it is assumed that the reference ligand is in the binding pocket.

@mattragoza
Copy link
Owner

There are functions to extract pockets for UFF minimization in the binding pocket context (see liGAN.molecules.get_rd_mol_pocket), but I would not recommend doing for proteins that are input to the generative model

@wxfsd
Copy link

wxfsd commented Dec 22, 2021

Hello, I ran into the same problem. What is your openbabel version? How to install lib manually? @yuanqidu
Looking forward to your help!

@mattragoza
Copy link
Owner

Hello @yuanqidu @wxfsd, I wanted to update you on this issue as I am actively working to resolve it. The problem is that there is an incompatibility between the conda-installed openbabel and conda/pip-installed molgrid (they are the same binary). I am working on a conda build recipe that will hopefully resolve this issue (https://github.com/mattragoza/conda-molgrid), but it is still under construction. I have provided an environment.yaml file in the conda-molgrid repo that you should be able to use to create a conda environment in which you can successfully build molgrid from source. Please let me know if you run into issues using this conda environment (if you do, please open an issue in the conda-molgrid repo).

@wxfsd
Copy link

wxfsd commented Dec 27, 2021

Okay, I'm trying, I will ask on that link (https://github.com/mattragoza/conda-molgrid) if I have any issues. Thank you very much. @mattragoza

@mattragoza
Copy link
Owner

@yuanqidu also I have uploaded new types files that have the problematic poses removed, you can find the the links in the download_data.sh script

@yuanqidu
Copy link
Author

yuanqidu commented Jan 4, 2022

Thanks! May I ask how many (percentage) problematic files were removed?

Happy new year!

@mattragoza
Copy link
Owner

9 total poses were removed from the data files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants