-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model cannot be trained #28
Comments
The data file that references the structures in the data/crossdock2020/ directory is data/crossdock2020/selected_test_targets.types. This only includes the 10 targets that we selected for test evaluations, so I would not advise training a model on it. Instructions for downloading the full crossdocked dataset can be found here: If you would like to train a model from scratch using the full CrossDocked2020 data set, I can make the required data files (.types files) available. |
Thanks for your help! Yes, I did try to use crossdock for training, but the code just froze at next_batch under libmolgrid library. |
Can you provide your conda environment? |
my conda environment is with torch 1.10.0+cu102 |
How did you install openbabel and libmolgrid? |
@yuanqidu posted this error, before editing the message above:
Isn't this related to a mismatch between the types contained in the |
@RMeli Yes, that error message indicates that you are trying to train using gninatypes files or a molcache2 file, which is not compatible with vectorized atom typing that is used by this project. If that's the case, you have to download the full crossdocked2020 dataset (the structure files, not the molcaches) and use that instead. You can reference config/train.config file for how the data paths should be set up. |
Thanks for the catch... I loaded the wrong package from the crossdock dataset for that run, BUT it was not the main reason, I deleted it because I corrected it very soon. So, it was not the error. I still stuck with the loading data step... I install openbabel via conda install openbabel -c conda-forge and molgrid via pip install molgrid Wait a second, I think you are right, I am still getting this "Vector typer used with gninatypes files" error even when I use the full crossdock dataset. I basically download the full set of crossdock and the types. Some observations: the above error could be circumvented with the very small test type file provided in THIS repo. BUT it still could not proceed and froze at next_batch |
Those train and test files reference the custom gninatypes format that is not compatible with liGAN. You should use the following train and test files: http://bits.csb.pitt.edu/files/it2_tt_0_lowrmsd_mols_train0.types |
In that case, this is probably due to a mismatch between the conda-installed openbabel and the version used by pip-installed molgrid. This is a known issue that is resolvable by building libmolgrid from source using the version of openbabel that you have installed on your system. |
When I tried to manually build the libmolgrid, I could not use the command apt install libeigen3-dev libboost-all-dev since I am on centos 7. What openbabel version is compatible with the package? May I re-install the openbabel instead of the molgrid package? |
Finally, after manually installing libmolgrid, I solved the problem. But I have one more question, the files mentioned in the given types for training are .sdf.gz, while in the dataset donwloaded are gninatypes, there is a mismatch. |
Great, I'm glad you were able to manually build libmolgrid. Please follow the steps in the download_data.sh script to download the full crossdocked dataset, which should include .sdf.gz:
|
Also note that the types2xyz.py script that David mentioned in the libmolgrid issue you opened will not work for this project, since we use an alternative typing scheme. The gninatypes files contain atoms that already have a different type scheme applied, so there is no way to use them for this project. You will need to download the original molecule files to train a generative model. |
Ah, the problem is that the molecules in the downloaded crossdocked2020 dataset contain multiple poses, but we need them to be split out so that individual poses are in separate files. There is a script to do this but the step is currently missing from the download_data.sh script, my apologies. I will update the script and let you know how to run it ASAP. |
Yes, thanks, please :) |
The following commands will split out the poses in the sdf.gz files that you need for training:
|
Just FYI, there are some issues with the data set that I am working to resolve. The train and test files will have to be updated to remove some bad/missing molecules. |
Thanks very much for your help! |
May I just move the missing files from the types file? |
I have one more question about the method. Did you first extract the pocket from the protein or did you use the full protein and ligand for conditional generation? If you did extract pocket, how did you do so? |
Yes, you can simply remove the missing data rows from the .types files. We provide the full protein as input to conditional generation, but only the binding pocket will fit in the grid bounds. The grid will be centered on the reference ligand, so it is assumed that the reference ligand is in the binding pocket. |
There are functions to extract pockets for UFF minimization in the binding pocket context (see |
Hello, I ran into the same problem. What is your openbabel version? How to install lib manually? @yuanqidu |
Hello @yuanqidu @wxfsd, I wanted to update you on this issue as I am actively working to resolve it. The problem is that there is an incompatibility between the conda-installed openbabel and conda/pip-installed molgrid (they are the same binary). I am working on a conda build recipe that will hopefully resolve this issue (https://github.com/mattragoza/conda-molgrid), but it is still under construction. I have provided an environment.yaml file in the conda-molgrid repo that you should be able to use to create a conda environment in which you can successfully build molgrid from source. Please let me know if you run into issues using this conda environment (if you do, please open an issue in the conda-molgrid repo). |
Okay, I'm trying, I will ask on that link (https://github.com/mattragoza/conda-molgrid) if I have any issues. Thank you very much. @mattragoza |
@yuanqidu also I have uploaded new types files that have the problematic poses removed, you can find the the links in the download_data.sh script |
Thanks! May I ask how many (percentage) problematic files were removed? Happy new year! |
9 total poses were removed from the data files. |
When I attempted to train the code with train.py and the provided simple example it2_tt_0_lowrmsd_valid_mols_test0_1000.types and crossdock folder, the model froze in the dataloader next_batch part.
Looking forward to your help!
The text was updated successfully, but these errors were encountered: