Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: Jacobi iterations did not reach desired tolerance #287

Closed
msuruzhon opened this issue Mar 22, 2022 · 31 comments
Closed

ERROR: Jacobi iterations did not reach desired tolerance #287

msuruzhon opened this issue Mar 22, 2022 · 31 comments
Labels

Comments

@msuruzhon
Copy link
Collaborator

Hello,

I recently ran into this very rare issue during atom-atom matching (I have only seen it once so far):

gsl: svd.c:712: ERROR: Jacobi iterations did not reach desired tolerance
Default GSL error handler invoked.

I am attaching a minimal example to reproduce this error. The error seems to arise during this line in _align.py:

molecule0 = molecule0.move().align(molecule1, _SireMol.AtomResultMatcher(sire_mapping)).molecule()

Any thoughts on what is going on here?

Cheers.

@msuruzhon msuruzhon added the bug label Mar 22, 2022
@lohedges
Copy link
Member

Hmmm, I think we have experienced this issue before when trying to align molecules that had been generated from 2D depictions, i.e. all atoms lie in a single plane. I believe the issue only occurred when one coordinate was zero, i.e. the lay exactly on the z plane, and we added a workaround to deal with this (shifting the molecule off axis).

I'll see if something similar is going on.

@lohedges
Copy link
Member

Okay, I'm not sure how we can fix this in an easy way since it doesn't seem possible to change the GSL tolerances. (This is hitting the sweep limit on line 711 here. As a workaround, I'd advise minimising your ligand structures prior to generating the mapping and aligning. For example, the following works for me.

import BioSimSpace as BSS

mol0 = BSS.IO.readMolecules(["lig_67_vac.parm7", "lig_67_vac.rst7"])[0]
mol1 = BSS.IO.readMolecules(["lig_52_vac.parm7", "lig_52_vac.rst7"])[0]

protocol = BSS.Protocol.Minimisation(steps=1000)

process = BSS.Process.OpenMM(mol0.toSystem(), protocol)
process.start()
process.wait()
mol0 = process.getSystem()[0]

process = BSS.Process.OpenMM(mol1.toSystem(), protocol)
process.start()
process.wait()
mol1 = process.getSystem()[0]

print("Calculating MCS...")
mcs = BSS.Align.matchAtoms(mol0, mol1)

@msuruzhon
Copy link
Collaborator Author

Thanks for that @lohedges. Is minimisation expected to always fix this issue? Also, do we know when this issue occurs in general or is that unclear? Finally, would it make sense to optionally disable the scoring_function and return the first match if one needs to (or alternatively add a failsafe mechanism when such an error occurs)?

@lohedges
Copy link
Member

At the moment I'm not 100% sure, since it's not obvious to me what's causing the issue. (I've eyeballed the molecules, but don't see anything obvious.) As you imply, another solution could be to change the scoring function to rmsd or rmsd_flex_align, but then you'd need to make sure that you use the flexAlign function to actually align based on the mapping, since you might trigger the same error with rmsdAlign.

Annoyingly I do wrap the call to the Sire align mover within a try / except block, but the GSL error isn't being caught properly and the script ends up crashing with a segmentation fault. If the error was caught, then we could check for the message that you are seeing and suggest pre-minimisation, or an alternative scoring function as a workaround. I'll try to figure out why the error isn't being caught properly.

@lohedges
Copy link
Member

On another note, I'm also trying to work out why Process.Gromacs.getSystem() is so flaky. It always seems to take two attempts to work (the trjconv frame extraction) even though the command line call is exactly the same in both cases. Also, if I just do it on the command-line (rather than using BioSimSpace) then it works straight away. This was why I used OpenMM rather than GROMACS in the example above.

@lohedges
Copy link
Member

Okay, fixed that by simple re-running the command if the frame extraction fails. (Still don't know why it fails, but its an easy workaround.)

I think I need to figure out where the GSL error is happening in Sire, and catch it properly there.

@lohedges
Copy link
Member

Nope, I'm unable to catch the GSL error in Sire and the segmentation fault still occurs. It looks like GSL itself might not be doing some memory cleanup when handling its own GSL_ERROR during the SVD routine. I'm not sure what else we can do here.

@lohedges
Copy link
Member

Looking at the Sire code, the GSL error should already be caught:

        // calculate single value decomposition of A into V S W^T
        int ok = gsl_linalg_SV_decomp_jacobi(A, W, S);


        if (ok != 0)
            throw SireMaths::domain_error( QObject::tr(
                    "Could not calculate the single value decomposition of %1.")
                        .arg(this->toString()), CODELOC );

When I get a chance I'll check the values of the vector and matrix that are triggering the segmentation fault, since it might be possible to catch the issue before the GSL routine is called.

@lohedges
Copy link
Member

There doesn't appear to be anything particularly unusual about the matrices and vector. I guess we just need to chalk this up as a GSL issue and hope that anyone who comes across this problem will be able to find this thread.

@msuruzhon
Copy link
Collaborator Author

Thanks for investigating @lohedges, maybe we could raise an issue there as well? Is this the correct repo: https://github.com/ampl/gsl?

@lohedges
Copy link
Member

lohedges commented Mar 23, 2022

I think that's just a clone of the original source with a few tweaks to make it easier to integrate into other projects, e.g. using CMake. The main page is here, which contains links to various mailing lists, including one for bugs.

@msuruzhon
Copy link
Collaborator Author

Thanks @lohedges I have just sent them an email and have linked to this issue.

@lohedges
Copy link
Member

Reading the bug tracker it looks like there are a few other open reports of error handling issues , e.g. here. From the comments it looks like there's only a single maintainer for the project. The related issues were posted some time ago and are still open.

@lohedges
Copy link
Member

Okay, I've fixed it. We can just used the following before calling the function:

gsl_set_error_handler_off();

This disables the default GSL error handler, so the function just returns the error code and does nothing else. Following this change, we correctly catch the error code in Sire and I see the following output:

Exception 'SireMaths::domain_error' thrown by the thread 'master'.
Could not calculate the single value decomposition of / 310.229, -200.643, 35.5843 \
| -157.983, 281.687, 44.2894 |
\ -5.87562, 40.6398, 73.2926 /.
Thrown from FILE: /home/lester/Code/Sire/corelib/src/libs/SireMaths/matrix.cpp, LINE: 669, FUNCTION: boost::tuples::tuple<SireMaths::Matrix, SireMaths::Matrix, SireMaths::Matrix> SireMaths::Matrix::singleValueDecomposition() const

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "script.py", line 8, in <module>
    mcs = BSS.Align.matchAtoms(mol0, mol1)
  File "/home/lester/Code/BioSimSpace/python/BioSimSpace/Align/_align.py", line 825, in matchAtoms
    property_map0, property_map1)
  File "/home/lester/Code/BioSimSpace/python/BioSimSpace/Align/_align.py", line 1446, in _score_rdkit_mappings
    raise _AlignmentError(msg) from e
BioSimSpace._Exceptions._exceptions.AlignmentError: Failed to align molecules when scoring based on mapping: {0: 0, 1: 1, 2: 2, 3: 3, 10: 10, 9: 9, 4: 4, 42: 39, 5: 5, 6: 6, 7: 7, 8: 8, 27: 26, 26: 25, 25: 24, 24: 23, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 23: 16, 21:

I can now catch the verbose Sire error message in BioSimSpace and suggest a solution. In this case it's one of the MCS matches that fails, rather than all of them. I'm not sure whether to just discard this result, or suggest a workaround, e.g. minimisation.

@lohedges
Copy link
Member

In this particular case there are of the order of 100+ matches that are being scored and only one fails. I think I'll just warn but not error when this happens during scoring. I can't see how it can happen during alignment itself, unless the user is passing in a custom mapping, i.e. the mapping would have already failed during the scoring stage.

@lohedges
Copy link
Member

Weirdly, if I take the failed mapping from the scoring and just call the rmsdAlign function based on that, then I don't see the error. The call to the underlying Sire move functionality is identical in both cases.

@lohedges
Copy link
Member

lohedges commented Mar 23, 2022

Looks like GitHub actions isn't working. I'll try to re-trigger the Sire build tomorrow, then push the BioSimSpace changes when it's run.

@msuruzhon
Copy link
Collaborator Author

That's great, thanks for fixing this so quickly @lohedges .

@msuruzhon
Copy link
Collaborator Author

Hi again @lohedges, I wasn't sure whether to post a new issue but I have issues importing Sire when gsl=2.7.1 is used (as defined in the anaconda channel). If gsl=2.7 is used instead, it imports fine, so maybe we should pin this dependency? Is it something that you could reproduce as well? Thanks.

@lohedges
Copy link
Member

I'll try to reproduce at my end. Out of interest, what error are you seeing? Since the CI passed, and this is a minor version number change to gsl it must be a conda error, i.e. the changes should be API backwards compatible and the library version should be identical.

@msuruzhon
Copy link
Collaborator Author

Yes, I thought the same but this error is still quite strange for a minor release. That's the error I get:

ImportError: libgsl.so.25: cannot open shared object file: No such file or directory

So this seems to be a packaging issue?

@msuruzhon
Copy link
Collaborator Author

And this is the error I get on macOS, which pretty much boils down to the same thing:

ImportError: dlopen(/Users/msuruzhon/miniconda3/envs/bss_orion_new/lib/python3.7/site-packages/Sire/Base/_Base.so, 2): Library not loaded: @rpath/libgsl.25.dylib
Referenced from: /Users/msuruzhon/miniconda3/envs/bss_orion_new/lib/libSireCAS.2022.1.0.dylib
Reason: image not found

@lohedges
Copy link
Member

Looking at the conda-forge page for gsl here I only see version 2.7, not 2.7.1. Where is it being pulled in from, or have they deleted a package recently?

It sounds like a packaging issue to me, but looking at their feedstock it doesn't seem to have been rebuilt recently. I'll be able to do a more in depth investigation when I'm at the computer.

@msuruzhon
Copy link
Collaborator Author

Yes so the one I am talking about is confusingly enough not available on conda-forge but on anaconda instead: https://anaconda.org/anaconda/gsl. And in some cases that's the one that gets pulled automatically.

@lohedges
Copy link
Member

Ah, okay. Well, that sounds like a packaging issue on behalf on the default conda channel, which is pretty annoying. I assume that 2.7.1 will be available on conda-forge in the near future, which should resolve this.

As a quick test I created a fresh (Linux) conda environment using:

conda create -n biosimspace-dev -c conda-forge -c michellab/label/dev biosimspace

This still pulls in the conda-forge version of gsl, even though the one from pkgs/main is newer, i.e. the channel priority takes higher precedence it seems (at least for a minor version difference). Perhaps your channel priority is different, or you have some configuration settings that mean that it is always taking the latest version.

@lohedges
Copy link
Member

It might also be that you've pulled in the non conda-forge gsl version when installing some other package, so have it in your conda cache. When installing BioSimSpace, the cache may then take precedence over the channel priority, i.e. it tries to avoid downloads and changing the environment where possible.

I've tried using mamba too, and this also pulls in the correct conda-forge gsl when creating a fresh environment.

@msuruzhon
Copy link
Collaborator Author

Yes I think my problem is that the defaults channel is always picked up first. But in any case, once 2.7.1 makes it to conda-forge, then this will break things in general. Would it make sense to pin the dependency as a precautionary measure?

@lohedges
Copy link
Member

lohedges commented Mar 29, 2022

I don't think we need to, since it sounds like a packaging issue on the defaults channel. I don't really want to start pinning since it becomes a maintenance nightmare and isn't the way that conda is meant to work anyway (conda-forge frown at pinning unless required, i..e for API/ABI compatability). If it's still an issue when the conda-forge package is updated, then I can raise an issue on their feedstock page.

Looking at the contents of the two package archives on the Anaconda cloud I see the following in the lib directory (for Linux):

conda-forge gsl 2.7:

ls -l lib
total 12M
drwxr-xr-x 2 lester lester 4.0K Mar 29 10:21 pkgconfig
-rw-r--r-- 1 lester lester 6.8M Jul 15  2021 libgsl.a
-rw-r--r-- 1 lester lester 594K Jul 15  2021 libgslcblas.a
lrwxrwxrwx 1 lester lester   13 Jul 15  2021 libgslcblas.so -> libcblas.so.3
lrwxrwxrwx 1 lester lester   13 Jul 15  2021 libgslcblas.so.0 -> libcblas.so.3
-rwxr-xr-x 1 lester lester 357K Jul 15  2021 libgslcblas.so.0.0.0
lrwxrwxrwx 1 lester lester   16 Jul 15  2021 libgsl.so -> libgsl.so.25.1.0
lrwxrwxrwx 1 lester lester   16 Jul 15  2021 libgsl.so.25 -> libgsl.so.25.1.0
-rwxr-xr-x 1 lester lester 3.4M Jul 15  2021 libgsl.so.25.1.0

defaults gsl 2.7.1:

ls -l lib
total 12M
drwxr-xr-x 2 lester lester 4.0K Mar 29 10:21 pkgconfig
-rw-r--r-- 1 lester lester 6.9M Mar 21 13:33 libgsl.a
-rw-r--r-- 1 lester lester 537K Mar 21 13:33 libgslcblas.a
lrwxrwxrwx 1 lester lester   20 Mar 21 13:33 libgslcblas.so -> libgslcblas.so.0.0.0
lrwxrwxrwx 1 lester lester   20 Mar 21 13:33 libgslcblas.so.0 -> libgslcblas.so.0.0.0
-rwxr-xr-x 1 lester lester 298K Mar 21 13:33 libgslcblas.so.0.0.0
lrwxrwxrwx 1 lester lester   16 Mar 21 13:33 libgsl.so -> libgsl.so.27.0.0
lrwxrwxrwx 1 lester lester   16 Mar 21 13:33 libgsl.so.27 -> libgsl.so.27.0.0
-rwxr-xr-x 1 lester lester 3.4M Mar 21 13:33 libgsl.so.27.0.0

As you can see, the packages have different shared object versions, despite there only being a minor/patch difference in the version number. This shouldn't be the case assuming the same build system and build environment. I imagine that the numbering will be correct when it appears on conda-forge, and if not we can bug them about it. (As I have already had to do for other packages, such as fontconfig here.)

You should be able to remove defaults as follows:

# Check your current channels.
conda config --show channels

# Remove defaults.
conda config --remove channels defaults

Alternatively, if you are creating an environment file, in which BioSimSpace is one of the packages that is installed you can do:

channels:
  - conda-forge
  - michellab/label/dev
  - nodefaults

This would make sure that defaults isn't used, even if the user happened to have it in their ~/.condarc file.

Dealing with all this stuff is one of the things that I hate about conda. Essentially it doesn't work once you start mixing channels, even with large standard channels such as conda-forge and defaults. Since we intend to move to conda-forge in future we have made efforts to make our build process compliant with their own build system. This means that (other than potential packaging issues) our packages should work assuming that you get all dependencies from conda-forge.

@lohedges
Copy link
Member

Thinking about the numbers, it could be that the conda-forge version is wrong to begin with, i.e. I imagine the SO number is the major and minor, so would make sense to be 27, not 25. I'll look on their feedstock to see if anyone has reported this, or to see how the number is generated.

@lohedges
Copy link
Member

No, it doesn't use the MAJOR/MINOR and is just generated by libtool. I imagine the build system is different on the two platforms, hence the different numbering. I would assume that 2.7.1 build on conda-forge will still give libgsl.so.25.

Looking at the conda-forge packages and docs, they also recommend using:

conda config --set channel_priority strict

This means that the channel priority will take precedence over other things that may be used for scoring when trying to resolve an environment change.

@msuruzhon
Copy link
Collaborator Author

Thanks Lester, this makes sense. I think we just have to keep an eye on this library to make sure it's not breaking something in conda-forge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants