Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed bug: unable to read cif that deviates from the pdb_id.cif format. #62

Merged
merged 13 commits into from Oct 28, 2021

Conversation

danny305
Copy link
Contributor

This should close issue #61

It also addresses the issue where sometimes the freesasa_node_atom_residue_number(atom) returns a decimal string. I encountered this when I passed a cif file that I got as an output from the ChargeFW2 repo. I didn't realize that I patched that as well in this PR until now. Let me know if you would like that as a separate PR.

@mittinatten
Copy link
Owner

mittinatten commented Jul 1, 2021 via email

@danny305
Copy link
Contributor Author

I have refactored the code to do essentially what you said. The code no longer depends on the pdb code being part of the filename.

It was a bit more complicated than that. I had to edit the fresasa_structure and the freesasa_node structure properties. I was unsure on how to use gemmi types in your C code (structure.c and node.c) so instead I added a pointer to the filename as a char * and have a map on the C++ side that maps the filename to the gemmi doc.

I will update the PR today or tomorrow. Im cleaning it up first.

Insertion codes are separate fields from the residue number.

@danny305
Copy link
Contributor Author

I also refactored the code so it can now accept cif files that already have FreeSASA columns/data. Essentially, making the code idempotent.

Copy link
Owner

@mittinatten mittinatten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Finally back on track here. This looks good, have you verified that it works for stdin too? (looks like it should, I can check later).

Regarding the pointers: You could probably make the pointer version work with a void pointer and some casts. The benefit of the pointer solution is that main.cc would be the only part of the code that has state, everything else is passed around as arguments. But your solution probably works just as well in practice.

I like the idea of idempotency, could you add a CLI test to verify that? Also a test to check that if you change the parameters, the results are updated.

src/cif.cc Show resolved Hide resolved
src/structure.c Outdated Show resolved Hide resolved
@danny305
Copy link
Contributor Author

I have verified that it works with stdin.

I am unsure on how to edit your test. I know I did it once but that was a long time ago and I am focused on several other things here in the next few weeks.

Maybe you can pull request my pull request like you did in the past?? You could also tweak the pointer solution so only main has to deal with state.

I also wanted to ask you on how to make freesasa deal with cations? The cation values are always 0.000. I guess I can open up another issue for cations.

@mittinatten
Copy link
Owner

mittinatten commented Aug 14, 2021

Sure, I'll have a go!

If the cations are hetatms, you should be able to include them using that option. But if they're some obscure element, the code might default back to radius 0.00. In that case, if you can find a sensible value for their radius, you can add it here: https://github.com/mittinatten/freesasa/blob/master/src/classifier.c#L885

@mittinatten
Copy link
Owner

I had a go at this a few weeks ago and I think that unless the CIF document is stored somewhere at file scope, the pointer will be already be invalid when we're back at the output stage. So then we might as well keep the map from filename to doc as in your solution. This is not ideal, because in a long running process with more and more structures this will essentially be a memory leak. A short term solution can be to add some kind of reset function so that the calling code can clean up manually when appropriate. We could perhaps even consider the map a cache: if the input isn't found in the map (because it was cleaned up) we can load it from disc again.

There are probably better, less stateful solutions, where the code that generates the CIF doc and the freesasa_structure bundle them somehow, but that would require some refactoring (and might require converting more code to C++, not sure). There might be another type of pointer that we can use, but I know too little C++ for that.

So I think for the 2.x version we should go for the solution already in this PR, perhaps enhanced with cache functionality and at least a reset function.

@danny305
Copy link
Contributor Author

danny305 commented Sep 4, 2021

How do you want to split this? What exactly do you want me to do?

I will be traveling for the next 10 days and will not be able to work on this.

I am also unsure how this could lead to a memory leak? If we have a map and there is a long running process with lots of cif files then the map will just grow in size? I don't see how it leaks.

Also, what you are looking for is what is called a shared_pointer in C++.

Also move all declarations of code related to cif document references to freesa_internal.h.
@mittinatten
Copy link
Owner

Sorry if I was vague, I guess I just wanted some feedback on my thoughts 😄

I assume a shared_ptr won't work if the only reference is in C code?

It's not a memory leak in the strictest sense, I guess, since there are still references somewhere to the documents. But for external users of the library it's essentially a leak, since they have now way of keeping the memory use from growing over time.

For the cache idea to work, we would need to pass file paths instead of FILE pointers to the functions in cif.cc. So that they can check if the file is in the cache and get it if not. So I think will skip that for now.

I have sketched a solution, where I use unique integers instead of filenames, as references to the documents, which are generated inside cif.cc. There are two advantages with that:

  1. One less argument to pass around, i.e. the implementation is completely opaque, and calling code will be simpler, less chance of messing things up. It will also work if somehow two separate structures are read from stdin at different times.
  2. No calls to strdup(), i.e. slightly less complex.

I also added the function freesasa_cif_clear_docs() to allow clearing the store of documents if necessary.

For some reason I was allowed to push this to your branch, my intention was to push it as a PR to your PR, as agreed, sorry about that!

I think there are still some wrinkles here to be ironed out. But conceptually I'm more or less happy with this solution. I guess I could use your C++ expertise to make sure my changes don't have any unintended consequences.

Once we've agreed on a solution, I'll add the extra tests and we should be done.

@mittinatten
Copy link
Owner

And no stress if it takes a while before you're available, as you've probably noticed I have limited time to work on this myself at the moment.

@danny305
Copy link
Contributor Author

I haven't forgotten about you. Ill get back to you next week. Putting out fires over here.

@danny305
Copy link
Contributor Author

Yo Simon.

I plan on going over this tomorrow and reviewing the code. Since we are hours apart. Could you point me to exactly what I need to review? It has been some time and I don't want to miss it. I reread your comment above and am happy with the solution as well.

Ill message you after I review it.

@mittinatten
Copy link
Owner

Hi, so if you could check that the C++ I wrote is sound that would be great, and there's the PR to this branch on your fork that needs to merged, if you don't have any objections (danny305#3). Also, as mentioned there, idempotency is broken.

danny305 and others added 2 commits October 20, 2021 12:56
Allow freesasa_structure_free to release CIF doc via function pointer.
@danny305
Copy link
Contributor Author

I forgot to implement idempotency to the freesasa tables. I had just implemented it for the atom_site table. They are implemented now and its working.

@danny305
Copy link
Contributor Author

I think this feature is ready for a merge?

@danny305
Copy link
Contributor Author

By the way your C++ was really good. Nice job refactoring to enable the feature.

@mittinatten
Copy link
Owner

Hi, I'll have one final look during the weekend and merge. I also think we're set.

mittinatten and others added 3 commits October 23, 2021 12:25
Make --separate-models and --separate-chains work again for CIF output.
@mittinatten mittinatten merged commit 6872c21 into mittinatten:master Oct 28, 2021
@mittinatten
Copy link
Owner

Merged. Thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants