New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed bug: unable to read cif that deviates from the pdb_id.cif format. #62
Conversation
Hi,
As I mentioned I am afk all of July, so excuse any silly errors below, I am
on my phone.
I wonder how this will work if one is analyzing several files with the same
source, e.g. 1abc_a.cif and 1abc_b.cif, which are modified somehow and
FreeSASA is used to compare them. In the CLI this doesn’t matter, because
we only have one file at a time. But in a python application, for instance,
we could easily have many files. It is also an arbitrary and surprising
feature that filenames need to start with the PDB code, which is not always
the case, for instance if they are snapshots from a simulation, the name
might be just a timecode (I guess that would have caused problems before
too).
When we first added this we discussed reentrant versions, but landed on the
simpler solution that works fine with the CLI. I think we should probably
revisit the reentrant ideas. I think adding an optional field to the struct
freesasa_structure with a pointer to the cif doc should work and be simple.
Add a function freesasa_structure_set_cif_doc(structure, doc-pointer) which
is called once before or after the atoms are added and then extracted when
writing the output using freesasa_structure_get_cif_doc(structure).
Regarding the residue numbers, I am not sure they are always numbers? Don’t
remember, are insertion codes a separate field or can it be part of the
residue number? I think this part should be a separate PR since the other
part might need some work still. (Not sure I’ll be able to review it
properly until I’m back home).
ons 30 juni 2021 kl. 22:38 skrev Danny Diaz ***@***.***>:
… This should close issue #61
<#61>
It also addresses the issue where sometimes the
freesasa_node_atom_residue_number(atom) returns a decimal string. I
encountered this when I passed a cif file that I got as an output from the
ChargeFW2 repo. I didn't realize that I patched that as well in this PR
until now. Let me know if you would like that as a separate PR.
------------------------------
You can view, comment on, or merge this pull request online at:
#62
Commit Summary
- fixed bug--unable to read cif that deviates from the pdb_id.cif
format.
File Changes
- *M* src/cif.cc
<https://github.com/mittinatten/freesasa/pull/62/files#diff-5af0d93cd805891147be50a902813773be55f475e104e54388b4e885db61c6fd>
(9)
Patch Links:
- https://github.com/mittinatten/freesasa/pull/62.patch
- https://github.com/mittinatten/freesasa/pull/62.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#62>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAVV4HBBGMCA2SRFAROM6RDTVN6DVANCNFSM47TC3SWA>
.
|
I have refactored the code to do essentially what you said. The code no longer depends on the pdb code being part of the filename. It was a bit more complicated than that. I had to edit the fresasa_structure and the freesasa_node structure properties. I was unsure on how to use gemmi types in your C code (structure.c and node.c) so instead I added a pointer to the filename as a char * and have a map on the C++ side that maps the filename to the gemmi doc. I will update the PR today or tomorrow. Im cleaning it up first. Insertion codes are separate fields from the residue number. |
I also refactored the code so it can now accept cif files that already have FreeSASA columns/data. Essentially, making the code idempotent. |
…ocument for writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Finally back on track here. This looks good, have you verified that it works for stdin too? (looks like it should, I can check later).
Regarding the pointers: You could probably make the pointer version work with a void pointer and some casts. The benefit of the pointer solution is that main.cc
would be the only part of the code that has state, everything else is passed around as arguments. But your solution probably works just as well in practice.
I like the idea of idempotency, could you add a CLI test to verify that? Also a test to check that if you change the parameters, the results are updated.
I have verified that it works with stdin. I am unsure on how to edit your test. I know I did it once but that was a long time ago and I am focused on several other things here in the next few weeks. Maybe you can pull request my pull request like you did in the past?? You could also tweak the pointer solution so only main has to deal with state. I also wanted to ask you on how to make freesasa deal with cations? The cation values are always 0.000. I guess I can open up another issue for cations. |
Sure, I'll have a go! If the cations are hetatms, you should be able to include them using that option. But if they're some obscure element, the code might default back to radius 0.00. In that case, if you can find a sensible value for their radius, you can add it here: https://github.com/mittinatten/freesasa/blob/master/src/classifier.c#L885 |
I had a go at this a few weeks ago and I think that unless the CIF document is stored somewhere at file scope, the pointer will be already be invalid when we're back at the output stage. So then we might as well keep the map from filename to doc as in your solution. This is not ideal, because in a long running process with more and more structures this will essentially be a memory leak. A short term solution can be to add some kind of reset function so that the calling code can clean up manually when appropriate. We could perhaps even consider the map a cache: if the input isn't found in the map (because it was cleaned up) we can load it from disc again. There are probably better, less stateful solutions, where the code that generates the CIF doc and the freesasa_structure bundle them somehow, but that would require some refactoring (and might require converting more code to C++, not sure). There might be another type of pointer that we can use, but I know too little C++ for that. So I think for the 2.x version we should go for the solution already in this PR, perhaps enhanced with cache functionality and at least a reset function. |
How do you want to split this? What exactly do you want me to do? I will be traveling for the next 10 days and will not be able to work on this. I am also unsure how this could lead to a memory leak? If we have a map and there is a long running process with lots of cif files then the map will just grow in size? I don't see how it leaks. Also, what you are looking for is what is called a shared_pointer in C++. |
Also move all declarations of code related to cif document references to freesa_internal.h.
Sorry if I was vague, I guess I just wanted some feedback on my thoughts 😄 I assume a It's not a memory leak in the strictest sense, I guess, since there are still references somewhere to the documents. But for external users of the library it's essentially a leak, since they have now way of keeping the memory use from growing over time. For the cache idea to work, we would need to pass file paths instead of FILE pointers to the functions in I have sketched a solution, where I use unique integers instead of filenames, as references to the documents, which are generated inside
I also added the function For some reason I was allowed to push this to your branch, my intention was to push it as a PR to your PR, as agreed, sorry about that! I think there are still some wrinkles here to be ironed out. But conceptually I'm more or less happy with this solution. I guess I could use your C++ expertise to make sure my changes don't have any unintended consequences. Once we've agreed on a solution, I'll add the extra tests and we should be done. |
And no stress if it takes a while before you're available, as you've probably noticed I have limited time to work on this myself at the moment. |
I haven't forgotten about you. Ill get back to you next week. Putting out fires over here. |
Yo Simon. I plan on going over this tomorrow and reviewing the code. Since we are hours apart. Could you point me to exactly what I need to review? It has been some time and I don't want to miss it. I reread your comment above and am happy with the solution as well. Ill message you after I review it. |
Hi, so if you could check that the C++ I wrote is sound that would be great, and there's the PR to this branch on your fork that needs to merged, if you don't have any objections (danny305#3). Also, as mentioned there, idempotency is broken. |
Allow freesasa_structure_free to release CIF doc via function pointer.
I forgot to implement idempotency to the freesasa tables. I had just implemented it for the atom_site table. They are implemented now and its working. |
I think this feature is ready for a merge? |
By the way your C++ was really good. Nice job refactoring to enable the feature. |
Hi, I'll have one final look during the weekend and merge. I also think we're set. |
Add test for CIF idempotency.
…d, for CIF output.
Make --separate-models and --separate-chains work again for CIF output.
Merged. Thanks for this! |
This should close issue #61
It also addresses the issue where sometimes the
freesasa_node_atom_residue_number(atom)
returns a decimal string. I encountered this when I passed a cif file that I got as an output from the ChargeFW2 repo. I didn't realize that I patched that as well in this PR until now. Let me know if you would like that as a separate PR.