Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CombinedData.as_dict/from_dict and other methods overwrite; CombinedData.as_lammpsdata deepcopied #2191

Merged
merged 5 commits into from Aug 25, 2021

Conversation

htz1992213
Copy link
Contributor

@htz1992213 htz1992213 commented Jul 3, 2021

Summary

Include a summary of major changes in bullet points:

  • Add as_dict, from_dict for CombinedData to overwrite superclass methods
  • Overwrite structure, disassemble, from_ff_and_topologie, from_structur methods for CombinedData
  • Rewrite as_lammpsdata method for CombinedData so that each attribute is deep copied
  • Update docs
  • Join string frags
  • Added tests

@htz1992213
Copy link
Contributor Author

Fixes #2188

@coveralls
Copy link

coveralls commented Jul 6, 2021

Coverage Status

Coverage decreased (-0.6%) to 83.143% when pulling cab0331 on htz1992213:master into 40817cc on materialsproject:master.

@htz1992213 htz1992213 changed the title [WIP] CombinedData.as_dict/from_dict and other methods overwrite; CombinedData.as_lammpsdata deepcopied CombinedData.as_dict/from_dict and other methods overwrite; CombinedData.as_lammpsdata deepcopied Jul 7, 2021
@htz1992213
Copy link
Contributor Author

Hey @mkhorton, I have added tests for the new methods.

@htz1992213
Copy link
Contributor Author

The serialization of the CombinedData class should work as expected with the update @rkingsbury

@rkingsbury
Copy link
Contributor

The serialization of the CombinedData class should work as expected with the update @rkingsbury

Thanks @htz1992213 ! This looks like a great update that will really improve robustness and clarity of CombinedData.

@@ -351,7 +351,7 @@ def get_string(self, distance=6, velocity=8, charge=4, hybrid=True):
velocity (int): No. of significant figures to output for
velocities. Default to 8.
charge (int): No. of significant figures to output for
charges. Default to 3.
charges. Default to 4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no significant penalty in terms of size or performance, I suggest we default to 5 decimal places here. The only reason I say that is that for the TIP4P-FB water model I'm using, 5 decimals are required in order to keep the water molecule neutral. Obviously it's not a big deal to set this via kwarg if there are strong reasons to keep only 4 decimals as the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous default was 3, changed to 4 as a safe buffer. Would you prefer to change it to even more? I am actually working on a charge writer class in mdgo.forcefield, it can auto-detect significant digits so that users don't have to specify that explicitly to make it right.

@rkingsbury
Copy link
Contributor

Longer-term, I still think we should work on getting DataFrame supported in monty so that we can do away with custom as_dict and from_dict methods (and this comment applies throughout pymatgen). But for now this is a good solution. CombinedData has been inheriting too many methods from LammpsData that don't make sense, so I'm glad you added these as well as the AttributeError where appropriate.

@mkhorton
Copy link
Member

mkhorton commented Jul 8, 2021

I think we should prefer the monty solution over the custom as_dict/from_dict here actually; let's try and get the monty solution first, and fall back to this if that's impractical? I think it would be an equivalent number of lines of code.

I'm trying to actively discourage custom as_dict, from_dict methods wherever possible, they're creating a lot of technical debt.

@rkingsbury
Copy link
Contributor

I think we should prefer the monty solution over the custom as_dict/from_dict here actually; let's try and get the monty solution first, and fall back to this if that's impractical? I think it would be an equivalent number of lines of code.

I'm trying to actively discourage custom as_dict, from_dict methods wherever possible, they're creating a lot of technical debt.

@mkhorton I agree with you, however this PR fixes several pretty serious flaws in CombinedData that are affecting our workflows, and brings it into line with LammpsData which also has custom as_dict / from_dict. I'd encourage you to consider merging this as a stopgap solution until we get monty patched, which may take more time. Once that happens, we can remove as_dict / from_dict from all the Lammps I/O classes at once.

@mkhorton
Copy link
Member

mkhorton commented Jul 8, 2021

Unfortunately "stopgap" solutions are how we accrue technical debt, because we forget to go back and remove them. I am reluctant to merge this when there has been a clear alternative suggested, with specific code proposed, and that could be implemented on a comparable timeframe.

@mkhorton
Copy link
Member

mkhorton commented Jul 8, 2021

Note that this comment is for the as_dict, from_dict specifically, not for the remainder of the PR which is fine to merge.

@htz1992213
Copy link
Contributor Author

Unfortunately "stopgap" solutions are how we accrue technical debt, because we forget to go back and remove them. I am reluctant to merge this when there has been a clear alternative suggested, with specific code proposed, and that could be implemented on a comparable timeframe.

@mkhorton I think modifying monty is beyond my expertise because it involves making test cases that I may not familiar. I don't think this is a stopgap because it is for sure built on the latest dependencies of pymatgen. But I would be happy to modify both LammpsData/CombinedData immediately once the patch to monty is in place. I am quite actively maintaining this submodule so no worries about forgetting to go back here. And I think the urgent is to make sure the serialization of CombinedData works.

@mkhorton
Copy link
Member

mkhorton commented Jul 8, 2021

@mkhorton I think modifying monty is beyond my expertise because it involves making test cases that I may not familiar.

Ok, that's not a problem -- I'm happy to propose this change as a PR to monty myself but it'll take me a few days until I can get to it.

@htz1992213
Copy link
Contributor Author

Hey @mkhorton @rkingsbury, Shyue merged the Monty update, so I was able to deprecate the as_dict and from_dict methods for LammpsData and CombinedData. The MontyEncoder/Decoder works fine here. One thing I want to add is that, the pandas.DataFrame.to_dict method will convert the int type "index" of the df to str type. This won't cause a fundamental error here because LammpsData will just output that column as str anyway. But I think in general it is something that could be fixed in Monty.

@rkingsbury
Copy link
Contributor

Thanks @htz1992213 ! Good catch re: the index type. I don't have a well-informed opinion of whether that type conversion is a problem or not.

@mkhorton mkhorton merged commit a68137e into materialsproject:master Aug 25, 2021
@mkhorton
Copy link
Member

Thanks @htz1992213! I'll go ahead and merge.

The index conversion is interesting to know -- this is done by pandas JSON export itself rather than monty, I believe?

@htz1992213
Copy link
Contributor Author

htz1992213 commented Aug 25, 2021

@rkingsbury @mkhorton Yes I think pandas export JSON by itself. But I think that default behavior can be changed as described in this answer

@mkhorton
Copy link
Member

@htz1992213 ah, I see. If we make this change it would be a breaking change since it would change the serialization format, would that be ok?

@htz1992213
Copy link
Contributor Author

@mkhorton I made the tests in this PR to only check if the actual values in dfs are the same regardless of the type. So it should be ok updating monty according to the link. I don't think it will break the things here.

@mkhorton
Copy link
Member

I was asking more if it would cause any problems for you for your own work, if you've already been using the functionality, otherwise I will submit another PR to monty ASAP.

@htz1992213
Copy link
Contributor Author

@mkhorton Oh I see. No, I don't think that is a problem for me. I was just discussing if it is a better practice to maintain the int type of the index for serializing a df in Monty.

@mkhorton
Copy link
Member

I don't think it's a big issue, but it does seem best practice for serialization/deserialization to retain type information, and since it's a simple change I think we should do that before it sees widespread usage.

@htz1992213
Copy link
Contributor Author

@mkhorton Yeah I totally agree, a simple fix would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants