Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more compact schema for root directory files #197

Closed
csadorf opened this issue May 21, 2019 · 18 comments
Closed

Use more compact schema for root directory files #197

csadorf opened this issue May 21, 2019 · 18 comments
Labels
enhancement New feature or request pinned Instructs stale bot to ignore this issue.
Milestone

Comments

@csadorf
Copy link
Contributor

csadorf commented May 21, 2019

Based on feature usage, the project root directory currently contains a number of special files. Some of them are hidden, some are not.

  • signac_project_document.json
  • signac.rc
  • .signac_sp_cache.json.gz
  • .signac_history.txt

As suggested by @vyasr we may want to switch to a more compact storage format, for example by bundling all files within a .signac folder.

@csadorf csadorf added the enhancement New feature or request label May 21, 2019
@csadorf csadorf added this to the v2.0.0 milestone May 21, 2019
@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2020

@bdice @mikemhenry thoughts on this issue? I think this would be relatively straightforward to implement, the bigger question is what tradeoffs there are. For example, making this change would effectively discourage direct modification of the signac.rc file, for instance to change the workspace (you would have to go through the config API). Personally I think this is a good thing and would want to follow git's example; you can always modify .git/config if you know what you're doing, but it's better to put up some barriers to this for less-experienced users IMO.

@bdice
Copy link
Member

bdice commented Feb 26, 2020

I propose:

  • Keep signac.rc and project/job state point/document/data files clearly visible with the current name scheme.
  • Place all caches and shell history files in a .signac directory.
  • Recommend adding .signac to .gitignore.

I believe that the clear visibility of signac files (aside from caches/shell history files) is a good thing, especially for new users learning to navigate the data space.

@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2020

I'm mostly in agreement. All JSON document files should be visible, all internally generated caches etc should be hidden. The only file I'm not 100% sure about is signac.rc, since we tend to mirror git a lot and .git is hidden. I could see going either way with this file, so I'll defer to others if you have preferences.

@b-butler
Copy link
Member

I think that having one way to modify the configuration is best for new users. Thus I would suggest putting signac.rc in the .signac folder while exposing the settings through the config API.

@csadorf
Copy link
Contributor Author

csadorf commented Feb 26, 2020

Agreed, I'm not so sure about the config file either.

@mikemhenry
Copy link
Contributor

I will have to think about this a bit. I like how clean it would be to keep signac.rc under source control, and just ignore .signac/. But if we want to 'hide' signac.rc from new users/let people live theirs lives without seeing it, perhaps .signac.rc (or even .signacrc) would be a good compromise. That way the file is still 'hidden' but it would be easy to keep it tracked in source control, and then users can with confidence know anything in the .signac/ folder isn't critical.

@atravitz
Copy link
Contributor

I will have to think about this a bit. I like how clean it would be to keep signac.rc under source control, and just ignore .signac/. But if we want to 'hide' signac.rc from new users/let people live theirs lives without seeing it, perhaps .signac.rc (or even .signacrc) would be a good compromise. That way the file is still 'hidden' but it would be easy to keep it tracked in source control, and then users can with confidence know anything in the .signac/ folder isn't critical.

I like this suggestion a lot. Is there anything blocking us from making this change now?

@csadorf
Copy link
Contributor Author

csadorf commented Nov 12, 2020

I will have to think about this a bit. I like how clean it would be to keep signac.rc under source control, and just ignore .signac/. But if we want to 'hide' signac.rc from new users/let people live theirs lives without seeing it, perhaps .signac.rc (or even .signacrc) would be a good compromise. That way the file is still 'hidden' but it would be easy to keep it tracked in source control, and then users can with confidence know anything in the .signac/ folder isn't critical.

I like this suggestion a lot. Is there anything blocking us from making this change now?

I think what we need are some super basic guidelines for developers on how this folder is to be used. For example, should files related to flow be placed in .signac/flow/ or can they just go into .signac/? What about the "core" files, do they go into the root directory or into .signac/core/? So, I'd suggest that we compile all the files that we currently have, make a complete proposal as to how to structure them, derive some very basic rules regarding namespaces that can be applied in the future, and then we can assess whether we want to immediately implement that in a backwards compatible way or whether we just break with 2.0 and then require the new structure. Either way, we would want to have an automated migration path.

@bdice
Copy link
Member

bdice commented Nov 15, 2020

Here's an attempt at compiling and organizing all the project-wide files. Feel free to edit this comment if I missed something and you have privileges to edit on my behalf.

Path Code/Data .gitignore? Suitable for .signac? Package
.bundles Data Yes Yes flow
.signac_shell_history Data Yes Yes core
.signac_sp_cache.json.gz Data Yes Yes core
signac.rc Code? Yes/No? No core
signac_data.h5 Data Yes No core
signac_project_document.json Data Yes No core
workspace Data Yes No core
signac_statepoints.json Data Yes Yes core

My reasoning for the columns I included: It should be easy to separate code and data. I try keep a relatively strict separation and commit code but do not commit data. I consider the information in signac.rc to be somewhere between the two and I honestly don't know whether to recommend committing it (in practice, I don't think that I usually commit signac.rc). However, there are important semantic differences among the data files, so I broke it out into "Suitable for .signac?". Data managed by the framework (likely intended for long-term archival) shouldn't be treated the same as temporary cache info that could be re-generated.

I propose this set of rules:

  • signac.rc shall contain project configuration and settings (retain current behavior).
  • The .signac directory shall be used for optional caches or state information that would not be recommended to commit to source control.
    • This includes command histories or temporary execution information like .bundles.
    • A hypothetical example would be temporary files or caches generated by signac-dashboard.
  • The .signac directory shall not be used for project data managed by the framework (e.g. the project document or HDF5 data).
  • .signac shall only contain files or subdirectories with pre-determined names, not arbitrary or user-defined names.
    • Subdirectories of .signac may contain files or subdirectories with arbitrary names (e.g. files named according to a job id or scheduler job number).
  • Names of files in .signac do not need to begin with a period or contain the word "signac" because they are already prefixed with .signac/.
    • For example, I recommend renaming .signac_sp_cache.json.gz to .signac/statepoint_cache.json.gz.

Consequences that should follow from these rules:

  • Removing files in .signac should have a relatively low impact on the project.
  • .signac should be safe to add to .gitignore.

I think @csadorf's proposal of namespaces (subdirectories in .signac) is worth considering but might be an unnecessary layer. I think we can safely assume that framework developers can coordinate as needed to avoid conflict between filenames (we've had no trouble in the root directory, and it only gets easier to manage in a hidden subdirectory).

@vyasr
Copy link
Contributor

vyasr commented Nov 15, 2020

I won't have time to think more about this for a while, so when I saw this comment I figured I'd add my thoughts now (I also added the signac_statepoints.json file to the table).

While source control is something to consider, and code vs data is an important distinction, I think what determines whether or not something belongs in .signac is based on the mission statement of signac: providing organization for data on the filesystem that can then be accessed with or without signac. That reasoning establishes a clear dividing line: all data and information describing its organization (the workspace, statepoints, hdf5 stores, project doc/store, etc) must be visible, while any intermediate or cache file generated by signac that would not be necessary for a non-signac user to interpret the data space can be hidden in .signac. This reasoning does not change any of the rules @bdice proposes, it just places a different emphasis on how to think about them. It also says nothing about the internal organization of .signac; the suggestion that subdirectories must be used for arbitrarily named files is fine with me, it would help keep things cleaner.

From that perspective, and knowing how signac.rc is currently used, there is only one reason why signac.rc needs to be visible: the need to know the workspace. To my knowledge there is no other configuration information stored in signac.rc that is relevant to parsing the data. The fact that the workspace is configurable is what makes this dividing line blurry, since that has immediate implications on how to parse the data (for instance, if there are folders workspace1 to workspace10 and they all contain different data, which is the active workspace?). That makes me wonder how useful the ability to configure the workspace directory is. Given that you can always just move the current workspace to a different location and make a new directory (or symlink one), what is the benefit of this feature? IMO if we removed that, then signac.rc would be a candidate to move into .signac. The only other benefit to having it there is to be able to see whether a directory is a signac project, but I don't think that's a very strong argument since you can just as easily look for .signac.

@csadorf
Copy link
Contributor Author

csadorf commented Nov 16, 2020

I mostly agree with @bdice 's rules and I also agree with @vyasr ' assessment and slight shift in perspective. I also agree that removing the ability to change the workspace would probably make a lot of sense not only in this context, but also in general (this has been proposed and discussed before).

With this I suggest the following addendum to the so far proposed rules:

  1. Configuration files within a project root directory are placed in .signac/config.
  2. The project workspace is always $PROJECT_ROOT/workspace (could be a separate change).
  3. A project root directory is defined as the parent directory of a directory named .signac.

This has the following advantages:

  1. Our git-analogy is even better adhered to (compare .git/config etc.)
  2. For a user who does not use signac, it does not matter what workspace is active or whatever, really only the data matters.

@cbkerr
Copy link
Member

cbkerr commented Jan 4, 2021

@vyasr :

The fact that the workspace is configurable is what makes this dividing line blurry, since that has immediate implications on how to parse the data (for instance, if there are folders workspace1 to workspace10 and they all contain different data, which is the active workspace?). That makes me wonder how useful the ability to configure the workspace directory is. Given that you can always just move the current workspace to a different location and make a new directory (or symlink one), what is the benefit of this feature?

@csadorf

I also agree that removing the ability to change the workspace would probably make a lot of sense not only in this context, but also in general (this has been proposed and discussed before)

This issue is the only place that I could find discussion of this topic, but I want to argue for maintaining the ability to configure the name of the workspace directory.

I personally haved used the ability to rename workspace directory to increase the readability of the data. I was prototyping a genetic algorithm managed with signac (only run locally so far), and I used nested projects with the top-level project managing the generations and each generation having projects that managed individuals. To capture this, I named the top workspace directory generations. Each job in that had a workspace directory called individuals.

I acknowledge that it might be that because I had the option, I found a use for it, and that if I didn’t have the option, I’d have managed in another way (I have a readme file anyway).

…knowing how signac.rc is currently used, there is only one reason why signac.rc needs to be visible: the need to know the workspace. To my knowledge there is no other configuration information stored in signac.rc that is relevant to parsing the data.
– Vyas

Allow workspace-name configurability might not conflict with moving signac.rc to .signac/config. In the end, it is the signac user’s responsibility to ensure that the project is understandable when they share it. They choose how to export it or can make use of the view command. In this vein, they could include a readme file indicating where to look for the data.

The only other benefit to having it there is to be able to see whether a directory is a signac project, but I don’t think that’s a very strong argument since you can just as easily look for .signac.
– Vyas

I personally like having a signac.rc file visible to identify a signac project when browsing directories, but I often show hidden files in Mac’s Finder or ls -la anyway so hiding the config files wouldn’t affect me.

Will all users and their collaborators be familiar with git and know to look for a .git file?

@bdice:

Removing files in .signac should have a relatively low impact on the project.

I like the idea that I can delete the .signac file and signac will just regenerate cache files, etc.

@vyasr
Copy link
Contributor

vyasr commented Jan 4, 2021

That's a valid argument for allowing configuration of the variable. I still think that the benefits of a standardized schema may outweigh this, since then it doesn't rely on user documentation at all. It would also allow us to specify a much more explicit data model in a way that would allow us to change the data layout much more easily in the future, which would be very valuable for some of the generalizations originally proposed in the signac 2.0 prototype. We can always discuss this specific configuration independently of the other parts of this issue, though.

Regarding whether users will be familiar enough with git to know to look for a .git file, I don't really think that's super relevant. Mirroring git is a nice design pattern for us, but from a user perspective what really matters is whether users know how to figure out if something is a signac directory. Teaching a user to look for .signac is marginally harder than teaching them to look for signac.rc (ls -a vs ls), but in either case it's something we'd be teaching in a tutorial so I think it's fine.

@vyasr vyasr mentioned this issue Feb 26, 2021
12 tasks
@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2021

@glotzerlab/signac-committers I'm wondering what we should do with the signac_statepoints.json file. Currently, every method of the Project that uses this file can be pointed to a different filename if necessary, but ultimately it boils down to functioning like another state point cache, but one that is more "trusted". If the file exists, the internal _get_statepoint method can read from this file if a given state point is found neither in the state point cache, nor on disk. Similarly, the repair method can use this file to reconstruct the data space if anything has been corrupted. However, I've never seen any of this functionality put to use, and I'm not sure that as a user I would know when I had a signac_statepoints.json file whose contents I trusted more than the actual data space.

The write_statepoints method can be used to generate this file, or alternatively store the same data into a different filename provided by the user. I can imagine that functionality possibly being useful, but I'm not sure about the rest. How should we deal with this? I'm leaning towards removing most (if not all) logic associated with it, but I could be overlooking significant benefits that it confers. If we want to keep it, though, we do need to address where the file should live; the "default" file should perhaps be within .signac, but users can also write to a custom file, which would not be hidden. This dichotomy seems odd to me.

@csadorf in case you had specific goals in mind when adding this feature let us know!

@bdice
Copy link
Member

bdice commented Feb 26, 2021

I agree with @vyasr on basically everything stated above.

  • I agree that I haven't seen the signac_statepoints.json file used before and am unaware of its intended benefits over the zipped state point cache generated by signac update-cache.
  • I agree that the logic for writing/reading/repairing from this file could be deprecated in favor of reading from the zipped cache. It doesn't seem like there's a good case to support both, and the zipped cache is probably faster?
  • I agree that the default location can be in .signac if we choose to keep the file and features that use it.

I'm not sure that as a user I would know when I had a signac_statepoints.json file whose contents I trusted more than the actual data space.

I think you would know that repairs from a state point cache are needed if signac raised an error when opening jobs/reading state points.

@vyasr
Copy link
Contributor

vyasr commented Feb 26, 2021

Yes, I agree that you'd know when repairs are necessary. The thing I don't know is how you would know whether to trust the signac_statepoints.json file. For all you know it could be vastly out-of-date, or contain state points that are no longer desired.

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 2, 2021
@csadorf csadorf removed the stale label Jun 2, 2021
@vyasr vyasr added the pinned Instructs stale bot to ignore this issue. label Jun 3, 2021
@vyasr
Copy link
Contributor

vyasr commented Apr 12, 2022

I think this issue is resolved by #678 and #708. signac-flow will need to make a separate set of changes to migrate any hidden files it stores at the project root.

@vyasr vyasr closed this as completed Apr 12, 2022
@vyasr vyasr mentioned this issue Apr 14, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned Instructs stale bot to ignore this issue.
Projects
None yet
Development

No branches or pull requests

7 participants