Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

organize: simplified default folder naming #842

Merged
merged 23 commits into from
May 5, 2024

Conversation

pmiam
Copy link
Contributor

@pmiam pmiam commented May 3, 2024

  1. deprecate get_document_hash_folder
  2. consolidates document hashing under compute_an_id
  3. simplifies the creation of default directory names
  4. makes the papis_id value available at format time for most templates

One test needed to be updated (02fa5a6) to accommodate these changes.

A side effect of these changes are that the path creating functions which used to be independent of the database now need the database to know about the document before they're called. I think that sounds worse than it is, but let me know your thoughts. With the updated test, everything passes. Also, I manually tested the use-cache = False case and it seems to work.

Also, I also exposed a (possibly pre-existing) problem where if doc.save() is called before doc.set_folder(), the save fails due to an unset info file path.

These changes might lead to improved duplicate checking being discussed at issue #841

pmiam added 12 commits May 3, 2024 04:23
papis.id.compute_an_id and DB infrastructure produce UIDs more robustly
paths.py functions no longer have to take files lists as arguments.
However, the database now must be aware of a document before invoking
get_document_folder.

This means running doc.save() without first running doc.set_folder()
will fail due to an unset info file path.

From what I can tell all user interfaces call set_folder even for
metadata-only entries. However, it is common to use from_data in the
code. I've added a warning to the docstring but ideally the folder arg
should not be optional.
@jghauser
Copy link
Member

jghauser commented May 3, 2024

Thanks a lot for this PR, @pmiam!

A quick question: does this by default lead to folder names that are just md5 hashes? I'm not sure I'm a big fan of this as it makes the folder names not very human-readable (in fact, I think even the current default isn't great as it starts with the hash, and i think folders being organised by e.g. author family name would be more intuitive). Of course, this can be changed in the settings and doesn't present a problem as long as one uses the papis command, but I think we should aim for default values that are as approachable as possible.

@pmiam
Copy link
Contributor Author

pmiam commented May 3, 2024

Thank you for the warm welcome!

does this by default lead to folder names that are just md5 hashes?

It does, although it also makes it so the papid_id is available in templates for people like me who prefer it that way.

My personal preference is for machine-managed, flat directory trees. Frankly, that makes the most sense to me as a default, and I think it encourages use of the Papis tui or another front end, but I see your point. Especially as I was writing this, I noticed that a lot of work has been done to make human readable file names very expressive and robust to collision. I've made sure to leave that unaffected.

We could set a human friendly template as the default configuration easily.

Copy link
Collaborator

@alexfikl alexfikl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can further simplify the changes to papis.commands.add, but overall this looks like a good direction!

I agree with @jghauser that we should try for a better default folder name: HASHHASHHASH-author-author was not the worst, but just using the MD5 hash looks very opaque. I think Zotero also stores files on disk in some storage/PIMHYJGK type folder and it's very hard to find things without the interface.

@pmiam @jghauser Maybe the default could be something based on time-added and not papis_id? (both keys should always be available) E.g. {doc[time-added]}-{doc[author]}?

papis/commands/add.py Outdated Show resolved Hide resolved
papis/commands/add.py Outdated Show resolved Hide resolved
papis/document.py Outdated Show resolved Hide resolved
tests/test_paths.py Outdated Show resolved Hide resolved
@jghauser
Copy link
Member

jghauser commented May 3, 2024

@pmiam @jghauser Maybe the default could be something based on time-added and not papis_id? (both keys should always be available) E.g. {doc[time-added]}-{doc[author]}?

This is a very intriguing idea! It would avoid collisions and easily allows sorting by a very useful property (a bit of a shame that it would lead to oldest at the top by default, though i guess that cannot be avoided).

What I personally use is SURNAME_OF_FIRST_AUTHOR-YEAR-FIRST_5_WORDS_OF_TITLE (to be more precise, it's the jinja2 template I mentioned here), which i think looks quite nice and doesn't really lead to collisions (though I guess it could be complemented with a suffix as for file names). Not sure if that could be (easily) implemented without jinja2 though. I do think that having at least some of the title in the directory name is useful.

@alexfikl
Copy link
Collaborator

alexfikl commented May 4, 2024

(a bit of a shame that it would lead to oldest at the top by default, though i guess that cannot be avoided).

No worries, just reverse it with ls -r 😁

What I personally use is SURNAME_OF_FIRST_AUTHOR-YEAR-FIRST_5_WORDS_OF_TITLE (to be more precise, it's the jinja2 template I mentioned here), which i think looks quite nice and doesn't really lead to collisions (though I guess it could be complemented with a suffix as for file names). Not sure if that could be (easily) implemented without jinja2 though. I do think that having at least some of the title in the directory name is useful.

Hm, I think we have two issues here

  • First, it would be nice to have a good default for add-folder-name. I'm not sure I feel strongly here, so anything like {doc[author]}-{doc[title:.5S]} or {doc[year]}-{doc[author]} should work fine?
  • Second, we need a fallback value when constructing a folder name for when a mostly empty document comes in. This should then just use {doc[time-added]} and whatever else it can find in the document to identify it?

We can probably make that another PR too. In this PR, it's sufficient to move papis_id higher-up so that it can be used in formatting strings and deprecate get_document_hash_folder. 😁

@jghauser
Copy link
Member

jghauser commented May 4, 2024

We can probably make that another PR too. In this PR, it's sufficient to move papis_id higher-up so that it can be used in formatting strings and deprecate get_document_hash_folder. 😁

Totally agree, I'm gonna respond in the issue that you opened!

pmiam and others added 5 commits May 4, 2024 23:40
This writes the info.yaml as early as possible. papis_id is also
available in the ref template.

Note: computing papis_id so early means unique _deterministic_ hashes
can not be guaranteed!. _Random_ uids are 100% guaranteed.
Co-authored-by: Alex Fikl <alexfikl@gmail.com>
Copy link
Collaborator

@alexfikl alexfikl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few small nitpicks around the logger messages mostly, but this should be good to go! ❤️

doc/source/default-settings.rst Outdated Show resolved Hide resolved
papis/document.py Outdated Show resolved Hide resolved
papis/paths.py Outdated Show resolved Hide resolved
papis/paths.py Outdated Show resolved Hide resolved
papis/paths.py Outdated Show resolved Hide resolved
@pmiam pmiam force-pushed the PR/simplify-default-folder-name branch from c0ced16 to 083c270 Compare May 5, 2024 20:36
@pmiam
Copy link
Contributor Author

pmiam commented May 5, 2024

I'll get right on that, sorry. I've been working between celebrations today. Happy Easter!

@alexfikl
Copy link
Collaborator

alexfikl commented May 5, 2024

I'll get right on that, sorry. I've been working between celebrations today.

No worries, it's fixed and I'll merge as soon as the tests pass!

Happy Easter!

Happy Easter!

@alexfikl alexfikl merged commit 5709374 into papis:main May 5, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants