Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep copy of original files #20

Open
mtbc opened this issue Feb 11, 2016 · 47 comments
Open

Deep copy of original files #20

mtbc opened this issue Feb 11, 2016 · 47 comments

Comments

@mtbc
Copy link
Member

mtbc commented Feb 11, 2016

OMERO's 5.2 branch already offers the Duplicate request to have the server copy model subgraphs; it is tested by https://github.com/openmicroscopy/openmicroscopy/blob/develop/components/tools/OmeroJava/test/integration/DuplicationTest.java. However, copying images is disappointing because the pixel data is missing because it depends on original files which are uniquely named and singly owned. For space reasons we don't want to actually have to copy the files in the binary repository but we probably want the duplicator to own their own duplicate which may be moved to different groups from the original.

The initial use case, for which the existing duplicator probably already suffices, is described in http://trac.openmicroscopy.org/ome/ticket/11532 where it would be possible for scripts to duplicate instruments and suchlike instead of sharing them with derived images or, even better, the duplication could be automatically done by need as described by https://trello.com/c/ISnICsrC/16-auto-duplicate-in-graph-operations.

A more general deep copy could be arranged if we could duplicate original files. For instance, we could allow marked "copies" of original files to have the same name but be read-only, deleting the underlying file only when the last of that name is deleted from the database. Or, we could actually use filesystem links to seemingly copy the file, except that the new copy would have to be on the same volume if hardlinking, or would be lost if softlinking when the original is deleted. Then, there is the matter of pyramids: we want to avoid generating duplicate pyramids, but how can the owner of the duplicate have permissions to find the original's pyramids?

This issue exists to collect interesting points regarding:

  • How this grander deep-copy should, or should not, be implemented.
  • Requirements, and use cases informing them, to avoid our taking a course that ultimately disappoints.
@pwalczysko
Copy link
Member

I am not sure about the exact implementation of deep copy, but atm I think we are trying to solve 4 issues by this:

  • have in clients a tree which behaves "normally", this means the filemanager behaviour of Copy/Cut/Paste (instead of Copy Link, Cut Link...), especially with respect to a subsequent Delete of the Copy
  • allow the copying and move of the copied images with purpose of publishing them in another, specific publicly accessible group -> this is very broken atm
  • allow image processing of other users in Read-Annotate group (see the case of mporter and other subcases)
  • allow move of processed images into another group (see ad 2 above, this is binding to the point 2)

I think we could ask @imunro about the setup in their institute because they were trying to tackle the Public group case seriously to have some concrete user case.

@imunro
Copy link

imunro commented Feb 11, 2016

As far as publication goes our current approach is to move data for publication into the public group.
Ideally it would be possible to shallow copy into the public group - leaving the existing Project-Dataset structure intact.

@gusferguson
Copy link

@mtbc - I will try and formulate the use-cases I have in a clear format in the next day or two and add these.

@gusferguson
Copy link

This is how I conceptualise Deep vs Shallow copy - (I may well be wrong!)
Shallow Copy is what happens with current copy-paste.
Note: Deep Copy shown in different group, but could be in same.

1 deep copy

Deep Copy Use Cases

Publish 1:

  • image in lab group - read-annotate
  • rendering settings saved
  • annotations added
  • ROIs added
  • analysis performed
  • make deep copy —> Public folder - read-only - public access
  • delete attachments on deep copy
  • edit tags
  • subsequent changes to original image rendering settings or annotations will not affect published image
  • delete of attached file from original image will not delete attachment to published image

Publish 2:

  • image in lab group
  • rendering settings saved
  • annotations added
  • ROIs added
  • analysis performed
  • make shallow copy to Public folder - read only - public access
  • subsequent edits to ROIs, annotations, analysis are reflected in both copies - i.e. changes only have to be made in one place

Analysis:

  • read-write-1 > user-1 owns image - sets rendering settings - draws ROIs - does analysis
  • read-write-1 > user-2 - makes deep copy - resets rendering settings - changes ROIs - does analysis
  • user-1 transfers “original" to another group - private-1
  • or user-1 deletes “original"

Notes:

Once we have Deep Copy then the UI terminology will need to make clear the difference between them.
May need to call Shallow Copy “alias” or “shortcut” and Deep Copy “copy”.
If a copy of Original File + settings.

It is not transparent from the UI what happens when I move a Shallow Copy (linked) to a different group from the original - both “copies” disappear - without any warning.
This will differ from moving one of a deep copy-original - where only one will move.
Will probably need way to differentiate Deep from Shallow copy - e.g. add - “copy” to name (like OSX Finder).

Questions:

  • will Deep Copy be possible by other users in Read-Write groups? - presuming it is possible by owners, group owners and admins in other groups.
  • who will own the Deep Copy - owner of original image or user who makes the Deep Copy (i.e. any user in read-write group)?

@gusferguson
Copy link

@mtbc - still need to think some more about all this.

@mtbc
Copy link
Member Author

mtbc commented Feb 16, 2016

From the point of view of what's technically implemented on the server (which may not be how things should be for the end user), deep copy is possible by other users even in read-only groups, so long as they don't expect the resulting image to be in the original owner's dataset, as the copy is owned by the copier.

@gusferguson
Copy link

@pwalczysko - is this what Ian needs?

2 deep copy

@pwalczysko
Copy link
Member

I think we need to include here what @imunro acutally wants:

  • shallow copy which is moveable to another group

Sorry, first commented, then read your above comment @gusferguson
I think basically yes, this would be @imunro 's demand. But for details like Change to one is reflected in other and whether or not he actually wants all the Annotations with the shallow copy, this would be for him to specify.

@joshmoore
Copy link
Member

I could use some discussion about "shallow copy". The "change to one is reflected in the other" sounds new to me, and also sounds quite expensive.

@gusferguson
Copy link

@joshmoore - I might be causing confusion by using "shallow copy" - I just mean the current "link" copy that exists already.

@joshmoore
Copy link
Member

Shallow Copy is what happens with current copy-paste.

Ah, i.e. an image is linked into two datasets at the same time? Ok. I'll try to adjust my mental map.

@imunro
Copy link

imunro commented Feb 16, 2016

From what I can see that looks as if it would satisfy the requests we've had. Basically people want a method of easily making their data public e.g as supporting data when a paper is published.
Currently they just move the data in question onto the public group. The shortcoming of this is that if ( as is usual) they are cherrypicking the 'best' examples then moving it disrupts whatever organisation they have in place i.e. if they then want to do further analysis on all the data it's now in 2 different places. = hope that makes sense.

@gusferguson
Copy link

@imunro - thanks for the clarification - I have added that as Publish use case 2.

@mtbc
Copy link
Member Author

mtbc commented Feb 22, 2016

@joshmoore: Does the above introduce any new dimensions? I am thinking we are facing exactly the issues we feared. 😃 I wonder how to proceed.

@jburel
Copy link
Member

jburel commented Feb 23, 2016

projection will probably fall into the category of deep copy: at least at the graph level (a new set of raw data is created)
Currently part of the graph is shared between the original image and the source image preventing move for example.

@mtbc
Copy link
Member Author

mtbc commented Mar 16, 2016

Absolutely. If the pixels service weren't so annoyingly featureful I'd have fixed this already! /-:

@mtbc mtbc mentioned this issue Jul 25, 2016
@mtbc
Copy link
Member Author

mtbc commented Dec 2, 2016

We can probably allow deep-copy of files without any database changes. However, regarding data duplication in the server's binary repository: separately from the pyramid duplication fear above, if we are to avoid data duplication in the managed repository then there is the question of what happens when the original file is deleted.

With the database change to a Boolean "is-a-copy" column that removes that row from the uniqueness constraint,

  1. If the original file is deleted we can just have one of the copies serve as a new original. One would have to choose which of them but there may be no wrong answer to that.
  2. This change may take us further away from a better arrangement for how OMERO organizes its idea of the filesystem.

Without any database changes,

  1. We need to copy files so that they can be read from different groups so we would do so via filesystem-level links from different paths/names. However, the new copy may have to be in a different partition within the managed repository so sometimes only a soft link is possible (imagine in-place import).
  2. Soft-linked copies would lose their pixel data if the original is deleted.

So, UX questions focus on: If the original is deleted, how good or bad is it if the new copy's pixel data (or copied attachments) disappear if the originals are deleted? Does it suffice if at least the copier knows which kind of copy they got?

(A question for @joshmoore might be: do you see a solution to pyramid duplication that also requires database changes? E.g., a pointer from a Pixels copy back to some other instance's pyramids.)

@manics
Copy link
Member

manics commented Dec 5, 2016

Another potential factor: how well would this work with an object data store?

@pwalczysko
Copy link
Member

So, UX questions focus on: If the original is deleted, how good or bad is it if the new copy's pixel data (or copied attachments) disappear if the originals are deleted? Does it suffice if at least the copier knows which kind of copy they got?

This is very very bad. If I delete the original and the copy is deleted as well, then this is not a deep copy by any stretch of imagination. The user will be completely unable to absorb a concept of any in-betweens regarding deep copy (direct experience of the situation which we have now).

@gusferguson
Copy link

This is very very bad.

Couldn't agree more!

@jburel
Copy link
Member

jburel commented Dec 5, 2016

if the user does a deep copy, he/she will not expect to be affected by the deletion to the original.
We cannot do that

@mtbc
Copy link
Member Author

mtbc commented Dec 5, 2016

So, my current guess is:

  • add a boolean column originalfile.iscopy with the repo, path, name uniqueness limited to !iscopy
  • for pixels and thumbnail add a bigint column binaryfrom referencing id non-cyclically

However, I am not sure that these are exactly what we will be glad we did. Chat some more? Go for it? Postpone to >5.3?

@mtbc
Copy link
Member Author

mtbc commented Dec 5, 2016

(I assume we do still need to permit deletion of the original.)

@joshmoore
Copy link
Member

If we don't want to prevent the deletion of the original, then I'd say the other recent comments above amount to a hard-linked mrepo-internal re-import which we could make available without the DB changes. The primary disadvantages would be:

  • either duplicates pyramids or requires more work to move them to the mrepo cc: @sbesson
  • would be disallowed (i.e. would throw an exception) if multiple mount-points exist under mrepo

From my side, I think we're still looking for features (or API breakages) which would require the DB changes, i.e. finding the related original fileset or as Mark asks, prevent an operation like delete on the original. Do we want any relationship between old & new?

NB: The addition of binaryfrom is new to me, but especially with regards to letting thumbnails prevent deletion of original files I have some concerns.

@mtbc
Copy link
Member Author

mtbc commented Dec 5, 2016

would be disallowed (i.e. would throw an exception) if multiple mount-points exist under mrepo

is a big problem I'd have thought: after years using the initial volume the admin adds a new and suddenly nobody can deep-copy any more.

binaryfrom just points back to another (say) thumbnail in the same table; have to mv its file to a "new" original's ID if the original is deleted; shouldn't prevent deletion of original files.

@mtbc
Copy link
Member Author

mtbc commented Jul 29, 2020

So, while it requires a run of the import machinery, there's a reimport workaround available, perhaps best by a server-side script, broadly,

  1. Duplicate the metadata.
  2. Reimport the filesets in-place and delete that new metadata.
  3. Use the filesets of the reimport in the metadata duplicate to fill out its empty images.
  4. Do something clever for attachments, tables, etc.

Alternatively, the duplicate machinery is already just about there and its genericity will tend to cover edge cases but it needs,

  1. Adjust duplicate's graph rules to properly include the file-related objects; comparing with chgrp may help.
  2. Adjust the database schema to allow duplicate entries for the same underlying file.
  3. Adjust the server's deletion logic to delete the underlying file only when the last row for it is deleted.

I'd like to think we're relying less on server-built pixel pyramids anyway so can brush that duplication issue under the rug. 😃

@mtbc
Copy link
Member Author

mtbc commented Jul 30, 2020

In chatting with @joshmoore a third option came to my mind, broadly,

  1. Adjust duplicate's graph rules to properly include the file-related objects; comparing with chgrp may help.
  2. Add hooks for OriginalFile duplication that handle the filesystem: using new paths, with the help of the import-time template prefix expansion code where necessary, hard-linking rather than copying where it can.

This does not require database changes but should be able to at least copy filesets, attachments, thumbnails.

This might be the best in terms of trading off implementation effort with outcome.

@mtbc
Copy link
Member Author

mtbc commented Aug 14, 2020

In trying to figure where in the managed repository to put the copy of the original files: the DuplicateI steps are running within an executor.execute but the code for creating a new receiving space in the managed repository passes Ice.Current around, fiddles with omero.session.uuid to sudo, etc. It may be tricky to create the parent directories for the copy of the original files from within an IRequest step.

@joshmoore
Copy link
Member

I would assume the repository has examples of going from Ice.Current to a valid executor.execute context. It may need extracting to a helper though.

@mtbc
Copy link
Member Author

mtbc commented Aug 14, 2020

And the reverse too? Let's see what we can find. 👍

@mtbc
Copy link
Member Author

mtbc commented Aug 14, 2020

(DuplicateI.step would like to do something like ManagedRepositoryI.internalImport.)

@mtbc
Copy link
Member Author

mtbc commented Aug 14, 2020

Perhaps awkwardly, further executor.execute occur within the import preparation and I don't believe they like to be nested.

@joshmoore
Copy link
Member

It's true they don't like to be nested. Though once one is active other methods shouldn't need to call them. If a separate transaction is necessary, then only submit() will do.

@mtbc
Copy link
Member Author

mtbc commented Aug 14, 2020

Aha, that might be exactly the clue I needed to get things working, even if I have to duplicate some execute-using code at first.

@joshmoore
Copy link
Member

Quite possibly. The RepositoryDao, I think, has two copies of several methods (or did) for just that reason.

@mtbc
Copy link
Member Author

mtbc commented Aug 25, 2020

Current interesting problem is backing out from failures: how to track what makeCheckedDirs actually had to create so that we can now delete them.

It could be useful for another devspace to be created for testing deep copy. Bugs might mess up the binary repository or database. Unless it's easy to restore merge-ci's data? cc: @pwalczysko

@pwalczysko
Copy link
Member

Unless it's easy to restore merge-ci's data? cc: @pwalczysko

Not very easy. Moderate difficulty - needs manual reimports (although there are scripts, but these need to be run manually).

@mtbc
Copy link
Member Author

mtbc commented Sep 7, 2020

In considering how to provide access to DuplicateI from OMERO.web it could be worth reviewing typical duplication workflows, especially given how DuplicateI supports extra arguments. For example, by default annotations are included when an image is copied. By using typesToIgnore=['IAnnotationLink'] one can wholly omit annotations from a duplicate image or, for instance, by using typesToReference=['Annotation'] one can have the duplicate point to the original's annotations (thus preserving ownership, etc.). Also see ome/omero-cli-duplicate#7.

@pwalczysko
Copy link
Member

pwalczysko commented Sep 8, 2020

3 workflows for duplicate:

  1. Puclication workflow

    • duplicate parts of the data, insist that the duplication is with all the annotations as a default
    • possibly change ownership of the duplicated data
    • move the duplicated data into the publication group
  2. Cooperation workflow:

    • duplicate parts of the data, sometimes with, sometimes without annotations (depends on the workflow, but not pointing to the original annotations, that would lead to cross-group linking attempt, see step 2 below)
    • move the data into another ad-hoc cooperation group where the collaborator is a member of
  3. Organization of own data workflow:

    • duplicate some images in order to enhance overview over my own data, so that I can have a "real" copy of an image in 2 different datasets (first choice would be to have the annotations linked in this case, although needs to be clear in the clients that the annotations are NOT duplicated)
    • having the option with not duplicating the annotations (mainly file annotations) would be useful for a facility manager or sysadmin worried about the load on storage capacity
    • this workflow will also have the biggest potential to create a "unclean practices" and using the duplicates in many datasets as a replacement for tagging or KVPs or other, just because it fits the mental model of folders and subfolders -> another reason why there might be an urge coming from a data-management institutional person not to overindulge in it or even to stop their users to use this workflow if the data-manager wishes so
  4. Leaving user workflow

    • duplicate the images of a leaving user to be able to continue working with them even as the user might be removed from the group. Almost certainly the annotations should be duplicated with the objects.
    • change ownership of the duplicate to the target user
    • possibly even move to another group
  5. Any combination of the above, and other real-life workflows I might have not thought of.

In short, all three flags highlighted in #20 (comment) comment are under circumstances interesting and useful in the above four workflows.

@joshmoore @mtbc does that answer the question ?

@joshmoore
Copy link
Member

(A quick extra idea I had during the call: what about duplicate --legacy which has the previous behavior, i.e. turns off the binary bits?)

@pwalczysko
Copy link
Member

what about duplicate --legacy which has the previous behavior, i.e. turns off the binary bits?)

I would be afraid that it is very difficult to explain the behaviour of this option to the users, but maybe someone can come up with a scenario where this flag would be of advantage to a user ?

@will-moore
Copy link
Member

Seems to me there are basically 2 workflows:

  1. Duplicate it into a new group (need to duplicate annotations)
  2. Duplicate and keep in same group (don't need to duplicate annotations)

The only issue might be if you do 2, then decide to move it to another group later.
By far the biggest driver for the duplicate functionality (as far as I'm aware) is "publish" or other flavours of option 1.
So in webclient the "Make a copy" feature could just support the 2 options above?

@pwalczysko
Copy link
Member

pwalczysko commented Sep 9, 2020

So in webclient the "Make a copy" feature could just support the 2 options above?

@will-moore If I take your lead on the workflow listing, then i would see a sub-option of your option 2. (say, 2b) with

  1. Duplicate and keep in same group (duplicate annotations optional)
    2b. Duplicate and keep in the same group (do not duplicate annotations, instead link the annotations to both copies)

Option 2b is very interesting for saving space for FileAnnotations I would imagine, but also it could help to keep overview by not proliferating tags etc.

@will-moore
Copy link
Member

@pwalczysko Your 2b is the same as my 2 (do not duplicate annotations).
Is there any reason why you'd want to duplicate annotations if you're NOT moving to a different group? Do we want to support that in webclient? Seems like that would cause some confusion if you duplicate Tags in the same group.

In the 'don't duplicate annotations' scenario, do we mean ALL annotations? E.g. if I duplicate an image with Key-Value pairs, then I edit the KV pairs on one image, would they update on the other image? I probably wouldn't expect that. Same for Comments (although you can't edit in webclient) and Ratings. Files and Tags are OK not to duplicate.

@pwalczysko
Copy link
Member

In the 'don't duplicate annotations' scenario, do we mean ALL annotations?

@mtbc might correct me, I think the answer is yes. Not sure if any granularity is possible there @mtbc ?

Is there any reason why you'd want to duplicate annotations if you're NOT moving to a different group?

Yes, to make the annotations yours, as you can duplicate objects of other users in 3 types of groups. The duplicate gives you back a nice, one-owner tree with one-owner annotations and one-owner links.

E.g. if I duplicate an image with Key-Value pairs, then I edit the KV pairs on one image, would they update on the other image?

@will-moore that depends on whether or not you chose option 1 or option 2b. Btw, the option 2 (or 2a ...) was thought by me as "do not duplicate the annotations at all, i.e. have an unannotated duplicate image/Dataset/project as a result.

Same for Comments (although you can't edit in webclient) and Ratings. Files and Tags are OK not to duplicate.

This assumes the granularity which I am not sure is a given, see my first sentence in this comment... cc @mtbc

@pwalczysko
Copy link
Member

One more usecase of a granular exclusion of certain annotations duplicates can be seen in ome/omero-blitz#100 (comment) - ROIs might simply take too long to duplicate, and the user might choose not to duplicate tham...

@mtbc
Copy link
Member Author

mtbc commented Sep 9, 2020

Not sure if any granularity is possible there

By type is available, e.g., duplication can treat tags differently from comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants