New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #7240: provide disk usage count #2919
Conversation
(add thumbnail service into request object factory registry)
Kick travis please? |
Done. |
One of the usual ones -- travis-ci/travis-ci#2507. Kick again? |
Thank you, third run-through passed. (-: |
**/ | ||
class DiskUsage extends Request { | ||
omero::api::IdListMap objects; | ||
bool includeAnnotations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what conditions would one want to exclude the annotations? Should this be excludeAnnotations
so that if unset, you get the value by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, deleting images may (typically?) not delete their annotations. Happy to switch to an excludeAnnotations
if inclusion is the preferable default (or remove the option altogether).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's an interesting point and gets towards what we consider to be the definitive graph. So, agreed that we should have them in sync (also for chgrp, export, etc), but my assumption is that many annotations should be included by default. Perhaps something that has to be refined afterwards. If we were confident in our graph API (whether Graph1 with /Image
and {"Keep"}
or Graph2) then I'd almost suggest we simply make this a subclass of Graph
.
I haven't finished going through the implementation yet, just the API, but this looks quite nice. Only 2 suggestions on possible API changes. Would you be up for writing a Another next step may be integration tests, but that's possibly in concert with @ximenesuk and @bpindelski. |
I don't actually know Python, mostly I've gotten by here so far through some combination of (a) Perl 5 experience, (b) Haskell experience, (c) reading others' code; so I've not been too sure about the little code I wrote other than it appeared to work. Would be up for doing more Python coding if I can first take several days to at least skim-read some educational literature and experiment on that basis, but now might not be the right time. (To some extent I'll have the same issue with C++ in that I am very rusty on the beyond-C side.) I was thinking about integration tests. I wasn't sure from the client how to do things like size the thumbnail files or pyramids (not that I looked very hard, I admit), and I don't know what the inter-platform issues are with things like line endings in import logs, so it looked like it might be tough to get right, but I would certainly be glad to help out with work on them. |
Understood. Whatever works for the team. This should be a fairly thin plugin like chgrp.py assuming we can re-use the graph API (or something quite similar).
I'd assume it'd be fairly easy to come up with a heuristic for the likes of thumbnails since they'll be standard 96x96 jpegs. Line-endings on import files will vary, it's true, but a touch on the |
Thank you, interesting. At a glance, thumbnails on my current local server vary from 771 bytes to 5737 bytes. I bet this is all solvable, though, just not by me alone with only a small effort. |
Understood. Similar to @aleksandra-tarkowska's |
Pushed code to provide the breakdown (and some explanatory API docs). The current report is fairly simple and matches the model object graph well. Perhaps we can go for this for now, and adjust it if necessary once related client scoping is done? |
I've been doing some basic testing via Python (before the last two commits) and things look good. I'll test the breakdown on Monday. |
Thank you. If you have notes today of the "correct answers" for the totals then you might want to make sure that the answers remain the same on Monday. (-: |
I've tested this fairly thoroughly at the sort of level you can manually. I tried each of the object types in the comment and some that aren't. The values, for valid objects, always looked right bar the log file size issue. I tried overlapping objects (e.g. Project plus contained Dataset) to check that nothing was added in twice, multiple object ids, etc. Everything looks good. The breakdown,
I assume that these are the scripts, which are not stored under the data directory? |
Discussed with @ximenesuk yesterday, probably a good idea to test against a server the size of nightshade. |
gh-2945 should make that testing a little easier. Something like:
where X is the id of some group with a lot of data. Of particular interest here is how long such a call takes. |
refactors usage tracking to fit extra duties
@ximenesuk, in answer to your question in #2919 (comment),
Yes, you're finding the standard Python scripts. |
I've extended gh-2945 to use the new response object and add a file count column. Everything looks good with minimal testing. I'll test further using the build tomorrow. |
This all looks good with more complex and overlapping arguments. |
@ximenesuk: Yet tested against the nightshade clone? |
Not tested on nightshade, but this query on trout:
took around 20 seconds. |
@jburel: Can you suggest a suitable server of nightshade's size? |
(Or the 4½TB trout server may have sufficed.) |
Conflicting PR. Removed from build OME-5.1-merge-push#419. See the console output for more details. |
Conflicting PR. Removed from build OME-5.1-merge-push#420. See the console output for more details. |
Conflict appears to be with #2403. |
Conflicting PR. Removed from build OME-5.1-merge-push#422. See the console output for more details. |
Conflicting PR. Removed from build OME-5.1-merge-push#424. See the console output for more details. |
Compared with trout's 23,423 thumbnails, @aleksandra-tarkowska reports that nightshade has nearly 200,000. How to test on anything like that, though, I don't know, given that the disk usage calculation expects access to the actual files (pyramid sizes, etc.), not just the database: we don't have a suitable test server? Having said that, I don't know how I could much speed the calculation up anyway without it caching old sizes, which could be bad for the use case, "test the size, delete some stuff, test the size again". Inspection of the code should confirm that time complexity should be reasonable (though of course more data means more cache misses) and that the memory requirements should be low (the only data there are many of is |
Assuming a roughly linear response and that everything else is broadly an order of magnitude larger on nightshade that would give around 3 minutes to return the disk usage of all groups in one shot. |
What remains to be done here? As with @aleksandra-tarkowska's |
Yes, I'd also say, get it in, that will make it easier for @ximenesuk to develop against it. (@aleksandra-tarkowska will then have to merge develop into #2403 but that should be pretty easy.) |
Marked for a review of the API based on the CLI usage. Merging. |
fix #7240: provide disk usage count
This PR fixes http://trac.openmicroscopy.org.uk/ome/ticket/7240. Testing is best done by people who know their way around the model graph objects and
omero.data.dir
. Pick some model objects, likeomero.cmd.DiskUsage(objects={'Image' : [1]}, includeAnnotations=True)
, and see what answer you get from submitting the requests to the server.The disk usage calculated should include any pyramids, thumbnails, archived files, import logs, other images from the fileset, etc. entailed by the model objects, and, for
includeAnnotations=True
, any file annotations on those objects, including, for instance, an annotation on a well whose screen you specified. A file should never be double-counted.In code review, scrutinize the population of
DiskUsageI.TRAVERSAL_QUERIES
.Caveats:
head -c
will verify for you that the DB size corresponds to up toStep 4
and no further.)Experimenter
orExperimenterGroup
due to http://trac.openmicroscopy.org.uk/ome/ticket/12514. Filtering queries onname = 'stderr'
identifies these.Testing tips:
If you want to see the
DEBUG
messages inBlitz-0.log
-- very handy for figuring out which files are being counted and how large the server thinks that they are -- not only will you have to adjustetc/logback.xml
to logomero.cmd.graphs
atDEBUG
but you will have to comment out the filter that deniesomero.*
debug logging, thank you to @ximenesuk for that workaround.To count up the bytes used in uploaded files, I was using lines like,
or, for a range of thumbnails (e.g., well samples for a plate),
--no-rebase