fix #7240: provide disk usage count #2919

mtbc · 2014-08-07T14:59:12Z

This PR fixes http://trac.openmicroscopy.org.uk/ome/ticket/7240. Testing is best done by people who know their way around the model graph objects and omero.data.dir. Pick some model objects, like omero.cmd.DiskUsage(objects={'Image' : [1]}, includeAnnotations=True), and see what answer you get from submitting the requests to the server.

The disk usage calculated should include any pyramids, thumbnails, archived files, import logs, other images from the fileset, etc. entailed by the model objects, and, for includeAnnotations=True, any file annotations on those objects, including, for instance, an annotation on a well whose screen you specified. A file should never be double-counted.

In code review, scrutinize the population of DiskUsageI.TRAVERSAL_QUERIES.

Caveats:

Import log size calculations are inaccurate due to http://trac.openmicroscopy.org.uk/ome/ticket/12513. Easiest is to just fix the size in the database, or in the sum you calculate manually. (head -c will verify for you that the DB size corresponds to up to Step 4 and no further.)
Non-existent temporary files are included if you specify Experimenter or ExperimenterGroup due to http://trac.openmicroscopy.org.uk/ome/ticket/12514. Filtering queries on name = 'stderr' identifies these.

Testing tips:

If you want to see the DEBUG messages in Blitz-0.log -- very handy for figuring out which files are being counted and how large the server thinks that they are -- not only will you have to adjust etc/logback.xml to log omero.cmd.graphs at DEBUG but you will have to comment out the filter that denies omero.* debug logging, thank you to @ximenesuk for that workaround.

To count up the bytes used in uploaded files, I was using lines like,

find ManagedRepository/user-2_3/2014-08/07/11-52-21.241 -type f -exec ls -l '{}' '+' | awk '{ sum += $5 } END { print sum }'

or, for a range of thumbnails (e.g., well samples for a plate),

ls -l Thumbnails/{25..40} | awk '{ sum += $5 } END { print sum }'

--no-rebase

(add thumbnail service into request object factory registry)

mtbc · 2014-08-07T16:24:08Z

Kick travis please?

melissalinkert · 2014-08-07T16:25:51Z

Done.

mtbc · 2014-08-07T16:54:21Z

Compiling 8803 source files to /home/travis/build/openmicroscopy/openmicroscopy/components/blitz/target/classes

The command "./components/tools/travis-build $BUILD" exited with 247.

One of the usual ones -- travis-ci/travis-ci#2507. Kick again?

mtbc · 2014-08-07T21:24:40Z

Thank you, third run-through passed. (-:

joshmoore · 2014-08-08T08:54:06Z

components/blitz/resources/omero/cmd/FS.ice

+         **/
+        class DiskUsage extends Request {
+            omero::api::IdListMap objects;
+            bool includeAnnotations;


Under what conditions would one want to exclude the annotations? Should this be excludeAnnotations so that if unset, you get the value by default?

For instance, deleting images may (typically?) not delete their annotations. Happy to switch to an excludeAnnotations if inclusion is the preferable default (or remove the option altogether).

Ah, that's an interesting point and gets towards what we consider to be the definitive graph. So, agreed that we should have them in sync (also for chgrp, export, etc), but my assumption is that many annotations should be included by default. Perhaps something that has to be refined afterwards. If we were confident in our graph API (whether Graph1 with /Image and {"Keep"} or Graph2) then I'd almost suggest we simply make this a subclass of Graph.

joshmoore · 2014-08-08T08:59:28Z

I haven't finished going through the implementation yet, just the API, but this looks quite nice. Only 2 suggestions on possible API changes. Would you be up for writing a bin/omero usage or similar CLI plugin in order to permit testing (/cc @sbesson) before the UI's are reading for testing? (/cc @pwalczysko @gusferguson @jburel @will-moore @aleksandra-tarkowska)

Another next step may be integration tests, but that's possibly in concert with @ximenesuk and @bpindelski.

mtbc · 2014-08-08T09:40:16Z

I don't actually know Python, mostly I've gotten by here so far through some combination of (a) Perl 5 experience, (b) Haskell experience, (c) reading others' code; so I've not been too sure about the little code I wrote other than it appeared to work. Would be up for doing more Python coding if I can first take several days to at least skim-read some educational literature and experiment on that basis, but now might not be the right time. (To some extent I'll have the same issue with C++ in that I am very rusty on the beyond-C side.)

I was thinking about integration tests. I wasn't sure from the client how to do things like size the thumbnail files or pyramids (not that I looked very hard, I admit), and I don't know what the inter-platform issues are with things like line endings in import logs, so it looked like it might be tough to get right, but I would certainly be glad to help out with work on them.

joshmoore · 2014-08-08T12:00:21Z

Would be up for doing more Python coding if I can first take several days to at least skim-read some educational literature ...

Understood. Whatever works for the team. This should be a fairly thin plugin like chgrp.py assuming we can re-use the graph API (or something quite similar).

I was thinking about integration tests. I wasn't sure from the client how to do things like size the thumbnail files or pyramids (not that I looked very hard, I admit), and I don't know what the inter-platform issues are with things like line endings in import logs, so it looked like it might be tough to get right, but I would certainly be glad to help out with work on them.

I'd assume it'd be fairly easy to come up with a heuristic for the likes of thumbnails since they'll be standard 96x96 jpegs. Line-endings on import files will vary, it's true, but a touch on the RawFileStore (as proposed to fix the sha/size bug) would get you the true size of the file for calculating client-side before running the new query.

mtbc · 2014-08-08T12:30:01Z

Thank you, interesting. At a glance, thumbnails on my current local server vary from 771 bytes to 5737 bytes. I bet this is all solvable, though, just not by me alone with only a small effort.

joshmoore · 2014-08-08T12:40:43Z

Understood. Similar to @aleksandra-tarkowska's SendEmail & reimportFileset method additions, we'll need to come up with a pipeline for "prototype-API changes" so we can get things in timely (to prevent conflicts) but also be sure that we don't forget to finalize the work for 5.1.0. /cc @jburel

mtbc · 2014-08-08T14:32:24Z

Pushed code to provide the breakdown (and some explanatory API docs). The current report is fairly simple and matches the model object graph well. Perhaps we can go for this for now, and adjust it if necessary once related client scoping is done?

ximenesuk · 2014-08-08T14:37:45Z

I've been doing some basic testing via Python (before the last two commits) and things look good. I'll test the breakdown on Monday.

mtbc · 2014-08-08T14:39:35Z

Thank you. If you have notes today of the "correct answers" for the totals then you might want to make sure that the answers remain the same on Monday. (-:

ximenesuk · 2014-08-13T11:26:09Z

I've tested this fairly thoroughly at the sort of level you can manually. I tried each of the object types in the comment and some that aren't. The values, for valid objects, always looked right bar the log file size issue. I tried overlapping objects (e.g. Project plus contained Dataset) to check that nothing was added in twice, multiple object ids, etc. Everything looks good. The breakdown, bytesByReferer, is useful, a count for each element, filesByReferer, and overall, totalFilesUsed, might also be useful if it is possible. Looking at root I found:

>>> m = DiskUsage(objects={'Experimenter':[0]})
>>> prx = conn.c.sf.submit(m)
>>> print CmdCallbackI(conn.c, prx).loop(500, 500)
object #0 (::omero::cmd::DiskUsageResponse)
{
    bytesByReferer = 
    {
        key = OriginalFile
        value = 305275
    }
    totalBytesUsed = 305275
}

I assume that these are the scripts, which are not stored under the data directory?

jburel · 2014-08-15T07:30:22Z

Discussed with @ximenesuk yesterday, probably a good idea to test against a server the size of nightshade.

ximenesuk · 2014-08-15T08:35:27Z

gh-2945 should make that testing a little easier. Something like:

bin/omero usage ExperimenterGroup:X  --report

where X is the id of some group with a lot of data. Of particular interest here is how long such a call takes.

refactors usage tracking to fit extra duties

mtbc · 2014-08-19T12:53:35Z

@ximenesuk, in answer to your question in #2919 (comment),

find dist/lib/scripts/omero -type f -exec ls -l '{}' '+' | awk '{ sum += $5 } END { print sum }'
305275

Yes, you're finding the standard Python scripts.

ximenesuk · 2014-08-19T16:16:50Z

I've extended gh-2945 to use the new response object and add a file count column. Everything looks good with minimal testing. I'll test further using the build tomorrow.

ximenesuk · 2014-08-21T09:52:21Z

This all looks good with more complex and overlapping arguments.

mtbc · 2014-08-26T15:27:47Z

@ximenesuk: Yet tested against the nightshade clone?

ximenesuk · 2014-08-26T15:58:47Z

Not tested on nightshade, but this query on trout:

jbin/omero usage ExperimenterGroup:6,7,8,9 --report --units G
Using session 0d69b1c0-8d21-4dc0-8a95-f8d6bdbc6328 (user-6@trout.openmicroscopy.org:4064). Idle timeout: 10.0 min. Current group: read-annotate-1
Total disk usage: 4504 GiB in 63070 files
 component    | size (bytes)  | files 
--------------+---------------+-------
 Pixels       | 272323008634  | 14259 
 FilesetEntry | 4562577676477 | 14739 
 Thumbnail    | 49781417      | 23423 
 Job          | 388574289     | 3625  
 OriginalFile | 3460215       | 443   
 Annotation   | 17724863335   | 7837  
(6 rows)

took around 20 seconds.

mtbc · 2014-08-27T08:05:02Z

@jburel: Can you suggest a suitable server of nightshade's size?

mtbc · 2014-08-27T19:46:24Z

(Or the 4½TB trout server may have sufficed.)

snoopycrimecop · 2014-08-28T19:37:58Z

Conflicting PR. Removed from build OME-5.1-merge-push#419. See the console output for more details.

…disk-usage-count

snoopycrimecop · 2014-08-29T02:50:24Z

Conflicting PR. Removed from build OME-5.1-merge-push#420. See the console output for more details.

mtbc · 2014-08-29T07:06:08Z

Conflict appears to be with #2403.

snoopycrimecop · 2014-08-30T19:37:43Z

Conflicting PR. Removed from build OME-5.1-merge-push#422. See the console output for more details.

snoopycrimecop · 2014-09-01T02:50:16Z

Conflicting PR. Removed from build OME-5.1-merge-push#424. See the console output for more details.

mtbc · 2014-09-02T07:53:20Z

Compared with trout's 23,423 thumbnails, @aleksandra-tarkowska reports that nightshade has nearly 200,000. How to test on anything like that, though, I don't know, given that the disk usage calculation expects access to the actual files (pyramid sizes, etc.), not just the database: we don't have a suitable test server? Having said that, I don't know how I could much speed the calculation up anyway without it caching old sizes, which could be bad for the use case, "test the size, delete some stuff, test the size again".

Inspection of the code should confirm that time complexity should be reasonable (though of course more data means more cache misses) and that the memory requirements should be low (the only data there are many of is Long instances).

ximenesuk · 2014-09-02T07:57:09Z

Assuming a roughly linear response and that everything else is broadly an order of magnitude larger on nightshade that would give around 3 minutes to return the disk usage of all groups in one shot.

joshmoore · 2014-09-03T13:03:35Z

What remains to be done here? As with @aleksandra-tarkowska's SendEmail PR, we have a chicken-n-egg issue with new API and users to drive it. Assuming no suggestions for API modification remain, I'd say let's get it in, and finish @ximenesuk's PR which should shake out any issues.

mtbc · 2014-09-03T13:12:26Z

Yes, I'd also say, get it in, that will make it easier for @ximenesuk to develop against it. (@aleksandra-tarkowska will then have to merge develop into #2403 but that should be pretty easy.)

joshmoore · 2014-09-03T13:40:08Z

Marked for a review of the API based on the CLI usage. Merging.

fix #7240: provide disk usage count

mtbc added 4 commits August 7, 2014 15:30

remove DeltaVision-specific constant from pixels service

cc2c056

create new disk usage request

c75db2e

implement new disk usage request

cac9000

make new request callable

3548c7f

(add thumbnail service into request object factory registry)

jburel added the develop label Aug 7, 2014

mtbc mentioned this pull request Aug 7, 2014

Extra annotations #2412

Merged

joshmoore reviewed Aug 8, 2014
View reviewed changes

mtbc added 2 commits August 8, 2014 15:22

adjust API of disk usage request

df578e4

adjust disk usage implementation to match API change

bc7dc3a

ximenesuk mentioned this pull request Aug 14, 2014

Add a plugin for determining disk usage. #2945

Closed

mtbc added 2 commits August 19, 2014 13:31

extend disk usage object with file counts

0553514

implement file counts for disk usage requests

acdaf14

refactors usage tracking to fit extra duties

Merge remote-tracking branch 'remotes/origin/develop' into trac-7240-…

ab65c77

…disk-usage-count

mtbc mentioned this pull request Sep 3, 2014

Download spw original files #2953

Merged

joshmoore added a commit that referenced this pull request Sep 3, 2014

Merge pull request #2919 from mtbc/trac-7240-disk-usage-count

3667342

fix #7240: provide disk usage count

joshmoore merged commit 3667342 into ome:develop Sep 3, 2014

mtbc deleted the trac-7240-disk-usage-count branch September 3, 2014 13:48

mtbc mentioned this pull request Sep 3, 2014

Send email #2403

Merged

ximenesuk mentioned this pull request Sep 4, 2014

Add a disk usage plugin #3009

Merged

sbesson added this to the 5.1.0-m1 milestone Oct 14, 2014

mtbc mentioned this pull request Oct 14, 2014

fix #12552: include extra annotations in disk usage calculation #3110

Merged

mtbc mentioned this pull request Jan 7, 2015

add integration tests for disk usage request #3317

Merged

fix #7240: provide disk usage count #2919

fix #7240: provide disk usage count #2919

Conversation

mtbc commented Aug 7, 2014

mtbc commented Aug 7, 2014

melissalinkert commented Aug 7, 2014

mtbc commented Aug 7, 2014

mtbc commented Aug 7, 2014

joshmoore Aug 8, 2014

Choose a reason for hiding this comment

mtbc Aug 8, 2014

Choose a reason for hiding this comment

joshmoore Aug 8, 2014

Choose a reason for hiding this comment

joshmoore commented Aug 8, 2014

mtbc commented Aug 8, 2014

joshmoore commented Aug 8, 2014

mtbc commented Aug 8, 2014

joshmoore commented Aug 8, 2014

mtbc commented Aug 8, 2014

ximenesuk commented Aug 8, 2014

mtbc commented Aug 8, 2014

ximenesuk commented Aug 13, 2014

jburel commented Aug 15, 2014

ximenesuk commented Aug 15, 2014

mtbc commented Aug 19, 2014

ximenesuk commented Aug 19, 2014

ximenesuk commented Aug 21, 2014

mtbc commented Aug 26, 2014

ximenesuk commented Aug 26, 2014

mtbc commented Aug 27, 2014

mtbc commented Aug 27, 2014

snoopycrimecop commented Aug 28, 2014

snoopycrimecop commented Aug 29, 2014

mtbc commented Aug 29, 2014

snoopycrimecop commented Aug 30, 2014

snoopycrimecop commented Sep 1, 2014

mtbc commented Sep 2, 2014

ximenesuk commented Sep 2, 2014

joshmoore commented Sep 3, 2014

mtbc commented Sep 3, 2014

joshmoore commented Sep 3, 2014