Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #7240: provide disk usage count #2919

Merged
merged 9 commits into from Sep 3, 2014

Conversation

mtbc
Copy link
Member

@mtbc mtbc commented Aug 7, 2014

This PR fixes http://trac.openmicroscopy.org.uk/ome/ticket/7240. Testing is best done by people who know their way around the model graph objects and omero.data.dir. Pick some model objects, like omero.cmd.DiskUsage(objects={'Image' : [1]}, includeAnnotations=True), and see what answer you get from submitting the requests to the server.

The disk usage calculated should include any pyramids, thumbnails, archived files, import logs, other images from the fileset, etc. entailed by the model objects, and, for includeAnnotations=True, any file annotations on those objects, including, for instance, an annotation on a well whose screen you specified. A file should never be double-counted.

In code review, scrutinize the population of DiskUsageI.TRAVERSAL_QUERIES.

Caveats:

Testing tips:

If you want to see the DEBUG messages in Blitz-0.log -- very handy for figuring out which files are being counted and how large the server thinks that they are -- not only will you have to adjust etc/logback.xml to log omero.cmd.graphs at DEBUG but you will have to comment out the filter that denies omero.* debug logging, thank you to @ximenesuk for that workaround.

To count up the bytes used in uploaded files, I was using lines like,

find ManagedRepository/user-2_3/2014-08/07/11-52-21.241 -type f -exec ls -l '{}' '+' | awk '{ sum += $5 } END { print sum }'

or, for a range of thumbnails (e.g., well samples for a plate),

ls -l Thumbnails/{25..40} | awk '{ sum += $5 } END { print sum }'

--no-rebase

@jburel jburel added the develop label Aug 7, 2014
@mtbc mtbc mentioned this pull request Aug 7, 2014
@mtbc
Copy link
Member Author

mtbc commented Aug 7, 2014

Kick travis please?

@melissalinkert
Copy link
Member

Done.

@mtbc
Copy link
Member Author

mtbc commented Aug 7, 2014

Compiling 8803 source files to /home/travis/build/openmicroscopy/openmicroscopy/components/blitz/target/classes

The command "./components/tools/travis-build $BUILD" exited with 247.

One of the usual ones -- travis-ci/travis-ci#2507. Kick again?

@mtbc
Copy link
Member Author

mtbc commented Aug 7, 2014

Thank you, third run-through passed. (-:

**/
class DiskUsage extends Request {
omero::api::IdListMap objects;
bool includeAnnotations;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions would one want to exclude the annotations? Should this be excludeAnnotations so that if unset, you get the value by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, deleting images may (typically?) not delete their annotations. Happy to switch to an excludeAnnotations if inclusion is the preferable default (or remove the option altogether).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's an interesting point and gets towards what we consider to be the definitive graph. So, agreed that we should have them in sync (also for chgrp, export, etc), but my assumption is that many annotations should be included by default. Perhaps something that has to be refined afterwards. If we were confident in our graph API (whether Graph1 with /Image and {"Keep"} or Graph2) then I'd almost suggest we simply make this a subclass of Graph.

@joshmoore
Copy link
Member

I haven't finished going through the implementation yet, just the API, but this looks quite nice. Only 2 suggestions on possible API changes. Would you be up for writing a bin/omero usage or similar CLI plugin in order to permit testing (/cc @sbesson) before the UI's are reading for testing? (/cc @pwalczysko @gusferguson @jburel @will-moore @aleksandra-tarkowska)

Another next step may be integration tests, but that's possibly in concert with @ximenesuk and @bpindelski.

@mtbc
Copy link
Member Author

mtbc commented Aug 8, 2014

I don't actually know Python, mostly I've gotten by here so far through some combination of (a) Perl 5 experience, (b) Haskell experience, (c) reading others' code; so I've not been too sure about the little code I wrote other than it appeared to work. Would be up for doing more Python coding if I can first take several days to at least skim-read some educational literature and experiment on that basis, but now might not be the right time. (To some extent I'll have the same issue with C++ in that I am very rusty on the beyond-C side.)

I was thinking about integration tests. I wasn't sure from the client how to do things like size the thumbnail files or pyramids (not that I looked very hard, I admit), and I don't know what the inter-platform issues are with things like line endings in import logs, so it looked like it might be tough to get right, but I would certainly be glad to help out with work on them.

@joshmoore
Copy link
Member

Would be up for doing more Python coding if I can first take several days to at least skim-read some educational literature ...

Understood. Whatever works for the team. This should be a fairly thin plugin like chgrp.py assuming we can re-use the graph API (or something quite similar).

I was thinking about integration tests. I wasn't sure from the client how to do things like size the thumbnail files or pyramids (not that I looked very hard, I admit), and I don't know what the inter-platform issues are with things like line endings in import logs, so it looked like it might be tough to get right, but I would certainly be glad to help out with work on them.

I'd assume it'd be fairly easy to come up with a heuristic for the likes of thumbnails since they'll be standard 96x96 jpegs. Line-endings on import files will vary, it's true, but a touch on the RawFileStore (as proposed to fix the sha/size bug) would get you the true size of the file for calculating client-side before running the new query.

@mtbc
Copy link
Member Author

mtbc commented Aug 8, 2014

Thank you, interesting. At a glance, thumbnails on my current local server vary from 771 bytes to 5737 bytes. I bet this is all solvable, though, just not by me alone with only a small effort.

@joshmoore
Copy link
Member

Understood. Similar to @aleksandra-tarkowska's SendEmail & reimportFileset method additions, we'll need to come up with a pipeline for "prototype-API changes" so we can get things in timely (to prevent conflicts) but also be sure that we don't forget to finalize the work for 5.1.0. /cc @jburel

@mtbc
Copy link
Member Author

mtbc commented Aug 8, 2014

Pushed code to provide the breakdown (and some explanatory API docs). The current report is fairly simple and matches the model object graph well. Perhaps we can go for this for now, and adjust it if necessary once related client scoping is done?

@ximenesuk
Copy link
Contributor

I've been doing some basic testing via Python (before the last two commits) and things look good. I'll test the breakdown on Monday.

@mtbc
Copy link
Member Author

mtbc commented Aug 8, 2014

Thank you. If you have notes today of the "correct answers" for the totals then you might want to make sure that the answers remain the same on Monday. (-:

@ximenesuk
Copy link
Contributor

I've tested this fairly thoroughly at the sort of level you can manually. I tried each of the object types in the comment and some that aren't. The values, for valid objects, always looked right bar the log file size issue. I tried overlapping objects (e.g. Project plus contained Dataset) to check that nothing was added in twice, multiple object ids, etc. Everything looks good. The breakdown, bytesByReferer, is useful, a count for each element, filesByReferer, and overall, totalFilesUsed, might also be useful if it is possible. Looking at root I found:

>>> m = DiskUsage(objects={'Experimenter':[0]})
>>> prx = conn.c.sf.submit(m)
>>> print CmdCallbackI(conn.c, prx).loop(500, 500)
object #0 (::omero::cmd::DiskUsageResponse)
{
    bytesByReferer = 
    {
        key = OriginalFile
        value = 305275
    }
    totalBytesUsed = 305275
}

I assume that these are the scripts, which are not stored under the data directory?

@jburel
Copy link
Member

jburel commented Aug 15, 2014

Discussed with @ximenesuk yesterday, probably a good idea to test against a server the size of nightshade.

@ximenesuk
Copy link
Contributor

gh-2945 should make that testing a little easier. Something like:

bin/omero usage ExperimenterGroup:X  --report

where X is the id of some group with a lot of data. Of particular interest here is how long such a call takes.

@mtbc
Copy link
Member Author

mtbc commented Aug 19, 2014

@ximenesuk, in answer to your question in #2919 (comment),

find dist/lib/scripts/omero -type f -exec ls -l '{}' '+' | awk '{ sum += $5 } END { print sum }'
305275

Yes, you're finding the standard Python scripts.

@ximenesuk
Copy link
Contributor

I've extended gh-2945 to use the new response object and add a file count column. Everything looks good with minimal testing. I'll test further using the build tomorrow.

@ximenesuk
Copy link
Contributor

This all looks good with more complex and overlapping arguments.

@mtbc
Copy link
Member Author

mtbc commented Aug 26, 2014

@ximenesuk: Yet tested against the nightshade clone?

@ximenesuk
Copy link
Contributor

Not tested on nightshade, but this query on trout:

jbin/omero usage ExperimenterGroup:6,7,8,9 --report --units G
Using session 0d69b1c0-8d21-4dc0-8a95-f8d6bdbc6328 (user-6@trout.openmicroscopy.org:4064). Idle timeout: 10.0 min. Current group: read-annotate-1
Total disk usage: 4504 GiB in 63070 files
 component    | size (bytes)  | files 
--------------+---------------+-------
 Pixels       | 272323008634  | 14259 
 FilesetEntry | 4562577676477 | 14739 
 Thumbnail    | 49781417      | 23423 
 Job          | 388574289     | 3625  
 OriginalFile | 3460215       | 443   
 Annotation   | 17724863335   | 7837  
(6 rows)

took around 20 seconds.

@mtbc
Copy link
Member Author

mtbc commented Aug 27, 2014

@jburel: Can you suggest a suitable server of nightshade's size?

@mtbc
Copy link
Member Author

mtbc commented Aug 27, 2014

(Or the 4½TB trout server may have sufficed.)

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OME-5.1-merge-push#419. See the console output for more details.

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OME-5.1-merge-push#420. See the console output for more details.

@mtbc
Copy link
Member Author

mtbc commented Aug 29, 2014

Conflict appears to be with #2403.

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OME-5.1-merge-push#422. See the console output for more details.

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OME-5.1-merge-push#424. See the console output for more details.

@mtbc
Copy link
Member Author

mtbc commented Sep 2, 2014

Compared with trout's 23,423 thumbnails, @aleksandra-tarkowska reports that nightshade has nearly 200,000. How to test on anything like that, though, I don't know, given that the disk usage calculation expects access to the actual files (pyramid sizes, etc.), not just the database: we don't have a suitable test server? Having said that, I don't know how I could much speed the calculation up anyway without it caching old sizes, which could be bad for the use case, "test the size, delete some stuff, test the size again".

Inspection of the code should confirm that time complexity should be reasonable (though of course more data means more cache misses) and that the memory requirements should be low (the only data there are many of is Long instances).

@ximenesuk
Copy link
Contributor

Assuming a roughly linear response and that everything else is broadly an order of magnitude larger on nightshade that would give around 3 minutes to return the disk usage of all groups in one shot.

@mtbc mtbc mentioned this pull request Sep 3, 2014
@joshmoore
Copy link
Member

What remains to be done here? As with @aleksandra-tarkowska's SendEmail PR, we have a chicken-n-egg issue with new API and users to drive it. Assuming no suggestions for API modification remain, I'd say let's get it in, and finish @ximenesuk's PR which should shake out any issues.

@mtbc
Copy link
Member Author

mtbc commented Sep 3, 2014

Yes, I'd also say, get it in, that will make it easier for @ximenesuk to develop against it. (@aleksandra-tarkowska will then have to merge develop into #2403 but that should be pretty easy.)

@joshmoore
Copy link
Member

Marked for a review of the API based on the CLI usage. Merging.

joshmoore added a commit that referenced this pull request Sep 3, 2014
@joshmoore joshmoore merged commit 3667342 into ome:develop Sep 3, 2014
@mtbc mtbc deleted the trac-7240-disk-usage-count branch September 3, 2014 13:48
@mtbc mtbc mentioned this pull request Sep 3, 2014
@sbesson sbesson added this to the 5.1.0-m1 milestone Oct 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants