Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

there should be an mdu command #143

Open
bahamas10 opened this issue Nov 1, 2013 · 13 comments
Open

there should be an mdu command #143

bahamas10 opened this issue Nov 1, 2013 · 13 comments

Comments

@bahamas10
Copy link
Contributor

client.ftw is the perfect api for creating a du like command. something that can calculate usage on a file/directory basis.

I may start working on this, so I make this issue as a more of a placeholder than anything

@bahamas10
Copy link
Contributor Author

bonus: it should calculate usage by adding up the size attributes, but with a command line switch, calculate manta usage by adding up size * copies

@davepacheco
Copy link
Contributor

davepacheco commented Jun 27, 2016

This would be really valuable. Ideally, the tool should take into account both (1) the number of copies (as reported from the HTTP headers), as well as (2) the physical size used, which you can get by calling stat(1) or stat(2) from a job and looking at the number of blocks. It could report both logical size and physical size. It would be really neat if it had an option to export in ncdu's format, which means we could import it into that tool to help visualize space used.

The way I'd probably go about this is to have the tool run an mfind on its arguments looking only for objects and then run a two-phase job on those objects: the first phase would be a C program that just calls stat(2) and reports the object name, the logical size, the number of blocks used, the size of each block, and anything else would be useful there. The second phase would be a reduce phase that would sort the inputs by object name and then create the JSON structure that ncdu expects.

The trickiest part of that is probably dealing with the encoding between phases. If we just make the mapper a Node program, then we could just JSON.stringify the name, or even the entire record, but the startup cost would be substantially larger. That may not matter too much, given Manta's parallelism.

@trentm
Copy link
Contributor

trentm commented Jun 27, 2016

[trent.mick@us-east /trent.mick/stor/tmp/snap]$ cat afile
this is a file
[trent.mick@us-east /trent.mick/stor/tmp/snap]$ cat samecontent
this is a file
[trent.mick@us-east /trent.mick/stor/tmp/snap]$ ln afile snaplink
[trent.mick@us-east /trent.mick/stor/tmp/snap]$ ^D

$ minfo /trent.mick/stor/tmp/snap/afile
HTTP/1.1 200 OK
etag: 3c53eadf-0fa8-c3c0-c7de-db74fbecdd01
last-modified: Mon, 27 Jun 2016 17:16:29 GMT
durability-level: 2
content-length: 15
content-md5: JrtzVWzrMqXfMLczxTVe5Q==
content-type: text/plain
date: Mon, 27 Jun 2016 17:17:05 GMT
server: Manta
x-request-id: 350a0f51-9e6e-4c96-9aa9-b4ab3ca6e829
x-response-time: 24
x-server-name: 39adec6c-bded-4a14-9d80-5a8bfc1121f9
connection: keep-alive
x-request-received: 1467047825291
x-request-processing-time: 369

$ minfo /trent.mick/stor/tmp/snap/snaplink
HTTP/1.1 200 OK
etag: 3c53eadf-0fa8-c3c0-c7de-db74fbecdd01
last-modified: Mon, 27 Jun 2016 17:16:55 GMT
durability-level: 2
content-length: 15
content-md5: JrtzVWzrMqXfMLczxTVe5Q==
content-type: text/plain
date: Mon, 27 Jun 2016 17:17:09 GMT
server: Manta
x-request-id: a24a4ffa-0799-4a04-968d-c64e10894cb7
x-response-time: 10
x-server-name: 39adec6c-bded-4a14-9d80-5a8bfc1121f9
connection: keep-alive
x-request-received: 1467047829543
x-request-processing-time: 343

$ minfo /trent.mick/stor/tmp/snap/samecontent
HTTP/1.1 200 OK
etag: 3be48426-f15c-4938-9a20-c66a318093b0
last-modified: Mon, 27 Jun 2016 17:16:39 GMT
durability-level: 2
content-length: 15
content-md5: JrtzVWzrMqXfMLczxTVe5Q==
content-type: text/plain
date: Mon, 27 Jun 2016 17:17:29 GMT
server: Manta
x-request-id: 410af1ea-ccbf-4ab5-be22-a6e025e170e3
x-response-time: 11
x-server-name: 3d2b5d91-5cd9-4123-89a5-794f44eab9fd
connection: keep-alive
x-request-received: 1467047849735
x-request-processing-time: 422

Note that the etag for a snaplink'd file is the same. I wonder if it is
sematically correct to derive a "ino" value for the ncdu format from this
etag, as a way to indicate to ncdu that the snaplink is the equivalent of a
hardlink in terms of not taking extra space.

@davepacheco
Copy link
Contributor

That's a good point, although I don't think know if we want to commit to that. I think it would be reasonable if an Etag for the same content was the same, even if it was a separate copy.

@trentm
Copy link
Contributor

trentm commented Jun 27, 2016

FWIW, obj.etag appears to be the metadata objectId: https://github.com/joyent/manta-muskie/blob/054fcc04fe724a0e319d6dedb185632c0f0c61bf/lib/obj.js#L520

Not sure that is a promised interface.

I think it would be reasonable if an Etag for the same content was the same, even if it was a separate copy.

Shouldn't it capture other metadata as well? content-type, 'm-*' headers.

@davepacheco
Copy link
Contributor

Shouldn't it capture other metadata as well? content-type, 'm-*' headers.

Possibly. I'd have to review RFC 2616. Even if so, is it plausible that an implementation could allow separate copies of the same data with the same headers to have the same etag?

On the other hand, unless we add a new header for object-id, I'm not sure how else mdu could deal with snaplinks.

@trentm
Copy link
Contributor

trentm commented Jun 28, 2016

@davepacheco I wonder if it would be faster to just do directory listings and use the "size" and "durability" fields for listed files:

$ mls -j ~~/stor/tmp
{"name":"5f67e820-1489-4db7-9df2-1d8e3ec5cd90-file.gz","etag":"142ad91b-73d8-6cb4-9cd9-efacf7df7a9a","size":229535627,"type":"object","mtime":"2014-10-08T22:53:25.146Z","durability":2,"parent":"/trent.mick/stor/tmp"}
{"name":"5f67e820-1489-4db7-9df2-1d8e3ec5cd90.imgmanifest","etag":"88ac47b9-e53f-c065-b446-e2d0455c0c00","size":1052,"type":"object","mtime":"2014-10-08T22:52:44.298Z","durability":2,"parent":"/trent.mick/stor/tmp"}
...

That is logical instead of physical block usage, which is a bit of a departure from standard du. I don't know if that limitation would be unacceptable.

@davepacheco
Copy link
Contributor

Yes, a tool that looked at logical usage would be much easier to build and would run much faster. For end users, that's probably more appropriate, too. But as operators, we've wasted lots of time in the past clearing out usage of lots of logical space that freed up very little physical space, and I really don't want to do that again. Ideally, this tool would have two modes, but it's the slower-running, harder-to-build one that we really want at the moment.

@trentm trentm self-assigned this Jul 14, 2016
@trentm
Copy link
Contributor

trentm commented Jul 14, 2016

I've started work on this... might still be a while tho.

@tebbers
Copy link

tebbers commented Aug 22, 2016

@davepacheco I concur with you a du like tool would be awesome. Having to go through this to try to understand where "all the disk space went" I'm wondering if it would be easier/possible to expose compression ratio by file in the directory listing? then if a file was 1x compress vs 6x compressed, i think it would get you to the physical sizing a "bit" easier?

@davepacheco
Copy link
Contributor

@tebbers That would definitely be nice, but it's surprisingly difficult. The problem is that ZFS doesn't appear to calculate the correct number of physical bytes used until the transaction group that created the file has been written out, which is likely well after the object's metadata has been committed. We could populate this information asynchronously, but that's non-trivial itself and leaves a window shortly after object creation where the information would be incorrect.

@tebbers
Copy link

tebbers commented Aug 22, 2016

@davepacheco you have any pointers? I'm trying to create a du -ks in shell scripts by doing mfind's and mget's but it's pretty time consuming.

@davepacheco
Copy link
Contributor

I've been using manta-mdu, which is close to the point where it could be polished and brought into node-manta.

@trentm trentm removed their assignment Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants