Detect out of space errors with posix_fallocate #445

razeh · 2018-12-19T21:34:49Z

By switching from ftruncate to posix_fallocate we error out early if
there isn't enough disk space for the incoming file. If we use
ftruncate a sparse file is created that is then mmapped. As the bytes
come over the network and the file is populated we get a SIGBUS when
the kernel is unable to allocate space on the disk.

By switching from ftruncate to posix_fallocate we error out early if there isn't enough disk space for the incoming file. If we use ftruncate a sparse file is created that is then mmapped. As the bytes come over the network and the file is populated we get a SIGBUS when the kernel is unable to allocate space on the disk.

jgunthorpe · 2018-12-19T21:49:48Z

falloc can be really slow on some filesystems, are you sure this is worthwhile?

razeh · 2018-12-19T21:53:15Z

For my application, I'd really like to make sure that the file made it across the network. I'm open to alternatives; the only one that I've come up with so far is a signal handler. Are there any other suggestions?

jgunthorpe · 2018-12-19T23:28:39Z

Positive acknowledgement on success?

razeh · 2018-12-20T15:00:52Z

The server does send an acknowledgement on success -- although the status isn't reflected in the client's exit code. But when there isn't enough space on disk the server gets a SIGBUS and then exits before it can send anything.

razeh · 2018-12-20T16:44:03Z

I'd hoped that a SIGBUS signal handler that set a flag to tell _recv to return early would work, but it does not. The program stays stuck in rrecv after the SIGBUS is sent, so _recv never gets to do an early return.

jgunthorpe · 2018-12-20T21:54:10Z

SIGBUS handlers are really tricky to make.

It sounds liek you have everything you need, just make the sender fail if it doesn't get a positive ack.

razeh · 2018-12-20T22:09:19Z

Yes, but to fail if I don't get a positive ack would require a timeout, and configuring that is somewhat problematic. Since I'm only using this in my own environment, and the posix_fallocate call isn't slow for my file systems, I'll just go with a local fork that uses posix_fallocate, and a slightly modified client that returns the status code it gets.

jgunthorpe · 2018-12-20T22:10:27Z

CM should give you an immediate connection close after the SIGBUS, and since an Ack wasn't received that should be enough to report failure.

razeh · 2018-12-20T22:54:36Z

I am not seeing an immediate connection close on the client when the server gets the SIGBUS; I'm seeing the client hang after the server crashes.

-bash-4.2$ /bin/rcopy
waiting for connection...client: 10.78.255.223
opening: g, transferring...25936828560 bytes...Bus error

And then the client simply hangs, never getting to the close:

/bin/rcopy g tcs-login-2:g
opening...transferring...

And a pstack shows that the client is stuck inside of the rsend:

-bash-4.2$ pstack 40738
#0 0x00002aaaab1127e0 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x00002aaaaaef9043 in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1
#2 0x00002aaaaacdcd2d in rs_get_comp.constprop () from /usr/lib64/librdmacm.so.1
#3 0x00002aaaaace0828 in rsend () from /usr/lib64/librdmacm.so.1
#4 0x0000000000401942 in main ()

jgunthorpe · 2018-12-20T23:04:27Z

Looks like a bug in rsockets..

razeh closed this Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect out of space errors with posix_fallocate #445

Detect out of space errors with posix_fallocate #445

Uh oh!

razeh commented Dec 19, 2018

Uh oh!

jgunthorpe commented Dec 19, 2018

Uh oh!

razeh commented Dec 19, 2018

Uh oh!

jgunthorpe commented Dec 19, 2018

Uh oh!

razeh commented Dec 20, 2018

Uh oh!

razeh commented Dec 20, 2018 •

edited

Loading

Uh oh!

jgunthorpe commented Dec 20, 2018

Uh oh!

razeh commented Dec 20, 2018

Uh oh!

jgunthorpe commented Dec 20, 2018

Uh oh!

razeh commented Dec 20, 2018

Uh oh!

jgunthorpe commented Dec 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Detect out of space errors with posix_fallocate #445

Detect out of space errors with posix_fallocate #445

Uh oh!

Conversation

razeh commented Dec 19, 2018

Uh oh!

jgunthorpe commented Dec 19, 2018

Uh oh!

razeh commented Dec 19, 2018

Uh oh!

jgunthorpe commented Dec 19, 2018

Uh oh!

razeh commented Dec 20, 2018

Uh oh!

razeh commented Dec 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgunthorpe commented Dec 20, 2018

Uh oh!

razeh commented Dec 20, 2018

Uh oh!

jgunthorpe commented Dec 20, 2018

Uh oh!

razeh commented Dec 20, 2018

Uh oh!

jgunthorpe commented Dec 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

razeh commented Dec 20, 2018 •

edited

Loading