Skip to content

Conversation

@razeh
Copy link

@razeh razeh commented Dec 19, 2018

By switching from ftruncate to posix_fallocate we error out early if
there isn't enough disk space for the incoming file. If we use
ftruncate a sparse file is created that is then mmapped. As the bytes
come over the network and the file is populated we get a SIGBUS when
the kernel is unable to allocate space on the disk.

By switching from ftruncate to posix_fallocate we error out early if
there isn't enough disk space for the incoming file. If we use
ftruncate a sparse file is created that is then mmapped. As the bytes
come over the network and the file is populated we get a SIGBUS when
the kernel is unable to allocate space on the disk.
@jgunthorpe
Copy link
Member

falloc can be really slow on some filesystems, are you sure this is worthwhile?

@razeh
Copy link
Author

razeh commented Dec 19, 2018

For my application, I'd really like to make sure that the file made it across the network. I'm open to alternatives; the only one that I've come up with so far is a signal handler. Are there any other suggestions?

@jgunthorpe
Copy link
Member

Positive acknowledgement on success?

@razeh
Copy link
Author

razeh commented Dec 20, 2018

The server does send an acknowledgement on success -- although the status isn't reflected in the client's exit code. But when there isn't enough space on disk the server gets a SIGBUS and then exits before it can send anything.

@razeh
Copy link
Author

razeh commented Dec 20, 2018

I'd hoped that a SIGBUS signal handler that set a flag to tell _recv to return early would work, but it does not. The program stays stuck in rrecv after the SIGBUS is sent, so _recv never gets to do an early return.

@jgunthorpe
Copy link
Member

SIGBUS handlers are really tricky to make.

It sounds liek you have everything you need, just make the sender fail if it doesn't get a positive ack.

@razeh
Copy link
Author

razeh commented Dec 20, 2018

Yes, but to fail if I don't get a positive ack would require a timeout, and configuring that is somewhat problematic. Since I'm only using this in my own environment, and the posix_fallocate call isn't slow for my file systems, I'll just go with a local fork that uses posix_fallocate, and a slightly modified client that returns the status code it gets.

@jgunthorpe
Copy link
Member

CM should give you an immediate connection close after the SIGBUS, and since an Ack wasn't received that should be enough to report failure.

@razeh
Copy link
Author

razeh commented Dec 20, 2018

I am not seeing an immediate connection close on the client when the server gets the SIGBUS; I'm seeing the client hang after the server crashes.

-bash-4.2$ /bin/rcopy
waiting for connection...client: 10.78.255.223
opening: g, transferring...25936828560 bytes...Bus error

And then the client simply hangs, never getting to the close:

/bin/rcopy g tcs-login-2:g
opening...transferring...

And a pstack shows that the client is stuck inside of the rsend:

-bash-4.2$ pstack 40738
#0 0x00002aaaab1127e0 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x00002aaaaaef9043 in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1
#2 0x00002aaaaacdcd2d in rs_get_comp.constprop () from /usr/lib64/librdmacm.so.1
#3 0x00002aaaaace0828 in rsend () from /usr/lib64/librdmacm.so.1
#4 0x0000000000401942 in main ()

@jgunthorpe
Copy link
Member

Looks like a bug in rsockets..

@razeh razeh closed this Dec 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants