Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

3821 Race in rollback, zil close, and zil flush #208

Closed
wants to merge 1 commit into from

Conversation

prakashsurya
Copy link
Member

@prakashsurya prakashsurya commented Oct 24, 2016

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: Dan Kimmel dan.kimmel@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com

Effectively we can’t update the “synced” uberblock until all blocks
have been written to disk, the uberblocks have been updated on-disk,
and all ZILs have been cleaned.

The old code was updating this too early which would allow the ZIL to
think that it could proceed when closing the zil. Now, a txg is only
considered "synced" after the all blocks have been written to disk, the
uberblocks have been updated on-disk, and all ZILs have been cleaned.

This will close the race that can happen between zil_commit() and
zil_clean().

Upstream bugs: DLPX-43104

@prakashsurya prakashsurya force-pushed the openzfs-DLPX-43104 branch 2 times, most recently from d66461d to c79c5ea Compare October 24, 2016 15:55
@avg-I
Copy link

avg-I commented Oct 24, 2016

@ahrens does this change have any relation to the following bugs (filed by me)?
https://www.illumos.org/issues/7199
https://www.illumos.org/issues/7200

@grwilson
Copy link
Member

@avg-I There's a possibility that they are related although it's not immediately obvious to me. This race was very subtle so it's possible that it had a wider impact than what we initially analyzed.

@avg-I
Copy link

avg-I commented Oct 25, 2016

@grwilson having looked at the actual change I think that it is probably unrelated to those issues.
But it should cover another question of mine: https://www.mail-archive.com/developer@lists.open-zfs.org/msg00030.html
Anyway, I'll re-run my tests with this change included.

@avg-I
Copy link

avg-I commented Oct 29, 2016

Reporting back that I was able to reproduce both problems with this patch applied.
But it is for a different issue anyway.
So, the change looks good to me.

@prakashsurya
Copy link
Member Author

Thanks @avg-I, I added you to the commit message as an official reviewer for this patch.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>

Effectively we can’t update the “synced” uberblock until all blocks
have been written to disk, the uberblocks have been updated on-disk,
and all ZILs have been cleaned.

The old code was updating this too early which would allow the ZIL to
think that it could proceed when closing the zil. Now, a txg is only
considered "synced" after the all blocks have been written to disk, the
uberblocks have been updated on-disk, and all ZILs have been cleaned.

This will close the race that can happen between zil_commit() and
zil_clean().

Upstream bugs: DLPX-43104
@prakashsurya
Copy link
Member Author

@zettabot go

@prakashsurya
Copy link
Member Author

@ahrens I think this is good to go, but please just verify the zloop output prior to RTI-ing and merging this; I think that failure is seen on master, but I'm not certain.

@ahrens
Copy link
Member

ahrens commented Nov 6, 2016

Yes, that's definitely a known problem with ztest.

@ahrens ahrens closed this in 47890f8 Nov 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants