Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: remove --worker-id, replace with --node-id #844

Merged
merged 4 commits into from
Feb 6, 2017

Conversation

ploxiln
Copy link
Member

@ploxiln ploxiln commented Jan 5, 2017

This is one of the for-1.0 changes contemplated in #741

This is the minimal-effort approach. If --worker-id is specified with an int, parsing options will fail, and the user will hopefully notice the change instead of just specifying "true" or "false" to work-around the failure. Or is backwards-compatibility (with just a "deprecated" warning) desired?

@ploxiln
Copy link
Member Author

ploxiln commented Jan 5, 2017

added another change contemplated in #741, removing the ID from the metadata filename

@ploxiln ploxiln changed the title rename nsqd option --worker-id to --node-id nsqd: remove --worker-id, replace with --node-id Jan 5, 2017
nsqd/nsqd.go Outdated
@@ -370,6 +379,11 @@ func (n *NSQD) PersistMetadata() error {
return err
}

err = os.Remove(oldFileName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with this approach is that it breaks the ability to roll back from a nsqd upgrade (old nsqd would not see new file).

As much as i hate things sticking around, I'd rather see a phased deprecation where one release persists both versions with defined precedence on startup (or requiring equality between two files), and a subsequent release remove the old file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. will attempt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean we should have a point release between where we are and 1.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking I think we have 3 options w/r/t version promises. We

a) backport a few small deprecation/forwards compatible things that get released as 0.3.9/0.4 and complete removal in 1.0 (documenting the intermediate version requirements for clean transitions)
b) 1.0 deprecates and provides forwards compatibility w/ removal in 2.0
c) we just cold turkey switch in 1.0 and give strong documentation for the transition risk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is an easy case because it's hard to accidentally have nsqd running with the wrong ID. Deploying a newer or older version can also mean deploying with new or old flags as well (and if you get it wrong, nsqd stops and tells you).

The metadata file case is harder. A missing metadata file is normal for first-time run, it's not an error. We may want to keep this safety rail in for a "long" time anyway, because some users may not be upgrading all instances every time. If someone didn't update and validate all nodes/instances on this new point release, and later they find out about 1.0, they might end up in an easy-silent-data-loss situation, I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an idea: I don't know what it means for windows, but we could just drop a symlink in when we remove the old filename; that way i think we preserve rollback. We could then refuse to run if both are present (and one isn't a symlink) to make sure rolling forward again is clean. The symlink would mean most importantly we only update a single file going forward when persisting the metdata

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They symlink idea sounds fine to me. It might even work on windows, though nsqd would probably have to be run as "Administrator".

https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/#dSlfXGIY3B5ZQF0B.97
golang/go@cf521ce

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is (a), I dislike the idea of maintaining compatibility (and this symlink proposal sounds questionable).

That said, I don't think it's necessary to remove the the old file? If we leave it alone, the rollback path becomes easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that some messages will be processed before the version rollback, so the old metadata file will have stale info, be missing new messages, have references to queue files that don't exist anymore...

@@ -80,7 +80,8 @@ func nsqdFlagSet(opts *nsqd.Options) *flag.FlagSet {
flagSet.Bool("version", false, "print version string")
flagSet.Bool("verbose", false, "enable verbose logging")
flagSet.String("config", "", "path to config file")
flagSet.Int64("worker-id", opts.ID, "unique seed for message ID generation (int) in range [0,4096) (will default to a hash of hostname)")
flagSet.Int64("node-id", opts.ID, "unique part for message IDs, (int) in range [0,1024) (default is hash of hostname)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this range been wrong this whole time? sheesh 😭

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, but it was checked correctly in nsqd.New() anyway. Maybe everybody just lets the default "hash of hostname" take care of it.

@ploxiln
Copy link
Member Author

ploxiln commented Jan 5, 2017

Pushed up change to maintain old and new metadata files in parallel. It wasn't possible for metadata loading to abort nsqd startup before, I changed that so it is ...

@ploxiln
Copy link
Member Author

ploxiln commented Jan 5, 2017

(squashed in minor fixes, left major commits for discussion)

nsqd/nsqd.go Outdated
}
if err != nil {
fn = fnID
data = dataID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at this block multiple times and I still can't really make sense of it.

Is there a simpler way to implement this? What if LoadMetadata took a filename and the caller did some of this compatibility logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use err and errID as a shorthand to indicate if fn or fnID exist. If it was any other error it would have aborted/returned already

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... so if err != nil here, then fn did not exist and data is empty, but fnID did exist and dataID was successfully read into. Below code can get whichever was found using just fn and data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all that said I'll be glad to refactor once I figure out what the final strategy will be :)

Copy link
Member

@mreiferson mreiferson Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location of the comment is misleading, I was referring to the whole new set of changes related to error handling, not this specific line or if block.

re: strategy, it seems like regardless of version choices, we're going to land forward/backward compatible code, so these changes will be necessary in some form.

@ploxiln
Copy link
Member Author

ploxiln commented Jan 6, 2017

The "what to do with the old metadata file" discussion is by default hidden as a "comment on outdated changes" so I'll continue it here. I do kinda like the symlink idea. By the way, on both macOS and Linux you can use rename to atomically replace a plain file with a symlink and vice-versa. Older nsqd will read the file via the symlink, save to the .tmp file, then move it over the symlink.

@judwhite does nsqd normally run as "administrator" when run as a service, and is it reasonable to have windows users run it as administrator when run manually? Any thoughts on the windows symlink situation?

@ploxiln
Copy link
Member Author

ploxiln commented Jan 7, 2017

some interesting reading:
golang/go#8914 (comment)
https://github.com/golang/go/blob/release-branch.go1.7/src/internal/syscall/windows/syscall_windows.go#L130

go 1.5+ apparently uses MoveFileEx() with the appropriate flags on Windows, so we might not need an atomicRename() wrapper to do the special thing on windows anymore

http://www.weirdnet.nl/apple/rename.html
macOS rename() had atomicity issues until recently

@jehiah
Copy link
Member

jehiah commented Jan 7, 2017

I would like to make sure we support forwards and backwards and forwards again safely. The scenario i am concerned about is nuanced, but it's:

a) running nsqd persisting to nsqd.1.dat
b) upgrade; uses nsqd.dat; some persistence for nsqd.1.dat -> nsqd.dat so rollback works
c) rollback; loads nsqd.1.dat; nsqd.dat is stale
d) roll forward again; nsqd.dat incorrectly used.

I think there are two sane options to support this back and forth scenario safely. da) refuse to startup if both files are present (And old isn't a symlink). db) use the file w/ the newer timestamp (if timestamps are equal; refuse to start if file contents differ)

For me, this is a real aspect of trying deploying a canary build; reverting, and then deploying an updated canary again. I also don't think this is extremely unusual.

@mreiferson
Copy link
Member

mreiferson commented Jan 7, 2017

Given that we decided to support --node-id for 1.0, and the main concern is compatibility, the file name change feels unnecessarily complex and not worth it.

Let's just keep the file name as is, it only matters for pedantic correctness.

@jehiah
Copy link
Member

jehiah commented Jan 7, 2017

I just hate having to explain how to predict what a new hosts ID will be when I see files be attempted to be moved to a new host. I really want to see that go!

@ploxiln
Copy link
Member Author

ploxiln commented Jan 7, 2017

I think of the data-path as a portable package of nsqd state. It doesn't make sense that if you decide to change the node-id, or move the state to a new host (and use automatic node-id), then you have to rename just this one file. It would be nice to clean this up, and now would be the time to start the transition (though not necessarily in this PR).

By the way, I think it would be fine to drop backwards compatibility in a 1.x minor update, I don't think we have to wait until 2.0. This is because 1.0 will be forwards-compatible with that, it would only break compatibility with pre-1.0 - so this is a bit different than supporting/deprecating a flag that you can still use in 1.0.

@mreiferson
Copy link
Member

OK, those seem rare but reasonable arguments.

I'll leave code review to @jehiah, good catch on the rename changes!

nsqd/nsqd.go Outdated
if err != nil {
return err
}

stat, err := os.Lstat(fileNameID)
if err != nil || (stat.Mode()&os.ModeSymlink) == 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're curious about the spacing around &, gofmt does this to emphasize the operator precedence. I guess that means I don't need the parentheses, I can remove them if preferred.
golang/go#12720

"version": "1.0.0-alpha"
}`

tmpDir, err := ioutil.TempDir("", "nsq-test-")
Copy link
Member Author

@ploxiln ploxiln Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TempDir() does the random-suffix part automatically. Docs say:

Multiple programs calling TempDir simultaneously will not choose the same directory.

(I mention this because the mustStartNSQD() helper composes its own random-ish suffix for TempDir() if DataPath is unset.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, another nice find/cleanup

@ploxiln
Copy link
Member Author

ploxiln commented Jan 9, 2017

I tested on Windows 7. Everything seems to work, when nsqd is run as Administrator (I used an Administrator PowerShell, and a windows build of curl to test). When not run as Administrator it fails to PersistMetadata() on startup and aborts, as expected. Still waiting for windows folks to comment on the symlink / Administrator / service / container situation on Windows, to see how acceptable this is. cc @judwhite @elvarb @esiqveland

mreiferson
mreiferson previously approved these changes Jan 9, 2017
Copy link
Member

@mreiferson mreiferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this LGTM, but I think we should wait for someone w/ Windows to confirm expectations around Administrator privileges.

@mreiferson
Copy link
Member

@ploxiln I'm inclined to land this and folks can follow up with opinions if we need to make changes before release.

Want to squash?

@ploxiln
Copy link
Member Author

ploxiln commented Jan 12, 2017

Sounds good.

IMHO these commits are logically split into separate topics, but I have no problem squashing them all together as is preferred. Done.

nsqd/nsqd.go Outdated
err = os.Remove(fileNameID)
}
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ploxiln if os.IsNotExist(err) is true you'll return a value for err, is that intentional, or would you prefer to return nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bug, I was trying to be too concise, nice catch.
Looks like I'll back-out this windows-path variant anyway.

@judwhite
Copy link
Contributor

I think parity in the system between OS's is more important than my personal taste. Though we won't get there with symlinks, if there are going to be two files on Linux I'm fine with the same on Windows. I'll defer; whatever the core maintainers think is best.

@ploxiln
Copy link
Member Author

ploxiln commented Jan 27, 2017

I threw out the latest windows variant I mentioned yesterday, and squashed everything down as I found appropriate, just now.

@mreiferson
Copy link
Member

as there is no easy way to generate the proper node ID that's currently auto generated. That makes it infeasible to implement out of band scripting for a rollback where you move nsqd.dat to the right nsqd.%d.dat for rollback because you don't know the value for %d.

@jehiah what do you mean? It's deterministic, why can't it be scripted? I agree the symlink makes rollback easier, obviously.

@mreiferson
Copy link
Member

mreiferson commented Jan 28, 2017

If you want the symlink, fine, let's just land this already. My only requirement is that I don't think the 1.0 version should include code that handles symlinks (creation or cleanup).

@ploxiln
Copy link
Member Author

ploxiln commented Jan 28, 2017

So in 1.0, you want it to

  • try to read from new metadata file, if success, done
  • try to read from old metadata file only if new one does not exist
  • write new metadata file (and do nothing else)

@mreiferson
Copy link
Member

IMO it should read/write to the new metadata file and do nothing else.

@ploxiln
Copy link
Member Author

ploxiln commented Jan 28, 2017

It seems harsh if one wouldn't be able to upgrade from 0.3.8 to 1.0 without either manually renaming the metadata file or losing topics/channels - unless there would be a pre-1.0 transition release series that was around for 6 months.

I think the minimum that makes sense is to just fall back to reading the old metadata file if the new one does not yet exist. That's really not much code. And it's sufficient if you never rollback ...

os.Rename() now does the same thing on Windows
that atomicRename() did (since go 1.5)
no need to add our own pseudo-random "unique" suffix
symlink old metadata filename to new
when loading, if both exist, ensure they match
this makes rollback possible without losing messages
(when rolling back forward, some manual intervention is required)

on windows, Symlink() needs Administrator privs,
so just write plain old metadata file

includes tests
@ploxiln
Copy link
Member Author

ploxiln commented Jan 29, 2017

For sake of discussion ... I've pushed up a commit that reduces this back down to what I think the minimum for 1.0 should be.

@ploxiln
Copy link
Member Author

ploxiln commented Jan 29, 2017

With the "minimum back-compat" code, the rollback procedure would be something like:

  1. stop nsqd
  2. [ -e "$DATAPATH"/nsqd.[0-9]*.dat ] || exit 1
  3. mv -v "$DATAPATH"/nsqd.dat "$DATAPATH"/nsqd.[0-9]*.dat
  4. rollback and start nsqd

I'd also like to point out that with either the symlink-style or minimal-style back-compat code, 1.0 being backward-compatible with 0.x would not necessitate all of 1.x keeping any back-compat code - if 1.0 was ever run it would write the new metadata file, which later 1.x would find even without any back-compat code. So this is not comparable to keeping a function present in a library.

@ploxiln
Copy link
Member Author

ploxiln commented Feb 1, 2017

ping

I think there are two options:

  • remove my last commit, merge with symlink-based rollback-compat, make a pre-1.0 stable release series soon
  • squash my last commit in, merge with one-time-upgrade-compat only, make a 1.0 release soon

@mreiferson
Copy link
Member

Sorry I wasn't more clear, I meant to land your changes including symlinks and stamp a non-1.0 release, then remove that temporary code and work towards 1.0.

@ploxiln
Copy link
Member Author

ploxiln commented Feb 2, 2017

I removed that last commit. This should be ready to go.

@ploxiln
Copy link
Member Author

ploxiln commented Feb 6, 2017

ping

@jehiah jehiah merged commit a507eb9 into nsqio:master Feb 6, 2017
@jehiah
Copy link
Member

jehiah commented Feb 6, 2017

<3 I heart a 1.0-migration release build for those who want easy rollout.

@jehiah
Copy link
Member

jehiah commented Feb 16, 2017

FYI: i'm working to push this build out on the Bitly stack. I'm using the snippet below to prep our ability to rollback seamlessly (without causing complications rolling forward again)

DATA_DIR="/data/nsqd"
if [ -e "$DATA_DIR/nsqd.dat" ] && [ ! -h "$DATA_DIR/nsqd.dat" ]; then
    OLD_NAME=$(find $DATA_DIR -type l -lname "$DATA_DIR/nsqd.dat")
    if [ -n "$OLD_NAME" ]; then
        echo "[`date +'%Y%m%d %H:%M:%S'`] rolling back $DATA_DIR/nsqd.dat back to $OLD_NAME"
        mv $DATA_DIR/nsqd.dat $OLD_NAME || exit 1
    fi
fi

@mreiferson
Copy link
Member

@jehiah you building that with go 1.8 too by any chance?

@jehiah
Copy link
Member

jehiah commented Feb 16, 2017

@mreiferson no unfortunately; this is my last pre-req task i want to finish before i start working on 1.8 on our stack.

@mreiferson
Copy link
Member

NP, just thinking in terms of a release for these changes I usually use the latest stable version of Go at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants