Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update event stream #6853

Merged
merged 14 commits into from Aug 19, 2019

Conversation

@davidMcneil
Copy link
Contributor

commented Aug 14, 2019

Resolves #6761 #6740

This PR first addresses #6740 using the existing natsio and nitox nats clients. The PR then removes those clients and switches to using a non-streaming client rust-nats. In initial testing, rust-nats and non-streaming in general appear much more reliable. However, testing was done on a standalone nats server instead of directly with automate. This is blocked on automate supporting a plain nats connection.

There are several TODOs in this PR. Those should be straightforward to address once we verify this is the direction we want to go for our nats client. #6770 is the spike to evaluate the nats clients.

davidMcneil added 6 commits Aug 13, 2019
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@chef-expeditor

This comment has been minimized.

Copy link

commented Aug 14, 2019

Hello davidMcneil! Thanks for the pull request!

Here is what will happen next:

  1. Your PR will be reviewed by the maintainers.
  2. If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

@davidMcneil

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

This resolves #6740 by adding the event-stream-connect-timeout cli option. This can also be set with the HAB_EVENT_STREAM_CONNECT_TIMEOUT environment variable. This option takes a numeric value that represents the number of seconds to wait for an event stream connection before exiting the supervisor. The value 0 is treated specially. It indicates there is no timeout and the supervisor should immediately start regardless of the state of the event stream connection.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil force-pushed the dmcneil/event-stream branch from f299d31 to 60df8c6 Aug 14, 2019
@davidMcneil

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Unfortunately, we were not able to avoid forking the rust-nats library. We needed to make the following changes:

  • make the connect method public
  • add support for auth token credentials
  • percentage decode appropriate parts of the nats connection string

The fork can be found here.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil force-pushed the dmcneil/event-stream branch from 35e4fdb to d137ddc Aug 15, 2019
davidMcneil added 3 commits Aug 15, 2019
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Copy link
Contributor

left a comment

Looks good overall... I had a couple observations / documentation tweaks, though. I think we may need to adjust how we're handling subjects, too.

Nice work!

trace!("About to queue an event: {:?}", event);
if let Err(e) = self.0.unbounded_send(event) {
error!("Failed to queue event: {:?}", e);
if let Err(e) = self.0.try_send(event) {

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 15, 2019

Contributor

Probably worth documenting here that if we fill up the channel (because we're not currently connected to the NATS server), then we'll drop additional messages on the floor here because try_send will return an error.

Actually, it'd be good to use TrySendError::is_full() to get more information in our error logging.

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 16, 2019

Author Contributor

The current error message we give when try_send fails due to a full channel is Failed to queue event: send failed because channel is full. Is there something more you would like to see?

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 16, 2019

Contributor

Nope, that's good!

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 16, 2019

Contributor

I think a comment in the code about dropping messages would still be useful as a documentation of intent.

components/sup/src/event/stream.rs Show resolved Hide resolved
Ok(())
});

Runtime::new().expect("Couldn't create event stream runtime!")

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 15, 2019

Contributor

It's probably worth noting here: the reason all this was initially running in a thread in the first place is that the nitox library had an issue where it didn't play well with other futures on its reactor. To work around that, I just put it off on its own reactor on a separate thread.

Since rust-nats presumably doesn't have that issue, we could theoretically move all this into running directly on the Supervisor's main reactor. If we were to do that, however, we'd need to do it in such a way that we could cleanly shut it down when the Supervisor needs to come down, or run into the same underlying issue that was behind #6712 and fixed by #6717.

(It's perfectly fine to leave it as is; I'm just thinking it would be good to leave a comment here to ensure that any well-meaning refactoring engineer that comes after us knows what the trade-offs are.)

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 16, 2019

Author Contributor

These commits here and here address this.

It seems like we need a more robust and ergonomic solution for ensuring all futures are stopped before shutting down. This is outside the scope of this PR, but just to get some ideas out there.

What if we made a "wrapper" that wraps two runtimes. This "wrapper" could have a spawn_divergent (or something to indicate that this future does not end) call. This would spawn the future on a runtime that calls shutdown_now instead of shutdown_on_idle when we shutdown.

Not sold on this solution, but it would be nice to not have to keep a handle for every divergent future. I wonder how other tokio projects handle correctly ending unbounded futures. Thoughts?

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 16, 2019

Contributor

Neat 😄

I'm not sure what the best way going forward should be for this. I like the "handle" approach, since it's explicit, but it does require a little bookkeeping. I haven't seen other approaches for this, though (which is what motivated that handle solution initially).

Your "two Runtimes" approach is also an interesting one, and is worth digging into, I think.

As long as the code in this PR doesn't get us back into a 0.83.0 bug situation, I'm 👍 on merging it.

@@ -11,7 +11,7 @@ use tokio::{prelude::Stream,
runtime::current_thread::Runtime};

/// All messages are published under this subject.
const HABITAT_SUBJECT: &str = "habitat";
const HABITAT_SUBJECT: &str = "habitat.event.healthcheck";

This comment has been minimized.

Copy link
@christophermaier

christophermaier Aug 15, 2019

Contributor

We send out more events than just healthchecks, though... if we're going to have different subjects per event, we'll need to handle that a little differently.

This comment has been minimized.

Copy link
@davidMcneil

davidMcneil Aug 16, 2019

Author Contributor

Thanks for catching this! I will follow up with the automate team and see what they expect.

davidMcneil added 2 commits Aug 16, 2019
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil force-pushed the dmcneil/event-stream branch from e62b6ef to b9d80a2 Aug 16, 2019
Copy link
Contributor

left a comment

Once the event subject situation is resolved, I'm 👍

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil merged commit 160756e into master Aug 19, 2019
5 checks passed
5 checks passed
DCO This commit has a DCO Signed-off-by
Details
buildkite/habitat-sh-habitat-master-verify Build #3157 passed (28 minutes, 29 seconds)
Details
buildkite/habitat-sh-habitat-master-website Build #283 passed (45 seconds)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
expeditor/config-validation Validated your Expeditor config file
Details
@chef-ci chef-ci deleted the dmcneil/event-stream branch Aug 19, 2019
@ericcalabretta

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

@davidMcneil if a user sets HAB_EVENT_STREAM_CONNECT_TIMEOUT=0 Would the supervisor attempt to connect to Automate if it was not initially available?

Automate may be unavailable because of an upgrade, failure, etc and users may want to still start the supervisor & services. It sounds like setting the value to 0 would accomplish this but users would also want the supervisor to connect to Automate when/if it becomes available. When the upgrade is complete or service was restored from whatever the failure was.

@davidMcneil

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

@ericcalabretta Regardless of the value of HAB_EVENT_STREAM_CONNECT_TIMEOUT habitat will always try and connect to automate when it goes to publish an event (if it is disconnected). HAB_EVENT_STREAM_CONNECT_TIMEOUT only impacts startup behavior. So if a value of 0 is used habitat will eventually connect to automate given it comes up at the correct url with the correct auth token.

@ericcalabretta

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

@davidMcneil That's perfect, thanks for the clarifications.

@mwrock mwrock added the X-feature label Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.