Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start the trigger in a known state when using a dynamic configuration URL #350

Merged
merged 22 commits into from Feb 15, 2019

Conversation

Projects
None yet
3 participants
@tekkamanendless
Copy link
Contributor

commented Feb 9, 2018

In my Jenkins instance, I seemingly randomly have these instances where two patchset-created events, sent seconds apart, have a problem: one of them will be processed normally, and the other will not.

After a whole bunch of logging and debugging, I have traced the problem to the dynamicGerritProjects list. On the event that works, the list will be the proper size; however, on the event that doesn't, it'll be empty. This is ultimately due to the fact that dynamically configured triggers start with a bad initial state.

This request solves this problem with:

  • Trigger restarts should not have a bad initial state.

There are also some logging updates to:

  • Generally add some context to otherwise context-free log statements.

On trigger restarts...

The start action of GerritTrigger doesn't actually fetch the dynamic project configuration. Instead, it starts a thread to do it. This means that there is a window of time between when start returns and when the thread completes its first task that the dynamic project list is empty.

For classic use, this was probably not a huge issue. However, using imperative Jenkinsfile jobs, this is a major problem because after every job is run, the configuration "changes" (if you use the properties command in the job).

In my case, when multiple Gerrit events were submitted in rapid succession, some would be "lost" due to the trigger-restart time window where there were no dynamic projects listed. In the case of those "lost" events, the trigger would run, and the isInteresting method would return false, stating that the event was not relevant to any project or branch in the configuration (logging shows that the length of dynamicGerritProjects is zero in those cases).

This commit ensures that the dynamic project list is updated as part of the start action so that the trigger is always valid.


On logging...

Some of the logs showed some potentially helpful debugging information, but without context. For example, when determining that an event would be ignored by the trigger, this was printed: Nothing interesting here, move along folks!. This message did not state the event, so looking through my logs to see what went wrong did not provide very much help (I have a lot of jobs triggered with this plugin). The new statement looks like Event is not interesting; event: {}. Most of the log changes are along this line: adding context such as the event or the job.

tekkamanendless added some commits Feb 5, 2018

Added locking around `dynamicGerritProjects`.
In my Jenkins instance, I seemingly randomly have these instances where two
patchset-created events, sent seconds apart, have a problem: one of them
will be processed normally, and the other will not.

After a whole bunch of logging and debugging, I have traced the problem to
the `dynamicGerritProjects` list.  On the one that works, the list will be
the proper size; however, on the one that doesn't, it'll be empty.

I suspect that this has to do with the task that updates the dynamic trigger
configuration in the background at the moment when the new event comes in.
Because the timing code is all coordinated by Jenkins, it is at least
plausible that race conditions could be more likely than expected, since
tasks will tend to run at whatever internal intervals Jenkins uses for them.

The logging changes generally add some context to otherwise context-free
log statements.

The locking around `dynamicGerritProjects` aims to control every access of
that variable.  For some strange reason, making the lock a normal class member
did not work (it was always null, even when intantiated in the definition as
well as the two constructors).  However, using a static constant object seems
to have at least stopped all the null-pointer exceptions.

I'm currently testing this on my instance, and we tend to have this problem
come up once or twice a week.  We'll see how it goes.
Trigger restarts should not have a bad initial state.
The `start` action here doesn't actually fetch the dynamic project configuration.
Instead, it starts a thread to do it.  This means that there is a window of time
between when `start` returns and when the thread completes its first task that the
dynamic project list is empty.

For classic use, this was probably not a huge issue.  However, using imperative
Jenkinsfile jobs, this is a major problem because after every job is run, the
configuration "changes" (if you use the `properties` command in the job).

In our case, when multiple Gerrit events were submitted in rapid succession, some
would be "lost" due to the trigger-restart time window where there were no dynamic
projects listed.

This commit ensures that the dynamic project list is updated as part of the `start`
action so that the trigger is always valid.
@Jimilian

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2018

Sorry for feeling my PRs too late.

@rsandell

This comment has been minimized.

Copy link
Member

commented Feb 28, 2018

#353 is now merged, so you can go ahead and adapt it. Unless @Jimilian has better suggestions?

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2018

I've merged in all the latest changes (the cache proxy); we're down to a line change in "start" and some updated logging lines.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Oct 25, 2018

Any guidance on where we want to go from here? My use case absolutely requires that all jobs be properly configured with their triggers before they run, but I'm willing to accept that I'm a minority in this case. Would you be happy with a setting that toggles this on somehow?

@rsandell

This comment has been minimized.

Copy link
Member

commented Nov 2, 2018

You need to fix it so that updateTriggerConfigURL() isn't called on the main thread during Jenkins startup and slows down startup times. The cache helps but I fear it won't be good enough.
During "normal" operations I can live with having the update run directly in the same thread as whatever called start (in freestyle land that would be the http thread that saved the job) but not during Jenkins startup.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Nov 7, 2018

Okay, I think that I understand what you guys are going for.

Here's what I'm thinking would do the trick:

  1. The trigger part has to be set up immediately. From the moment that the job is registered, it needs to be listening for Gerrit events, otherwise it will lose them (it can't stop and start listening).
  2. The EventListener should be stripped down to just write every event to a queue (in GerritTrigger). It won't do any processing; it'll just write them down and move on.
  3. A queue-reading thread should attempt to pull items off of the queue as soon as they come in and process them the way that EventListener used to ("isInteresting" and all that). However, that queue-reading thread will only be active if the dynamic configuration has been loaded (for non-dynamic configurations, it will always be active).

Thus, when the job initially starts up, the queue will be open for business, but the queue-reading thread will not be pulling anything from the queue yet. Once a configuration has been loaded (non-dynamic configurations are trivially considered "loaded" immediately), the queue-reading thread will go through them, one at a time, and figure out if they should trigger the job.

When jobs are "restarted" (which is upon each and every run if you are running a pipeline job that calls "properties(...)" up front), there will be no discontinuity since no events will be lost. We will pause queue-reading thread upon "stop" and resume it on "start".

Upon initial startup, there will be no additional delays, since we can now safely afford to have the dynamic-configuration loading happen out-of-band whenever it gets around to it. The moment that the dynamic configuration is known, all of the events that happened from the moment that "start" was initially called will be processed.

This is the behavior that I am expecting (after a job is present in Jenkins, all events that happen are captured by it), and it meets your startup thread requirements.

If that works for you, I'll get that written and submitted tomorrow.

tekkamanendless added some commits Nov 8, 2018

Switch to an event queue
This change makes the EventListener merely listen for events and send
them to the GerritTrigger.  The GerritTrigger now has a background thread
called an EventProcessor.  When the GerritTrigger receives an event from
the EventListener, it sends the event to the GerritProcessor, which
maintains a queue.

In dynamic configuration mode, the queue is "paused" for reading until
the configuration has been loaded.  In non-dynamic configuration mode,
the queue is "unpaused" for reading immediately.

Between stop and start calls, the queue is maintained so that no events
are lost.
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

I have a few weeks of in-house testing to go before I'll be ready with this, but this is generally what I think meets your requirements.

@rsandell

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

Sorry again for not responding timely.
Putting the events in a queue will be very risky as it will change the triggering behaviour severely especially around dependant triggered jobs. So that is a no go in my mind.

The events are already handled on a thread pool, with each thread handling one event and checking all the jobs, so pausing one of those threads while the dynamic trigger is updating should be safer.
You could potentially achieve that for example by the update thread flipping a semaphore when the update is happening, and when start has been called and then flipping it back when it's done writing the dynamic info. Then you just wait for the semaphore to be open before calling isInteresting.

And as I commented before, it will probably be fine to call the update directly in the start method as long as you don't do it during Jenkins startup ('Jenkins.getInstance().getInitLevel() != InitMilestone.COMPLETED'). and maybe do the semaphore only then to simplify.

tekkamanendless added some commits Nov 16, 2018

Use a latch to force EventListener to wait until the project list is …
…ready

The EventListener always "waits" for the project list to be ready before
taking action on an event.  For static configurations, this is always
a no-op, since the latch counter will always be zero.  For dynamic
configurations, the latch will be initialized to "1" when the trigger
is started; as soon as the configuration has been loaded, it will
decrease to zero.

When the job is updated or repeatedly stopped and started, the latch
will only be reset to "1" if the dynamic configuration URL changes.
This way, things can move smoothly if you're using a pipeline job
that calls "properties()" to reset itself every time.
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Nov 16, 2018

Okay, so this time I think I have what you want.

I added a latch that the EventListener waits on before calling isInteresting. For static projects, the latch is set to 0 immediately upon start, so there is effectively no waiting. For dynamic projects, the latch is initialized to 1 upon start and is decreased to 0 when the trigger URL has been fetched.

There's a little bit of logic in there for reusing the same latch between stop and start calls if the URL didn't change. For a pipeline job that calls properties, it'll stop and start itself after every run, so it's important to not hold everyone up if the trigger URL part of the job didn't change between runs.

@rsandell

This comment has been minimized.

Copy link
Member

commented Nov 19, 2018

Yepp, your approach seems good.

tekkamanendless added some commits Nov 19, 2018

The latch is now transient.
This updates the code to make the latch "transient" (it will not be
serialized); instead, we initialize it to the zero state when
`GerritTrigger` is constructed.
Revert the SureFire workaround.
FYI: -Djdk.net.URLClassPath.disableClassPathURLCheck=true
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Nov 19, 2018

First round of testing has gone well. I haven't noticed any bad behaviors with a pipeline job calling properties() every time that it's run. All of the events are being handled properly.

Jenkins question for you: am I guaranteed that one of the defined GerritTrigger constructors will be called? As long as a GerritTrigger instance cannot be made without calling one of those constructors (I'm mainly concerned with the XML state stuff), then we should be good with the latch.

Initialize the latch in `readResolve`
`readResolve` is called when the plugin is loaded from disk, and
this bypasses the constructors (since we don't have a zero-argument
version).  In that case, we need to set up the latch appropriately.

I took it out of the constructors because it was redundant, since
it's defined that way at the class level.

I also added some logging for when the different constructor options
are called.
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

Answer to my own question: either GerritTrigger(List), GerritTrigger(List, ...), or readResolve() will be called. readResolve() is magically called when deserializing from XML.

@rsandell

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

So how are you progressing? Do you feel that you are done? I don't see any unit tests for example verifying that it works and won't regress in the future.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2018

My changes are ready as far as I'm concerned.

On the unit testing side, I just realized that there are currently no tests covering any cases where setTriggerConfigURL is called. I'll see what I can do about putting together something that'll try to cover that case. If you have any tips for how you'd like to see that done, I'll try to follow them. Otherwise, I'll wing it and we can discuss the implementation here.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2018

Quick update: I can make some simple unit tests around making sure that the dynamic project list loads when it's supposed to load with some in-memory HTTP servers or temp files, but I have no idea how to spin up enough of the framework to ensure that the trigger is started and listening for Gerrit events to properly demonstrate that an event received after "start" but before the URL is fetched is processed properly after the URL is fetched. I'm going to need some serious assistance from someone who has way more experience with unit-testing Jenkins plugins.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2018

I've submitted what I believe to be a halfway-decent unit test to cover this. Take a look and let me know what else you want.

Move the @PrepareForTest change to just the new test method
Apparently adding `GerritTrigger.class` to `@PrepareForTest` at the
top level breaks things, so I've overridden `@PrepareForTest` at the
method level.
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Dec 20, 2018

For the record, if I run the test without my changes, it fails as expected, so I'm happy with it.

[ERROR] Failures: 
[ERROR]   GerritTriggerTest.testDynamicTriggerConfigurationTimeGap:1810 
Wanted but not invoked:
toGerritRunListener.onTriggered(
    same(abstractProject),
    same(PatchsetCreated: Change-Id for #1000: Iddaaddaa123456789 PatchSet: 1)
);
-> at com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.GerritTriggerTest.testDynamicTriggerConfigurationTimeGap(GerritTriggerTest.java:1810)

However, there were other interactions with this mock:
toGerritRunListener.isProjectTriggeredAndIncomplete(
    abstractProject,
    PatchsetCreated: Change-Id for #1000: Iddaaddaa123456789 PatchSet: 1
);
-> at com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.GerritTrigger.isInteresting(GerritTrigger.java:953)

toGerritRunListener.isTriggered(
    abstractProject,
    PatchsetCreated: Change-Id for #1000: Iddaaddaa123456789 PatchSet: 1
);
-> at com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.GerritTrigger.isInteresting(GerritTrigger.java:956)
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Jan 7, 2019

Has anyone had a chance to look at the changes made in December (including the tests)?

@rsandell

This comment has been minimized.

Copy link
Member

commented Feb 1, 2019

Hmm it's the gerritEvent method that should call onTriggered and it seems like you are setting up the mock correctly, so why it isn't called is a bit of a mystery to me as well. Perhaps make sure the trigger object isn't returning true for isSilentMode?

tekkamanendless added some commits Feb 1, 2019

Simplify everything because a stopped instance will never start
Jenkins plugin instances are never "restarted"; instead, they are
just created anew and started from scratch.  The new one may be
loaded from storage using `readResolve`, but an existing one will
never be reconfigured directly.
@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

At this point, I believe that I have addressed everyone's concerns, and my test instance of Gerrit is still happy with everything.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Feb 2, 2019

It looks like there was a problem with the Jenkins tooling; it couldn't spin up a node. Is there a button that you could press to re-trigger the build?

[Pipeline] node

Still waiting to schedule task

???

Creating placeholder flownodes because failed loading originals.



GitHub has been notified of this commit’s build result



java.io.IOException: Tried to load head FlowNodes for execution Owner[Plugins/gerrit-trigger-plugin/PR-350/28:Plugins/gerrit-trigger-plugin/PR-350 #28] but FlowNode was not found in storage for head id:FlowNodeId 1:8
@rsandell

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

Sometimes closing and reopening the pr will make Jenkins start a new build. But now always. Simplest is to push a bogus commit that only changes a comment or something simple like that.

@tekkamanendless

This comment has been minimized.

Copy link
Contributor Author

commented Feb 8, 2019

Okay, I removed a couple debug lines that weren't adding anything, and Jenkins actually ran the build this time.

@rsandell rsandell merged commit 6c836a4 into jenkinsci:master Feb 15, 2019

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details

@tekkamanendless tekkamanendless deleted the tekkamanendless:dynamic-list-locking branch Feb 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.