-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Problem registering governance engine / Asset consumer exceptions #6549
Comments
May be related to #6552 This scenario works correctly with the local graph repository |
On a third try it passed. (step by step) Ran again (run all) - failed Similar to CTS, this looks like it may be a timing related issue |
Here's a fragment of the log from an error :
|
Also note that AssetConsumer OMAS is reporting continual errors (more at https://gist.github.com/c069ada7a9c9e3767322a3a8bb9512a3 ) Full logs at https://1drv.ms/u/s!ApVqcIDT57-fmskFDq72ALtm-VVKDA?e=4DolJ3 However this also occurs with the data catalog notebook.
Whilst cocoMDS4 does not have a local repository, it does have connected remote repositories |
This seems to work in 3.10 - but even in 3.8/9 it was intermittent. |
Querying remote members for cocoMDS4:
This also shows up in DINO with an empty cohort list: These are all consistent with the exception seen in the log from AssetConsumer (which is enabled on cocoMDS4) - it has no chance of retrieving anything .. cocoMDS5 (proxy) has the same issue. Compare with cocoMDS2 which shows correctly >>> https://gist.github.com/f6534a9c706684c64b40b6af0fba7859 This means cocoMDS4 cannot serve data lake users (it's role) |
The cohort membership issue may be related to #6353 -- this was a timing issue which affected CTS. |
Repeated test - jars locally. Issues does not occur This would support this as a repeat of the issue mentioned above - timing differences on cloud (faster) vs local (slow mac). Additionally kafka/zookeeper differ - strimzi vs homebrew Repeated on rancher desktop. Issue does occur cocoMDS4 is nearly the last server to start (OLS and MDSx after), and the last member of cocoCohort - perhaps an issue with the reconfiguration. Will look at kafka message sequence |
This same issue occurs with the coco labs on k8s with 3.10 For example if we look at the cocoMDS1 registrations:
In this case cocoMDS2 startup, registers, refreshes it's registration, and other servers in the cohort respond Interestingly cocoMDS5 (proxy) does not seem to send out a re-registration request Looking then at activity around cocoMDS4:
Quite different - no audit messages relating to the reregistration requests from elsewhere being received - though it is It does still send a registration refresh (as can be seen from the other servers responding to it) |
It appears as cocoMDS4 is not responding to registration events in this case ( NEW_MEMBER_IN_COHORT / processRegistrationEvent) |
Checking kafka offsets:
As can be seen above, all consumer groups have 0 lag, so the clients are up to date with all. One possible scenario could be that in cocoMDS4 we:
This is less likely to occur with metadata servers since there are more events are sent and received. This also explains why this is only seen in some situations, and why random delays can affect the behaviour as different servers become synchronized Finally, once a cohort is established, and topics are all setup, offsets management from then on will mean we don't hit this issue. The possible fixes would include
|
Another possible scenario - we are receiving the messages ok, but not managing the registry store correctly. Looking at logs we see the cohort registry store is being created multiple times ...?
|
In fact the audit event is misleading. The audit log message is raised when we fail to read from the file. However it is NOT created at this point. In fact it appears as if by 13:27:15 we still don't have a valid store saved - this is many minutes after any registration events would have been seen, and confirmed by exceptions in the debug log. Creates of the registry can be seen in the logs (debug) this could be cause or effect - either we have omitted to write, or the root cause remains not receiving/processing the registration events |
Confirmation that cocoMDS4 sees it's own refresh event - so listener is definately active:
And that cocoMDS4 sees refreshes from other cohort members (here is cocoMDS3):
|
Thanks for this excellent analysis. I am thinking that now we have a separate topic for the registration events, we can create more traffic on it without affecting the instance events. I was thinking that members of the cohort could send refresh registration requests at regular intervals which would ensure this timing window is eliminated and we have the basis of a heartbeat mechanism. |
The log of the above is slightly misleading.
|
@mandy-chessell A heartbeat could be a useful backstop -- but I think we have a timing window here, just can't quite pin it down yet. there's a lot of log data. |
Moved cocoMDS4 over to the 'factory' platform for testing (local change) in order to allow for clearer debugging (since many log (non-audit) entries don't include thread etc making it tricky to track/debug the issue. Problem still occurs. Proven that the RE_REGISTRATION event is being sent/received by other servers around 10:29:30:
And that cocoMDS4 is completing joining the consumer group well before this:
(Occurs for multiple groups up until 10:20:28 -- for the different topics we consume, but the point is all are prior to the above, so the offset should be correct ie prior to the refresh) Yet the listener appears to start ok for cocoMDS4, but no incoming events are reported:
|
Further debug shows the listener is working as expected (aside: default poll is 1s which seems short, but is configurable) As per initial hunch, There is a kafka API call to establish the start point for reading messages here's some activity from cocoMDS4
So here we
Where we detect offsets and then get messages we see behaviour more like:
This co-incides with our own debugging of incoming messages When we compare the times for cocoMDS4 which settles on offset 25 we see that our application thinks the listener is initialized around
So what we're seeing here is this timing window. Whilst we think the listener is active immediately, it's not really read for many seconds later -- since it's only then the offset is figured out and set. A subtle kafka behaviour, noting we also have multiple threads at work here This is consistent with my early theory, but provides hard evidence the issue is start offset. Setting the offset based on the time we 'decided' to start the kafka topic connector would be better (and it doesn't matter if this is 30,60,120s later it actually takes effect). But this should only be done if there isn't a current offset. If there is we should just continue where we left off, as the server may have been restarted (or another replica is being run) Another alternative would be to check at what point in our kafka interaction we can be sure the offset is set, and ensure no events are sent out until this point - but this would need extra synchronization logic(we have separate threads for provider/consumer etc) and likely more complex |
To review actual messages we can use
See https://gist.github.com/2ddfb6b242741ea5b6973336c53b3097 for all registration events Looking at the REFRESH_REGISTRATION_REQUEST (coming from our producer) we see this is:
That time equates to the RESPONSES from the other servers in the cohort begin shortly after this ie:
This is |
Currently testing a fix... |
On testing, I found that when I tried to query the allocated partitions (so I could validate/set the offset), the set was empty.
The reason is that automatic partition assignment does not occur until after the first poll() on a topic. Prior to kafka 2.4 a poll(0) could be used to force a metadata refresh without consuming any events. However this is not possible in later versions of Kafka. The two options are therefore to either
|
Updated fix now seems to work - raising PR ie for the same scenario we now do a rewind to go back to the desired start time - and find those missing registration events. The result is that cohort membership is now correct for cocoMDS4
|
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
…ts on topic ->null Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
…ts on topic ->null Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
#6549 Fix kafka consumer initial seek position
Is there an existing issue for this?
Current Behavior
Noticed this when checking the notebooks for release.
This notebook is explicitly marked as not to be run. However it looks to be an issue worth checking, possibly after some refactoring
Expected Behavior
governance engine registration passes
Steps To Reproduce
Run the incomplete/unsupported notebook 'automated curation' in a coco pharma tutorial environment
Environment
- Egeria:3.8
Any Further Information?
No response
The text was updated successfully, but these errors were encountered: