-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move metrics in lbcore #300
Move metrics in lbcore #300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Makes more sense with lightblue-platform/lightblue-core#808 review also.
* CDI injection. | ||
*/ | ||
private static final RequestMetrics metrics = | ||
new DropwizardRequestMetrics(MetricRegistryFactory.getMetricRegistry()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably move to the RestConfiguration class, where it can pass it to the LightblueFactory.
Also, I forgot that getMetricRegistry()
also initializes the JMX reporter. We should probably make that clearer somehow. I'd go with simply getJmxMetricRegistry()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed getMetricRegistry() to getJmxMetricRegistry() for clarity.
@@ -111,7 +122,7 @@ public Response getSearchesForEntity(@PathParam("entity") String entity, | |||
} | |||
CallStatus st=new FindCommand(freq.getEntityVersion().getEntity(), | |||
freq.getEntityVersion().getVersion(), | |||
freq.toJson().toString()).run(); | |||
freq.toJson().toString(), metrics).run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we need to pass in the metrics here. Mediator should already know how to track this; it's now not the responsibility of the HTTP layer to track metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I had an idea about this I just commented on the other PR. We have the option to start tracking in the REST layer (here), and then pass the context along.
Another alternative is to go back to doing most tracking in the REST layer (not passing the ctx along), but for bulk requests just pass along the RequestMetrics so it can track the individual requests itself. You'll still need the RequestMetrics interface in core to accomplish that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer starting metric tracking at REST layer and then passing along the ctx to mediator for all requests (including bulk). One benefit that was seen by tracking metrics at mediator level is that we get more insights on errors which occur at mediator level. If we track everything at REST level, it just marks mediator errors as completed requests, instead of error requests.
So RequestMetrics will get passed on along with each request to commands, commands will initialize the context and then pass ctx to mediator for further tracking at mediator level (error tracking and marking request as completed).
I was thinking of marking request completion at REST level, but then that would not be able to handle Bulk requests. So request ending should have to be handled at mediator level itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we track everything at REST level, it just marks mediator errors as completed requests, instead of error requests.
We can use the errors in the Response object, and add all of those to the context. This is same error resolution as what was being done in mediator, so it's no different, just happens at a different time+place which doesn't matter for error counters.
I was thinking of marking request completion at REST level, but then that would not be able to handle Bulk requests. So request ending should have to be handled at mediator level itself.
Why not? The bulk request code would then do the job of tracking the individual requests (start, end, + errors). Around the bulk request itself would be the REST layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that comment before I made the changes to track errors via response object. With response object based tracking, the Bulk requests can be handled easily as well as mediator errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svkm102 Can you do a find/replace on version number to substitute underscore ( _ ) for the dot ( . ) in the version? Graphite treats . specially so it will throw things off if we pass that along. So something like metrics:name=api.find.test.1.0.0.requests.active
would become
metrics:name=api.find.test.1_0_0.requests.active
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I didn't mean to change it in the command calls, just wherever you are defining the name of the metric being captured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the version handling to replace "." with "_" in lightblue-core. That should take care of versions in all commands.
@@ -86,6 +90,7 @@ public static AbstractLockCommand getLockCommand(String request) { | |||
|
|||
@Override | |||
public CallStatus run() { | |||
RequestMetrics.Context context = metrics.startLockRequest(getLockCommandName(), domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be done at the HTTP layer?
I understand if locking is special, just wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have lock implementation at crud level. Everything lock related seems to be at rest level only. Same is the case with GenerateCommand. So this is the only place their metrics can be tracked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
// only when stream is fully written. | ||
if(stream) { | ||
context = metrics.startStreamingEntityRequest("find", entity, version); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to pollute the HTTP layer with this here, we can still track it in the Mediator. We just have to listen for when the streaming results are depleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved Streaming request tracking to mediator.
@@ -65,13 +66,13 @@ public AbstractRestCommand() { | |||
* @return | |||
* @throws Exception | |||
*/ | |||
protected Mediator getMediator() { | |||
protected Mediator getMediator(RequestMetrics metrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not passing RequestMetrics to getMediator like this. It's a bit odd and makes the code a bit messy. See comments in lightblue-platform/lightblue-core#808.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just about there! Thanks!
@@ -111,7 +122,7 @@ public Response getSearchesForEntity(@PathParam("entity") String entity, | |||
} | |||
CallStatus st=new FindCommand(freq.getEntityVersion().getEntity(), | |||
freq.getEntityVersion().getVersion(), | |||
freq.toJson().toString()).run(); | |||
freq.toJson().toString(), metrics).run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we track everything at REST level, it just marks mediator errors as completed requests, instead of error requests.
We can use the errors in the Response object, and add all of those to the context. This is same error resolution as what was being done in mediator, so it's no different, just happens at a different time+place which doesn't matter for error counters.
I was thinking of marking request completion at REST level, but then that would not be able to handle Bulk requests. So request ending should have to be handled at mediator level itself.
Why not? The bulk request code would then do the job of tracking the individual requests (start, end, + errors). Around the bulk request itself would be the REST layer.
@@ -86,6 +90,7 @@ public static AbstractLockCommand getLockCommand(String request) { | |||
|
|||
@Override | |||
public CallStatus run() { | |||
RequestMetrics.Context context = metrics.startLockRequest(getLockCommandName(), domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
if (context != null) { | ||
context.endRequestMonitoring(); | ||
} | ||
// TODO: What if there is IOException? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to mark any errors that occur here I think. Otherwise I suspect they'll bubble up to RESTEasy and get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added catch block to mark request exception.
return new CallStatus(new Response()); | ||
} else { | ||
return new CallStatus(getMediator().find(ireq)); | ||
return new CallStatus(getMediator().find(ireq, metricCtx)); | ||
} | ||
} catch (Error e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to mark errors for conditions above:
try {
ireq = getJsonTranslator().parse(FindRequest.class, JsonUtils.json(request));
} catch (Exception e) {
LOGGER.error("find:parse failure: {}", e);
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, "Error during the parse of the request"));
}
LOGGER.debug("Find request:{}", ireq);
try {
validateReq(ireq, entity, version);
} catch (Exception e) {
LOGGER.error("find:validate failure: {}", e);
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, "Request is not valid"));
}
Similar errors can probably occur in other commands, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error conditions in all missing places in all commands.
267f372
to
865294a
Compare
865294a
to
24ad3c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small fixes, I think one more update and we should be good.
@@ -155,14 +163,26 @@ public CallStatus run() { | |||
streamResponse=getMediator().findAndStream(ireq); | |||
return new CallStatus(new Response()); | |||
} else { | |||
return new CallStatus(getMediator().find(ireq)); | |||
metricCtx.endRequestMonitoring(); | |||
r = getMediator().find(ireq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll want to remove metricCtx.endRequestMonitoring();
here since it's in the finally block (also it's in the wrong order here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy/Paste at wrong location.Thanks for catching it :). Fixed.
LOGGER.error("find:parse failure: {}", e); | ||
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, "Error during the parse of the request")); | ||
} | ||
LOGGER.debug("Find request:{}", ireq); | ||
try { | ||
validateReq(ireq, entity, version); | ||
} catch (Exception e) { | ||
metricCtx.markRequestException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these cases we might want to mark with the exception instance returned from Error.get(...)
to be consistent with other errors. That said, it looks like we would lose the context of what the specific exception is if we did that, because lightblue's Error exceptions will just be logged simply as "Error". Unfortunately this is already the case for all of the other Errors in the Response. Not very specific!
I have an idea to fix that though. Will comment about that in core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, looks like you already added improvements around the lightblue Error objects :-). Nice! So yeah, using the Error.get value to log the errors here like the others should be more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced all instances of metricCtx.markRequestException(e) with metricCtx.markRequestException(e, e.getErrorCode()) (for lightblue specific errors) and metricCtx.markRequestException(e, e.getMessage()) to get error details in metric.
} | ||
} | ||
}; | ||
} | ||
|
||
@Override | ||
public CallStatus run() { | ||
metricCtx = metrics.startStreamingEntityRequest("find", entity, version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to start the streaming version if stream
is true. Otherwise, use the regular version.
Also, slight nit pick, there is a mix of tabs and spaces in some of the changes it looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it originally but looks like it got removed during refractoring. Thanks for pointing it out. Fixed.
LOGGER.error("failure: {}", e); | ||
return new CallStatus(e); | ||
} catch (Exception e) { | ||
metricCtx.markRequestException(e, e.getMessage()); | ||
LOGGER.error("failure: {}", e); | ||
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_ERROR, e.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this case will still benefit from using the Error instance so it's consistent with the response like other Errors that we handle, like so:
LOGGER.error("failure: {}", e);
Error error = Error.get(RestCrudConstants.ERR_REST_ERROR, e.toString());
metricCtx.markRequestException(error);
return new CallStatus(error);
This implies an overload like:
@Override
public void markRequestException(Error e) {
markRequestException(e, e.getErrorCode());
}
Which I think is a good idea anyway. We want to encapsulate as much as possible, not rely on clients to know how to map an Error to marking exceptions.
Would use that for above catch too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood your earlier comment regarding the same.Thanks for clarification and example. Update it as suggested.
} | ||
} | ||
|
||
protected abstract JsonNode runLockCommand(Locking locking); | ||
|
||
public abstract String getLockCommandName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would probably keep this protected if possible FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed visibility.
@@ -50,6 +53,7 @@ public CallStatus run() { | |||
try { | |||
req = getJsonTranslator().parse(BulkRequest.class, JsonUtils.json(request)); | |||
} catch (Exception e) { | |||
metricCtx.markRequestException(e, e.getMessage()); | |||
LOGGER.error("bulk:parse failure: {}", e); | |||
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_ERROR, "Error parsing request")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again for these I would use the Error
type returned from Error.get(RestCrudConstants.ERR_REST_ERROR, "Error parsing request")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed the saved search stuff before. Thanks!
} | ||
|
||
@Override | ||
public CallStatus run() { | ||
RequestMetrics.Context savedSearchMetricCtx = metrics.startEntityRequest("savedsearch", entity, version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the saved search name in the metrics for this actually? That will tell us exactly how certain queries are performing. Because this has an additional parameter, I'd probably track this as startSavedSearchRequest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added startSavedSearchRequest to track monitoring of savedSearch with searchName as an added parameter.
} | ||
|
||
@Override | ||
public CallStatus run() { | ||
RequestMetrics.Context savedSearchMetricCtx = metrics.startEntityRequest("savedsearch", entity, version); | ||
RequestMetrics.Context findMetricCtx = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to track both saved search and find for a saved search. The saved search metric count as a find, so you could argue that we can combine this with graphite queries if we want. Then, we don't have to worry about tracking two things in the code. This also technically allows us to do things like, compare non-saved searches with saved searches. If we put saved searches in finds, too, then it is sort of lossy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed redundant find monitoring.
} | ||
|
||
@Override | ||
public CallStatus run() { | ||
RequestMetrics.Context metricCtx = metrics.startSavedSearchRequest("savedsearch", searchName, entity, version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think passing the "savedsearch" argument is not needed now (we know it is a saved search already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove the extra param. Thanks for catching this one. Fixed.
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, e.toString())); | ||
return new CallStatus(error); | ||
} finally { | ||
if (!stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are failure cases where we are streaming, but the context is not ended. We should unit test these cases to make sure they are caught and handled. For example, the catch block at line 150: a failure there with a streaming request will never close the context.
I think if you make line 185 if (streamResponse == null) {
instead of if (!stream) {
it will fix these cases, but I'd highly recommend having unit tests to make sure these branches are all handled as it's too easy to miss one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using streamresponse null check does handle both streaming and non streaming cases more efficiently. Changed the test condition accordingly and added test cases in FindCommandTest to test both streaming and non streaming cases with metrics handling to cover all cases.
Did a final once over, all I can see is the issue with errors during streaming find requests as mentioned. Otherwise looks good! |
Conflicts: crud/src/test/java/com/redhat/lightblue/rest/crud/cmd/FindCommandTest.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests you added. Thank you so much for your patience, I think this is ready to try out in dev. Thanks!
(Note there is one change I requested to core, but this PR is good to go when that is).
@svkm102 I merged your core PR, but this one has some failing tests. Can you have a look when you have the chance?
|
No description provided.