Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move metrics in lbcore #300

Merged
merged 20 commits into from
Sep 1, 2017

Conversation

svkm102
Copy link
Contributor

@svkm102 svkm102 commented Aug 11, 2017

No description provided.

@svkm102 svkm102 changed the title WIP : Move metrics in lbcore Move metrics in lbcore Aug 16, 2017
Copy link
Contributor

@alechenninger alechenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Makes more sense with lightblue-platform/lightblue-core#808 review also.

* CDI injection.
*/
private static final RequestMetrics metrics =
new DropwizardRequestMetrics(MetricRegistryFactory.getMetricRegistry());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably move to the RestConfiguration class, where it can pass it to the LightblueFactory.

Also, I forgot that getMetricRegistry() also initializes the JMX reporter. We should probably make that clearer somehow. I'd go with simply getJmxMetricRegistry().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed getMetricRegistry() to getJmxMetricRegistry() for clarity.

@@ -111,7 +122,7 @@ public Response getSearchesForEntity(@PathParam("entity") String entity,
}
CallStatus st=new FindCommand(freq.getEntityVersion().getEntity(),
freq.getEntityVersion().getVersion(),
freq.toJson().toString()).run();
freq.toJson().toString(), metrics).run();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we need to pass in the metrics here. Mediator should already know how to track this; it's now not the responsibility of the HTTP layer to track metrics.

Copy link
Contributor

@alechenninger alechenninger Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I had an idea about this I just commented on the other PR. We have the option to start tracking in the REST layer (here), and then pass the context along.

Another alternative is to go back to doing most tracking in the REST layer (not passing the ctx along), but for bulk requests just pass along the RequestMetrics so it can track the individual requests itself. You'll still need the RequestMetrics interface in core to accomplish that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer starting metric tracking at REST layer and then passing along the ctx to mediator for all requests (including bulk). One benefit that was seen by tracking metrics at mediator level is that we get more insights on errors which occur at mediator level. If we track everything at REST level, it just marks mediator errors as completed requests, instead of error requests.

So RequestMetrics will get passed on along with each request to commands, commands will initialize the context and then pass ctx to mediator for further tracking at mediator level (error tracking and marking request as completed).

I was thinking of marking request completion at REST level, but then that would not be able to handle Bulk requests. So request ending should have to be handled at mediator level itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we track everything at REST level, it just marks mediator errors as completed requests, instead of error requests.

We can use the errors in the Response object, and add all of those to the context. This is same error resolution as what was being done in mediator, so it's no different, just happens at a different time+place which doesn't matter for error counters.

I was thinking of marking request completion at REST level, but then that would not be able to handle Bulk requests. So request ending should have to be handled at mediator level itself.

Why not? The bulk request code would then do the job of tracking the individual requests (start, end, + errors). Around the bulk request itself would be the REST layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that comment before I made the changes to track errors via response object. With response object based tracking, the Bulk requests can be handled easily as well as mediator errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svkm102 Can you do a find/replace on version number to substitute underscore ( _ ) for the dot ( . ) in the version? Graphite treats . specially so it will throw things off if we pass that along. So something like metrics:name=api.find.test.1.0.0.requests.active
would become
metrics:name=api.find.test.1_0_0.requests.active

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I didn't mean to change it in the command calls, just wherever you are defining the name of the metric being captured

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the version handling to replace "." with "_" in lightblue-core. That should take care of versions in all commands.

@@ -86,6 +90,7 @@ public static AbstractLockCommand getLockCommand(String request) {

@Override
public CallStatus run() {
RequestMetrics.Context context = metrics.startLockRequest(getLockCommandName(), domain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be done at the HTTP layer?

I understand if locking is special, just wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have lock implementation at crud level. Everything lock related seems to be at rest level only. Same is the case with GenerateCommand. So this is the only place their metrics can be tracked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

// only when stream is fully written.
if(stream) {
context = metrics.startStreamingEntityRequest("find", entity, version);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to pollute the HTTP layer with this here, we can still track it in the Mediator. We just have to listen for when the streaming results are depleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved Streaming request tracking to mediator.

@@ -65,13 +66,13 @@ public AbstractRestCommand() {
* @return
* @throws Exception
*/
protected Mediator getMediator() {
protected Mediator getMediator(RequestMetrics metrics) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not passing RequestMetrics to getMediator like this. It's a bit odd and makes the code a bit messy. See comments in lightblue-platform/lightblue-core#808.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

@alechenninger alechenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just about there! Thanks!

@@ -111,7 +122,7 @@ public Response getSearchesForEntity(@PathParam("entity") String entity,
}
CallStatus st=new FindCommand(freq.getEntityVersion().getEntity(),
freq.getEntityVersion().getVersion(),
freq.toJson().toString()).run();
freq.toJson().toString(), metrics).run();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we track everything at REST level, it just marks mediator errors as completed requests, instead of error requests.

We can use the errors in the Response object, and add all of those to the context. This is same error resolution as what was being done in mediator, so it's no different, just happens at a different time+place which doesn't matter for error counters.

I was thinking of marking request completion at REST level, but then that would not be able to handle Bulk requests. So request ending should have to be handled at mediator level itself.

Why not? The bulk request code would then do the job of tracking the individual requests (start, end, + errors). Around the bulk request itself would be the REST layer.

@@ -86,6 +90,7 @@ public static AbstractLockCommand getLockCommand(String request) {

@Override
public CallStatus run() {
RequestMetrics.Context context = metrics.startLockRequest(getLockCommandName(), domain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

if (context != null) {
context.endRequestMonitoring();
}
// TODO: What if there is IOException?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to mark any errors that occur here I think. Otherwise I suspect they'll bubble up to RESTEasy and get lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added catch block to mark request exception.

return new CallStatus(new Response());
} else {
return new CallStatus(getMediator().find(ireq));
return new CallStatus(getMediator().find(ireq, metricCtx));
}
} catch (Error e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to mark errors for conditions above:

             try {
                 ireq = getJsonTranslator().parse(FindRequest.class, JsonUtils.json(request));
             } catch (Exception e) {
                 LOGGER.error("find:parse failure: {}", e);
                 return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, "Error during the parse of the request"));
             }
             LOGGER.debug("Find request:{}", ireq);
             try {
                 validateReq(ireq, entity, version);
             } catch (Exception e) {
                 LOGGER.error("find:validate failure: {}", e);
                 return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, "Request is not valid"));
             }

Similar errors can probably occur in other commands, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error conditions in all missing places in all commands.

Copy link
Contributor

@alechenninger alechenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small fixes, I think one more update and we should be good.

@@ -155,14 +163,26 @@ public CallStatus run() {
streamResponse=getMediator().findAndStream(ireq);
return new CallStatus(new Response());
} else {
return new CallStatus(getMediator().find(ireq));
metricCtx.endRequestMonitoring();
r = getMediator().find(ireq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want to remove metricCtx.endRequestMonitoring(); here since it's in the finally block (also it's in the wrong order here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy/Paste at wrong location.Thanks for catching it :). Fixed.

LOGGER.error("find:parse failure: {}", e);
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, "Error during the parse of the request"));
}
LOGGER.debug("Find request:{}", ireq);
try {
validateReq(ireq, entity, version);
} catch (Exception e) {
metricCtx.markRequestException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these cases we might want to mark with the exception instance returned from Error.get(...) to be consistent with other errors. That said, it looks like we would lose the context of what the specific exception is if we did that, because lightblue's Error exceptions will just be logged simply as "Error". Unfortunately this is already the case for all of the other Errors in the Response. Not very specific!

I have an idea to fix that though. Will comment about that in core.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind, looks like you already added improvements around the lightblue Error objects :-). Nice! So yeah, using the Error.get value to log the errors here like the others should be more consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all instances of metricCtx.markRequestException(e) with metricCtx.markRequestException(e, e.getErrorCode()) (for lightblue specific errors) and metricCtx.markRequestException(e, e.getMessage()) to get error details in metric.

}
}
};
}

@Override
public CallStatus run() {
metricCtx = metrics.startStreamingEntityRequest("find", entity, version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to start the streaming version if stream is true. Otherwise, use the regular version.

Also, slight nit pick, there is a mix of tabs and spaces in some of the changes it looks like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it originally but looks like it got removed during refractoring. Thanks for pointing it out. Fixed.

LOGGER.error("failure: {}", e);
return new CallStatus(e);
} catch (Exception e) {
metricCtx.markRequestException(e, e.getMessage());
LOGGER.error("failure: {}", e);
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_ERROR, e.toString()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this case will still benefit from using the Error instance so it's consistent with the response like other Errors that we handle, like so:

              LOGGER.error("failure: {}", e);
              Error error = Error.get(RestCrudConstants.ERR_REST_ERROR, e.toString());
              metricCtx.markRequestException(error);
              return new CallStatus(error);

This implies an overload like:

        @Override
        public void markRequestException(Error e) {
            markRequestException(e, e.getErrorCode());
        }

Which I think is a good idea anyway. We want to encapsulate as much as possible, not rely on clients to know how to map an Error to marking exceptions.

Would use that for above catch too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood your earlier comment regarding the same.Thanks for clarification and example. Update it as suggested.

}
}

protected abstract JsonNode runLockCommand(Locking locking);

public abstract String getLockCommandName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably keep this protected if possible FWIW

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed visibility.

@@ -50,6 +53,7 @@ public CallStatus run() {
try {
req = getJsonTranslator().parse(BulkRequest.class, JsonUtils.json(request));
} catch (Exception e) {
metricCtx.markRequestException(e, e.getMessage());
LOGGER.error("bulk:parse failure: {}", e);
return new CallStatus(Error.get(RestCrudConstants.ERR_REST_ERROR, "Error parsing request"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again for these I would use the Error type returned from Error.get(RestCrudConstants.ERR_REST_ERROR, "Error parsing request").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@alechenninger alechenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the saved search stuff before. Thanks!

}

@Override
public CallStatus run() {
RequestMetrics.Context savedSearchMetricCtx = metrics.startEntityRequest("savedsearch", entity, version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the saved search name in the metrics for this actually? That will tell us exactly how certain queries are performing. Because this has an additional parameter, I'd probably track this as startSavedSearchRequest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added startSavedSearchRequest to track monitoring of savedSearch with searchName as an added parameter.

}

@Override
public CallStatus run() {
RequestMetrics.Context savedSearchMetricCtx = metrics.startEntityRequest("savedsearch", entity, version);
RequestMetrics.Context findMetricCtx = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to track both saved search and find for a saved search. The saved search metric count as a find, so you could argue that we can combine this with graphite queries if we want. Then, we don't have to worry about tracking two things in the code. This also technically allows us to do things like, compare non-saved searches with saved searches. If we put saved searches in finds, too, then it is sort of lossy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed redundant find monitoring.

}

@Override
public CallStatus run() {
RequestMetrics.Context metricCtx = metrics.startSavedSearchRequest("savedsearch", searchName, entity, version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing the "savedsearch" argument is not needed now (we know it is a saved search already).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove the extra param. Thanks for catching this one. Fixed.

return new CallStatus(Error.get(RestCrudConstants.ERR_REST_FIND, e.toString()));
return new CallStatus(error);
} finally {
if (!stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are failure cases where we are streaming, but the context is not ended. We should unit test these cases to make sure they are caught and handled. For example, the catch block at line 150: a failure there with a streaming request will never close the context.

I think if you make line 185 if (streamResponse == null) { instead of if (!stream) { it will fix these cases, but I'd highly recommend having unit tests to make sure these branches are all handled as it's too easy to miss one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using streamresponse null check does handle both streaming and non streaming cases more efficiently. Changed the test condition accordingly and added test cases in FindCommandTest to test both streaming and non streaming cases with metrics handling to cover all cases.

@alechenninger
Copy link
Contributor

Did a final once over, all I can see is the issue with errors during streaming find requests as mentioned. Otherwise looks good!

Sunny Mourya added 3 commits August 29, 2017 11:44
Conflicts:
	crud/src/test/java/com/redhat/lightblue/rest/crud/cmd/FindCommandTest.java
Copy link
Contributor

@alechenninger alechenninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests you added. Thank you so much for your patience, I think this is ready to try out in dev. Thanks!

(Note there is one change I requested to core, but this PR is good to go when that is).

@derek63
Copy link
Member

derek63 commented Aug 31, 2017

@svkm102 I merged your core PR, but this one has some failing tests. Can you have a look when you have the chance?

Failed tests:   runStreamFindWithParseProblemAndMetrics(com.redhat.lightblue.rest.crud.cmd.FindCommandTest): expected:<1> but was:<0>
  runFindWithInvalidAndMetrics(com.redhat.lightblue.rest.crud.cmd.FindCommandTest): expected:<1> but was:<0>
  runStreamFindWithInvalidAndMetrics(com.redhat.lightblue.rest.crud.cmd.FindCommandTest): expected:<1> but was:<0>
  runFindWithParseProblemAndMetrics(com.redhat.lightblue.rest.crud.cmd.FindCommandTest): expected:<1> but was:<0>

@svkm102 svkm102 merged commit 753b819 into lightblue-platform:master Sep 1, 2017
@svkm102 svkm102 deleted the metrics-refractoring branch November 7, 2017 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants