Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting granularity of Phobos distributed traces #46

Closed
object opened this issue Mar 9, 2022 · 23 comments
Closed

Limiting granularity of Phobos distributed traces #46

object opened this issue Mar 9, 2022 · 23 comments

Comments

@object
Copy link

object commented Mar 9, 2022

I tried to use Phobos-instrumented system with Honeycomb APM and quickly got into a problem with rate limit. Honeycomb sets maximum rate of 2000 events per second for a dataset (perhaps higher rate for Enterprise account, I didn't check) and our application exceeded that. While this is understandable because each incoming request results in a complex processing workflow across the Akka cluster nodes, the traffic in fact is relatively low.

I wonder whether you think Phobos tracing can be extended with activity/actor filtering so an Akka application can be configured to to sample just subsets of actors/activities. I haven't thoroghly thought about filtering principles, just trying to figure out how to deal with challenges like described above, when Phobos traces hit certain rate limit set by the APM provider.

@Aaronontheweb
Copy link
Member

I'll write up a longer response to this, but there are definitely ways to accomplish this in Phobos 1.x through:

  1. Trace filtering
  2. Toggle tracing on/off for specific actors

And in Phobos 2.x you can do both of those plus use OpenTelemetry processors to filter out traces before they're exported to a service like HoneyComb. I'll get to work writing up a more detailed tuning guide on this subject.

@object
Copy link
Author

object commented Mar 9, 2022

Thank you. I wasn't aware of either trace filtering or toggle for specific actors. Will appreciated some details on how to achieve this.

@Aaronontheweb
Copy link
Member

Here's the official literature we have so far: https://phobos.petabridge.com/articles/performance.html#performance-optimization-best-practices

But I can probably go deeper on that

@object
Copy link
Author

object commented Mar 9, 2022

Great, that's the good start for me.

@object
Copy link
Author

object commented Mar 10, 2022

Trying to get filtering to work, but still get unwanted messages. This is what I did:

                builder.WithTracing (fun b ->
                    b.SetTracer(tracer) |> ignore
                    b.AddIncludeMessageFilter<Nrk.Oddjob.Core.ShardMessages.MediaSetShardMessage>() |> ignore
                    b.AddIncludeMessageFilter<Nrk.Oddjob.Ps.PsShardMessages.PsShardMessage>() |> ignore
                    b.AddIncludeMessageFilter<Nrk.Oddjob.WebApi.HealthCheck.Actors.UpdateSubcomponentState>() |> ignore)

This is supposed to filter out all message except for the three explicitly specified types above, right? But I still see other messages, both from out domain and Akka system messages (e.g. Akka.Cluster.ClusterHeartbeatSender+Heartbeat) logged.

@object
Copy link
Author

object commented Mar 10, 2022

Hmm, tried AddIncludeMessageFilter, AddExcludeMessageFilter and AddMessageFilter (with ExcludeTypeFilterRule and IncludeTypeFilterRule), but I am still getting various message types logged.

Is the following understanding correct?

  • If I only specify AddIncludeMessageFilter, then only messages of type T will be logged (tried that, didn't work)
  • If I only specify AddExcludeMessageFilter, then only messages of type T will be skipped

What happens if I try the combination of two? Anyway, looks like I can't get any filtering at all.

@Aaronontheweb
Copy link
Member

So the filtering APIs have been, IMHO, intuitive at best and I'd like to simplify what we're doing there. Are you working against Phobos 2.0? I have more leeway to make major changes there and I think I can follow some of the instrumentation patterns that other OTel libraries are doing to accomplish the same goal. I'd be happy to prototype something today and show you what it looks like here.

@object
Copy link
Author

object commented Mar 10, 2022

I am still on Phobos 1.x. I am working on an OTel proof of concept to demo something tomorrow so I guess I won't be able to switch to Phobos now. So I appreciate a hint about what can be wrong with my filtering. But I can switch to Phobos 2.0 next week, if you think it will work better in that respect.

@object
Copy link
Author

object commented Mar 10, 2022

Right now I don't seem to get any filtering to work.

@Aaronontheweb
Copy link
Member

I'll take a look at the 1.x bits right now then.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Mar 10, 2022

So the way the filters work:

  1. By default, all messages that are already in a trace are included - this behavior can be changed via .IncludeMessagesAlreadyInTrace(false) on the builder. This drops a lot of trace traffic when this happens.
  2. As long as the traced message in Phobos matches any of your filters, it will be included in the trace.

Can you try .IncludeMessagesAlreadyInTrace(false) and see if that helps?

@object
Copy link
Author

object commented Mar 10, 2022

Thanks, will do.

@Aaronontheweb
Copy link
Member

In the meantime, I'm still going to look at improving this as none of this is intuitive.

@object
Copy link
Author

object commented Mar 10, 2022

@Aaronontheweb I see that the number of irrelevant messages has drastically reduced. With a few exceptions, I only see messages from Include filter. Example of a message that is still logged is Akka.Cluster.Tools.PublishSubscribe.Internal.Status that is published by /system/distributedPubSubMediator (Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubMediator).

@Aaronontheweb
Copy link
Member

@object that's great to hear - so I just found a bug in Phobos based on your feedback here:

/// <summary>
    ///     INTERNAL API
    ///     Used to implement filter rules.
    /// </summary>
    internal static class ClusterCustomFilteringRules
    {
        public static readonly ImmutableHashSet<IFilterRule> DistributedPubSubFilter =
            ImmutableHashSet<IFilterRule>.Empty.Add(new IncludeFunctionFilterRule(o =>
            {
                switch (o)
                {
                    // covers the bulk of useful messages
                    case IDistributedPubSubMessage i when i is IWrappedMessage:
                    case Subscribe _:
                    case SubscribeAck _:
                    case Unsubscribe _:
                    case UnsubscribeAck _:
                    case Put _:
                    case Remove _:
                        return true;
                    default:
                        return false;
                }
            }));
    }

We implemented this but..... It is not currently being used inside any of the filters 🤦

We'll push an update which includes this

@Aaronontheweb
Copy link
Member

In theory your filter should already cover this without this filter - so there's likely another issue with the DistributedPubSub filtering specifically that I need to look into.

@Aaronontheweb
Copy link
Member

Ok, looks like @Arkatufus has patched this - we will push a new Phobos release in the next day or two.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Mar 17, 2022

Update on this - @Arkatufus did indeed patch the DistributedPubSub issue but I decided that our filtering system, generally, isn't robust enough because it doesn't do the following things well:

  1. Distinguish between traces that are explicitly rejected versus "not handled" by a filter - which makes it easy for subsequent rules to accidentally override a previous rule that rejected the message. Only one rule needs to return true for the message to be traced today.
  2. Clearly explain the order in which rules will be applied (they are applied in the order in which they are declared, currently) - and this makes it confusing for teams to build a complete filtration system that is inherently understandable.
  3. Lastly, the behavior around how we handle messages when they are already part of a trace is distinct from whether we start a new trace altogether - this is not clear to end-users at all and is rather unwieldy.

I have a draft PR open right now that makes some breaking changes to the filtering system in order to rectify this.

This code is all for the 1.x branch of Phobos but it will look similar for 2.x. We will release this update for both versions concurrently.

var phobosConfig = new PhobosConfigBuilder()
  .WithTracing(m =>
  {
      m.SetTracer(tracer);

      // don't include messages in trace that don't satisfy other filters
      m.VerboseTracing(false); // this setting is new
      m.IncludeMessagesAlreadyInTrace(false);

      // only accept FilteredMessage types
      m.AddIncludeMessageFilter<FilteredMessage>();
  });

Using the current filtering API, this code means "don't trace any messages under any circumstances unless they implement the FilteredMessage base class."

var phobosConfig = new PhobosConfigBuilder()
          .WithTracing(m =>
          {
              m.SetTracer(tracer);

              // don't include messages in trace that don't satisfy other filters
              m.VerboseTracing(false);
              m.IncludeMessagesAlreadyInTrace(true); // set to true this time

              // only accept FilteredMessage types
              m.AddIncludeMessageFilter<FilteredMessage>();
          });

Using the current filtering API, this code means "don't trace any messages under any circumstances unless they implement the FilteredMessage base class OR they are already part of a trace."

Finally, if the builder pattern isn't easy enough to use we're going to offer the option to replace it altogether with a single ITraceFilter implementation:

/// <summary>
/// Each <see cref="ActorSystem"/> can implement exactly one of these to control
/// the decision-making around which traces to include and which ones to exclude.
///
/// When this type is configured, it will replace all of the <see cref="IFilterRule"/>s
/// that have been previously defined.
/// </summary>
/// <seealso cref="FilterDecision"/>
public interface ITraceFilter
{
    /// <summary>
    /// Evaluates a message processed by an actor and determines whether or not to include it in the current
    /// or a new trace.
    /// </summary>
    /// <param name="message">The message currently being processed by an actor.</param>
    /// <param name="alreadyInTrace">Whether or not this message is already part of an active trace.</param>
    /// <returns><c>true</c> if we are going to trace the message, <c>false</c> otherwise.</returns>
    /// <remarks>
    /// Should return <c>true</c> unless you are explicitly trying to filter a given message out of your trace.
    /// </remarks>
    bool ShouldTraceMessage(object message, bool alreadyInTrace);
}

All of the filtering context is passed in via the method and all of the user-defined filtering rules exist in a single location for this ActorSystem, structured in exactly the order implemented by the user. None of the settings from the builder pattern apply when a custom ITraceFilter is used.

I should have a small demo video of this done soon that shows it in action.

@object
Copy link
Author

object commented Mar 17, 2022

This looks good and reasonable @Aaronontheweb. I am already on Phobos 2, so once you have a beta that supports it, I can give it a try.

@Aaronontheweb
Copy link
Member

Hi @object - we have pushed new Phobos 2.0 binaries that have an API change that should nip this problem in the bud.

Please see the release notes here and follow the configuration instructions - I think this should help quite a bit: https://phobos.petabridge.com/articles/releases/RELEASE_NOTES.html#200-beta4httpssdkbincompublisherpetabridgeproductphobospackagesphobosactorversions200-beta4-and-150-beta1httpssdkbincompublisherpetabridgeproductphobospackagesphobosactorversions150-beta1-march-17th-2022

@Aaronontheweb
Copy link
Member

Hi @object , have you been able to give this a try yet?

@object
Copy link
Author

object commented Mar 22, 2022

@Aaronontheweb Sitting with a test project now, will give you an update today or tomorrow.

@object
Copy link
Author

object commented Mar 22, 2022

@Aaronontheweb What happens with the intermediate spans that don't satisfy filter conditions?

Scenario 1. "don't trace any messages under any circumstances unless they implement the FilteredMessage base class."

We have a transaciton that consists of spans [FilteredMessage1; NotFilteredMessage; FilteredMessage2]
The transaction contains NotFilteredMessage that should be filtered out. Will NotFilteredMessage in the middle terminate the transaciton so only the first FilteredMessage will be seen? If not, then how the transaction will be shown in an APM dashboard? Or there will be two separate transactions [FilteredMessage1] and [FilteredMessage2]?

Scenario 2. "don't trace any messages under any circumstances unless they implement the FilteredMessage base class OR they are already part of a trace."

A: [FilteredMessage1; NotFilteredMessage:; FilteredMessage2]
For 2.A all messages should be included.
B: [NotFilteredMessage; FilteredMessage1; FilteredMessage2]
What happens here? Will the transaction start from FilteredMessage1 and include [FilteredMessage1; FilteredMessage2]?
C: [NotFilteredMessage1; FilteredMessage1; NotFilteredMessage2; FilteredMessage2]
Will the transaction include [FilteredMessage1; NotFilteredMessage2; FilteredMessage2]?

@object object closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants