Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API Part 1 #928

shrinandthakkar · 2023-03-16T15:07:33Z

The EventProducer of every DatastreamTask reports SLA and latency metrics for every datastream record. But when topics (at least one partition) have higher throughput than the thresholds, it introduces latency and SLA misses in the mirroring pipeline.

This pull request is the first part of changes to handle the metrics and SLA reporting of throughput-violating topics via the datastream update API. It introduces the following changes:

The datastream update endpoint shall simultaneously accept a datastream and a list of throughput-violating topics via its datastream metadata. The ZkAdapter would persist this information in the DatastreamStore.
The update API touches every server host in its normal code path, and in that code path, every host (Coordinator) maintains a shared cache (Datastream -> Violating Topics Map) with the latest violations for every datastream.
A new callback for the eventProducer to fetch the latest list of offending topics from the Coordinator. We ensure that the correct set of topics is excluded from reporting the metrics and SLAs for every record.

Part 2 of this series would take care of:

Excluding the reporting of metrics for these throughput-violating topics within EventProducer.
Introducing separate metrics for throughput-violating topics.

Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.

Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md

shrinandthakkar · 2023-03-28T00:01:30Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

+              (t) -> getThroughputViolatingTopics(t.getDatastreams()) :
+              (t) -> new HashSet<>();
+
+      EventProducer producer =


From @vmaheshw

Will it make sense to simplify this at Datastream level rather than DatastreamTask level? For eg: in digest, there is only one datastream. This function will result in 10 copies for the same datastream. Also, hashing based on datastream name will be cheaper and more efficient.

Instead of moving this at the Datastream level, I refactored to only maintaining a single initialization of this callback in the coordinator which is passed to each EventProducer init.

shrinandthakkar · 2023-03-28T00:02:17Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

+  private final ReadWriteLock _throughputViolatingTopicsMapReadWriteLock = new ReentrantReadWriteLock();
+  private final Lock _throughputViolatingTopicsMapWriteLock = _throughputViolatingTopicsMapReadWriteLock.writeLock();
+  private final Lock _throughputViolatingTopicsMapReadLock = _throughputViolatingTopicsMapReadWriteLock.readLock();
+


From @vmaheshw

Do we need to go to low-level locks for this? Can we not rely on ConcurrentHashMap? Violation calculation does not have to be precise in the close second.

I would prefer not to use the concurrent hash map since while performing "replace-all" to this map, there may be chances that these violating topics' metrics and SLAs get reported. Hence to keep things consistent, I went with this approach.

I recommend simplicity, especially since this is not on a critical path.

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

vmaheshw

I am okay with the READ-WRITE lock if the other approver is also fine.

Please address the valid check comment.

vmaheshw · 2023-03-30T17:21:46Z

...-restli/src/main/java/com/linkedin/datastream/server/dms/ZookeeperBackedDatastreamStore.java

+  private static final Double ONE_MEBIBYTE = (double) (1024 * 1024);
+  private static final Double ZNODE_BLOB_SIZE_LIMIT = ONE_MEBIBYTE;


Do you really need 2 variables?

The purpose of these variables are little different

ZNODE_BLOB_SIZE_LIMIT ––> Validation of data size per znode based on the znode limit (=1MB)

ONE_MEBIBYTE --> Converting bytes to MB to get encoded data size in MBs.

Ideally, I can use the same variable, but I kept them both for better understanding purposes of two different operations happening on the same value.

vmaheshw · 2023-03-30T17:23:18Z

...-restli/src/main/java/com/linkedin/datastream/server/dms/ZookeeperBackedDatastreamStore.java

@@ -127,6 +128,12 @@ public void updateDatastream(String key, Datastream datastream, boolean notifyLe
      throw new DatastreamException("Datastream does not exists, can not be updated: " + key);
    }

+    // As this limit is ZK specific, adding this validation check specifically in ZookeeperBackedDatastreamStore.
+    double datastreamBlobSizeInMBs = getBlobSizeInMBs(DatastreamUtils.toJSON(datastream));
+    Validate.isTrue(datastreamBlobSizeInMBs <= ZNODE_BLOB_SIZE_LIMIT,


The validation check is tricky, especially in the case of a Programmatic update. This will block the datastream update until the logic in the caller is fixed. Can you think of a less disruptive way?

What are your concerns @vmaheshw? I understand that we need to add similar validation on the client side as well. If a client request is breaching the ZK node size limit, it will eventually fail the update anyway, it's just the failure will happen in the ZkAdapter/ZkClient layer. This to me is a more explicit and descriptive way to fail.

I initially thought that it can impact the datastream restart, but I was wrong.

However, in the scenario, the limit goes >1MB because of the violation list, we will not be able to disable this at the server level and will have to rely on the external service to disable this feature. Until then, any other update/allowlisting will not go through.

@vmaheshw understood your concern.

I have added steps in both update and create paths to only adhere to the throughputViolatingTopic metadata if the corresponding config is enabled in our server.

vmaheshw · 2023-03-30T17:26:02Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

+  private final ReadWriteLock _throughputViolatingTopicsMapReadWriteLock = new ReentrantReadWriteLock();
+  private final Lock _throughputViolatingTopicsMapWriteLock = _throughputViolatingTopicsMapReadWriteLock.writeLock();
+  private final Lock _throughputViolatingTopicsMapReadLock = _throughputViolatingTopicsMapReadWriteLock.readLock();
+


I recommend simplicity, especially since this is not on a critical path.

jzakaryan

Left a few comments. Overall looks good.

jzakaryan · 2023-03-30T20:06:52Z

...-restli/src/main/java/com/linkedin/datastream/server/dms/ZookeeperBackedDatastreamStore.java

@@ -127,6 +128,12 @@ public void updateDatastream(String key, Datastream datastream, boolean notifyLe
      throw new DatastreamException("Datastream does not exists, can not be updated: " + key);
    }

+    // As this limit is ZK specific, adding this validation check specifically in ZookeeperBackedDatastreamStore.
+    double datastreamBlobSizeInMBs = getBlobSizeInMBs(DatastreamUtils.toJSON(datastream));
+    Validate.isTrue(datastreamBlobSizeInMBs <= ZNODE_BLOB_SIZE_LIMIT,


What are your concerns @vmaheshw? I understand that we need to add similar validation on the client side as well. If a client request is breaching the ZK node size limit, it will eventually fail the update anyway, it's just the failure will happen in the ZkAdapter/ZkClient layer. This to me is a more explicit and descriptive way to fail.

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

…a Datastream Update API

jzakaryan

Going forward, please retain the commit history in your PRs. Don't rebase/squash those commits locally. Having a commit history retained in the PR helps reviewers see changes over time.

* Releasing a new version And Minor improvements * Using immutable empty set & keeping SNAPSHOT to accidently not release any version --------- Co-authored-by: Shrinand Thakkar <sthakkar@sthakkar-mn2.linkedin.biz>

shrinandthakkar force-pushed the inlogs-bad-actors-update-api branch 2 times, most recently from eb4de71 to ba7ba60 Compare March 17, 2023 07:32

shrinandthakkar changed the title ~~Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API~~ Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API Part 1 Mar 17, 2023

shrinandthakkar force-pushed the inlogs-bad-actors-update-api branch 8 times, most recently from ee784a5 to 6622bed Compare March 22, 2023 23:41

shrinandthakkar commented Mar 28, 2023

View reviewed changes

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java Show resolved Hide resolved

shrinandthakkar force-pushed the inlogs-bad-actors-update-api branch from 6622bed to 22151e0 Compare March 28, 2023 00:10

shrinandthakkar requested review from vmaheshw and jzakaryan March 28, 2023 00:18

shrinandthakkar force-pushed the inlogs-bad-actors-update-api branch from 22151e0 to bbb8494 Compare March 30, 2023 06:22

vmaheshw reviewed Mar 30, 2023

View reviewed changes

jzakaryan requested changes Mar 30, 2023

View reviewed changes

shrinandthakkar force-pushed the inlogs-bad-actors-update-api branch from bbb8494 to 4f30281 Compare March 30, 2023 23:21

Handling Metrics and SLA Reporting for Throughput Violating Topics vi…

3c722ab

…a Datastream Update API

shrinandthakkar force-pushed the inlogs-bad-actors-update-api branch from 4f30281 to 3c722ab Compare March 30, 2023 23:38

jzakaryan self-requested a review March 31, 2023 19:10

jzakaryan approved these changes Mar 31, 2023

View reviewed changes

vmaheshw approved these changes Mar 31, 2023

View reviewed changes

shrinandthakkar merged commit 6ebb701 into linkedin:master Mar 31, 2023

shrinandthakkar mentioned this pull request Apr 4, 2023

Separated Metrics Handling for Throughput Violating Topics #930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API Part 1 #928

Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API Part 1 #928

shrinandthakkar commented Mar 16, 2023

shrinandthakkar Mar 28, 2023

shrinandthakkar Mar 28, 2023 •

edited

shrinandthakkar Mar 28, 2023

shrinandthakkar Mar 28, 2023

vmaheshw Mar 30, 2023

vmaheshw left a comment

vmaheshw Mar 30, 2023

shrinandthakkar Mar 30, 2023

vmaheshw Mar 30, 2023

jzakaryan Mar 30, 2023

vmaheshw Mar 30, 2023

shrinandthakkar Mar 30, 2023

vmaheshw Mar 30, 2023

jzakaryan left a comment

jzakaryan Mar 30, 2023

jzakaryan left a comment

		private static final Double ONE_MEBIBYTE = (double) (1024 * 1024);
		private static final Double ZNODE_BLOB_SIZE_LIMIT = ONE_MEBIBYTE;

Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API Part 1 #928

Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API Part 1 #928

Conversation

shrinandthakkar commented Mar 16, 2023

Choose a reason for hiding this comment

shrinandthakkar Mar 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmaheshw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jzakaryan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jzakaryan left a comment

Choose a reason for hiding this comment

shrinandthakkar Mar 28, 2023 •

edited