Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust dual read monitoring data match logic and remove log rate limiter #968

Merged
merged 2 commits into from
Jan 19, 2024

Conversation

bohhyang
Copy link
Contributor

@bohhyang bohhyang commented Jan 18, 2024

Dual read monitoring have been having tons of false-positive uri property mismatches. We used a rate limiter logger which was flooded by uri logs and starved critical logs from service and cluster properties.

Changes

  1. Eliminate two kinds of mismatches: 1) uriSpecificProperties from xDS flow having inner empty maps will be treated the same as an empty uriSpecificProperties. 2) data read from FS has a version of "-1". Allow data match if (at least) one data is read FS.
  2. Use unlimited logger for service and cluster property warn logs, while keep using a rate limiter logger for uri warn logs.
  3. adjust log messages to show the property type, name, data and version, in both LBs clearly.

Test Done

Unit tests.

Manual test with QEI d2-proxy.
Log snippet shows the data matches sequence for service "workplaceTypeTaxonomyVersions", its cluster "DataMan" and the uris (I turned debug level log to info level just to show the logs):

2024/01/19 08:51:29.322 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qjayMYGSaFtuedww==] Added new entry ServiceStoreProperties for workplaceTypeTaxonomyVersions for New LB.
2024/01/19 08:51:29.322 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qjayMYGSaFtuedww==] Current entries of ServiceStoreProperties for workplaceTypeTaxonomyVersions: 
Old LB: null
New LB: CacheEntry{_version=253657044435, _timeStamp='2024/01/19 08:51:29.321', _data=ServiceStoreProperties [_stableServiceProperties=ServiceProperties [_clusterName=DataMan, _path=/workplaceTypeTaxonomyVersions, _serviceName=workplaceTypeTaxonomyVersions, _loadBalancerStrategyList=[relative, degrader], _loadBalancerStrategyProperties={http.loadBalancer.pointsPerWeight=100, http.loadBalancer.lowEmittingInterval=20000, http.loadBalancer.highEmittingInterval=150000, http.loadBalancer.updateIntervalMs=5000, http.loadBalancer.consistentHashAlgorithm=distributionBased, http.loadBalancer.ringRampFactor=2.0, http.loadBalancer.highWaterMark=3000.0, http.loadBalancer.quarantine.maxPercent=0.1, http.loadBalancer.lowWaterMark=500.0}, _transportClientProperties={http.idleTimeout=10500000, allowedClientOverrideKeys=http.protocolVersion, http.poolSize=100, http.protocolVersion=HTTP_2, http.maxResponseSize=20971520, http.sslIdleTimeout=10500000, http.requestTimeout=10000}, _relativeStrategyProperties={highErrorRate=0.1, lowErrorRate=0.02, downStep=0.2, enableFastRecovery=true, relativeLatencyLowThresholdFactor=4.0, relativeLatencyHighThresholdFactor=5.0, slowStartThreshold=0.16, quarantineProperties={quarantineMaxPercent=0.1}, initialHealthScore=0.01, ringProperties={consistentHashAlgorithm=distributionBased}, upStep=0.05, emittingIntervalMs=150000}, _degraderProperties={degrader.minCallCount=1, degrader.highErrorRate=0.1, degrader.lowLatency=3000, degrader.downStep=0.05, degrader.highLatency=5000, degrader.initialDropRate=0.99, degrader.slowStartThreshold=0.16, degrader.lowErrorRate=0.02}, prioritizedSchemes=[https], bannedUris=[], serviceMetadata={}, backupRequests=[], enableClusterSubsetting=false, minimumClusterSubsetSize=-1], _canaryConfigs=null, _canaryDistributionStrategy=null]}
2024/01/19 08:51:29.668 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qot/cmC3hYTrXTuA==] Matched ServiceStoreProperties for workplaceTypeTaxonomyVersions. 
Old LB: CacheEntry{_version=253657044435, _timeStamp='2024/01/19 08:51:29.668', _data=ServiceStoreProperties [_stableServiceProperties=ServiceProperties [_clusterName=DataMan, _path=/workplaceTypeTaxonomyVersions, _serviceName=workplaceTypeTaxonomyVersions, _loadBalancerStrategyList=[relative, degrader], _loadBalancerStrategyProperties={http.loadBalancer.pointsPerWeight=100, http.loadBalancer.lowEmittingInterval=20000, http.loadBalancer.highEmittingInterval=150000, http.loadBalancer.updateIntervalMs=5000, http.loadBalancer.consistentHashAlgorithm=distributionBased, http.loadBalancer.ringRampFactor=2.0, http.loadBalancer.highWaterMark=3000.0, http.loadBalancer.quarantine.maxPercent=0.1, http.loadBalancer.lowWaterMark=500.0}, _transportClientProperties={http.idleTimeout=10500000, allowedClientOverrideKeys=http.protocolVersion, http.poolSize=100, http.protocolVersion=HTTP_2, http.maxResponseSize=20971520, http.sslIdleTimeout=10500000, http.requestTimeout=10000}, _relativeStrategyProperties={highErrorRate=0.1, lowErrorRate=0.02, downStep=0.2, enableFastRecovery=true, relativeLatencyLowThresholdFactor=4.0, relativeLatencyHighThresholdFactor=5.0, slowStartThreshold=0.16, quarantineProperties={quarantineMaxPercent=0.1}, initialHealthScore=0.01, ringProperties={consistentHashAlgorithm=distributionBased}, upStep=0.05, emittingIntervalMs=150000}, _degraderProperties={degrader.minCallCount=1, degrader.highErrorRate=0.1, degrader.lowLatency=3000, degrader.downStep=0.05, degrader.highLatency=5000, degrader.initialDropRate=0.99, degrader.slowStartThreshold=0.16, degrader.lowErrorRate=0.02}, prioritizedSchemes=[https], bannedUris=[], serviceMetadata={}, backupRequests=[], enableClusterSubsetting=false, minimumClusterSubsetSize=-1], _canaryConfigs=null, _canaryDistributionStrategy=null]}
New LB: CacheEntry{_version=253657044435, _timeStamp='2024/01/19 08:51:29.321', _data=ServiceStoreProperties [_stableServiceProperties=ServiceProperties [_clusterName=DataMan, _path=/workplaceTypeTaxonomyVersions, _serviceName=workplaceTypeTaxonomyVersions, _loadBalancerStrategyList=[relative, degrader], _loadBalancerStrategyProperties={http.loadBalancer.pointsPerWeight=100, http.loadBalancer.lowEmittingInterval=20000, http.loadBalancer.highEmittingInterval=150000, http.loadBalancer.updateIntervalMs=5000, http.loadBalancer.consistentHashAlgorithm=distributionBased, http.loadBalancer.ringRampFactor=2.0, http.loadBalancer.highWaterMark=3000.0, http.loadBalancer.quarantine.maxPercent=0.1, http.loadBalancer.lowWaterMark=500.0}, _transportClientProperties={http.idleTimeout=10500000, allowedClientOverrideKeys=http.protocolVersion, http.poolSize=100, http.protocolVersion=HTTP_2, http.maxResponseSize=20971520, http.sslIdleTimeout=10500000, http.requestTimeout=10000}, _relativeStrategyProperties={highErrorRate=0.1, lowErrorRate=0.02, downStep=0.2, enableFastRecovery=true, relativeLatencyLowThresholdFactor=4.0, relativeLatencyHighThresholdFactor=5.0, slowStartThreshold=0.16, quarantineProperties={quarantineMaxPercent=0.1}, initialHealthScore=0.01, ringProperties={consistentHashAlgorithm=distributionBased}, upStep=0.05, emittingIntervalMs=150000}, _degraderProperties={degrader.minCallCount=1, degrader.highErrorRate=0.1, degrader.lowLatency=3000, degrader.downStep=0.05, degrader.highLatency=5000, degrader.initialDropRate=0.99, degrader.slowStartThreshold=0.16, degrader.lowErrorRate=0.02}, prioritizedSchemes=[https], bannedUris=[], serviceMetadata={}, backupRequests=[], enableClusterSubsetting=false, minimumClusterSubsetSize=-1], _canaryConfigs=null, _canaryDistributionStrategy=null]}
2024/01/19 08:51:29.669 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qot/cmC3hYTrXTuA==] Current entries of ServiceStoreProperties for workplaceTypeTaxonomyVersions: 
Old LB: null
New LB: null

2024/01/19 08:51:30.143 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qv9q3dngdZSKCk0A==] Added new entry ClusterStoreProperties for DataMan for New LB.
2024/01/19 08:51:30.143 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qv9q3dngdZSKCk0A==] Current entries of ClusterStoreProperties for DataMan: 
Old LB: null
New LB: CacheEntry{_version=253657044272, _timeStamp='2024/01/19 08:51:30.143', _data=ClusterStoreProperties [_stableClusterProperties=ClusterProperties [_clusterName=DataMan, _prioritizedSchemes=[], _properties={}, _bannedUris=[], _partitionProperties=com.linkedin.d2.balancer.properties.NullPartitionProperties@6df0a51e, _sslSessionValidationStrings=[dataman], _darkClusterConfigMap={}, _delegated=false], _canaryConfigs=null, _canaryDistributionStrategy=null, _failoutProperties=null]}
2024/01/19 08:51:30.160 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qwOS5ilZCYy6w7tg==] Added new entry UriProperties for DataMan for New LB.
2024/01/19 08:51:30.160 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0qwOS5ilZCYy6w7tg==] Current entries of UriProperties for DataMan: 
Old LB: null
New LB: CacheEntry{_version=253607588919|2, _timeStamp='2024/01/19 08:51:30.160', _data=UriProperties [_clusterName=DataMan, _urisBySchemeAndPartition={https={0=[https://ltx1-app5667.stg.linkedin.com:3874/dataman, https://ltx1-app6217.stg.linkedin.com:3874/dataman]}}, _partitions={https://ltx1-app5667.stg.linkedin.com:3874/dataman={0=[ weight =1.0 ]}, https://ltx1-app6217.stg.linkedin.com:3874/dataman={0=[ weight =1.0 ]}}, _uriSpecificProperties={https://ltx1-app5667.stg.linkedin.com:3874/dataman={}, https://ltx1-app6217.stg.linkedin.com:3874/dataman={}}]}

2024/01/19 08:51:30.416 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0q0IKLLrOFoNYaC8A==] Matched ClusterStoreProperties for DataMan. 
Old LB: CacheEntry{_version=253657044272, _timeStamp='2024/01/19 08:51:30.416', _data=ClusterStoreProperties [_stableClusterProperties=ClusterProperties [_clusterName=DataMan, _prioritizedSchemes=[], _properties={}, _bannedUris=[], _partitionProperties=com.linkedin.d2.balancer.properties.NullPartitionProperties@6df0a51e, _sslSessionValidationStrings=[dataman], _darkClusterConfigMap={}, _delegated=false], _canaryConfigs=null, _canaryDistributionStrategy=null, _failoutProperties=null]}
New LB: CacheEntry{_version=253657044272, _timeStamp='2024/01/19 08:51:30.143', _data=ClusterStoreProperties [_stableClusterProperties=ClusterProperties [_clusterName=DataMan, _prioritizedSchemes=[], _properties={}, _bannedUris=[], _partitionProperties=com.linkedin.d2.balancer.properties.NullPartitionProperties@6df0a51e, _sslSessionValidationStrings=[dataman], _darkClusterConfigMap={}, _delegated=false], _canaryConfigs=null, _canaryDistributionStrategy=null, _failoutProperties=null]}
2024/01/19 08:51:30.416 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0q0IKLLrOFoNYaC8A==] Current entries of ClusterStoreProperties for DataMan: 
Old LB: null
New LB: null

2024/01/19 08:51:30.441 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0q0gdDgbUuAqeczSw==] Matched UriProperties for DataMan that only differ in version. 
Old LB: CacheEntry{_version=-1|2, _timeStamp='2024/01/19 08:51:30.441', _data=UriProperties [_clusterName=DataMan, _urisBySchemeAndPartition={https={0=[https://ltx1-app5667.stg.linkedin.com:3874/dataman, https://ltx1-app6217.stg.linkedin.com:3874/dataman]}}, _partitions={https://ltx1-app5667.stg.linkedin.com:3874/dataman={0=[ weight =1.0 ]}, https://ltx1-app6217.stg.linkedin.com:3874/dataman={0=[ weight =1.0 ]}}, _uriSpecificProperties={}]}
New LB: CacheEntry{_version=253607588919|2, _timeStamp='2024/01/19 08:51:30.160', _data=UriProperties [_clusterName=DataMan, _urisBySchemeAndPartition={https={0=[https://ltx1-app5667.stg.linkedin.com:3874/dataman, https://ltx1-app6217.stg.linkedin.com:3874/dataman]}}, _partitions={https://ltx1-app5667.stg.linkedin.com:3874/dataman={0=[ weight =1.0 ]}, https://ltx1-app6217.stg.linkedin.com:3874/dataman={0=[ weight =1.0 ]}}, _uriSpecificProperties={https://ltx1-app5667.stg.linkedin.com:3874/dataman={}, https://ltx1-app6217.stg.linkedin.com:3874/dataman={}}]}
2024/01/19 08:51:30.441 INFO [DualReadLoadBalancerMonitor] [__r2d2DefaultClient__.r2d2Client.d2Executor-4-1] [d2-proxy] [AAYPT0q0gdDgbUuAqeczSw==] Current entries of UriProperties for DataMan: 
Old LB: null
New LB: null

Copy link
Collaborator

@brycezhongqing brycezhongqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@bohhyang bohhyang merged commit 089994f into master Jan 19, 2024
1 of 2 checks passed
@bohhyang bohhyang deleted the boyang/dualReadMonitor branch January 19, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants