Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kuma-cp) TrafficRoute in multizone issue #1979

Merged
merged 14 commits into from
May 25, 2021
Merged

Conversation

lobkovilya
Copy link
Contributor

Summary

In some specific cases, TrafficRoutes in multizone didn't work correctly. It happened because envoy.Cluster doesn't support hosts with the same ip/port but with different metadata for subsets.

Current PR changes the way we implemented TrafficRoutes, instead of ClusterSubsets we will use a separate cluster for each TrafficRoute.Split.

This change affected metric because previously Service name and Cluster name we the same, but now Cluster has autogenerated name service_name-_num_ where num is an index of Split. In order to support existing grafana dashboards Kuma Data Plane hijacks metrics and merge metrics for clusters that belong to the same service.

Full changelog

  • update envoy Configurers
  • merge metrics in kuma-dp
  • rename Subset to Cluster and Cluster to Service + small refactoring
  • add e2e tests

Issues resolved

Fix #XXX

Documentation

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

# Conflicts:
#	pkg/xds/cache/cla/cache.go
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@lobkovilya lobkovilya requested a review from a team as a code owner May 12, 2021 13:37
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
func (c *Cache) GetCLA(ctx context.Context, meshName, meshHash, service string, apiVersion envoy_common.APIVersion) (proto.Message, error) {
key := fmt.Sprintf("%s:%s:%s:%s", apiVersion, meshName, service, meshHash)
func (c *Cache) GetCLA(ctx context.Context, meshName, meshHash string, cluster envoy_common.Cluster, apiVersion envoy_common.APIVersion) (proto.Message, error) {
key := fmt.Sprintf("%s:%s:%s:%s", apiVersion, meshName, cluster.Name(), meshHash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this work with following case

apiVersion: kuma.io/v1alpha1
kind: TrafficRoute
mesh: default
metadata:
  name: route-all-default
spec:
  sources:
    - match:
        kuma.io/service: 'web'
  destinations:
    - match:
        kuma.io/service: 'backend'
  conf:
    split:
      - weight: 50
        destination:
          kuma.io/service: backend
          version: 1
      - weight: 50
        destination:
          kuma.io/service: backend
          version: 2
---
apiVersion: kuma.io/v1alpha1
kind: TrafficRoute
mesh: default
metadata:
  name: route-all-default
spec:
  sources:
    - match:
        kuma.io/service: 'other'
  destinations:
    - match:
        kuma.io/service: 'backend'
  conf:
    split:
      - weight: 50
        destination:
          kuma.io/service: backend
          version: 3
      - weight: 50
        destination:
          kuma.io/service: backend
          version: 4

if the cache is maintained for the whole mesh, wouldn't we generate _0_, _1_ and have a mismatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right, I added the method Hash() to use in MeshHash (instead of cluster.Name())

pkg/xds/cache/cla/cache.go Show resolved Hide resolved
test/e2e/trafficroute/kuma_universal.go Outdated Show resolved Hide resolved
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

# Conflicts:
#	pkg/xds/envoy/listeners/v2/http_outbound_route_configurer_test.go
#	pkg/xds/envoy/listeners/v3/http_outbound_route_configurer_test.go
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@lobkovilya
Copy link
Contributor Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented May 24, 2021

Command update: success

Branch has been successfully updated

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lobkovilya
❌ mergify[bot]
You have signed the CLA already but the status is still pending? Let us recheck it.

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
}
return errors.Errorf("resulted split values %v and %v aren't within original values %v and %v",
echo1Resp/float64(maxAttempts)*100, echo2Resp/float64(maxAttempts)*100, v1Weight, v2Weight)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of this algo vs simple send 100 requests, see how they are distributed?

}

for _, ne := range notExpected {
Expect(stdout).ToNot(ContainSubstring(ne))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to do expect from eventually? I was under impression this is not a good idea. Will this allow Eventually to do next try or will this fail the test?

kuma.io/service: echo-server_kuma-test_svc_8080
version: v4
`
Expect(NewClusterSetup().Install(YamlUniversal(trafficRoute)).Setup(global)).To(Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: YamlUniversal(trafficRoute)(global) is shorter

err := waitRatio("echo-server_kuma-test_svc_8080.mesh", v1Weight, v2Weight, "echo-v1", "echo-v2")
Expect(err).ToNot(HaveOccurred())
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like waitRatio and collectResponses are a bit cryptic. Names do not suggest that those are assertion and I think both of them contains a bit complicated logic.

Here's what I would do:

  • Write a method collectResponses := func(destination string) map[string]int (instance to responses) that only sends 10 (can be 100 if needed) requests and collects responses, don't do any assertions
  • In should access all instances of the service I would write - eventually (collectResponses and then do assertion in the test that there is 1,2,4 and not 3)
  • In should route 100 percent of the traffic to the different service I would write - eventually (collectResponses and then do assertion that responses are from one instance)
  • In should route split traffic between the versions with 20/80 ratio I would write - eventually (collectResponses and count if the traffic is propagated properly)

I think it will be also more flexible to add new test cases. Also I think collectResponses could be reused between both tests.

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@jakubdyszkiewicz
Copy link
Contributor

That's some next-level sorcery with those chained assertions 👀

@lobkovilya lobkovilya merged commit f0211ec into master May 25, 2021
@lobkovilya lobkovilya deleted the fix/tr-in-multizone branch May 25, 2021 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants