Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling upgrade fails when all members change their IP addresses #14088

Closed
leszko opened this issue Nov 8, 2018 · 47 comments

Comments

Projects
None yet
6 participants
@leszko
Copy link
Contributor

commented Nov 8, 2018

When the cluster has some data (tested with ~1GB) and doing a rolling upgrade one by one, sometimes when the last member gets restarted, it fails with the following error.

Nov 07, 2018 8:41:27 AM com.hazelcast.internal.cluster.impl.DiscoveryJoiner
INFO: [10.16.0.19]:5701 [dev] [3.10.5] [10.16.0.15]:5701 is added to the blacklist.
Nov 07, 2018 8:41:27 AM com.hazelcast.nio.tcp.TcpIpConnectionErrorHandler
WARNING: [10.16.0.19]:5701 [dev] [3.10.5] Removing connection to endpoint [10.16.0.15]:5701 Cause => java.net.SocketException {Host is unreachable to address /10.16.0.15:5701}, Error-Count: 13
Nov 07, 2018 8:41:27 AM com.hazelcast.internal.partition.impl.PartitionReplicaStateChecker
OFF: [10.16.0.19]:5701 [dev] [3.10.5] Could not get a response from master about migrations! -> com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [10.16.0.15]:5701, partitionId: -1, operation: com.hazelcast.internal.partition.operation.HasOngoingMigration, service: hz:core:partitionService

Looks like maybe some ongoing migrations have the address fixed and the member cannot connect. Or something with master election. Maybe just a simple retry on the members discovery would solve the issue.

The easier to reproduce is to do it on Kubernetes (then, usually all PODs are assigned with new IPs). However, I believe, the issue is not limited to Kubernetes.

Steps to reproduce on Kubernetes (using Helm Chart):

  1. Start a Hazelcast Cluster
helm install --name my-release --set cluster.memberCount=6,image.tag=3.10.3 stable/hazelcast
  1. Insert some data (can skip this point, but the issue happens less frequently without any data)
  2. Upgrade the cluster
helm upgrade my-release --set cluster.memberCount=6,image.tag=3.10.4 stable/hazelcast
@metanet

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

Hi Rafal,

I guess the failure occurs while the last node is shutting down, not during its restart. Can you share full logs of the failure?

Thanks,

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2018

Hi, actually it happens when the last one starts. Meaning, that all members are already in the new version.

Please see the logs attached.
logs.txt

@mdogan

This comment has been minimized.

Copy link
Member

commented Nov 9, 2018

This looks like a byzantine failure 👹 One of the nodes is lying about the master.

Following pattern is repeated:

Nov 07, 2018 8:40:51 AM com.hazelcast.internal.cluster.impl.ClusterJoinManager
WARNING: [10.16.0.19]:5701 [dev] [3.10.5] Ambiguous master response: This node has a master
 [10.16.0.15]:5701, but does not have a connection to [10.16.0.16]:5701. 
Sent master response as [10.16.0.12]:5701. Master field will be unset now...

Nov 07, 2018 8:40:51 AM com.hazelcast.internal.cluster.impl.ClusterJoinManager
WARNING: [10.16.0.19]:5701 [dev] [3.10.5] Could not create connection to possible master [10.16.0.15]:5701

Nov 07, 2018 8:40:54 AM com.hazelcast.nio.tcp.TcpIpConnector
INFO: [10.16.0.19]:5701 [dev] [3.10.5] Could not connect to: /10.16.0.15:5701. Reason: SocketException[Host is unreachable to address /10.16.0.15:5701]

Btw, "Ambiguous master response..." log is a bit misleading. It should be like:
(Fix #14090)

WARNING: [10.16.0.19]:5701 [dev] [3.10.5] Ambiguous master response! Received master response 
[10.16.0.12]:5701 from [10.16.0.16]:5701. This node has a master [10.16.0.15]:5701, 
but does not have an active connection to it. Master field will be unset now...

Anyway, what happens is, during discovery [10.16.0.19]:5701 is asking for master address. Some unknown node sends [10.16.0.15]:5701 as response. But [10.16.0.19]:5701 cannot connect to it. When [10.16.0.16]:5701 sends [10.16.0.12]:5701 as response, [10.16.0.19]:5701 resets the known master field. But just after that it receives [10.16.0.15]:5701 again. It looks like this loop repeats.

@leszko: Can you please enable debug or trace logs and reproduce this issue again?

@metanet

This comment has been minimized.

Copy link
Contributor

commented Nov 9, 2018

seems k8s is the right tool to make Hazelcast meet with byzantine failures :)

@Danny-Hazelcast

This comment has been minimized.

Copy link
Member

commented Nov 9, 2018

i have not encountered this type of failure mode. however in my test ip addresses remain constant through the rolling upgrade.

was this issue caused to changing IP addresses ? or due to ongoing migations ?

my rolling upgrade process is.

  1. graceful shutdown 1 member
  2. restart member jvm process using new version of Hazelcast jars
  3. wait until all members report the cluster is Safe.

and repeat until all members have been upgraded. but at each step i wait until the cluster is safe
before upgrading the next member.

2 Questions,

  1. do i need to do wait for cluster safe each time ?

  2. what steps is the helm command doing ?
    helm upgrade my-release --set cluster.memberCount=6,image.tag=3.10.4 stable/hazelcast

Thanks.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

@mdogan I've attached the FINEST logs. Hope it helps.
logs-finest.txt

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

@Danny-Hazelcast I'm not sure it's related to the fact that all IP have changed, but the thing is that scaling in works always correctly (some IP do not change) and rolling upgrade fails (all IPs change).

To answer your questions:

  1. In my case, I don't actually check that the cluster is safe, but use the graceful shutdown for terminating the members.
  2. The command itself is just changing the image which is used in Kubernetes. But what Kubernetes does can be described as follows (let's say the staring point is that we have 6 members (numbers 0...5), version 3.10.3):
    • Kubernetes starts to terminate Member number 5
    • Kubernetes waits until it is gracefully terminated (hazelcast.shutdownhook.policy=GRACEFUL)
    • When the Member 5 is terminated, it starts a new Member 5 (with new IP, with version 3.10.4)
    • Kubernetes waits 30 seconds
    • Kubernetes starts to terminate Member number 4
    • ....
    • Kubernetes starts to terminate Member number 0
    • Kubernetes waits until it is gracefully terminated (hazelcast.shutdownhook.policy=GRACEFUL)
    • When the Member 5 is terminated, it starts a new Member 5 (with new IP, with version 3.10.4)
    • Member 0 tries to start but fails with the logs attached. The story then is that after some time the Member 0 shuts down, so Kubernetes restarts it again, and then it works fine.

@mdogan mdogan self-assigned this Nov 15, 2018

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 11, 2018

As far as I understand, this is related to Kubernetes discovery. Details coming...

Instead of doing a version upgrade, I tried with shutting down the whole cluster via management center. Since 6 members are required, all members are re-created by Kubernetes.

When I tried with version 3.10.x, with Helm template version 1.0.1, after shutdown generally some of the members could not discover the others. This was also very obvious on management center, member list was changing all the time until split-brain discovery kicks in and clusters merge back.

helm install --name my-release --set cluster.memberCount=6,image.tag=3.10.6 hazelcast/hazelcast --version 1.0.1

But when I use the latest template with Hazelcast 3.11, I am not able to reproduce this issue. All members formed a single cluster after restart.

helm install --name my-release --set cluster.memberCount=6 hazelcast/hazelcast

Discovery config in version 1.0.1 was:

  <discovery-strategy enabled="true" class="com.hazelcast.kubernetes.HazelcastKubernetesDiscoveryStrategy">
    <properties>
      <property name="service-name">${serviceName}</property>
      <property name="namespace">${namespace}</property>
    </properties>
  </discovery-strategy>

In 1.1.0, latest, version, it is:

  <kubernetes enabled="true">
    <service-name>${serviceName}</service-name>
    <namespace>${namespace}</namespace>
    <resolve-not-ready-addresses>true</resolve-not-ready-addresses>
  </kubernetes>

Notice the newly added property, resolve-not-ready-addresses, in the latest. It's default value is false in HazelcastKubernetesDiscoveryStrategy.

When I set that property to true using a custom config with earlier template and Hazelcast 3.10.6, issue disappears.

helm install --name my-release --set cluster.memberCount=6,image.tag=3.10.6 hazelcast/hazelcast --version 1.0.1 -f values.yaml
  <discovery-strategy enabled="true" class="com.hazelcast.kubernetes.HazelcastKubernetesDiscoveryStrategy">
    <properties>
      <property name="service-name">${serviceName}</property>
      <property name="namespace">${namespace}</property>
      <property name="resolve-not-ready-addresses">true</property>
    </properties>
  </discovery-strategy>
@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 11, 2018

I tried today with the latest Helm Chart with the following hazelcast.xml (inside values.yaml):

      <?xml version="1.0" encoding="UTF-8"?>
      <hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.10.xsd"
                     xmlns="http://www.hazelcast.com/schema/config"
                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

        <properties>
          <property name="hazelcast.discovery.enabled">true</property>
        </properties>
        <network>
          <join>
            <multicast enabled="false"/>
            <discovery-strategies>
              <discovery-strategy enabled="true"
                  class="com.hazelcast.kubernetes.HazelcastKubernetesDiscoveryStrategy">
                <properties>
                  <property name="service-name">${serviceName}</property>
                  <property name="namespace">${namespace}</property>
                  <property name="resolve-not-ready-addresses">true</property>
                </properties>
              </discovery-strategy>
            </discovery-strategies>
          </join>
        </network>
      </hazelcast>

I did a rolling upgrade from 3.10.5 to 3.10.6 and can confirm that the same issue still exists. IMO it's not realted to the discovery or the property resolve-not-ready-addresses.

@mdogan If you cannot reproduce it, could you fill your cluster with a significant amount of data?

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

By the same error do you mean following log? Or do you see any other issues too, such as cluster is not formed with correct size etc?

Nov 07, 2018 8:41:27 AM com.hazelcast.internal.partition.impl.PartitionReplicaStateChecker
OFF: [10.16.0.19]:5701 [dev] [3.10.5] Could not get a response from master about migrations! 
-> com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: [10.16.0.15]:5701, 
partitionId: -1, operation: com.hazelcast.internal.partition.operation.HasOngoingMigration, service: hz:core:partitionService
@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 12, 2018

The same log. I attached in some of my comments (logs.txt). Cluster is not formed and the it fails with the error that you pasted.

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

To make this clear, that log starting with Could not get a response from master about migrations... is not an error or failure. It's log level is OFF, which is not supposed to be printed at all. But it seems jdk logger prints it by default.

What do you mean by cluster is not formed? Do you observe two or more different clusters with different member lists? Do you see a member join failure?

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 12, 2018

Hmm, logs.txt are all the logs. So, this error keeps repeating and the member does not form a cluster on it's own and it does not join any cluster. I didn't actually check what's blocking it.

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

If you can switch to log4j, then that log will be invisible completely, as expected.

I tried a few times with loading some data to cluster but during upgrade docker process on macOS gone crazy and Hazelcast members, management-center and kubectl became unresponsive completely.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 12, 2018

Could you try on GKE? I think local tests are not good, because you cannot use really huge amount of data + you don't have networking.

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

I can try on GKE too but I think testing with huge amount of data is irrelevant here. Because purpose of this test is testing Kubernetes member discovery and membership handling inside Kubernetes cluster.
We are not testing migrations and/or backup recovery here. We already have extensive tests for those.

Migrations don't have any effect on cluster membership management logic. Their effect is indirect by increasing the load on CPU, memory and network. But I don't think we are doing load testing here.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 13, 2018

Does your "migration and/or backup recovery" tests include the scenario when the IP numbers are changed after recovery?

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

Changing IP address means removing existing members and adding new ones. Without hot-restart there is no notion of fail-recovery in Hazelcast. What you think recover is actually adding new members to replace failed ones. And yes we have tests for removing & adding members.

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

What I mean is, running this test with huge amount of data can lead to high load on members and some members become unresponsive. And that can lead to membership failures. But is this what you want to test?

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 13, 2018

Hmm, if you cover a scenario when you do a rolling upgrade when all IP have changed and there is lot of data, then it's weird it fails on Kubernetes... actually I though that the reason for this failure is that there are ongoing migration with the IP assigned, but after the rolling upgrade the IP does exist anymore, so it crashes. Could something like that be a problem?

It's not a matter of a temporary unresponsiveness, the issue is that the cluster never recovers after the rolling upgrade.

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

... actually I though that the reason for this failure is that there are ongoing migration with the IP assigned, but after the rolling upgrade the IP does exist anymore, so it crashes. Could something like that be a problem?

No. Members going and coming back during migration is very usual & expected thing. When a member is removed when it's a participant of an ongoing migration, that migration will be marked as failed. And a new re-partitioning scheme will be calculated. Also, a node restarting with a different IP address is not an issue. Harder problem is, when it restarts with the same IP. Migration mechanism compares member UUIDs to validate incoming operations & requests.

It's not a matter of a temporary unresponsiveness, the issue is that the cluster never recovers after the rolling upgrade.

Can you be more specific here? What do you mean by cluster never recovers? What you observe exactly? Is it a member cannot join the existing cluster after restart & upgrade and cluster remains with 5 members? If yes, what happens to that restarting member? Does it create a standalone cluster? Or does it crash at some point?

When I look at the log file at #14088 (comment), I see that member eventually joins the cluster.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 13, 2018

Hmmm, so it's probably even more complex than I thought :/

Let me do some more testing on k8 and I'll come back here with some more info, like exactly what happens step by step that causes that issue.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

I've reproduced it taking all the details and DEBUG logs, so let me try to describe the issue as good as I can. We can later think how to reproduce it outside k8 if needed.

Initial information

  • 3 Hazelcast IMDG members version 3.10.5 (rolling upgraded to 3.10.6)
  • the issue exists only if there is a significant amount of data in the cluster
  • graceful shutdown is enabled
  • tested on GKE (k8 on Google Cloud Platform)
  • the following Helm Chart was used to deploy and upgrade Hazelcast
  • I used leszko/hazelcast Docker images, which are the same as hazelcast/hazelcast, but have DEBUG logging level enabled

What happens step-by-step

Step 1: Install Hazelcast cluster

$ helm install --name my-release --set image.repository=leszko/hazelcast,image.tag=3.10.5 .

Cluster is formed correctly, PODs (with Hazelcast members) have the following IP numbers:

  • Member 0: 10.16.0.47 (logs)
  • Member 1: 10.16.2.23 (logs)
  • Member 2: 10.16.0.48

Step 2: Fill the data

I connect with the external client and insert some data (100 entries, each new Long[1000000])

Everything looks fine so far.

Step 3: Start rolling upgrade

$ helm upgrade my-release --set image.repository=leszko/hazelcast,image.tag=3.10.6 .

The rolling upgrade is started, in k8 it works like terminate Member 2, wait until Member 2 is terminated (graceful shutdown), start Member 2, wait 60 s, terminate Member 1, wait until Member 1 is terminated (graceful shutdown), start Member 1, ..., and so on.

Step 4: Member 2 is restarted

Member 2 (10.16.0.48) is terminated, new Member 2 is started (10.16.0.49). It joins the cluster. No issues here logs

Step 5: Member 1 is restarted

Member 1 (10.16.2.23) is terminated, new Member 1 is started (10.16.2.24). It joins the cluster. No issues here logs

Step 6: Member 0 is restarted

Member 0 (10.16.0.47) is terminated, new Member 0 is started (10.16.0.50). Here, the issue starts, before joining the cluster, the member tries to connect to 10.16.0.47 (previous Member 0, which does not exist anymore). It tries for a few minutes and then shuts down logs

Step 7: Member 0 is started again

Kubernetes automatically starts Member 0 again when it detects it was shut down. So, new Member 0 is started (10.16.0.50). This time it start successfully. logs

Implications

  • The good thing is that no data was lost
  • One problem (minor) is that it does not look healthy from the user's perspective
  • Other problem (major) is that in case of Hazelcast Enterprise if you upgrade by the minor version, the cluster version is not automatically updated, so in the Step 7, we would end up with a Split Brain (the new Member would be of a version 3.11, the old cluster of a version 3.10)

Next steps

Since the issue seems not so straightforward, I think the best approach would be, if @mdogan , you could have a look at the logs and if nothing comes to your mind, we could sit together on this on Monday. Maybe connect with debugger or just think together. @mdogan , what do you think?

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

Here is what happens:

This is the initial cluster:

Members {size:3, ver:3} [
	Member [10.16.0.47]:5701 - 0df4ee85-30fb-40b6-93e9-927870fe8168
	Member [10.16.2.23]:5701 - 4c60af92-4dfe-4028-bba5-f061cb13a324 this
	Member [10.16.0.48]:5701 - 47c73a07-2daf-4cfe-a992-4127104626db
]

When Member 2 is restarted cluster becomes:

Members {size:4, ver:4} [
	Member [10.16.0.47]:5701 - 0df4ee85-30fb-40b6-93e9-927870fe8168
	Member [10.16.2.23]:5701 - 4c60af92-4dfe-4028-bba5-f061cb13a324
	Member [10.16.0.48]:5701 - 47c73a07-2daf-4cfe-a992-4127104626db
	Member [10.16.0.49]:5701 - a00b4bb5-0b9e-4d8e-8c2e-bd9df0cc94fa this
]

As you notice there are now 4 members, other two members have not realized yet [10.16.0.48]:5701 - 47c73a07-2daf-4cfe-a992-4127104626db is left. Because when a member is disconnected Hazelcast tries to connect that member a few times and by default this takes between 500ms ~ 1s. This can be configured to a lower value.

After they realize 10.16.0.48]:5701 - 47c73a07-2daf-4cfe-a992-4127104626db leave, cluster becomes:

Members {size:3, ver:5} [
	Member [10.16.0.47]:5701 - 0df4ee85-30fb-40b6-93e9-927870fe8168
	Member [10.16.2.23]:5701 - 4c60af92-4dfe-4028-bba5-f061cb13a324
	Member [10.16.0.49]:5701 - a00b4bb5-0b9e-4d8e-8c2e-bd9df0cc94fa this
]

Similarly when Member 1 is restarted, cluster becomes:

Members {size:4, ver:6} [
	Member [10.16.0.47]:5701 - 0df4ee85-30fb-40b6-93e9-927870fe8168
	Member [10.16.2.23]:5701 - 4c60af92-4dfe-4028-bba5-f061cb13a324
	Member [10.16.0.49]:5701 - a00b4bb5-0b9e-4d8e-8c2e-bd9df0cc94fa
	Member [10.16.2.24]:5701 - 43dc54d8-6f92-4170-b570-0589df698778 this
]

Then after they realize Member [10.16.2.23]:5701 - 4c60af92-4dfe-4028-bba5-f061cb13a324 is left:

Members {size:3, ver:7} [
	Member [10.16.0.47]:5701 - 0df4ee85-30fb-40b6-93e9-927870fe8168
	Member [10.16.0.49]:5701 - a00b4bb5-0b9e-4d8e-8c2e-bd9df0cc94fa
	Member [10.16.2.24]:5701 - 43dc54d8-6f92-4170-b570-0589df698778 this
]

When Member 0 is restarted, problem becomes a little tricky, since it was the master of cluster, who accepts join requests and decides membership changes.

During restart, Member 0 discovers other two nodes via Kubernetes discovery and asks them about the master node, to be able to send a join request:

FINEST: [10.16.0.50]:5701 [dev] [3.10.6] Found node service with address: [10.16.0.50]:5701
FINEST: [10.16.0.50]:5701 [dev] [3.10.6] Found node service with address: [10.16.0.49]:5701
FINEST: [10.16.0.50]:5701 [dev] [3.10.6] Found node service with address: [10.16.2.24]:5701
FINE: [10.16.0.50]:5701 [dev] [3.10.6] Will send master question to each address in: [[10.16.0.49]:5701, [10.16.2.24]:5701]

But since they have not realized [10.16.0.47]:5701's leave yet, they respond as:

FINE: [10.16.0.50]:5701 [dev] [3.10.6] Handling master response [10.16.0.47]:5701 from [10.16.2.24]:5701
FINE: [10.16.0.50]:5701 [dev] [3.10.6] Handling master response [10.16.0.47]:5701 from [10.16.0.49]:5701
FINE: [10.16.0.50]:5701 [dev] [3.10.6] Setting master address to [10.16.0.47]:5701
WARNING: [10.16.0.50]:5701 [dev] [3.10.6] Could not create connection to possible master [10.16.0.47]:5701

Up to this point, everything is expected to me, according to current implementation.

One issue is, [10.16.0.49]:5701 and [10.16.2.24]:5701 cannot realize [10.16.0.47]:5701's leave with socket errors in time. And [10.16.0.47]:5701 is removed after one minute due to heartbeat failure:

WARNING: [10.16.0.49]:5701 [dev] [3.10.6] Suspecting Member [10.16.0.47]:5701 - 0df4ee85-30fb-40b6-93e9-927870fe8168 because it has not sent any heartbeats since 2018-12-13 15:43:26.622. Now: 2018-12-13 15:44:29.185, heartbeat timeout: 60000 ms, suspicion level: 1.00

Another issue is, once [10.16.0.50]:5701 learns master as [10.16.0.47]:5701, it tries to connect to it until join timeouts. This is reported earlier by @tezc: #14051

For Kubernetes discovery, solution might be easier, since using Kubernetes registry, we know [10.16.0.47]:5701 doesn't exist anymore.

@mdogan

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

I'll think about how can we fix #14051.

As far as I understand, this is not related to version upgrade but restarting master member with a new IP address.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

Thanks @mdogan for the analysis. Makes sense to me. The proper solution is, as you mentioned, to fix #14051.

For Kubernetes discovery, solution might be easier, since using Kubernetes registry, we know [10.16.0.47]:5701 doesn't exist anymore.

We could fix it that way, but then I think we don't address the root cause. Anyway, this change would need to be done in Core as well.

@mmedenjak

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2019

Potentially fixed by: #14343

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Jan 31, 2019

I just tested and #14343 does not solve the issue.

@mdogan

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

Then I'm out of ideas now. I cannot reproduce the issue myself and cannot see any obvious problem after #14343.

@mdogan mdogan removed their assignment Jan 31, 2019

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Feb 18, 2019

FYI: I've reproduced the same issue on AKS (Azure k8), so the issue is not only related to GKE.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2019

@mdogan , I couldn't reproduce the issue from the current master. I got the same exception, but the member always manager to recover (after some number of errors). Could you please check the attached logs and write if that's the expected behavior? If yes, then we can close this issue.
azure-logs.txt

@mdogan

This comment has been minimized.

Copy link
Member

commented Feb 26, 2019

Thanks for the update @leszko.

10.244.1.14 is the restarted member and previous master was 10.244.1.11. 10.244.1.14 learns previous master address 10.244.1.11 from 10.244.1.13. It tries to join 10.244.1.11 nearly 45 seconds. Then 10.244.1.13 becomes the master, most probably it suspected from 10.244.1.11 due to a heartbeat failure. And 10.244.1.14 successfully joins the cluster.

To my understanding this looks good. Do you have the logs from 10.244.1.13?

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2019

@mdogan

Maybe it's worth to mention that I have graceful shutdown enabled (not sure if it changes anything). I retried again and this time I ended up with a Split Brain (which merged after 5 min). Please see the logs of all 3 members.
azure-0.txt
azure-1.txt
azure-2.txt

@mmedenjak

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

I suggest closing this issue based on the comments from the email:

Rafal:

I've added a comment in the GH issue, please check. In general, I couldn't reproduce exactly the same issue, but still see similar errors in logs (but the member recovers from them). If that's expected, then we can close the GH issue and make the issue resolved.

Mehmet:

To my understanding result looks good. I don't know why, but during pod restart/upgrade, network connections are not reset (somehow), and other members only realize missing member with heartbeat timeouts.

Seems to me there is nothing we can do on our side anymore and the original issue is resolved.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2019

I'd feel safer if @mdogan had a look at the latest logs I've sent.

The latest issue (Split Brain) is definitely better than what we had when I open the issue (Crashing completely) ;) Saying that, I'm still not super satisfied that we accept Split Brain as a kind-of a normal flow of the Rolling Upgrade on Kubernetes.

@mmedenjak

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

I thought he had and this was his reply:

To my understanding result looks good. I don't know why, but during pod restart/upgrade, network connections are not reset (somehow), and other members only realize missing member with heartbeat timeouts.

But I may be mistaken.

@mdogan

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

The latest case looks very complicated. I couldn't understand it very well. As far as I can tell, here is what happens:

  • When 10.244.0.12:5701 starts up at 2:39:18, it tries to connect to 10.244.1.5, 10.244.2.6 and 10.244.0.11. It discovers 10.244.0.11 as the master, most probably from 10.244.1.5 or 10.244.2.6. But 10.244.0.11 is off, so it cannot connect it until 10.244.1.5 and 10.244.2.6 go offline. Then it starts as standalone at 2:43:27.

  • When 10.244.2.7:5701 starts up at 2:40:18, it tries to connect to 10.244.1.5, 10.244.0.12 and 10.244.2.6. It discovers 10.244.2.6 as the master, most probably from 10.244.1.5. And tries to join 10.244.2.6 (which is offline at that point) until 10.244.1.5 goes offline. Then it discovers 10.244.0.12:5701 and joins it at 2:43:33.

  • When 10.244.1.6:5701 starts up at 2:41:18, it tries to connect to 10.244.2.7 and 10.244.0.12. But since they are already searching for a master, they ignore it. But then 10.244.1.6:5701 starts as standalone at 2:41:25 before other two create a cluster. I don't understand why this happens, debug logs are not available, so difficult to tell.

There are also connections addresses 10.240.0.4, 10.240.0.5, 10.240.0.6, 10.244.1.1, 10.244.2.1, I assume these are for some liveness and/or health check.

@mdogan

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

I think main issue with k8s is, network is not interrupted/reset and members don't realize missing members until failure detector timeout kicks in.

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

Thanks @mdogan for having a look. I think we can conclude that we still have some issue (when it's run on k8), but the issue is not that critical as the one from what we started ;)

I think we should close this issue and create a new one with just this Split Brain description and logs. And this new issue doesn't have to be 3.12. Anyway, we should still address it at some point.

@mdogan @mmedenjak , are you fine with that? Can I close this issue and open a new one with the "3.13" milestone?

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

Interestingly enough, I made last check for the rolling upgrade of the minor version (Hazelcast Enterprise, from 3.11.2 => 3.12-BETA-1) and the rolling upgrade itself went fine. Then, I wanted to terminate the whole cluster (one by one) and got the following logs.

gke-logs.txt

Again, I cannot say that something wrong happened, because no data is lost and I wanted to terminate the cluster and it finally got terminated. However, it definitely does not look healthy.

I use graceful shutdown, so I could imagine that the last member, or the second last got some errors, because they have no other member to which they could migrate the data. However, apparently, I scaled down 6 -> 5 -> 4 -> ... and each member shuts down with such issues.

@mdogan

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

It looks like the same problem. Master shuts down but sometimes nodes cannot realize its termination, because they don't receive a socket reset (EOF / FIN). That's why each time master candidate (next member to the master) waits until failure detector timeout.

Currently, a member doesn't send a shutdown message to the cluster while leaving, since we rely on socket events.

@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

This looks like a nasty issue.

I guess it's the same problem as when you have a cluster running on virtualbox/vmware and you pause one machine. Then it takes time for other machines to realize this. Simply because there is noone to send the FIN/RST. This is definitely worth further investigation.

I wonder whether there is any quick fix we could apply now. For example the TCP Keep-Alive feature could help, but it's just my gut feeling now.

@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

@mdogan

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

But what we noticed was that only one or two of the TCP connections held by Castle would raise an exception and get closed. The remaining connections in Castle would continue accepting data for as much as 10-15 minutes before finally raising exceptions. For this duration, these connections were essentially a black-hole as the container they connected to no longer existed.

Very similar observation...

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Mar 8, 2019

Steps to reproduce the issue:

Step 1: Install Hazelcast cluster

$ helm install --name my-release --set image.tag=3.11.2,cluster.memberCount=4,service.type=LoadBalancer,hazelcast.licenseKey=<license-key> hazelcast/hazelcast-enterprise

Step 2: Fill the data

I connect with the external dummy client and insert some data (200 entries, each new Long[1000000])

Step 3: Start rolling upgrade

$ helm upgrade my-release --set image.tag=3.12-BETA-1,cluster.memberCount=4,service.type=LoadBalancer,hazelcast.licenseKey=<license-key> hazelcast/hazelcast-enterprise

In my case, now it was correct.

Step 4: Terminate the cluster

$ helm delete --purge my-release

Here, I got the exception from my last comment. Each member had issues while trying to terminate.

Please see the logs:
logs-member-0.txt
logs-member-1.txt
logs-member-2.txt
logs-member-3.txt

@mdogan

This comment has been minimized.

Copy link
Member

commented Mar 25, 2019

Hopefully fixed by #14343 and #14752

@mdogan mdogan closed this Mar 25, 2019

@leszko

This comment has been minimized.

Copy link
Contributor Author

commented Mar 28, 2019

I played with the current master and could not reproduce the issue. Thanks a lot @mdogan for fixing it. Great job 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.