Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HZ-581] Address issues with hostnames #20014

Merged
merged 86 commits into from Jan 11, 2022

Conversation

ufukyilmaz
Copy link
Contributor

@ufukyilmaz ufukyilmaz commented Dec 1, 2021

This PR includes the connection manager changes which are
prerequisite for resolving the hostname related issues.
What I'm trying to do in this PR is to handle the multiple addresses
that refer to the members in the connection manager layer, and I tried
to make only the public address of the members visible at the higher
level of the code except for the places where connection initiation is
made that takes addresses from the config.
Also, tried to keep the scope of this PR especially at the connection
manager layer and avoided some changes that would cause changes
in the other layers such as operation service.

The changes in this PR are:

  • LocalAddressRegistry is introduced to store Address-UUID
    and UUID-Addresses mappings.
  • Use endpoint UUID's for managing the connection registration. It's mainly
    adjusting the connection managers to this uuid changes
  • Apply these new uuid changes to mock network classes as well.
  • Use Member's MEMBER protocol public address if available (o/w when
    advanced network is being used, then use the corresponding protocol's
    public address) as the remoteAddress of connection which previously
    exposes the other target addresses to the outside of the connection
    manager layer (OperationRunnerImpl gets this remote address of the
    connection and sets it as the caller address of Operation so they expose)

The most difficult part of the PR is the determination of the lifecycle of the
UUID->Address, Address->UUID entries' lifecycle. We must determine
when to remove these entries from our registry. We must remove these entries at
some point since they can get stale (and also resource usage)
The events that can leave these entries stale are as follows:

  1. In some scenarios, the UUID for the specific host machine (for the specific
    network address) can be changed. These are:
    • restarting a member on the same machine. The new hz member has a different
      UUID than the previous member despite using the same network addresses
    • invocation of ClusterServiceImpl#reset triggers the UUID change on the
      member without any change on the network address to which the member is bound.
      This method is only called before the split brain merge happens.
  2. At Hot Restart (Persistence) recovery, a new member started on a different
    host machine can use the UUID of a crashed member on a different address/es.

These two cases should be considered while determining the lifecycle of this
entries.
For now, UUID-Address entries registration takes place in:
For the members:

  • Member handshake processing for the member connections
    For the clients:
  • In the connection registration phase of client connection registrations

UUID-Address entries removal performed:
For the members:

  • when all connections between members are closed. (There may be multiple
    connections between members, and we must not clear these entries when
    one of these connections is closed.)

For the clients:

  • We remove entries belonging to client connections when client connections are
    closed (safe to remove as there is only one active connection).

In testing:
Need to test the consistency of the entries in this LocalAddressRegistry
(there must be no stale entries inside), and need to add tests that guarantee
that the entries inside do not leak.

Nightly test run on the latest commit: http://jenkins.hazelcast.com/job/ufuk-nightly-runner/26/

Checklist:

  • Labels (Team:, Type:, Source:, Module:) and Milestone set
  • Label Add to Release Notes or Not Release Notes content set
  • Request reviewers if possible
  • Send backports/forwardports if fix needs to be applied to past/future releases
  • New public APIs have @Nonnull/@Nullable annotations
  • New public APIs have @since tags in Javadoc

@ufukyilmaz ufukyilmaz added Team: Core Source: Internal PR or issue was opened by an employee Module: Network I/O labels Dec 1, 2021
@ufukyilmaz ufukyilmaz added this to the 5.1 milestone Dec 1, 2021
@ufukyilmaz ufukyilmaz self-assigned this Dec 1, 2021
@hazelcast hazelcast deleted a comment from hz-devops-test Dec 1, 2021
@hazelcast hazelcast deleted a comment from hz-devops-test Dec 2, 2021
@AyberkSorgun AyberkSorgun changed the title Hostname fixes part 1 HZ-528 Hostname fixes part 1 Dec 3, 2021
@ufukyilmaz ufukyilmaz changed the title HZ-528 Hostname fixes part 1 HZ-581 Hostname fixes part 1 Dec 3, 2021
@ufukyilmaz ufukyilmaz changed the title HZ-581 Hostname fixes part 1 [HZ-581] Hostname fixes part 1 Dec 6, 2021
@hazelcast hazelcast deleted a comment from hz-devops-test Jan 3, 2022
@hazelcast hazelcast deleted a comment from hz-devops-test Jan 4, 2022
Ufuk Yılmaz added 5 commits January 5, 2022 13:22
Since, it creates so much complexity without much gain.
Because if we close this connection after it starts to be used from one
side, we lose some packets on the way.
…part-1

# Conflicts:
#	hazelcast/src/test/java/com/hazelcast/instance/TestNodeContext.java
@ufukyilmaz ufukyilmaz changed the title [HZ-581] Hostname fixes part 1 [HZ-581] Hostname fixes Jan 10, 2022
@ufukyilmaz ufukyilmaz changed the title [HZ-581] Hostname fixes [HZ-581] Address issues with hostname Jan 10, 2022
@frant-hartm frant-hartm changed the title [HZ-581] Address issues with hostname [HZ-581] Address issues with hostnames Jan 10, 2022
Ufuk Yılmaz added 2 commits January 10, 2022 17:49
Formerly, the tests were assuming that there is only one
connection on each plane between the members. So, tests
started to fail when we remove duplicate handling.
Copy link
Member

@kwart kwart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to have this in the BETA release.

We still investigate some failures within the split-brain scenarios (reproducer in https://github.com/hazelcast/hazelcast-qe/tree/hostnames-bouncing/it/hostnames).

@ramizdundar ramizdundar merged commit 16ec195 into hazelcast:master Jan 11, 2022
@mmedenjak mmedenjak mentioned this pull request Jan 12, 2022
1 task
ufukyilmaz pushed a commit to ufukyilmaz/hazelcast that referenced this pull request Jan 12, 2022
ufukyilmaz pushed a commit to ufukyilmaz/hazelcast that referenced this pull request Jan 12, 2022
@kwart kwart mentioned this pull request Feb 1, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants