Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

Closed
saikaranam-amazon opened this issue Nov 3, 2022 · 28 comments
Labels
bug Something isn't working sprint backlog triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. v1.3.7

Comments

@saikaranam-amazon
Copy link
Member

What is the bug?
Errors/Broken search results during rolling upgrade of clusters from 1.3 to 2.0

How can one reproduce the bug?

  1. Create 1.3 cluster with atleast 2 nodes
  2. Create index with 2 primaries to allocate atleast one primary per node.
  3. Upgrade one of the node to 2.0 OS version.
  4. Invoke search query to invoke search on all the shards from 1.3 node.
  5. See that there are failures to execute the search on this request
"took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 1,
        "index" : "test-index",
        "node" : "O7kxX-lMTAKvXBj91-LQ8Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User",
          "caused_by" : {
            "type" : "class_not_found_exception",
            "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
          }
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        ".....
          }
        }
      }
    ]

Notice that the failed shard count is 1

What is the expected behavior?
Rolling upgrade of clusters should complete without any issues.

What is your host/environment?

  • OS: 1.3 to OS: 2.0 upgrade
  • Plugins: Security
@saikaranam-amazon saikaranam-amazon added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Nov 3, 2022
@saikaranam-amazon
Copy link
Member Author

Upon investigation,

@krishna-ggk
Copy link
Contributor

Looks like this was just discovered few days back - #2168

@scrawfor99
Copy link
Collaborator

Hi @saikaranam-amazon, thank you for filing this issue. Currently, the best option is likely to do a full-restart update to migrate the major versions.

@scrawfor99
Copy link
Collaborator

Hi @saikaranam-amazon are you also running the tarball installation? Thank you.

@saikaranam-amazon
Copy link
Member Author

@scrawfor99 - Tested using docker containers with one node on 1.3 and other on 2.0 OS versions.

  • Should be seen using tarball installation as well.

@scrawfor99
Copy link
Collaborator

@scrawfor99 - Tested using docker containers with one node on 1.3 and other on 2.0 OS versions.

* Should be seen using tarball installation as well.

You ran the containers separately or with a compose file? I do not know if it matters but I am trying to pinpoint the cause of the issue as well as what starts are impacted. The original discussion seemed to suggest that maybe Kubernetes install was not impacted.

@peternied
Copy link
Member

peternied commented Nov 3, 2022

Looks like we have two issues are tracking the same problem:

[Update] These issues are not related

@scrawfor99
Copy link
Collaborator

scrawfor99 commented Nov 3, 2022

I am posting some curl requests for easy reuse--I am not sure the best way of working on diagnosing the root cause so have been trying to reproduce the steps to get the error. I do not know if these are useful to anyone else if not you can ignore them:

Create new index with two primary: curl -X PUT --insecure -u 'admin:admin' 'https://localhost:9200/sample-index1' -H 'Content-Type: application/json' -d '{"settings": {"index": {"number_of_shards": 3,"number_of_replicas": 1}}, "mappings": {"properties": {"age": {"type": "integer"} } }, "aliases":{"sample-alias1": {}} }

Search from node 0: curl -X GET --insecure -u 'admin:admin' 'https://localhost:9200/sample-index1/_search' -H 'Content-Type: application/json' -d '{"from": 0}'

@cliu123
Copy link
Member

cliu123 commented Nov 3, 2022

Looks like we have two issues are tracking the same problem:

They are different issues. BWC failures(#2221) fails on upgrading from 2.4.0 to 3.0.0, which is new. This issue is about failing rolling upgrade from 1.x to 2.x.

@scrawfor99
Copy link
Collaborator

scrawfor99 commented Nov 3, 2022

This appears to be the commit that introduced the version update system: #454

If it cannot upgrade from 1.x to 2.x it may have to do with the switch from v6 security versioning to v7 as I believe that also occurred during that change?

@scrawfor99
Copy link
Collaborator

Update: Looking further it seems like the swap from 1.3 to 2.0 that introduced the naming convention change form openDistro to OpenSearch is likely causing the search which starts on the 1.3 node to then throw a class not found exception on the 2.0 node since opendistroforeleasticsearch.security.user.User became opensearch.security.user.User. So I think that it starts on the 1.3 node, operates search then passes the request to the 2.0 node without acknowledging the refactor between the versions. This would also align with rolling upgrades working for 1.x->1.x and 2.x->2.x but not for 1.x->2.x since this is when the refactor took place. I believe a quick work around may be to just add a catch for the ClassNotFoundException that just retries the search request using the new naming convention and combines the results from the 1.3 nodes with the 2.0 nodes. @cliu123 what do you think? I am not sure what the commands to go through the steps above ^ ( I have not yet had a chance to watch the demo video of how the requests are processed) to test this or where the search calls come from (this is where you will want to add the catch) but I hope that this points people in the right direction since I am 80% sure this is the cause.

@krishna-ggk
Copy link
Contributor

krishna-ggk commented Nov 4, 2022

@scrawfor99 Thanks for the analysis. Here is my thinking based on code read.

It seems like 1.3 is always operating in ODFE backward compatibility mode while "sending" data in the write path as it doesn't know whether the destination node is ES or OpenSearch node. (link).

        public ObjectStreamClass replace(final ObjectStreamClass desc) {
            final String name = desc.getName();
            if (name.startsWith(OS_PACKAGE)) {
                return nameToDescriptor.computeIfAbsent(name, s -> {
                    SpecialPermission.check();
                    // we can't modify original descriptor as it is cached by ObjectStreamClass, create clone
                    final ObjectStreamClass clone = AccessController.doPrivileged(
                        (PrivilegedAction<ObjectStreamClass>)() -> SerializationUtils.clone(desc)
                    );
                    DescriptorNameSetter.setName(clone, s.replace(OS_PACKAGE, ODFE_PACKAGE));
                    return clone;
                });
            }
            return desc;
        }
    }

To support 1.3 to 2.x rolling upgrade, 2.x will have to understand 1.3 classes. I understand that we can handle ClassNotFoundException, but do you think we can avoid this additional try/catch if we handle in the descriptor itself (like 1.3 was doing)?

        protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
            ObjectStreamClass desc = super.readClassDescriptor();
            final String name = desc.getName();
            if (name.startsWith(ODFE_PACKAGE)) {
                desc = ObjectStreamClass.lookup(Class.forName(name.replace(ODFE_PACKAGE, OS_PACKAGE)));
                logger.debug("replaced descriptor name from [{}] to [{}]", name, desc.getName());
            }
            return desc;
        }
    }

One question - it seems like ideally we should be able to support rolling upgrade from ODFE to 2.x too and only this issue seems to be the only wrinkle. Should we consider supporting ODFE to 2.x rolling upgrade as well as many users may want to directly upgrade to 2.x?

Thoughts?

@cwperks
Copy link
Member

cwperks commented Nov 4, 2022

@krishna-ggk Here's some findings from a cursory look at this issue:

Upgrade from ODFE to OS 2.x is not supported due to an issue with JDK 17 and illegal reflective access. See relevant PR here: #1691.

setiah goes in length on the issue here: #1653 (comment) and vrozov mentions technical limitations of the solution mentioned by nknize here: #1259 (comment)

I'm confused that you are receiving an error message referencing com.amazon.opendistroforelasticsearch.security.user.User as this is not the class name of the user in 1.3. The fully-qualified classname for User is org.opensearch.security.user: https://github.com/opensearch-project/security/blob/1.3/src/main/java/org/opensearch/security/user/User.java

Where could that reference be coming from?

I see that the 1.3 node writes this for backwards compatibility with ODFE nodes.

I'm able to reproduce with a _nodes request from the 1.3.6 node: curl -X GET --insecure 'https://admin:admin@mixed-cluster-opensearch-node2-1:9200/_nodes'

@krishna-ggk
Copy link
Contributor

@krishna-ggk Here's some findings from a cursory look at this issue:

Upgrade from ODFE to OS 2.x is not supported due to an issue with JDK 17 and illegal reflective access. See relevant PR here: #1691.

setiah goes in length on the issue here: #1653 (comment) and vrozov mentions technical limitations of the solution mentioned by nknize here: #1259 (comment)

@cwperks Thanks for bringing focus on this and the same applies to OS 1.x to OS 2.x upgrades too now. It appears that any reflective based solution is then ruled out.

@cwperks
Copy link
Member

cwperks commented Nov 7, 2022

@krishna-ggk We may be able to conditionally apply the ODFE re-write logic depending on the mixture of nodes in the cluster for 1.3.

ClusterInfoHolder (https://github.com/opensearch-project/security/blob/1.3/src/main/java/org/opensearch/security/configuration/ClusterInfoHolder.java#L86) has a mechanism for determining node versions within the cluster and it appears the logic to serialize the user object is located here (

getThreadContext().putHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER, Base64Helper.serializeObject(origUser));
)

@cliu123
Copy link
Member

cliu123 commented Nov 7, 2022

The other potential solution is to avoid reflection to avoid the illegal reflective access with JDK 17 if possible.

@cwperks cwperks added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. sprint backlog and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Nov 7, 2022
@peternied peternied added the v2.4.0 'Issues and PRs related to version v2.4.0' label Nov 8, 2022
@peternied
Copy link
Member

@cwperks @cliu123 is one of you working on this issue, I don't see an assignment and this seems like a candidate for 2.4.0 - is that correct?

FYI @scrawfor99 since you are driving the 2.4.0 release

@scrawfor99
Copy link
Collaborator

My understanding of this issue is that it is being worked on but at the same time there is unlikely to be a trivial solution. I do not know that a fix for 2.4.0 is necessarily in the making so much as an early patch should the solution be as complicated as it appears. Correct me if I am wrong @cliu123 @cwperks

@peternied peternied self-assigned this Nov 8, 2022
@peternied
Copy link
Member

peternied commented Nov 8, 2022

I'll be looking into this issue. Looks like the problem we are running into is related to this code:

private String getUser() {
User user = threadPool.getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_USER);
if(user == null && threadPool.getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER) != null) {
user = (User) Base64Helper.deserializeObject(threadPool.getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER));
}
return user==null?null:user.getName();
}

Audit logs get passed around, and we are seeing this issue with interop between 1.3 and 2.X because the thread context is sent over the wire with an object with an updated namespace. Because that fancy code in DescriptorReplacer [1] was pulled for jdk17 the object won't be found in the class path.

Next steps:

  • Need to add 1.3 -> 2.0 migration tests somewhere to verify where we are at
  • Look into alternative ways to handle that objects getting deserialized - it doesn't need to be treated as a 'User', instead it just needs to be null, or an Object with a .getName() method.
  • Determine if there are other source of failures that need to be investigated

[1]

private static class DescriptorReplacer {

@krishna-ggk
Copy link
Contributor

Thanks @peternied. One question - while I understand that the current failure was for an instance of User, should we be worried about encountering deserialization failures for instances of other security context classes?

@peternied
Copy link
Member

@krishna-ggk You are correct to worry that there might be other sources of failure. We are at the initial steps of this investigation and we will need an automated way to validate the cluster before.

@cliu123 cliu123 removed the v2.4.0 'Issues and PRs related to version v2.4.0' label Nov 9, 2022
@cliu123
Copy link
Member

cliu123 commented Nov 9, 2022

This is not going to be a part of 2.4.0. I've removed the 2.4.0 lable.

@peternied peternied removed their assignment Nov 14, 2022
@peternied
Copy link
Member

I've created a way to test this scenario and others like it, but haven't been able to look into this bug itself

@cwperks
Copy link
Member

cwperks commented Nov 17, 2022

@peternied I will look into this today and see if there's a possibility of getting the min node version from the ClusterInfoHolder to conditionally apply the serialization logic that replaces the package name with opendistro package name.

If the min node in the cluster is OS 1, then no need to perform the rewrite logic.

If the min node in the cluster is ODFE, then apply the rewrite logic.

@ronniepg
Copy link

@cwperks This issue will happen even when min node in the cluster is OS 1. As reported in #2168 the cluster was originally setup with OS 1.x and never upgraded from ODFE.

Also attaching the patch which fixed the rolling upgrade.
#2168-patch.txt

@cwperks
Copy link
Member

cwperks commented Nov 18, 2022

@ronniepg Understood and that's what this PR is targeting: #2268

There is logic in 1.3 to keep backwards compatibility with ODFE that will always rewrite package names from org.opensearch to com.amazon.opendistroforelasticsearch so that when messages are picked up by ODFE nodes that they are able to understand the message. That's a problem when you are going from OS 1 to OS 2 because OS 2 does not understand the com.amazon.opendistroforelasticsearch packages. The PR above aims to conditionally apply the serialization logic if there are ODFE nodes in the cluster. If you have only OS 1 nodes and going to OS 2, it should not be performing the package rewrite on serialization for the transport action.

@ronniepg
Copy link

Thanks the reference to #2268. So with this fix, the OS 1.x cluster will need to be upgraded to the latest 1.x version which has this fix before rolling upgrade to 2.x.

The patch which i attached earlier was for OS 2.x. ODFE nodes and OS 1 nodes will still be able to talk to each other since like you mentioned OS 1.3 does both the serialization/deserialization. Once the cluster is fully upgraded to OS 1, then it can be rolling upgraded to OS 2.

I guess the fix in OS 1.x code would be preferred to not keep any of 'opendistro' in the 2.x code base.

@cwperks cwperks added the v1.3.7 label Nov 29, 2022
@DarshitChanpura
Copy link
Member

Closing this as it was addressed in #2268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working sprint backlog triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. v1.3.7
Projects
None yet
Development

No branches or pull requests

8 participants