[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

saikaranam-amazon · 2022-11-03T14:29:40Z

What is the bug?
Errors/Broken search results during rolling upgrade of clusters from 1.3 to 2.0

How can one reproduce the bug?

Create 1.3 cluster with atleast 2 nodes
Create index with 2 primaries to allocate atleast one primary per node.
Upgrade one of the node to 2.0 OS version.
Invoke search query to invoke search on all the shards from 1.3 node.
See that there are failures to execute the search on this request

"took" : 40,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 1,
        "index" : "test-index",
        "node" : "O7kxX-lMTAKvXBj91-LQ8Q",
        "reason" : {
          "type" : "exception",
          "reason" : "java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User",
          "caused_by" : {
            "type" : "class_not_found_exception",
            "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
          }
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        ".....
          }
        }
      }
    ]

Notice that the failed shard count is 1

What is the expected behavior?
Rolling upgrade of clusters should complete without any issues.

What is your host/environment?

OS: 1.3 to OS: 2.0 upgrade
Plugins: Security

The text was updated successfully, but these errors were encountered:

saikaranam-amazon · 2022-11-03T14:35:04Z

Upon investigation,

The issue is first seen on [BUG] [class_not_found_exception] during rolling upgrade on security enabled cluster #1259 and the code was reverted on 2.0 due to this [BUG] Security plugin transport layer(multi-node cluster) does not work due to missing required reflective access (JDK17) #1653
But during the rolling upgrade, requests initiated from the co-ordinator node with 1.3 OS version fails to fetch the docs on the shards hosted on 2.0 version nodes with the above failure.

krishna-ggk · 2022-11-03T14:37:42Z

Looks like this was just discovered few days back - #2168

scrawfor99 · 2022-11-03T14:54:21Z

Hi @saikaranam-amazon, thank you for filing this issue. Currently, the best option is likely to do a full-restart update to migrate the major versions.

scrawfor99 · 2022-11-03T15:54:56Z

Hi @saikaranam-amazon are you also running the tarball installation? Thank you.

saikaranam-amazon · 2022-11-03T17:00:01Z

@scrawfor99 - Tested using docker containers with one node on 1.3 and other on 2.0 OS versions.

Should be seen using tarball installation as well.

scrawfor99 · 2022-11-03T19:33:26Z

@scrawfor99 - Tested using docker containers with one node on 1.3 and other on 2.0 OS versions.
* Should be seen using tarball installation as well.

You ran the containers separately or with a compose file? I do not know if it matters but I am trying to pinpoint the cause of the issue as well as what starts are impacted. The original discussion seemed to suggest that maybe Kubernetes install was not impacted.

peternied · 2022-11-03T19:47:55Z

~~Looks like we have two issues are tracking the same problem:~~

[Update] These issues are not related

scrawfor99 · 2022-11-03T20:12:27Z

I am posting some curl requests for easy reuse--I am not sure the best way of working on diagnosing the root cause so have been trying to reproduce the steps to get the error. I do not know if these are useful to anyone else if not you can ignore them:

Create new index with two primary: curl -X PUT --insecure -u 'admin:admin' 'https://localhost:9200/sample-index1' -H 'Content-Type: application/json' -d '{"settings": {"index": {"number_of_shards": 3,"number_of_replicas": 1}}, "mappings": {"properties": {"age": {"type": "integer"} } }, "aliases":{"sample-alias1": {}} }

Search from node 0: curl -X GET --insecure -u 'admin:admin' 'https://localhost:9200/sample-index1/_search' -H 'Content-Type: application/json' -d '{"from": 0}'

cliu123 · 2022-11-03T21:06:35Z

Looks like we have two issues are tracking the same problem:

BWC Tests are failing on main #2221

[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

They are different issues. BWC failures(#2221) fails on upgrading from 2.4.0 to 3.0.0, which is new. This issue is about failing rolling upgrade from 1.x to 2.x.

scrawfor99 · 2022-11-03T21:18:03Z

This appears to be the commit that introduced the version update system: #454

If it cannot upgrade from 1.x to 2.x it may have to do with the switch from v6 security versioning to v7 as I believe that also occurred during that change?

scrawfor99 · 2022-11-04T14:42:07Z

Update: Looking further it seems like the swap from 1.3 to 2.0 that introduced the naming convention change form openDistro to OpenSearch is likely causing the search which starts on the 1.3 node to then throw a class not found exception on the 2.0 node since opendistroforeleasticsearch.security.user.User became opensearch.security.user.User. So I think that it starts on the 1.3 node, operates search then passes the request to the 2.0 node without acknowledging the refactor between the versions. This would also align with rolling upgrades working for 1.x->1.x and 2.x->2.x but not for 1.x->2.x since this is when the refactor took place. I believe a quick work around may be to just add a catch for the ClassNotFoundException that just retries the search request using the new naming convention and combines the results from the 1.3 nodes with the 2.0 nodes. @cliu123 what do you think? I am not sure what the commands to go through the steps above ^ ( I have not yet had a chance to watch the demo video of how the requests are processed) to test this or where the search calls come from (this is where you will want to add the catch) but I hope that this points people in the right direction since I am 80% sure this is the cause.

krishna-ggk · 2022-11-04T15:36:51Z

@scrawfor99 Thanks for the analysis. Here is my thinking based on code read.

It seems like 1.3 is always operating in ODFE backward compatibility mode while "sending" data in the write path as it doesn't know whether the destination node is ES or OpenSearch node. (link).

        public ObjectStreamClass replace(final ObjectStreamClass desc) {
            final String name = desc.getName();
            if (name.startsWith(OS_PACKAGE)) {
                return nameToDescriptor.computeIfAbsent(name, s -> {
                    SpecialPermission.check();
                    // we can't modify original descriptor as it is cached by ObjectStreamClass, create clone
                    final ObjectStreamClass clone = AccessController.doPrivileged(
                        (PrivilegedAction<ObjectStreamClass>)() -> SerializationUtils.clone(desc)
                    );
                    DescriptorNameSetter.setName(clone, s.replace(OS_PACKAGE, ODFE_PACKAGE));
                    return clone;
                });
            }
            return desc;
        }
    }

To support 1.3 to 2.x rolling upgrade, 2.x will have to understand 1.3 classes. I understand that we can handle ClassNotFoundException, but do you think we can avoid this additional try/catch if we handle in the descriptor itself (like 1.3 was doing)?

        protected ObjectStreamClass readClassDescriptor() throws IOException, ClassNotFoundException {
            ObjectStreamClass desc = super.readClassDescriptor();
            final String name = desc.getName();
            if (name.startsWith(ODFE_PACKAGE)) {
                desc = ObjectStreamClass.lookup(Class.forName(name.replace(ODFE_PACKAGE, OS_PACKAGE)));
                logger.debug("replaced descriptor name from [{}] to [{}]", name, desc.getName());
            }
            return desc;
        }
    }

One question - it seems like ideally we should be able to support rolling upgrade from ODFE to 2.x too and only this issue seems to be the only wrinkle. Should we consider supporting ODFE to 2.x rolling upgrade as well as many users may want to directly upgrade to 2.x?

Thoughts?

cwperks · 2022-11-04T16:57:23Z

@krishna-ggk Here's some findings from a cursory look at this issue:

Upgrade from ODFE to OS 2.x is not supported due to an issue with JDK 17 and illegal reflective access. See relevant PR here: #1691.

setiah goes in length on the issue here: #1653 (comment) and vrozov mentions technical limitations of the solution mentioned by nknize here: #1259 (comment)

I'm confused that you are receiving an error message referencing com.amazon.opendistroforelasticsearch.security.user.User as this is not the class name of the user in 1.3. The fully-qualified classname for User is org.opensearch.security.user: https://github.com/opensearch-project/security/blob/1.3/src/main/java/org/opensearch/security/user/User.java

~~Where could that reference be coming from?~~

I see that the 1.3 node writes this for backwards compatibility with ODFE nodes.

I'm able to reproduce with a _nodes request from the 1.3.6 node: curl -X GET --insecure 'https://admin:admin@mixed-cluster-opensearch-node2-1:9200/_nodes'

krishna-ggk · 2022-11-07T14:27:51Z

@krishna-ggk Here's some findings from a cursory look at this issue:

Upgrade from ODFE to OS 2.x is not supported due to an issue with JDK 17 and illegal reflective access. See relevant PR here: #1691.

setiah goes in length on the issue here: #1653 (comment) and vrozov mentions technical limitations of the solution mentioned by nknize here: #1259 (comment)

@cwperks Thanks for bringing focus on this and the same applies to OS 1.x to OS 2.x upgrades too now. It appears that any reflective based solution is then ruled out.

cwperks · 2022-11-07T15:20:52Z

@krishna-ggk We may be able to conditionally apply the ODFE re-write logic depending on the mixture of nodes in the cluster for 1.3.

ClusterInfoHolder (https://github.com/opensearch-project/security/blob/1.3/src/main/java/org/opensearch/security/configuration/ClusterInfoHolder.java#L86) has a mechanism for determining node versions within the cluster and it appears the logic to serialize the user object is located here (

security/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java

Line 236 in f431ec2

    
           getThreadContext().putHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER, Base64Helper.serializeObject(origUser));

)

cliu123 · 2022-11-07T19:38:42Z

The other potential solution is to avoid reflection to avoid the illegal reflective access with JDK 17 if possible.

peternied · 2022-11-08T20:43:45Z

@cwperks @cliu123 is one of you working on this issue, I don't see an assignment and this seems like a candidate for 2.4.0 - is that correct?

FYI @scrawfor99 since you are driving the 2.4.0 release

scrawfor99 · 2022-11-08T20:50:44Z

My understanding of this issue is that it is being worked on but at the same time there is unlikely to be a trivial solution. I do not know that a fix for 2.4.0 is necessarily in the making so much as an early patch should the solution be as complicated as it appears. Correct me if I am wrong @cliu123 @cwperks

peternied · 2022-11-08T23:56:49Z

I'll be looking into this issue. Looks like the problem we are running into is related to this code:

security/src/main/java/org/opensearch/security/auditlog/impl/AbstractAuditLog.java

Lines 592 to 598 in 672f4d7

    
           private String getUser() { 
        
               User user = threadPool.getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_USER); 
        
               if(user == null && threadPool.getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER) != null) { 
        
                   user = (User) Base64Helper.deserializeObject(threadPool.getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER)); 
        
               } 
        
               return user==null?null:user.getName(); 
        
           }

Audit logs get passed around, and we are seeing this issue with interop between 1.3 and 2.X because the thread context is sent over the wire with an object with an updated namespace. Because that fancy code in DescriptorReplacer [1] was pulled for jdk17 the object won't be found in the class path.

Next steps:

Need to add 1.3 -> 2.0 migration tests somewhere to verify where we are at
Look into alternative ways to handle that objects getting deserialized - it doesn't need to be treated as a 'User', instead it just needs to be null, or an Object with a .getName() method.
Determine if there are other source of failures that need to be investigated

[1]

security/src/main/java/org/opensearch/security/support/Base64Helper.java

Line 154 in b478722

private static class DescriptorReplacer {

krishna-ggk · 2022-11-09T17:21:10Z

Thanks @peternied. One question - while I understand that the current failure was for an instance of User, should we be worried about encountering deserialization failures for instances of other security context classes?

peternied · 2022-11-09T17:42:30Z

@krishna-ggk You are correct to worry that there might be other sources of failure. We are at the initial steps of this investigation and we will need an automated way to validate the cluster before.

cliu123 · 2022-11-09T23:57:29Z

This is not going to be a part of 2.4.0. I've removed the 2.4.0 lable.

peternied · 2022-11-14T19:59:34Z

I've created a way to test this scenario and others like it, but haven't been able to look into this bug itself

cwperks · 2022-11-17T14:59:50Z

@peternied I will look into this today and see if there's a possibility of getting the min node version from the ClusterInfoHolder to conditionally apply the serialization logic that replaces the package name with opendistro package name.

If the min node in the cluster is OS 1, then no need to perform the rewrite logic.

If the min node in the cluster is ODFE, then apply the rewrite logic.

ronniepg · 2022-11-18T19:08:26Z

@cwperks This issue will happen even when min node in the cluster is OS 1. As reported in #2168 the cluster was originally setup with OS 1.x and never upgraded from ODFE.

Also attaching the patch which fixed the rolling upgrade.
#2168-patch.txt

cwperks · 2022-11-18T19:15:50Z

@ronniepg Understood and that's what this PR is targeting: #2268

There is logic in 1.3 to keep backwards compatibility with ODFE that will always rewrite package names from org.opensearch to com.amazon.opendistroforelasticsearch so that when messages are picked up by ODFE nodes that they are able to understand the message. That's a problem when you are going from OS 1 to OS 2 because OS 2 does not understand the com.amazon.opendistroforelasticsearch packages. The PR above aims to conditionally apply the serialization logic if there are ODFE nodes in the cluster. If you have only OS 1 nodes and going to OS 2, it should not be performing the package rewrite on serialization for the transport action.

ronniepg · 2022-11-18T19:39:00Z

Thanks the reference to #2268. So with this fix, the OS 1.x cluster will need to be upgraded to the latest 1.x version which has this fix before rolling upgrade to 2.x.

The patch which i attached earlier was for OS 2.x. ODFE nodes and OS 1 nodes will still be able to talk to each other since like you mentioned OS 1.3 does both the serialization/deserialization. Once the cluster is fully upgraded to OS 1, then it can be rolling upgraded to OS 2.

I guess the fix in OS 1.x code would be preferred to not keep any of 'opendistro' in the 2.x code base.

DarshitChanpura · 2022-12-05T20:41:31Z

Closing this as it was addressed in #2268

saikaranam-amazon added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Nov 3, 2022

martin-gaievski mentioned this issue Nov 3, 2022

Fixed failing unit test opensearch-project/k-NN#610

Merged

2 tasks

peternied mentioned this issue Nov 3, 2022

BWC Tests are failing on main #2221

Closed

peternied mentioned this issue Nov 3, 2022

Add backwards compatibility tests for Windows and JDK11/17 #2233

Closed

3 tasks

peternied mentioned this issue Nov 4, 2022

[Branch Protection] Re-enable Require 'backward-compatibility` #2235

Closed

cwperks added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. sprint backlog and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Nov 7, 2022

peternied added the v2.4.0 'Issues and PRs related to version v2.4.0' label Nov 8, 2022

peternied self-assigned this Nov 8, 2022

cliu123 removed the v2.4.0 'Issues and PRs related to version v2.4.0' label Nov 9, 2022

peternied mentioned this issue Nov 9, 2022

Generalize Backwards Compatibility tests so we can test from any version to any version #2253

Merged

3 tasks

peternied removed their assignment Nov 14, 2022

This was referenced Nov 18, 2022

[1.3] Conditionally serialize with ODFE package if min version of node in cluster is <OS 1.0.0 #2268

Merged

[BUG] User object deserialization prevents Rolling upgrade from version 1.x to 2.2.1 #2168

Closed

cwperks added the v1.3.7 label Nov 29, 2022

DarshitChanpura closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

saikaranam-amazon commented Nov 3, 2022

saikaranam-amazon commented Nov 3, 2022

krishna-ggk commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022

saikaranam-amazon commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022

peternied commented Nov 3, 2022 •

edited

scrawfor99 commented Nov 3, 2022 •

edited

cliu123 commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022 •

edited

scrawfor99 commented Nov 4, 2022

krishna-ggk commented Nov 4, 2022 •

edited

cwperks commented Nov 4, 2022 •

edited

krishna-ggk commented Nov 7, 2022

cwperks commented Nov 7, 2022 •

edited

cliu123 commented Nov 7, 2022

peternied commented Nov 8, 2022

scrawfor99 commented Nov 8, 2022

peternied commented Nov 8, 2022 •

edited

krishna-ggk commented Nov 9, 2022

peternied commented Nov 9, 2022

cliu123 commented Nov 9, 2022

peternied commented Nov 14, 2022

cwperks commented Nov 17, 2022

ronniepg commented Nov 18, 2022

cwperks commented Nov 18, 2022

ronniepg commented Nov 18, 2022

DarshitChanpura commented Dec 5, 2022

[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

[BUG] Errors/Broken operations during rolling upgrade of clusters from 1.3 to 2.0 #2228

Comments

saikaranam-amazon commented Nov 3, 2022

saikaranam-amazon commented Nov 3, 2022

krishna-ggk commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022

saikaranam-amazon commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022

peternied commented Nov 3, 2022 • edited

scrawfor99 commented Nov 3, 2022 • edited

cliu123 commented Nov 3, 2022

scrawfor99 commented Nov 3, 2022 • edited

scrawfor99 commented Nov 4, 2022

krishna-ggk commented Nov 4, 2022 • edited

cwperks commented Nov 4, 2022 • edited

krishna-ggk commented Nov 7, 2022

cwperks commented Nov 7, 2022 • edited

cliu123 commented Nov 7, 2022

peternied commented Nov 8, 2022

scrawfor99 commented Nov 8, 2022

peternied commented Nov 8, 2022 • edited

krishna-ggk commented Nov 9, 2022

peternied commented Nov 9, 2022

cliu123 commented Nov 9, 2022

peternied commented Nov 14, 2022

cwperks commented Nov 17, 2022

ronniepg commented Nov 18, 2022

cwperks commented Nov 18, 2022

ronniepg commented Nov 18, 2022

DarshitChanpura commented Dec 5, 2022

peternied commented Nov 3, 2022 •

edited

scrawfor99 commented Nov 3, 2022 •

edited

scrawfor99 commented Nov 3, 2022 •

edited

krishna-ggk commented Nov 4, 2022 •

edited

cwperks commented Nov 4, 2022 •

edited

cwperks commented Nov 7, 2022 •

edited

peternied commented Nov 8, 2022 •

edited