Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One JetStream cluster node restart, KeyValue client may get expired value #4162

Closed
yixinin opened this issue May 15, 2023 · 7 comments · Fixed by #4171
Closed

One JetStream cluster node restart, KeyValue client may get expired value #4162

yixinin opened this issue May 15, 2023 · 7 comments · Fixed by #4171

Comments

@yixinin
Copy link

yixinin commented May 15, 2023

One JetStream cluster node restart, KeyValue client may get expired value

server info:

 Starting nats-server
   Version:  2.9.16
   Git:      [f84ca24]
   Go build: go1.19.8
   Name:     NCK6Q7PPK4US5CDFZXMWSP57HLELDR77LGY5SPNBE4GCTHMNYZQQGFLR
   ID:       NCK6Q7PPK4US5CDFZXMWSP57HLELDR77LGY5SPNBE4GCTHMNYZQQGFLR
 Created system account: "$SYS"
 Listening for client connections on 0.0.0.0:4222
 Get non local IPs for "0.0.0.0"
   ip=172.16.67.134
 Server is ready
 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
Starting nats-server
  Version:  2.9.16
  Git:      [f84ca24]
  Cluster:  test-cluster
  Name:     s1
  Node:     3ahZoO2Q
  ID:       NDH3IQQ74UTRYE3H7YY4HU5MBTHGDODQOPSRGIJIZ377EAOR7YF4JN5E
Plaintext passwords detected, use nkeys or bcrypt
Using configuration file: .\s1\config.conf
Starting http monitor on 0.0.0.0:8001
Starting JetStream
    _ ___ _____ ___ _____ ___ ___   _   __  __
 _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
| || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
 \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|

         https://docs.nats.io/jetstream

---------------- JETSTREAM ----------------
  Max Memory:      6.00 GB
  Max Storage:     1.00 TB
  • s1
server_name=s1
listen=4001
http_port=8001
accounts {
    $SYS {
        users = [
          {
            user: "admin",
            pass: "$2asdasfasfas"
          }
        ]
    }
}

jetstream {
    store_dir=C:/Users/xx/nats/s1/data
}

cluster {
    name: test-cluster
    listen: 0.0.0.0:6001
    routes:[
        nats-route://localhost:6002
        nats-route://localhost:6003
    ]
}
  • s2
server_name=s2
listen=4002
http_port=8002
accounts {
    $SYS {
        users = [
          {
            user: "admin",
            pass: "$2asdasfasfas"
          }
        ]
    }
}

jetstream {
    store_dir=C:/Users/xx/nats/s2/data
}

cluster {
    name: test-cluster
    listen: 0.0.0.0:6002
    routes:[
        nats-route://localhost:6001
        nats-route://localhost:6003
    ]
}
  • s3
server_name=s3
listen=4003
http_port=8003
accounts {
    $SYS {
        users = [
          {
            user: "admin",
            pass: "$2asdasfasfas"
          }
        ]
    }
}

jetstream {
    store_dir=C:/Users/xx/nats/s3/data
}

cluster {
    name: test-cluster
    listen: 0.0.0.0:6003
    routes:[
        nats-route://localhost:6002
        nats-route://localhost:6001
    ]
}

Versions of nats-server and affected client libraries used:

nats-server -version: 2.9.16
github.com/nats-io/nats.go v1.25.0

OS/Container environment:

Windows 11 22H2 22621.1702

Steps or code to reproduce the issue:

  • start s1,s2,s3
  • connect
    var servers = []string{"nats://localhost:4001", "nats://localhost:4002", "nats://localhost:4003"}
    nc, err := nats.Connect(strings.Join(servers, ","))
  • create KeyValue
	kv, err := js.CreateKeyValue(&nats.KeyValueConfig{
		Bucket:   "test",
		TTL:      time.Second * 60,
		Replicas: 3,
		Storage:  nats.FileStorage,
		History:  1,
	})
  • put key1 = val1
     ver, err := kv.Put("key1", []byte("val1"))
     if err != nil {
          return err
     }
     fmt.Println("put[1] ver =", ver)
  • start a goroutine
	var ctx, cancel = context.WithCancel(context.Background())
	defer cancel()
	go func() {
		for {
			select {
			case <-ctx.Done():
				return
			default:
				item, err := kv.Get("key1")
				if err != nil {
					fmt.Println(err)
					return
				}
				ver := item.Revision()
				fmt.Println("get key1 = ", string(item.Value()), "ver =", ver)
			}
		}
	}()
  • shutdown server s1
  • put key1 = val2
     ver, err := kv.Put("key1", []byte("val2"))
     if err != nil {
          return err
     }
     fmt.Println("put[2] ver =", ver)
  • start server s1

Expected result:

put[1] ver = 1
get key1 = val1 ver 1
get key1 = val1 ver 1
... more [get key1 = val1 ver 1]
put[2] ver = 2
get key1 = val2 ver 2
get key1 = val2 ver 2
get key1 = val2 ver 2
... more [get key1 = val2 ver 2]

Actual result:

put[1] ver = 1
get key1 = val1 ver 1
get key1 = val1 ver 1
... more [get key1 = val1 ver 1]
put[2] ver = 2
get key1 = val2 ver 2
get key1 = val2 ver 2
get key1 = val1 ver 1 // ----- unexpected value
get key1 = val1 ver 1 // ----- unexpected value
get key1 = val2 ver 2
get key1 = val2 ver 2
get key1 = val1 ver 1 // ----- unexpected value
get key1 = val2 ver 2
get key1 = val2 ver 2
get key1 = val2 ver 2
... more [get key1 = val2 ver 2]

@derekcollison
Copy link
Member

Can you confirm at the beginning S1 is the leader?

This may be a Go client issue re-using the same delivery subject under the covers when it creates a new consumer for the KBV watcher.

/cc @piotrpio and @wallyqs

@yixinin
Copy link
Author

yixinin commented May 16, 2023

@derekcollison same with leader and follower,
I'm not sure It's client or server's issue, remove all data file before re-start nats-server[s1] can solve this, in some way ...

@derekcollison
Copy link
Member

I am inclined to think that if the KV watcher used a different inbox it would work correctly, waiting on @piotrpio to chime in.

@piotrpio
Copy link
Contributor

@derekcollison I do not think this is related to KV watcher - in the given example, watchers are not used (not even internally). The only thing kv.Get("key1") does is send a direct get request to the server.

I reproduced the issue and it seems it does not matter whether we kill leader or follower - for a short period after server restart (I guess until consensus is reached), the values returned by direct get sometimes contain old revision (from before server was killed). The issue disappears once I turn off AllowDirect and start using STREAM.MSG.GET.%sinstead.

To me this looks like an issue with direct get - I'm investigating and will let you know what I find out.

@derekcollison
Copy link
Member

Thanks for taking a look. I'll investigate and see if we can fix this up in the 2.9.17 release.

@derekcollison
Copy link
Member

And apologies @piotrpio, you are right, these are direct gets. Will fix.

@derekcollison
Copy link
Member

I have a test that now shows this, will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants