Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session reestablishment issues #383

Closed
LDTips opened this issue Feb 27, 2024 · 5 comments
Closed

Session reestablishment issues #383

LDTips opened this issue Feb 27, 2024 · 5 comments

Comments

@LDTips
Copy link

LDTips commented Feb 27, 2024

During testing of gNMIc's behavior when the subscription device (router) was turned off, it was noticed that not always a session is reestablished. The behavior regarding session reestablishment differs for different version - for some gNMIc versions, restarting the router causes the gNMIc to stop receiving data, as it does not try establishing a new session. This is an issue for the continuity of the session - router restart should not require to restart gNMIc as well

Below attached see logs showing behavior for different versions. For older 0.34.x version the behavior is correct, but not for the newest 0.36.x

For 0.34.x

$ docker run -it --rm -p 7890:7890 -v $(pwd):/rpc --net host ghcr.io/openconfig/gnmic:0.34.3 subscribe --insecure -a 192.168.122.101 -u admin -p admin --path /state/system/cpu[sample-period=300]/summary/usage --log
WARNING: Published ports are discarded when using host network mode
2024/02/22 08:51:21.389641 [gnmic] version=0.34.3, commit=d461fa9, date=2023-11-14T02:13:11Z, gitURL=https://github.com/openconfig/gnmic, docs=https://gnmic.openconfig.net
2024/02/22 08:51:21.389708 [gnmic] using config file ""
!! {"name":"default-1708591881","paths":["/state/system/cpu[sample-period=300]/summary/usage"],"mode":"stream"}
2024/02/22 08:51:21.391192 [gnmic] starting output type file
2024/02/22 08:51:21.391522 [file_output:default-stdout] initialized file output: {"FileName":"","FileType":"stdout","Format":"json","Multiline":true,"Indent":"  ","Separator":"\n","SplitEvents":false,"OverrideTimestamps":false,"AddTarget":"","TargetTemplate":"","EventProcessors":null,"MsgTemplate":"","ConcurrencyLimit":1,"EnableMetrics":false,"Debug":false,"CalculateLatency":false}
2024/02/22 08:51:21.391740 [gnmic] queuing target "192.168.122.101"
2024/02/22 08:51:21.391770 [gnmic] subscribing to target: "192.168.122.101"
2024/02/22 08:51:21.391857 [gnmic] starting target "192.168.122.101" listener
2024/02/22 08:51:21.395693 [gnmic] target "192.168.122.101" gNMI client created
2024/02/22 08:51:21.395718 [gnmic] sending gNMI SubscribeRequest: subscribe='subscribe:{subscription:{path:{elem:{name:"state"} elem:{name:"system"} elem:{name:"cpu" key:{key:"sample-period" value:"300"}} elem:{name:"summary"} elem:{name:"usage"}}}}', mode='STREAM', encoding='JSON', to 192.168.122.101
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591881",
  "timestamp": 1708591879927314768,
  "time": "2024-02-22T08:51:19.927314768Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14615884
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14615884"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.43"
      }
    }
  ]
}

{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591881",
  "timestamp": 1708591889918090740,
  "time": "2024-02-22T08:51:29.91809074Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14733407
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14733407"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.45"
      }
    }
  ]
}

[ At this point, the router is rebooted ]


2024/02/22 08:53:47.478755 [gnmic] target "192.168.122.101": subscription default-1708591881 rcv error: rpc error: code = Unavailable desc = error reading from server: read tcp 192.168.122.1:42704->192.168.122.101:57400: read: connection reset by peer
2024/02/22 08:53:47.478901 [gnmic] target "192.168.122.101": subscription default-1708591881 rcv error: retrying in 10s
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591881",
  "timestamp": 1708592036407277039,
  "time": "2024-02-22T08:53:56.407277039Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14701125
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14701125"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "33.44"
      }
    }
  ]
}

And now for 0.36.x the session does not get established after router restarts

$ docker run -it --rm -p 7890:7890 -v $(pwd):/rpc --net host ghcr.io/openconfig/gnmic:0.36.0 subscribe --insecure -a 192.168.122.101 -u admin -p admin --path /state/system/cpu[sample-period=300]/summary/usage --log
WARNING: Published ports are discarded when using host network mode
2024/02/22 08:51:16.841513 [gnmic] version=0.36.0, commit=83d62ed, date=2024-02-13T19:29:11Z, gitURL=https://github.com/openconfig/gnmic, docs=https://gnmic.openconfig.net
2024/02/22 08:51:16.841566 [gnmic] using config file ""
2024/02/22 08:51:16.841774 [gnmic] starting output type file
2024/02/22 08:51:16.842098 [file_output:default-stdout] initialized file output: {"FileName":"","FileType":"stdout","Format":"json","Multiline":true,"Indent":"  ","Separator":"\n","SplitEvents":false,"OverrideTimestamps":false,"AddTarget":"","TargetTemplate":"","EventProcessors":null,"MsgTemplate":"","ConcurrencyLimit":1,"EnableMetrics":false,"Debug":false,"CalculateLatency":false}
2024/02/22 08:51:16.842242 [gnmic] queuing target "192.168.122.101"
2024/02/22 08:51:16.842264 [gnmic] subscribing to target: "192.168.122.101"
2024/02/22 08:51:16.842293 [gnmic] starting target "192.168.122.101" listener
2024/02/22 08:51:16.846160 [gnmic] target "192.168.122.101" gNMI client created
2024/02/22 08:51:16.846192 [gnmic] sending gNMI SubscribeRequest: subscribe='subscribe:{subscription:{path:{elem:{name:"state"} elem:{name:"system"} elem:{name:"cpu" key:{key:"sample-period" value:"300"}} elem:{name:"summary"} elem:{name:"usage"}}}}', mode='STREAM', encoding='JSON', to 192.168.122.101
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591876",
  "timestamp": 1708591875378015875,
  "time": "2024-02-22T08:51:15.378015875Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14517980
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14517980"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.41"
      }
    }
  ]
}

{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591876",
  "timestamp": 1708591885368041968,
  "time": "2024-02-22T08:51:25.368041968Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14719334
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14719334"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.44"
      }
    }
  ]
}
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591876",
  "timestamp": 1708591895368374507,
  "time": "2024-02-22T08:51:35.368374507Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14742859
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14742859"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.45"
      }
    }
  ]
}

[ At this point, the router is rebooted ]
[ And nothing more happens since then ]
@karimra
Copy link
Collaborator

karimra commented Feb 28, 2024

Thanks for reporting this, I was able to reproduce and it's essentially due to this change (grpc/grpc-go#6834) in grpc-go v1.60.0. (gNMIc v0.34.3 uses grpc-go v1.59.0)

In short, the change in grpc-go v1.60.0 makes the gRPC client use the OS TCP keepalive defaults instead of the Go stdlib ones. Most linux dist use:

tcp_keepalive_time = 7200s
tcp_keepalive_intvl = 75s
tcp_keepalive_probes =  9

While Go stdlib uses 15s, 15s and 10. This results in a broken TCP connection being reset after about 15 + 10 * 15 = 165 seconds

I tested setting the OS TCP keep alive using with the below commands, gNMIc reconnects successfully once the TCP connection is reset and the router is back up.

sysctl -w \
  net.ipv4.tcp_keepalive_time=15 \
  net.ipv4.tcp_keepalive_intvl=15 \
  net.ipv4.tcp_keepalive_probes=10

I will have to think a bit about the best way to fix this, ideally I can enable gRPC keepalives by default providing most gNMI servers out there support it. Or use a custom TCP Dialer to avoid grpc-go creating one with the OS defaults. I will keep you posted.

@LDTips
Copy link
Author

LDTips commented Feb 28, 2024

Alright thanks for the confirmation. Do you know how could I fix this issue with these sysctl rules for the containerised gNMIc version? Or is just the better solution for the time being to use 0.34 version instead?

@karimra
Copy link
Collaborator

karimra commented Feb 28, 2024

Depends how you are running the container, docker run has a --sysctl flag, so you can do something like this:

docker run \
--sysctl net.ipv4.tcp_keepalive_time=15 \
--sysctl net.ipv4.tcp_keepalive_intvl=15 \
--sysctl net.ipv4.tcp_keepalive_probes=10 \
-it --rm -p 7890:7890 -v XXXXXX

That typically is not allowed with --net host, so you might want to run it on its own netns. Or modify the host values if that doesn't impact anything else.
Docker compose has similar options for sysctl.

@karimra
Copy link
Collaborator

karimra commented Mar 4, 2024

v0.36.1 has a default TCP keepalive of 15s, please check it out.

@LDTips
Copy link
Author

LDTips commented Mar 9, 2024

Quick test shows that the fix works. Thank you for the fix!
I will let you know if this issue arises again

@karimra karimra closed this as completed Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants