Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: akash provider lease-shell stops working when pod gets restarted due to eviction #42

Closed
arno01 opened this issue Apr 30, 2022 · 12 comments
Labels
repo/provider Akash provider-services repo issues sev2

Comments

@arno01
Copy link

arno01 commented Apr 30, 2022

Reproducer

  1. deploy something with 1 or 10 GB storage request (== limit);
  2. consume more than the limit in step 1;
root@ssh-6fd4f4bdf9-7b2p2:/# dd if=/dev/zero of=/test-count-2048 bs=10M count=2048
2048+0 records in
2048+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 13.8965 s, 1.5 GB/s
root@ssh-6fd4f4bdf9-7b2p2:/# Error: lease shell failed: remote process exited with code 137
  1. this will cause the pod to restart due to:
  "reason": "Evicted",
  "note": "Container ssh exceeded its local ephemeral storage limit \"10737418240\". ",

See the entire akash provider lease-events log below.

  1. akash provider lease-shell stops working;
$ akash provider lease-shell --tty --stdin -- ssh bash
Error: lease shell failed: remote command execute error: service with that name is not running: the service has failed

version 0.16.4-rc0

akash provider & client are of 0.16.4-rc0 version

$ curl -sk "https://provider.europlots.com:8443/version" | jq -r
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.5",
    "gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
    "gitTreeState": "clean",
    "buildDate": "2022-03-16T15:52:18Z",
    "goVersion": "go1.17.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-status after the eviction

$ akash provider lease-status
{
  "services": {
    "ssh": {
      "name": "ssh",
      "available": 1,
      "total": 1,
      "uris": [
        "31ai266lqddovfbslrlj1vtcfk.ingress.europlots.com"
      ],
      "observed_generation": 1,
      "replicas": 1,
      "updated_replicas": 1,
      "ready_replicas": 1,
      "available_replicas": 1
    }
  },
  "forwarded_ports": {
    "ssh": [
      {
        "host": "ingress.europlots.com",
        "port": 22,
        "externalPort": 32459,
        "proto": "TCP",
        "available": 1,
        "name": "ssh"
      }
    ]
  }
}

lease-events logs

$ akash provider lease-events 
{
  "type": "Normal",
  "reason": "Sync",
  "note": "Scheduled for sync",
  "object": {
    "kind": "Ingress",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "31ai266lqddovfbslrlj1vtcfk.ingress.europlots.com"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Scheduled",
  "note": "Successfully assigned vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg/ssh-6fd4f4bdf9-7b2p2 to node2",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Pulled",
  "note": "Container image \"ubuntu:21.10\" already present on machine",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Created",
  "note": "Created container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Started",
  "note": "Started container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Warning",
  "reason": "Evicted",
  "note": "Container ssh exceeded its local ephemeral storage limit \"10737418240\". ",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Killing",
  "note": "Stopping container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-7b2p2"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Scheduled",
  "note": "Successfully assigned vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg/ssh-6fd4f4bdf9-fwn5g to node2",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Pulled",
  "note": "Container image \"ubuntu:21.10\" already present on machine",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Created",
  "note": "Created container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "Started",
  "note": "Started container ssh",
  "object": {
    "kind": "Pod",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9-fwn5g"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "SuccessfulCreate",
  "note": "Created pod: ssh-6fd4f4bdf9-7b2p2",
  "object": {
    "kind": "ReplicaSet",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "SuccessfulCreate",
  "note": "Created pod: ssh-6fd4f4bdf9-fwn5g",
  "object": {
    "kind": "ReplicaSet",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh-6fd4f4bdf9"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
{
  "type": "Normal",
  "reason": "ScalingReplicaSet",
  "note": "Scaled up replica set ssh-6fd4f4bdf9 to 1",
  "object": {
    "kind": "Deployment",
    "namespace": "vpjq3g0uoce5ffa9j85h74t9skosfj92dp4ce7eamhsdg",
    "name": "ssh"
  },
  "lease_id": {
    "owner": "akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h",
    "dseq": 5673203,
    "gseq": 1,
    "oseq": 1,
    "provider": "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"
  }
}
@arno01 arno01 added the P1 label Apr 30, 2022
@hydrogen18
Copy link

I thought I had looked at something similar in the past, let me see if I can find it

@hydrogen18
Copy link

Looks similar to issue akash-network/node#1480 . Maybe this got reintroduced somehow?

@boz
Copy link

boz commented May 3, 2022

maybe the fix wasn't applied to the master branch previously

@hydrogen18
Copy link

Reproduced on master branch, will have to track this down to see what is going on

@boz
Copy link

boz commented May 3, 2022

I mean was it fixed on mainnet/main when mainnet/main was v0.14.x then lost when master was merged into mainnet/main for 0.16.x?

@hydrogen18
Copy link

It's hitting this line I think we need to filter out the pods that have failed & whatnot before trying to run the command

https://github.com/ovrclk/akash/blob/master/provider/cluster/kube/client_exec.go#L100

@hydrogen18
Copy link

@arno01 oh yeah now that I look at this are you sure the pod restarts? When I do this locally while watching the kubernetes cluster the pod moves to "completed". After a while the provider closes the lease because the containers aren't running.

The kubernetes pod has a restart policy of always, but apparently that doesn't mean anything of the sort

$ kubectl get pod --namespace=cul2933lrothig1100l4s5ra710m53f6sol2mncvhht3m web-77db64bfd-cn8jk  -o=jsonpath='{.spec.restartPolicy}' && echo
Always

I tried changing to "OnFailure" (since "Never" seems like a poor choice) but that gives me this error

E[2022-05-04|13:50:34.410] applying deployment                          module=provider-cluster-kube err="Deployment.apps \"bew\" is invalid: spec.template.spec.restartPolicy: Unsupported value: \"OnFailure\": supported values: \"Always\"" lease=akash178ctpsxaa4fcyq0fwtds4qx2ha0maluwll87wx/12/1/1/akash1xglzcfu4g9her6xhz95fk78h9555qaxz70cf4s service=bew

@hydrogen18
Copy link

@sacreman any suggestions here?

@arno01
Copy link
Author

arno01 commented May 10, 2022

@hydrogen18 I've tested this again just now:

TL;DR Looks like that issue is isolated to a single provider - Europlots.
I would think of closing this issue, but as you've also reproduced it maybe you want to check more things?

  • [Lumen] can confirm, the pod moves to "Completed" and the new one gets created;
  • [Lumen] cannot reproduce the issue -> I can lease-shell into the new one without issues;
  • [Akash.Pro] cannot reproduce the issue on my provider
  • [Europlots] can reproduce the issue:
$ akash provider lease-shell --tty --stdin -- ssh bash
Error: lease shell failed: remote command execute error: service with that name is not running: the service has failed
  • :8443/version reports are same (1:1);
  • looks like lease-events reports aren't sorted by time, see the output below, the "Scheduled" event goes before the "Killed" event (for Lumen);

Evidence (Lumen)

$ curl -s -k https://provider.mainnet-1.ca.aksh.pw:8443/version | jq 
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.5",
    "gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
    "gitTreeState": "clean",
    "buildDate": "2022-03-16T15:52:18Z",
    "goVersion": "go1.17.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-events.1.txt

$ akash provider lease-events > lease-events.1
$ cat lease-events.1 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Ingress"     "k9dq760v49f5t6l6v2hbqts7ac.ingress.mainnet-1.ca.aksh.pw"  "Normal"   "Sync"               "Scheduled for sync"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Scheduled"          "Successfully assigned ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8/ssh-7c9bb88b9f-54fvx to k8s-node-9.mainnet-1.ca"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Pulling"            "Pulling image ""ubuntu:21.10"""
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Pulled"             "Successfully pulled image ""ubuntu:21.10"" in 3.38558492s"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Created"            "Created container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-54fvx"                                     "Normal"   "Started"            "Started container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Scheduled"          "Successfully assigned ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8/ssh-7c9bb88b9f-rzxbx to k8s-node-5.mainnet-1.ca"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Pulling"            "Pulling image ""ubuntu:21.10"""
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Pulled"             "Successfully pulled image ""ubuntu:21.10"" in 3.385080374s"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Created"            "Created container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Started"            "Started container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Warning"  "Evicted"            "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Pod"         "ssh-7c9bb88b9f-rzxbx"                                     "Normal"   "Killing"            "Stopping container ssh"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "ReplicaSet"  "ssh-7c9bb88b9f"                                           "Normal"   "SuccessfulCreate"   "Created pod: ssh-7c9bb88b9f-rzxbx"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "ReplicaSet"  "ssh-7c9bb88b9f"                                           "Normal"   "SuccessfulCreate"   "Created pod: ssh-7c9bb88b9f-54fvx"
5823330  1  1  "akash1q7spv2cw06yszgfp4f9ed59lkka6ytn8g4tkjf"  "Deployment"  "ssh"                                                      "Normal"   "ScalingReplicaSet"  "Scaled up replica set ssh-7c9bb88b9f to 1"
$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-rzxbx                                              1/1     Running     0               2m1s    10.233.109.137    k8s-node-5.mainnet-1.ca        <none>           <none>

$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-54fvx                                              0/1     ContainerCreating   0               9s      <none>            k8s-node-9.mainnet-1.ca        <none>           <none>
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-rzxbx                                              0/1     Completed           0               2m11s   10.233.109.137    k8s-node-5.mainnet-1.ca        <none>           <none>

$ kubectl get pods -A -o wide | grep ssh
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-54fvx                                              1/1     Running     0               47s     10.233.99.87      k8s-node-9.mainnet-1.ca        <none>           <none>
ujrprcbfd0sjljt11f1rbignp2b65knk76qjphearskt8   ssh-7c9bb88b9f-rzxbx                                              0/1     Completed   0               2m49s   10.233.109.137    k8s-node-5.mainnet-1.ca        <none>           <none>

Evidence (Europlots)

$ curl -s -k https://provider.europlots.com:8443/version | jq 
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.5",
    "gitCommit": "c285e781331a3785a7f436042c65c5641ce8a9e9",
    "gitTreeState": "clean",
    "buildDate": "2022-03-16T15:52:18Z",
    "goVersion": "go1.17.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-events.2.txt

$ akash provider lease-events > lease-events.2
$ cat lease-events.2 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Ingress"     "gi52llqkrh8u98i6m3j0udd95c.ingress.europlots.com"  "Normal"   "Sync"               "Scheduled for sync"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Scheduled"          "Successfully assigned e8eivkd2u9j2vcvp7jjjsgi3uc65on2sqro3td0bjpfro/ssh-6ff4cf85f-gt6t2 to node3"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Created"            "Created container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Started"            "Started container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Warning"  "Evicted"            "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-gt6t2"                               "Normal"   "Killing"            "Stopping container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Scheduled"          "Successfully assigned e8eivkd2u9j2vcvp7jjjsgi3uc65on2sqro3td0bjpfro/ssh-6ff4cf85f-zh4bk to node3"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Created"            "Created container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Pod"         "ssh-6ff4cf85f-zh4bk"                               "Normal"   "Started"            "Started container ssh"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "ReplicaSet"  "ssh-6ff4cf85f"                                     "Normal"   "SuccessfulCreate"   "Created pod: ssh-6ff4cf85f-gt6t2"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "ReplicaSet"  "ssh-6ff4cf85f"                                     "Normal"   "SuccessfulCreate"   "Created pod: ssh-6ff4cf85f-zh4bk"
5823531  1  1  "akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc"  "Deployment"  "ssh"                                               "Normal"   "ScalingReplicaSet"  "Scaled up replica set ssh-6ff4cf85f to 1"

I've asked the provider for kubectl get pods -A -o wide output, but he is Away..
Shortly before I've asked him, he said that he's got some deployment that is still Terminating.
He was testing storage speed with Chia deployment and closed the lease, but it is still running:

# kubectl get pods --all-namespaces
NAMESPACE                                       NAME                                       READY   STATUS        RESTARTS       AGE
...
...
rgvf9cu3vacspjp0o9hdn8q4hc1pno1k7g2u8mjgrh3ha   chia-65b7fc4d96-62px7                      1/1     Terminating   0              163m
rgvf9cu3vacspjp0o9hdn8q4hc1pno1k7g2u8mjgrh3ha   chia-65b7fc4d96-v85qr                      1/1     Running       0              39m

Having that the namespace is same, there must be some issue on his side.

Evidence (Akash.Pro)

This is my provider

$ curl -s -k https://provider.akash.pro:8443/version | jq 
{
  "akash": {
    "version": "v0.16.4-rc0",
    "commit": "38b82258c14e3d0a2ed3d15a8d4140ec8c826a84",
    "buildTags": "\"osusergo,netgo,ledger,static_build\"",
    "go": "go version go1.17.6 linux/amd64",
    "cosmosSdkVersion": "v0.45.1"
  },
  "kube": {
    "major": "1",
    "minor": "23",
    "gitVersion": "v1.23.6",
    "gitCommit": "ad3338546da947756e8a88aa6822e9c11e7eac22",
    "gitTreeState": "clean",
    "buildDate": "2022-04-14T08:43:11Z",
    "goVersion": "go1.17.9",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

lease-events.3.txt

$ cat lease-events.3 | jq -r '[(.lease_id | .dseq, .gseq, .oseq, .provider), (.object | .kind, .name), .type, .reason, .note] | @csv' | column -t -s","
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Ingress"     "7efc7i47i9euj7laotatgtpt7c.ingress.akash.pro"  "Normal"   "Sync"               "Scheduled for sync"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Scheduled"          "Successfully assigned ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc/ssh-79cc8d4674-2hqht to node1"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Created"            "Created container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-2hqht"                          "Normal"   "Started"            "Started container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Scheduled"          "Successfully assigned ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc/ssh-79cc8d4674-zszts to node1"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Pulled"             "Container image ""ubuntu:21.10"" already present on machine"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Created"            "Created container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Started"            "Started container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Warning"  "Evicted"            "Container ssh exceeded its local ephemeral storage limit ""1073741824"". "
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Pod"         "ssh-79cc8d4674-zszts"                          "Normal"   "Killing"            "Stopping container ssh"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "ReplicaSet"  "ssh-79cc8d4674"                                "Normal"   "SuccessfulCreate"   "Created pod: ssh-79cc8d4674-zszts"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "ReplicaSet"  "ssh-79cc8d4674"                                "Normal"   "SuccessfulCreate"   "Created pod: ssh-79cc8d4674-2hqht"
5823715  1  1  "akash1nxq8gmsw2vlz3m68qvyvcf3kh6q269ajvqw6y0"  "Deployment"  "ssh"                                           "Normal"   "ScalingReplicaSet"  "Scaled up replica set ssh-79cc8d4674 to 1"
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-zszts                       1/1     Running   0          27s    10.233.90.30   node1   <none>           <none>
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-2hqht                       0/1     ContainerCreating   0          0s     <none>         node1   <none>           <none>
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-zszts                       0/1     Completed           0          33s    10.233.90.30   node1   <none>           <none>
root@node1:~# kubectl get pods -A -o wide | grep ssh
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-2hqht                       1/1     Running     0          2s     10.233.90.31   node1   <none>           <none>
ddqp0svbeqjnkiicq5d53c3dfduo83cm03b14btomvgsc   ssh-79cc8d4674-zszts                       0/1     Completed   0          35s    10.233.90.30   node1   <none>           <none>

@tidrolpolelsef
Copy link

I'm confused that we can't seem to reproduce this across all providers uniformly at this point. Do we know if there are any differences in configuration between those?

@chandadharap
Copy link

There is a workaround per @boz , making it sev2

@troian troian transferred this issue from akash-network/node Feb 17, 2023
@troian troian added the repo/provider Akash provider-services repo issues label Feb 17, 2023
@andy108369
Copy link
Contributor

There have been new finding in https://github.com/ovrclk/engineering/issues/538
Closing in favor of that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues sev2
Projects
None yet
Development

No branches or pull requests

7 participants