Investigate occasional activate-job errors #18

harryttd · 2020-09-08T20:23:58Z

I'm noticing that the activate-job is occasionally erring. Usually it restarts and the next try works. Other times all restarts fail.

Examples:

Running cmd similar to k -n tqtezos1 logs activate-job-2vmh8 -c activate to retrieve logs.

<<<<4: 500 Internal Server Error
  [ { "kind": "temporary", "id": "failure",
      "msg":
        "(Invalid_argument \"Json_encoding.construct: consequence of bad union\")" } ]
Error:
  (Invalid_argument "Json_encoding.construct: consequence of bad union")

<<<<2: 500 Internal Server Error
  [ { "kind": "permanent", "id": "proto.006-PsCARTHA.context.storage_error",
      "missing_key": [ "rolls", "owner", "current" ], "function": "copy" } ]
Error:
  Storage error:
    Cannot copy undefined key 'rolls/owner/current'.

Seb sent me some code (tezos source code I believe):

let () =
  register_error_kind
    `Permanent
    ~id:"context.storage_error"
    ~title: "Storage error (fatal internal error)"
    ~description:
      "An error that should never happen unless something \
       has been deleted or corrupted in the database."

Sometimes I see this error:

<<<<4: 500 Internal Server Error
  [ { "kind": "temporary", "id": "failure", "msg": "Fitness too low" } ]
Error:
  Fitness too low

Seb says he's seen that one very often when trying to activate a protocol on a chain that is already activated. Could be minikube has not removed all the necessary resources/storage after I deleted the namespace and re-applied the yml. This could also be related to the second error where there is deleted and/or corrupted data.

The text was updated successfully, but these errors were encountered:

harryttd · 2020-09-09T16:14:20Z

I noticed that even when deleting the tqtezos namespace, it appears that sometimes persistent volumes are not removed. Usually they are deleted along with the rest of the namespace. When the PV's are not removed and I applied the yaml again, I got the "fitness too low error". Which makes sense as the volume is being used which already has an activated protocol stored on it.

After manually deleting the PV's and applying yaml again, activate-job worked. Then I deleted the namespace, confirmed PV's were removed, and reapplied yaml. Now getting the rest of the activate-job errors. Deleting namespace and reapplying works again.

EDIT:
Noticed that if I leave the cluster running overnight, close my mackbook, and delete the namespace next day, the PV's persist.
SSH'ing into minikube shows the volumes still exist.

brandisimus · 2021-01-26T16:50:13Z

Aryeh please add documentation for this in the Development.md

closes #18

harryttd mentioned this issue Sep 8, 2020

Implement ability to spin up a local test network with configured number of bakers and nodes #17

Closed

agtilden added this to the 1.0.0.0 milestone Sep 23, 2020

brandisimus added this to In progress in private-chain-infrastructure Jan 15, 2021

brandisimus moved this from In progress to Backlog in private-chain-infrastructure Jan 15, 2021

brandisimus assigned harryttd Jan 26, 2021

harryttd added a commit that referenced this issue Jan 26, 2021

Add comment regarding pv/pvc not being deleted when deleting a ns

1e07ec9

closes #18

harryttd mentioned this issue Jan 26, 2021

Doc updates & devspace fix #91

Merged

harryttd closed this as completed in #91 Jan 26, 2021

private-chain-infrastructure automation moved this from Backlog to Done Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate occasional activate-job errors #18

Investigate occasional activate-job errors #18

harryttd commented Sep 8, 2020

harryttd commented Sep 9, 2020 •

edited

Loading

brandisimus commented Jan 26, 2021

Investigate occasional activate-job errors #18

Investigate occasional activate-job errors #18

Comments

harryttd commented Sep 8, 2020

harryttd commented Sep 9, 2020 • edited Loading

brandisimus commented Jan 26, 2021

harryttd commented Sep 9, 2020 •

edited

Loading