Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate occasional activate-job errors #18

Closed
harryttd opened this issue Sep 8, 2020 · 2 comments · Fixed by #91
Closed

Investigate occasional activate-job errors #18

harryttd opened this issue Sep 8, 2020 · 2 comments · Fixed by #91
Assignees

Comments

@harryttd
Copy link
Collaborator

harryttd commented Sep 8, 2020

I'm noticing that the activate-job is occasionally erring. Usually it restarts and the next try works. Other times all restarts fail.

Examples:

Running cmd similar to k -n tqtezos1 logs activate-job-2vmh8 -c activate to retrieve logs.

<<<<4: 500 Internal Server Error
  [ { "kind": "temporary", "id": "failure",
      "msg":
        "(Invalid_argument \"Json_encoding.construct: consequence of bad union\")" } ]
Error:
  (Invalid_argument "Json_encoding.construct: consequence of bad union")
<<<<2: 500 Internal Server Error
  [ { "kind": "permanent", "id": "proto.006-PsCARTHA.context.storage_error",
      "missing_key": [ "rolls", "owner", "current" ], "function": "copy" } ]
Error:
  Storage error:
    Cannot copy undefined key 'rolls/owner/current'.

Seb sent me some code (tezos source code I believe):

let () =
  register_error_kind
    `Permanent
    ~id:"context.storage_error"
    ~title: "Storage error (fatal internal error)"
    ~description:
      "An error that should never happen unless something \
       has been deleted or corrupted in the database."

Sometimes I see this error:

<<<<4: 500 Internal Server Error
  [ { "kind": "temporary", "id": "failure", "msg": "Fitness too low" } ]
Error:
  Fitness too low

Seb says he's seen that one very often when trying to activate a protocol on a chain that is already activated. Could be minikube has not removed all the necessary resources/storage after I deleted the namespace and re-applied the yml. This could also be related to the second error where there is deleted and/or corrupted data.

@harryttd
Copy link
Collaborator Author

harryttd commented Sep 9, 2020

I noticed that even when deleting the tqtezos namespace, it appears that sometimes persistent volumes are not removed. Usually they are deleted along with the rest of the namespace. When the PV's are not removed and I applied the yaml again, I got the "fitness too low error". Which makes sense as the volume is being used which already has an activated protocol stored on it.

After manually deleting the PV's and applying yaml again, activate-job worked. Then I deleted the namespace, confirmed PV's were removed, and reapplied yaml. Now getting the rest of the activate-job errors. Deleting namespace and reapplying works again.

EDIT:
Noticed that if I leave the cluster running overnight, close my mackbook, and delete the namespace next day, the PV's persist.
SSH'ing into minikube shows the volumes still exist.

@agtilden agtilden added this to the 1.0.0.0 milestone Sep 23, 2020
@brandisimus brandisimus moved this from In progress to Backlog in private-chain-infrastructure Jan 15, 2021
@brandisimus
Copy link

Aryeh please add documentation for this in the Development.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants