-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
create_service_account #143
Comments
Thanks for logging this issue. Can you please confirm you have: bastion_enabled = true ? |
All these are set in my tfvars. admin_instance_principal = true |
it looks like this line is never run. however if you comment out this line line
and ssh to the server, running this script manually on the server does result in the kubeconfig being generated. so I think the problem is generate_kubeconfig.sh failing then the service account creation bombs out as a result. |
The generate_kubeconfig.sh is supposed to run automatically once:
We have put depend_on in a few places to make the ordering of these actions deterministic. Looks like we may have missed some. Or this was possibly introduced when we shifted the instance_principal to the admin from the bastion. We'll hunt and fix it. |
Could be related to #140 |
I think you are right here. This morning I made the following change to enable us to log the output of the scripts.
/home/opc/generate_kubeconfig.sh: line 5: /usr/local/bin/oci: No such file or directory so essentially what I think is happening is that generate_kubeconfig is being run before admin_instance_principal is enabled which means that oci is not available when the generate_kubeconfig is run. |
definitely an ordering issue here.
the above code has got rid of the issue with the oci client being run before its installed now the only thing thats left is the the oci command is being run before admin_principe is enabled successful. |
Yeah, this is an ordering issue. Stuck at same place- Looks like oci client is not installed before it ran "write_kubeconfig_on_admin" |
The instance_principal, if enabled, is created immediately after the admin instance is created. See here: https://github.com/oracle-terraform-modules/terraform-oci-oke/blob/master/docs/dependencies.adoc By the time, cloud-init has finished on the compute, the dynamic group and the policy for instance_principal would have been created already. I'm adding @redscaresu's fix and also a dependency on install_kubectl. Given the installation of kubectl on admin is done through null_resource.install_kubectl_admin and therefore requires the compute instance to be up, this should ensure that the instance_principal would have been created by then. Together, I think these 2 should be enough. If not, then we'll look at the instance_principal in the base module, maybe add an explicit dependency there. All the additional functionality that we add now, their dependencies are documented here. I've submitted a PR: #146 . Can you please test and let us know? Thanks very much for your patience and help to hunt this. |
@hyder looks like that has worked now. Thanks for your help. |
Thanks @redscaresu. @sauraahu can you please confirm if this works for you as well? We can then merge and cut a new release for the registry. |
sorry, been testing a bit more. While this is an improvement, I dont think the issue is totally gone. It looks like there is still an ordering issue here. On the first first apply we still get an ordering issue whereby we attempt to create the kubeconfig before the admin principle is set up. On the second apply it is able to create the kubeconfig and then create the service account. According to some logging I created this is what happens. On First terraform apply tries to the kubeconfig and fails because it does not have admin principle applied to it. On second terraform apply
|
a bit more information. On the first terraform apply I am still unable to log onto to the bastion so it looks like I am trying to do the remote execs to the bastion and admin before I have physical access to those machines. A second terraform apply seems to resolve this ordering issue. |
I've added an additional wait to ensure the instance_principal has been created before kubeconfig is generated. Can you please try again? You'll need to make a pull on the branch. |
Still fails on the first terraform apply but the ordering issue is solved on the second apply. It looks like there is a certain amount time between enabling the admin principle and actually being granted this privilege. https://gist.github.com/redscaresu/e1e989abf48f2024cacd9b15593f285a The above error log shows that resource write_kubeconfig_on_admin fails. Looking at the log file for running kubeconfig.log
we can see we have looped twice before trying to execute
that means even though |
@redscaresu I've added a 30s sleep between instance_principal being detected and generating the kubeconfig. Can you please test again? Thanks again for your patience. |
so I tried that too and a simple sleep does not work unfortunately. So I tried to implement a rudimentary try/catch
The log is below https://gist.github.com/redscaresu/f4c4ef86b9ad79c237a94b4458630541 it is interesting that no matter how long we wait we never get the admin_principle permission we need. Its almost as if we are waiting on a terraform operation to complete before we are given this permission, I just do not know what resource that is. In the log you can see we hit the max 20 loops with their corresponding 20 secs of sleep before bombing out. I dont think it matters how long you will wait because it will always bomb out, we could wait for a 100 it would not matter. Something needs to complete before we run resource "null_resource" "write_kubeconfig_on_admin". I just dont know what that is. The below is the terraform, you can see that module.oke.null_resource.write_kubeconfig_on_admin finally bombs out after hitting the 20th loop in the bash script and subsequently causes the create_service_account sh to bomb out. module.oke.null_resource.write_kubeconfig_on_admin will never have admin_principle on the first pass. https://gist.github.com/redscaresu/697b82ded02c640e95aae3a04465f48a |
This is probably a red herring but....
This seems weird, am I mistaken or does it like write_kubeconfig_on_admin is waiting for module.oke.null_resource.create_service_account? write_kubeconfig_on_admin must come first, we cant create the service account until thats done
I tested this by setting |
I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content). The kubeconfig got created, but the create_service_account.sh failed: Running create_service_account.sh manually works. |
nice! can you show me what is your dependency on the create_service_account terraform resource? Is it dependent on the resource that creates your kubeconfig now? |
uuups, created kubeconfig in wrong place and didn't set the KUBECONFIG env. var. Going to destroy and apply again, fingers crossed. |
I have some progress:
|
looks like cloudinit did not finish yet, we should wait for /home/opc/admin.finish |
Got it working. I did fix it by having following code at the beginning of create_service_account.sh:
Not the most elegant fix, it would be better to not to connect to admin host and run create_service_account.sh until admin is ready. The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early. I suppose the generate_kubeconfig.sh can be fixed the same way, however I didn't use this script to generate kubeconfig on admin host. Instead, I did it like this:
|
Aha! I think this was the issue. So, I'm moving the delay to another null_resource instead. We were using rendering the kubeconfig before. However, it was storing the kubeconfig in the state file. I thought this was not a good idea. Oddly:
I'll add @styledigger's findings and make another push soon. Can I trouble you to test again? |
Ok, I've pushed an update to my branch. Can you please make a pull and test again? |
Given I still couldn't replicate the issue, I'll need 2 confirmations from those who have been able to in order to confirm we've fixed it: @redscaresu and @styledigger at least. |
Will test it now. @hyder Just to be sure, updates have been pushed to git@github.com:hyder/terraform-oci-oke.git? |
Yes, in branch issue-143. Use the following if you want to test with your existing clone:
Or you can do a fresh clone from my fork instead and checkout the issue-143 branch |
I did a fresh clone: Switched to a new branch 'hyder-issue-143:
Now terraform plan fails:
|
I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me. |
Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from. |
Sure. Meanwhile, i am thinking to host one example of how to create OKE cluster and host sample hello world example as that would require some additional steps like building docker image or use existing one, uploading image to OCIR, and create sample .yml with deployment and service configured. What would be right place to put that example ? |
Only issue i am getting is during terraform -destroy, dont know the reason: Error: Service error:Conflict. The Subnet ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqmzf2mjizqomds2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa references the VNIC ocid1.vnic.oc1.ap-mumbai-1.abrg6ljr4hnwbd4m4fsfmzldixv657vkzfbyeqlrmsyd7eusr6c4px4xcngq. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: d870a1beb10aeb1ab29a4110a93ae2b4/98933FF0DF57B2AF5B029C4FB4DF4B3A/EE342E4DD141CE0F6DDD1637CBABA34E Second time or run: Error: Error in function call on modules/auth/outputs.tf line 5, in output "ocirtoken": Call to function "element" failed: cannot use element function with an empty Error: Error in function call on modules/auth/outputs.tf line 9, in output "ocirtoken_id": Call to function "element" failed: cannot use element function with an empty Error: Invalid index on .terraform/modules/base/terraform-oci-base-1.1.3/modules/admin/outputs.tf line 9, in output "admin_instance_principal_group_name": The given key does not identify an element in this collection value. |
Both apply and destroy now work for me. |
Yeah, both apply and destroy work for me now. I am testing other features like dashboard, OCIR secret, helm etc. I will open separate bug if i found issue there. Meanwhile you can merge this please. |
Right, so let me summarize why this happened:
Adding the provider.tf is documented in the quickstart doc, although we only recently updated it, so all of us collectively forgot it should be added. I'll be merging now. Thanks a lot everyone for your help and patience to troubleshoot this. On the plus side, we have consequently made the underlying base module more robust. So, anybody who's building on top of this repo and using the admin host to install things into their oke cluster can rely on a more definite pattern. |
Fixed in #146 |
Community Note
Terraform Version and Provider Version
Terraform v0.12.23
Affected Resource(s)
data "template_file" "create_service_account"
resource null_resource "create_service_account"
scripts/create_service_account.template.sh
Debug Output
Github Gist: https://gist.github.com/redscaresu/8fccaaff9666194e698e3c28615953f7
Expected Behavior
run create_service_account.sh successfully
Actual Behavior
The script create_service_account.sh is successfully copied to the admin server however it does not run and errors out. commenting out the script in its entirety results in the successful completion of the terraform apply however when the script is not uncommented the following error is received
Error: error executing "/tmp/terraform_156019056.sh": Process exited with status 1
If I ssh to the admin host and run the script manually I receive the following error.
Steps to Reproduce
terraform apply
The text was updated successfully, but these errors were encountered: