Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_service_account #143

Closed
redscaresu opened this issue Apr 2, 2020 · 38 comments
Closed

create_service_account #143

redscaresu opened this issue Apr 2, 2020 · 38 comments
Labels
bug Something isn't working

Comments

@redscaresu
Copy link

redscaresu commented Apr 2, 2020

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version and Provider Version

Terraform v0.12.23

Affected Resource(s)

data "template_file" "create_service_account"
resource null_resource "create_service_account"
scripts/create_service_account.template.sh

Debug Output

Github Gist: https://gist.github.com/redscaresu/8fccaaff9666194e698e3c28615953f7

Expected Behavior

run create_service_account.sh successfully

Actual Behavior

The script create_service_account.sh is successfully copied to the admin server however it does not run and errors out. commenting out the script in its entirety results in the successful completion of the terraform apply however when the script is not uncommented the following error is received

Error: error executing "/tmp/terraform_156019056.sh": Process exited with status 1

If I ssh to the admin host and run the script manually I receive the following error.

[xxxx@admin ~]$ ./create_service_account.sh
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?

Steps to Reproduce

  1. terraform apply
@redscaresu redscaresu added the bug Something isn't working label Apr 2, 2020
@hyder
Copy link
Contributor

hyder commented Apr 2, 2020

Thanks for logging this issue. Can you please confirm you have:

bastion_enabled = true
admin_enabled = true
admin_instance_principal = true

?

@redscaresu
Copy link
Author

redscaresu commented Apr 2, 2020

Thanks for logging this issue. Can you please confirm you have:

bastion_enabled = true
admin_enabled = true
admin_instance_principal = true

?
Thanks for your response

All these are set in my tfvars.

admin_instance_principal = true
admin_enabled = true
bastion_enabled = true

@redscaresu
Copy link
Author

redscaresu commented Apr 2, 2020

it looks like this line is never run.

https://github.com/oracle-terraform-modules/terraform-oci-oke/blob/master/modules/oke/kubeconfig.tf#L97

however if you comment out this line line

"rm -f $HOME/generate_kubeconfig.sh"

and ssh to the server, running this script manually on the server does result in the kubeconfig being generated.

so I think the problem is generate_kubeconfig.sh failing then the service account creation bombs out as a result.

@hyder
Copy link
Contributor

hyder commented Apr 3, 2020

The generate_kubeconfig.sh is supposed to run automatically once:

  1. the oci-cli has been installed on the admin
  2. and the oke cluster created

We have put depend_on in a few places to make the ordering of these actions deterministic. Looks like we may have missed some. Or this was possibly introduced when we shifted the instance_principal to the admin from the bastion. We'll hunt and fix it.

@hyder
Copy link
Contributor

hyder commented Apr 3, 2020

Could be related to #140

@redscaresu
Copy link
Author

redscaresu commented Apr 3, 2020

I think you are right here. This morning I made the following change to enable us to log the output of the scripts.

"$HOME/create_service_account.sh >>kubeconfig.log 2>&1",
"$HOME/generate_kubeconfig.sh >>kubeconfig.log 2>&1",

/home/opc/generate_kubeconfig.sh: line 5: /usr/local/bin/oci: No such file or directory
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?

so essentially what I think is happening is that generate_kubeconfig is being run before admin_instance_principal is enabled which means that oci is not available when the generate_kubeconfig is run.

@redscaresu
Copy link
Author

redscaresu commented Apr 3, 2020

definitely an ordering issue here.

while [ ! -f /home/opc/admin.finish ]
do
  sleep 30
done
oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config  --region ${region} --token-version 2.0.0

the above code has got rid of the issue with the oci client being run before its installed now the only thing thats left is the the oci command is being run before admin_principe is enabled successful.

@saurabhuja
Copy link
Contributor

saurabhuja commented Apr 5, 2020

Yeah, this is an ordering issue. Stuck at same place-
module.oke.null_resource.write_kubeconfig_on_admin[0] (remote-exec): ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm

Looks like oci client is not installed before it ran "write_kubeconfig_on_admin"
Question is how to we order admin_instance_principle before this?

@hyder
Copy link
Contributor

hyder commented Apr 5, 2020

The instance_principal, if enabled, is created immediately after the admin instance is created. See here:

https://github.com/oracle-terraform-modules/terraform-oci-oke/blob/master/docs/dependencies.adoc

By the time, cloud-init has finished on the compute, the dynamic group and the policy for instance_principal would have been created already.

I'm adding @redscaresu's fix and also a dependency on install_kubectl. Given the installation of kubectl on admin is done through null_resource.install_kubectl_admin and therefore requires the compute instance to be up, this should ensure that the instance_principal would have been created by then. Together, I think these 2 should be enough. If not, then we'll look at the instance_principal in the base module, maybe add an explicit dependency there.

All the additional functionality that we add now, their dependencies are documented here.

I've submitted a PR: #146 . Can you please test and let us know?

Thanks very much for your patience and help to hunt this.

@redscaresu
Copy link
Author

@hyder looks like that has worked now. Thanks for your help.

@hyder
Copy link
Contributor

hyder commented Apr 6, 2020

Thanks @redscaresu. @sauraahu can you please confirm if this works for you as well? We can then merge and cut a new release for the registry.

@redscaresu
Copy link
Author

redscaresu commented Apr 6, 2020

sorry, been testing a bit more. While this is an improvement, I dont think the issue is totally gone.

It looks like there is still an ordering issue here. On the first first apply we still get an ordering issue whereby we attempt to create the kubeconfig before the admin principle is set up.

On the second apply it is able to create the kubeconfig and then create the service account. According to some logging I created this is what happens.

On First terraform apply tries to the kubeconfig and fails because it does not have admin principle applied to it.
ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm

On second terraform apply

New config written to the Kubeconfig file /home/opc/.kube/config
serviceaccount/kubeconfigsa created
clusterrolebinding.rbac.authorization.k8s.io/kubeconfigsa-crb created

@redscaresu
Copy link
Author

a bit more information.

On the first terraform apply I am still unable to log onto to the bastion so it looks like I am trying to do the remote execs to the bastion and admin before I have physical access to those machines.

A second terraform apply seems to resolve this ordering issue.

@hyder
Copy link
Contributor

hyder commented Apr 7, 2020

I've added an additional wait to ensure the instance_principal has been created before kubeconfig is generated. Can you please try again? You'll need to make a pull on the branch.

@redscaresu
Copy link
Author

Still fails on the first terraform apply but the ordering issue is solved on the second apply.

It looks like there is a certain amount time between enabling the admin principle and actually being granted this privilege.

https://gist.github.com/redscaresu/e1e989abf48f2024cacd9b15593f285a

The above error log shows that resource write_kubeconfig_on_admin fails.

Looking at the log file for running kubeconfig.log

waiting for admin to be ready
waiting for admin to be ready
ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm

we can see we have looped twice before trying to execute

oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config --region ${region} --token-version 2.0.0

that means even though /home/opc/ip.finish exists thats not enough to know whether we have been granted admin principle which means oci_identity_dynamic_group and oci_identity_policy is not enough to ensure that we have admin instance principle yet.

@hyder
Copy link
Contributor

hyder commented Apr 7, 2020

@redscaresu I've added a 30s sleep between instance_principal being detected and generating the kubeconfig. Can you please test again?

Thanks again for your patience.

@redscaresu
Copy link
Author

redscaresu commented Apr 8, 2020

so I tried that too and a simple sleep does not work unfortunately.

So I tried to implement a rudimentary try/catch

while [ ! -f /home/opc/admin.finish ]  || [ ! -f /home/opc/ip.finish  ];
do
  echo "waiting for admin to be ready"; sleep 10;
done

for i in `seq 1 20`;
do
  oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config  --region ${region} --token-version 2.0.0 && break
  sleep 20
done

The log is below

https://gist.github.com/redscaresu/f4c4ef86b9ad79c237a94b4458630541

it is interesting that no matter how long we wait we never get the admin_principle permission we need. Its almost as if we are waiting on a terraform operation to complete before we are given this permission, I just do not know what resource that is.

In the log you can see we hit the max 20 loops with their corresponding 20 secs of sleep before bombing out. I dont think it matters how long you will wait because it will always bomb out, we could wait for a 100 it would not matter.

Something needs to complete before we run resource "null_resource" "write_kubeconfig_on_admin". I just dont know what that is.

The below is the terraform, you can see that module.oke.null_resource.write_kubeconfig_on_admin finally bombs out after hitting the 20th loop in the bash script and subsequently causes the create_service_account sh to bomb out. module.oke.null_resource.write_kubeconfig_on_admin will never have admin_principle on the first pass.

https://gist.github.com/redscaresu/697b82ded02c640e95aae3a04465f48a

@redscaresu
Copy link
Author

redscaresu commented Apr 8, 2020

This is probably a red herring but....

module.oke.null_resource.write_kubeconfig_on_admin[0]: Still creating... [2m0s elapsed] 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "root" is waiting for "provisioner.file (close)" 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "module.oke.null_resource.create_service_account[0]" is waiting for "module.oke.null_resource.write_kubeconfig_on_admin[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "provider.null (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "provisioner.remote-exec (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "root" is waiting for "provisioner.file (close)" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "module.oke.null_resource.create_service_account[0]" is waiting for "module.oke.null_resource.write_kubeconfig_on_admin[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "provider.null (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "provisioner.remote-exec (close)" is waiting for "module.oke.null_resource.create_service_account[0]"

This seems weird, am I mistaken or does it like write_kubeconfig_on_admin is waiting for module.oke.null_resource.create_service_account? write_kubeconfig_on_admin must come first, we cant create the service account until thats done

module.oke.null_resource.write_kubeconfig_on_admin[0]: Still creating... [2m0s elapsed] 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]

I tested this by setting create_service_account = false and it still failed with the same problem so this is likely a red herring. While write_kubeconfig_on_admin was being created I checked to see that that the dynamic group was there with the associated admin instance and policy and they were present. This is very strange.

@styledigger
Copy link

styledigger commented Apr 8, 2020

I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content).

The kubeconfig got created, but the create_service_account.sh failed:
The connection to the server localhost:8080 was refused...

Running create_service_account.sh manually works.

@redscaresu
Copy link
Author

redscaresu commented Apr 8, 2020

I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content).

The kubeconfig got created, but the create_service_account.sh failed:
The connection to the server localhost:8080 was refused...

Running create_service_account.sh manually works.

nice! can you show me what is your dependency on the create_service_account terraform resource? Is it dependent on the resource that creates your kubeconfig now?

@styledigger
Copy link

uuups, created kubeconfig in wrong place and didn't set the KUBECONFIG env. var. Going to destroy and apply again, fingers crossed.

@styledigger
Copy link

I have some progress:

  • kubeconfig got created
  • create_service_account.sh fails with Unable to connect to the server: getting credentials: exec: exec: "oci": executable file not found in $PATH

@styledigger
Copy link

looks like cloudinit did not finish yet, we should wait for /home/opc/admin.finish

@styledigger
Copy link

styledigger commented Apr 8, 2020

Got it working.
Apart from waiting for the cloudinit to finish, I also had to set the OCI_CLI_AUTH=instance_principal.

I did fix it by having following code at the beginning of create_service_account.sh:

while [ ! -f /home/opc/admin.finish ];
do
  echo "waiting for admin to be ready"; sleep 10;
done
sleep 10
export OCI_CLI_AUTH=instance_principal
....

Not the most elegant fix, it would be better to not to connect to admin host and run create_service_account.sh until admin is ready. The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early.

I suppose the generate_kubeconfig.sh can be fixed the same way, however I didn't use this script to generate kubeconfig on admin host. Instead, I did it like this:


resource "null_resource" "write_kubeconfig_on_admin" {
  connection {
    host        = var.oke_admin.admin_private_ip
    private_key = file(var.oke_ssh_keys.ssh_private_key_path)
    timeout     = "40m"
    type        = "ssh"
    user        = "opc"

    bastion_host        = var.oke_admin.bastion_public_ip
    bastion_user        = "opc"
    bastion_private_key = file(var.oke_ssh_keys.ssh_private_key_path)
  }

  depends_on = [oci_containerengine_cluster.k8s_cluster]

  provisioner "file" {
    content     = data.oci_containerengine_cluster_kube_config.kube_config.content
    destination = "~/.kube/config"
  }

  count = var.oke_admin.bastion_enabled == true && var.oke_admin.admin_enabled == true ? 1 : 0
}

@hyder
Copy link
Contributor

hyder commented Apr 8, 2020

The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early.

Aha! I think this was the issue. So, I'm moving the delay to another null_resource instead.

We were using rendering the kubeconfig before. However, it was storing the kubeconfig in the state file. I thought this was not a good idea.

Oddly:

  1. I haven't run into any of the above issue at all
  2. The other scripts e.g. ocirsecret, metricserver, calico etc that all depend on the kubeconfig haven't run into this problem either.

I'll add @styledigger's findings and make another push soon. Can I trouble you to test again?

@hyder
Copy link
Contributor

hyder commented Apr 8, 2020

Ok, I've pushed an update to my branch. Can you please make a pull and test again?

@hyder
Copy link
Contributor

hyder commented Apr 8, 2020

Given I still couldn't replicate the issue, I'll need 2 confirmations from those who have been able to in order to confirm we've fixed it: @redscaresu and @styledigger at least.

@styledigger
Copy link

Will test it now. @hyder Just to be sure, updates have been pushed to git@github.com:hyder/terraform-oci-oke.git?

@hyder
Copy link
Contributor

hyder commented Apr 9, 2020

Yes, in branch issue-143.

Use the following if you want to test with your existing clone:

git checkout -b hyder-issue-143 master
git pull https://github.com/hyder/terraform-oci-oke.git issue-143

Or you can do a fresh clone from my fork instead and checkout the issue-143 branch

@styledigger
Copy link

I did a fresh clone:
git clone git@github.com:hyder/terraform-oci-oke.git

Switched to a new branch 'hyder-issue-143:

git checkout -b hyder-issue-143 master 
git pull https://github.com/hyder/terraform-oci-oke.git issue-143

Now terraform plan fails:

...
module.base.module.vcn.data.oci_core_services.all_oci_services[0]: Refreshing state...
module.network.data.oci_core_services.all_oci_services[0]: Refreshing state...

Error: Null value found in list

  on modules\policies\datasources.tf line 9, in data "oci_identity_regions" "home_region":
   9: data "oci_identity_regions" "home_region" {

Null values are not allowed for this attribute value.


Error: Invalid function argument

  on .terraform\modules\base\terraform-oci-base-1.1.3\datasources.tf line 9, in data "template_file" "ad_names":
   9:   count    = length(data.oci_identity_availability_domains.ad_list.availability_domains)
    |----------------
    | data.oci_identity_availability_domains.ad_list.availability_domains is null

Invalid value for "value" parameter: argument must not be null.


Error: Null value found in list

  on .terraform\modules\base\terraform-oci-base-1.1.3\datasources.tf line 18, in data "oci_identity_regions" "home_region":
  18: data "oci_identity_regions" "home_region" {

Null values are not allowed for this attribute value.


Error: Attempt to index null value

  on .terraform\modules\base\terraform-oci-base-1.1.3\modules\admin\locals.tf line 12, in locals:
  12:   admin_image_id = var.oci_admin.admin_image_id == "Oracle" ? data.oci_core_images.admin_images.images.0.id : var.oci_admin.admin_image_id
    |----------------
    | data.oci_core_images.admin_images.images is null

This value is null, so it does not have any indices.


Error: Attempt to index null value

  on .terraform\modules\base\terraform-oci-base-1.1.3\modules\bastion\locals.tf line 12, in locals:
  12:   bastion_image_id = var.oci_bastion.bastion_image_id == "Autonomous" ? data.oci_core_images.autonomous_images.images.0.id : var.oci_bastion.bastion_image_id
    |----------------
    | data.oci_core_images.autonomous_images.images is null

This value is null, so it does not have any indices.

@saurabhuja
Copy link
Contributor

I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.

@hyder
Copy link
Contributor

hyder commented Apr 10, 2020

I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.

Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from.

@saurabhuja
Copy link
Contributor

saurabhuja commented Apr 10, 2020

I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.

Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from.

Sure. Meanwhile, i am thinking to host one example of how to create OKE cluster and host sample hello world example as that would require some additional steps like building docker image or use existing one, uploading image to OCIR, and create sample .yml with deployment and service configured. What would be right place to put that example ?

@saurabhuja
Copy link
Contributor

saurabhuja commented Apr 10, 2020

Only issue i am getting is during terraform -destroy, dont know the reason:
First time:
module.network.oci_core_subnet.pub_lb[0]: Still destroying... [id=ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqm...2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa, 10m10s elapsed]

Error: Service error:Conflict. The Subnet ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqmzf2mjizqomds2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa references the VNIC ocid1.vnic.oc1.ap-mumbai-1.abrg6ljr4hnwbd4m4fsfmzldixv657vkzfbyeqlrmsyd7eusr6c4px4xcngq. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: d870a1beb10aeb1ab29a4110a93ae2b4/98933FF0DF57B2AF5B029C4FB4DF4B3A/EE342E4DD141CE0F6DDD1637CBABA34E

Second time or run:
module.base.module.vcn.oci_core_vcn.vcn: Destruction complete after 1s

Error: Error in function call

on modules/auth/outputs.tf line 5, in output "ocirtoken":
5: value = var.ocir.create_auth_token == true ? element(oci_identity_auth_token.ocirtoken.*.token, 0) : "none"
|----------------
| oci_identity_auth_token.ocirtoken is empty tuple

Call to function "element" failed: cannot use element function with an empty
list.

Error: Error in function call

on modules/auth/outputs.tf line 9, in output "ocirtoken_id":
9: value = var.ocir.create_auth_token == true ? element(oci_identity_auth_token.ocirtoken.*.id, 0) : "none"
|----------------
| oci_identity_auth_token.ocirtoken is empty tuple

Call to function "element" failed: cannot use element function with an empty
list.

Error: Invalid index

on .terraform/modules/base/terraform-oci-base-1.1.3/modules/admin/outputs.tf line 9, in output "admin_instance_principal_group_name":
9: value = var.oci_admin.admin_enabled == true && var.oci_admin.enable_instance_principal == true ? oci_identity_dynamic_group.admin_instance_principal[0].name : null
|----------------
| oci_identity_dynamic_group.admin_instance_principal is empty tuple

The given key does not identify an element in this collection value.

@styledigger
Copy link

styledigger commented Apr 11, 2020

Both apply and destroy now work for me.
Problem is that there is no provider.tf in issue-143 branch's root folder so OCI privider loads required OCIDs from ~.oci\config (which has wrong values in my case).

@saurabhuja
Copy link
Contributor

Yeah, both apply and destroy work for me now. I am testing other features like dashboard, OCIR secret, helm etc. I will open separate bug if i found issue there. Meanwhile you can merge this please.

@hyder
Copy link
Contributor

hyder commented Apr 12, 2020

Right, so let me summarize why this happened:

  1. we removed the provider.tf for Remove provider.tf to make this project a reusable module from the registry聽#130
  2. this made the terraform provider use the oci config which for some of you may have have different permissions, particularly with the ability to create dynamic groups
  3. as a result of the dynamic group for the instance_principal not being created, the admin host didn't enjoy instance_principal privileges
  4. this resulted in the admin host unable to use the oci cli to generate the kubeconfig
  5. since the kubeconfig is not generated, then the service accounts could not be created either

Adding the provider.tf is documented in the quickstart doc, although we only recently updated it, so all of us collectively forgot it should be added.

I'll be merging now.

Thanks a lot everyone for your help and patience to troubleshoot this. On the plus side, we have consequently made the underlying base module more robust. So, anybody who's building on top of this repo and using the admin host to install things into their oke cluster can rely on a more definite pattern.

@hyder
Copy link
Contributor

hyder commented Apr 12, 2020

Fixed in #146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants