Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openstack creating secgroup timeout #8819

Closed
ChiefAlexander opened this issue Sep 13, 2016 · 22 comments
Closed

Openstack creating secgroup timeout #8819

ChiefAlexander opened this issue Sep 13, 2016 · 22 comments
Labels

Comments

@ChiefAlexander
Copy link

When creating multiple security groups with the Openstack provider I ran into a issue where terraform would fail after 30 seconds.

Affected Resource(s)

Please list the resources as a list, for example:

  • openstack_networking_secgroup_v2

Terraform Configuration Files

I have 5 created security groups in a test file, just increased the number

resource "openstack_compute_secgroup_v2" "security_group" {
  name = "sec_group"
  description = "This is a security group"
  rule {
    from_port = 22
    to_port = 22
    ip_protocol = "tcp"
    cidr = "0.0.0.0/0"
  }
  rule {
    from_port = 1
    to_port = 65535
    ip_protocol = "udp"
    cidr = "${var.openstack_cidr}"
  }
  rule {
    from_port = 1
    to_port = 65535
    ip_protocol = "tcp"
    cidr = "${var.openstack_cidr}"
  }
  rule {
    from_port = 0
    to_port = 0
    ip_protocol = "icmp"
    cidr = "0.0.0.0/0"
  }
  rule {
    from_port = 8
    to_port = 0
    ip_protocol = "icmp"
    cidr = "0.0.0.0/0"
  }
}

resource "openstack_compute_secgroup_v2" "security_group2" {
  name = "sec_group2"
  description = "This is a security group"
  rule {
    from_port = 22
    to_port = 22
    ip_protocol = "tcp"
    cidr = "0.0.0.0/0"
  }
  rule {
    from_port = 1
    to_port = 65535
    ip_protocol = "udp"
    cidr = "${var.openstack_cidr}"
  }
  rule {
    from_port = 1
    to_port = 65535
    ip_protocol = "tcp"
    cidr = "${var.openstack_cidr}"
  }
  rule {
    from_port = 0
    to_port = 0
    ip_protocol = "icmp"
    cidr = "0.0.0.0/0"
  }
  rule {
    from_port = 8
    to_port = 0
    ip_protocol = "icmp"
    cidr = "0.0.0.0/0"
  }
}

Debug Output

https://gist.github.com/ChiefAlexander/87875d431c5eaeedee699c8340ba47cc

Expected Behavior

Terraform should not have exited its build

Actual Behavior

Terraform quits after 30s of trying to create the security groups

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Important Factoids

Our Openstack instance has a known issue of being slow to create security groups

@ChiefAlexander
Copy link
Author

ChiefAlexander commented Sep 13, 2016

I am assuming that this is a timeout issue since the Create function (https://github.com/hashicorp/terraform/blob/master/builtin/providers/openstack/resource_openstack_compute_secgroup_v2.go#L96) of the secgroup provider is missing this stateConf like in the secgroup delete function: https://github.com/hashicorp/terraform/blob/master/builtin/providers/openstack/resource_openstack_compute_secgroup_v2.go#L225

I could not find where it may be pulling in a default timeout, nor does the error point to a timeout issue. Openstack logs show the pipe breaking and no other telling errors.

@jtopjian
Copy link
Contributor

It does sound like a timeout of some sort, but I agree, the error doesn't indicate that. I've run into the EOFs before -- I'll have to check past issues to refresh my memory.

I have 5 created security groups in a test file, just increased the number

Can you elaborate on this more?

If you're creating five security groups, maybe the Nova API endpoint is hitting a bottleneck. Can you try creating 1 group, then 2 etc etc and see if there's a consistent number where the creation fails?

Can you also try running apply with -parallelism=1? That will cause each resource to be created one by one rather than a bunch in parallel.

@ChiefAlexander
Copy link
Author

ChiefAlexander commented Sep 13, 2016

I have a complex terraform file that I am using to stand up a mesos cluster. Once we encountered the error I created a simple terraform file to do some testing so that we could narrow down the issue. I found that creating one security group is successful but creating multiple 5x was not. I just used the same security group settings as those shown above, renaming them via increasing numbers.

Running with parallelism in the same test file from my debug output was successful.

Another thing to note is that when terraform fails, openstack shows the security groups as being created. But terraform has no state of that and will not destroy them.

@jtopjian
Copy link
Contributor

Running with parallelism in the same test file from my debug output was successful.

OK, so this definitely sounds like the Nova API endpoint is being overloaded. But...

Another thing to note is that when terraform fails, openstack shows the security groups as being created. But terraform has no state of that and will not destroy them.

Yes, this is due to the lack of safely checking to see if the security group was created or not. Definitely a bug.

Thanks for reporting this. :)

@jtopjian jtopjian added the bug label Sep 13, 2016
@jtopjian
Copy link
Contributor

Ah - I found where I've run into the EOF / secgroup error before:

https://github.com/hashicorp/terraform/blob/master/builtin/providers/openstack/resource_openstack_compute_instance_v2.go#L617

So it sounds like in this case, even though EOF is happening, the security group is still being created? I wonder if it's safe enough to pass on EOF in this case...

Are you able to alter the security group resource to pass on the EOF such as how it's done in the instance resource, then build from source and test? It'd definitely help since you have an environment that can easily trigger this. No big deal if not.

@ChiefAlexander
Copy link
Author

I shall give a best attempt but really make no promises as I am not proficient in golang but have been wanting to learn. Trial by fire perhaps?

@jtopjian
Copy link
Contributor

Sounds like a plan :)

But do let me know if you aren't able to get the patch in place. I can make one up for you to test in the next day or so and I'd even be happy to compile a linux binary so you can skip that part, too.

@ChiefAlexander
Copy link
Author

The error appears to be hitting here https://github.com/hashicorp/terraform/blob/master/builtin/providers/openstack/resource_openstack_compute_secgroup_v2.go#L116

Changing that if statement to:

if err != nil && err.Error() != "EOF" {

Still results in the same error. I have tried adding in more from that block that you provided but am not getting through.

@jtopjian
Copy link
Contributor

That's interesting... the crash/error output still points to that same line?

Try adding this:

fmt.Printf("[DEBUG] foobar error output: %#v", err)

And then search for "foobar" for easy grepping and see what the err looks like in its entirety.

@ChiefAlexander
Copy link
Author

That was a great idea:

2016/09/13 16:25:05 [DEBUG] plugin: terraform: openstack-provider (internal) 2016/09/13 16:25:05 [DEBUG] foobar error output: &gophercloud.UnexpectedResponseCodeError{URL:"http://[removedip]:8774/v2/8d2c65973bea4975b54a70377dc09922/os-security-groups", Method:"POST", Expected:[]int{200}, Actual:403, Body:[]uint8{0x7b, 0x22, 0x66, 0x6f, 0x72, 0x62, 0x69, 0x64, 0x64, 0x65, 0x6e, 0x22, 0x3a, 0x20, 0x7b, 0x22, 0x6d, 0x65, 0x73, 0x73, 0x61, 0x67, 0x65, 0x22, 0x3a, 0x20, 0x22, 0x51, 0x75, 0x6f, 0x74, 0x61, 0x20, 0x65, 0x78, 0x63, 0x65, 0x65, 0x64, 0x65, 0x64, 0x20, 0x66, 0x6f, 0x72, 0x20, 0x72, 0x65, 0x73, 0x6f, 0x75, 0x72, 0x63, 0x65, 0x73, 0x3a, 0x20, 0x5b, 0x27, 0x73, 0x65, 0x63, 0x75, 0x72, 0x69, 0x74, 0x79, 0x5f, 0x67, 0x72, 0x6f, 0x75, 0x70, 0x27, 0x5d, 0x22, 0x2c, 0x20, 0x22, 0x63, 0x6f, 0x64, 0x65, 0x22, 0x3a, 0x20, 0x34, 0x30, 0x33, 0x7d, 0x7d}}
2016/09/13 16:25:35 [DEBUG] plugin: terraform: openstack-provider (internal) 2016/09/13 16:25:35 [DEBUG] foobar error output: &url.Error{Op:"Post", URL:"http://[removedip]:8774/v2/8d2c65973bea4975b54a70377dc09922/os-security-groups", Err:(*errors.errorString)(0xc420072040)}

We think that we have discovered that the 30s timeout is coming from Nova like you had suggested. We are making a change tonight to see if that gets us through. I will update tomorrow with a status.

@ChiefAlexander
Copy link
Author

ChiefAlexander commented Sep 14, 2016

After our change we are still seeing the same issue. We had found that nova had a setting for url_timeout=30, which matched up with our 30s timeout terraform side (https://access.redhat.com/solutions/2150241). Upping that timeout to 60s still had terraform failing at 30s during the creation of security groups.

After our change I no longer see the 403 error. Just seeing the other error mentioned:

2016/09/13 21:22:48 [DEBUG] plugin: terraform: openstack-provider (internal) 2016/09/13 21:22:48 [DEBUG] foobar error output: &url.Error{Op:"Post", URL:"http://[removedip]:8774/v2/8d2c65973bea4975b54a70377dc09922/os-security-groups", Err:(*errors.errorString)(0xc420012240)}

@jtopjian
Copy link
Contributor

Darn.

Is there a helpful error message if you do:

log.Printf("%s", err.Error())

@ChiefAlexander
Copy link
Author

ChiefAlexander commented Sep 14, 2016

No extra info in that error message:

2016/09/14 14:24:08 [DEBUG] plugin: terraform: openstack-provider (internal) 2016/09/14 14:24:08 [DEBUG] foobar error output: Post http://[removedip]:8774/v2/8d2c65973bea4975b54a70377dc09922/os-security-groups: EOF

To make sure that I am coding how you think, here is what I am doing, line 116:

    if err != nil && err.Error() != "EOF" {
        log.Printf("[DEBUG] foobar error output: %s", err.Error())
        return fmt.Errorf("Error creating OpenStack security group: %s", err)
    }

@jtopjian
Copy link
Contributor

That looks correct to me. :)

@jtopjian
Copy link
Contributor

Oh, hold on. Is it correct to say the entire error message is:

Post http://[removedip]:8774/v2/8d2c65973bea4975b54a70377dc09922/os-security-groups: EOF

And not simply

EOF

?

If so, try:

eof := strings.Contains("EOF", err.Error())
if err != nil && ! eof {

@ChiefAlexander
Copy link
Author

I was able to use your code to get around the EOF error, I just had to flip around the Contains statement:

eof := strings.Contains(err.Error(), "EOF")

However Terraform then had a panic.
Full debug output of run: https://gist.github.com/ChiefAlexander/9b91d31e8f7ea1ae859d23c36285119b
Full panic output: https://gist.github.com/ChiefAlexander/0f21a5b3bfedec41b21a18e497eb1257

Before posting I did one more build and added in another log output line after the if statement we are breaking on to see if we got any further. We actually appear to be.
I think it is now breaking on LN120

@jtopjian
Copy link
Contributor

I was able to use your code to get around the EOF error, I just had to flip around the Contains statement:

Oops - right.

The crash log is reporting that it's making it to line 120. It's possible that the sg being returned doesn't actually contain anything due to the EOF being returned. This is a definitely strange. Is there anything in your Nova API logs that point to what might be happening?

I think the correct fix here might be to add a StateChangeConf in the Create. I can take a look at doing that.

Another possible workaround is defining security groups via the Neutron API -- maybe that'll help sort things out?

@jtopjian
Copy link
Contributor

I think the correct fix here might be to add a StateChangeConf in the Create. I can take a look at doing that.

Actually, scratch that. The error is happening in Create and that's even before one can check on the status of the create request.

This still sounds like something funny is going on with Nova API -- especially if running the requests one at a time works out. Definitely check out the api logs and also see if switching to the openstack_networking_secgroup_* ends up working; maybe it's an issue between nova and neutron communicating.

Keep me in the loop, though - hopefully we can get this one resolved. :)

@ChiefAlexander
Copy link
Author

ChiefAlexander commented Sep 16, 2016

This still sounds like something funny is going on with Nova API

I agree, we found the bug that is causing our slowness actually. Then because openstack is slow we have timeouts throughout the stack (haproxys, api's) at default (30s). That is my current working theory.

I was thinking that it was terraform timing out because I could not track down where it got its timeouts from, just thought it was set default somewhere to 30s.

Unless you want to keep this open for the safely checking on the create I am good to close this out. I can also open another more specific issue for the safely checking.

@jtopjian
Copy link
Contributor

Unless you want to keep this open for the safely checking on the create I am good to close this out. I can also open another more specific issue for the safely checking.

Let's keep this one open.

Thank you for all of your help with this!

@ChiefAlexander
Copy link
Author

Thank you for all of your help with this!

No no, Thank YOU for all your help :)

@ghost
Copy link

ghost commented Apr 10, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@hashicorp hashicorp locked and limited conversation to collaborators Apr 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants