Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform... has inproper state.... and now my infrastructure is borked.... #2058

Closed
jwaldrip opened this issue May 24, 2015 · 15 comments
Closed

Comments

@jwaldrip
Copy link
Contributor

State is no longer the state. Why doesnt terraform understand the current state of whats in AWS? I had to make a change in AWS manually, and ever since terraform cannot do anything...... HELP.

@Tokynet
Copy link

Tokynet commented May 24, 2015

Not sure if this is a real request or a joke but just in-case...

Did you try "terraform refresh"?

I have edited resources from the Aws console that were initially created
through terraform and after doing a refresh everything still works.

-Miguel
On May 24, 2015 2:33 AM, "Jason Waldrip" notifications@github.com wrote:

State is no longer the state. Why doesnt terraform understand the current
state of whats in AWS? I had to make a change in AWS manually, and ever
since terraform cannot do jack shit...... HELP.


Reply to this email directly or view it on GitHub
#2058.

@jwaldrip
Copy link
Contributor Author

I did.... Unfortunately did not help at all. Now I am digging through commits. It also doesnt help that atlas doesnt "show you" a previous version of the state file.

@mitchellh
Copy link
Contributor

[Note: I modified your initial comment to remove profanity. I find it doesn't help the conversation since it tends to make others emotional, and your initial frustration is well received without it. I otherwise didn't modify any content.]

Before every command, we copy the state into a terraform.tfstate.backup (within the same directory as the original state). If you're using remote state, this would be in the .terraform folder. You can also always specify this using command-line flags, but it is on by default.

That is your best chance to recover the state.

We're working on infrastructure import, but it isn't ready. It is simply a difficult problem, so I'm sorry about that. I'm also very sorry that there is very little I can do here, except to hope you have the backup file. If not, there are guides online to rebuild the state file manually.

Because we have infrastructure import as another issue, I'm going to close this.

Please let me know if I can assist in any other way.

Once you're able to resolve your issues in some way, I'd love to learn how the "state is no longer the state" happened, and also what happened to the backup file if that couldn't be recovered either. We have copious tests around all of this so it would be good to find any edge cases.

@Tokynet is correct. You mostly only need the ID of the resource to recover it (and the next version of Terraform will have a feature that helps re-import simple resources this way). So I'm assuming you lost your IDs from your state. The silver lining in this is that without that Terraform also cannot destroy your infrastructure (if it doesn't know about it).

So what I'm guessing is that you've lost the ability the maintain your infrastructure, but it still exists somewhere. In which case we need infrastrucure import, and we're working on it.

@mitchellh
Copy link
Contributor

In addition to the large comment above, here are some very specific actionable items so we can help you:

  1. What state currently exists? (Please gist if it contains no secrets, or email us!)
  2. What is the plan you're seeing? What are you expecting?
  3. What is your config?

Given the above, we should be able to recover this no problem.

@jwaldrip
Copy link
Contributor Author

The backup file was identical to the tfstate file. I was able to recover by reverting my commit history and manually updating the assets in AWS. This was a long process but it seemed to work. Now I am trying to get back to the state I had. Making changes slowly.

@jwaldrip
Copy link
Contributor Author

Has there been any thought into terraform tracking state of all AWS entities somehow. Even if they weren't originally defined in terraform. It would be cool if there was some sort of 2 way sync.

@mitchellh
Copy link
Contributor

@jwaldrip Not all AWS entities, just infrastructure import. I think the latter is priority number one, but the former would be interesting to talk about that may work well with the import.

The biggest issue in tracking all is that the API calls are really slow and throttled pretty heavily. And this is also the reason that Terraform uses a local cache of state that it "refreshes": even at around 50 to 100 servers, the API calls simply to refresh every resource takes quite awhile.

@miked0004
Copy link

miked0004 commented Nov 11, 2016

I have same problem. FWIW, found this utility that helps: https://github.com/dtan4/terraforming
This is only mildly useful as it does not do in-depth enough scanning, in my experience.

@cantide5ga
Copy link

cantide5ga commented Feb 23, 2017

Why doesnt terraform understand the current state of whats in AWS?

@mitchellh I doubt this will be seen but curious why Terraform manages the state like this instead of using the aws Describe API behind the scenes as the single source of truth. I'm guessing because this is to maintain the agnostic nature across cloud providers? You may also be addressing exactly this with this comment.

@mitchellh
Copy link
Contributor

mitchellh commented Feb 23, 2017

@cantide5ga https://www.terraform.io/docs/state/purpose.html

I should add there: AWS APIs actually aren't able to 100% enumerate the settings used to configure them. There are some creation-time settings that aren't available ever again via Reads. And, TF supports a lot more than AWS. Many other API calls have the same issue.

@WarFox
Copy link

WarFox commented Sep 21, 2017

I came across a similar issue, but I was experimenting with it, for science.

Here is what happened.

I was experimenting Terraform backend with Consul on https://demo.consul.io/ . I created an S3 bucket with the following Terraform file.

terraform {
  backend "consul" {
    address = "demo.consul.io"
    path = "getting-started-warfox"
    lock = false
  }
}

provider "aws" {
  region = "eu-west-1"
}

# New resource for the S3 bucket our application will use.
resource "aws_s3_bucket" "example" {
  bucket = "warfox-terraform-getting-started-guide"
  acl    = "private"
}

I checked if the state information was available in https://demo.cosul.io/ service. It was available.

Then I deleted the state information from https://demo.consul.io/. Note the S3 bucket is still intact in AWS. So basically Terraform is no longer aware of the S3 bucket. Now I re-ran terraform apply. This time terraform tried to create the S3 bucket and failed, because S3 bucket with the same name is already is available and is owned by me.

This was the error message:

* aws_s3_bucket.example: Error creating S3 bucket: BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it. 

I reckon this scenario can occur when you use Consul backend without proper acl defined. The key-value pair could accidentally get deleted or modified, thereby "state is no longer the state".
Since state file is separate from the backend service providers, this kind of scenario may occur from time to time in many cases.

terraform refresh command did not help me.

However,

terraform import aws_s3_bucket.example warfox-terraform-getting-started-guide

command updated Terraform state information in Consul.

So we have some way to recover from such problems.

@calvin-hartwell
Copy link

Sorry to bump but I'm interested in this scenario, are there any plans/commands which can be used to rebuild the index? I.E is terraform able to inspect AWS to understand what has/has not been built irrespective of corrupted/deleted index?

We also ran into an issue recently with the state file. If this functionality doesn't exist, I'm happy to work on it + raise PR.

Thanks

@MatdeB-SL
Copy link

I have also had this problem where Terraform gets out of sync with the state that exists, often due to crashes. Terraform refresh can only operate on entities which are mentioned in the Terraform state file, therefore it can't import the missing entities.

Cause:
Terraform doesn't add entries to the state file until they have finished being created, therefore if the apply process is interrupted while running, the state file will not include any entities that had not completed creation.

The entities that had begun creation will complete creation on the cloud provider but will never be reflected in the state file, meaning that Terraform can neither replace them or destroy them.

Solution
The best solution I have found is to load an old terraform state where apply had completed successfully. You can then call refresh or plan with this state and it will pick up the missed entities.

Steps:

  1. Enable versioning of state in backend (This is done with an S3 bucket for us)
  2. Find previous version of state file with complete applied state, and restore as current version.
  3. Update the Digest value in the lock table (only required if using locks)
  4. If you run terraform plan it will tell you what the new digest value should be
  5. Run terraform plan or apply against the new state file and it should include any missed entities.

This won't work if the entities in question are newly added.

@gggeek
Copy link

gggeek commented Mar 29, 2019

It seems I also hit some weird, unrelated but not dissimilar corner-case and managed to make it worse by trying to apply the procedure described in ticket #18643.

An unprecise recap of the events is:

  • start with a working infrastructure (aws) and matching state (s3 storage). Instances are managed via autoscaling groups, and blue-green rollouts mandate usage of create_before_destroy and name_prefix instead of name for aws_launch_configuration elements
  • get asked to tighten definition of security groups. This introduces cycles in resources definitions, which are worked around with a hack: using a data provider to read info of sg2 that is used in ingress rule of sg1. sg2 of course has sg1 in its rules. Things still work, as long as you have both sgs already created first...
  • decide to clean up tagging and naming: change the "name" of some resources, including the security groups in the name of having a common naming pattern
  • try a tf apply, and realize that changing the names of existing resources poses serious challenges to tf. Possibly related to the deep dependency chain, possibly simply because changing names is a no-no
  • roll back the tf files to use old naming
  • realize that tf still does not get back into a "sane state" on its own. tf apply consistently fails, despite tf refresh
  • try manually deleting by hand all resources which do not incur data loss, hoping that tf will rebuild them; hit major problems with manual removal of security groups
  • finally manage to get to a state where tf can successfully deploy the whole platform again, except for complaining at every 'apply' that there are resources that it wants to depose and can not find any more
  • in an attempt to silence those warnings, manually alter tf state by using 'tf state remove', as suggested in other ticket
  • run 'tf apply' again and come to the conclusion that the situation is now even worse off than before
  • destroy most resources manually again
  • run 'tf apply' again
  • weep

Now, I am sure a lot of the steps above should never have been taken,

At the moment, my biggest problem is that 'tf apply' wants to recreate a security group which already exists, and which my account has no permissions to delete manually.

Is there a way for me to achieve the equivalent of "manually editing the state file to make tf understand that the existing security group does not need to be rebuild" ? According to the docs terraform import might suit this case, but I am not too sure..

@ghost
Copy link

ghost commented Aug 13, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@hashicorp hashicorp locked and limited conversation to collaborators Aug 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants