Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inspec fails with kitchen verify -c N #119

Closed
pudge opened this issue Nov 16, 2016 · 20 comments
Closed

inspec fails with kitchen verify -c N #119

pudge opened this issue Nov 16, 2016 · 20 comments
Labels
Type: Bug Doesn't work as expected.

Comments

@pudge
Copy link

pudge commented Nov 16, 2016

Executing multiple simultaneous kitchen verify runs using concurrency (-c 5, for example) fails, apparently using the same configuration for all nodes (note target in the attached log has the same port for each run, when the boxes are all running on different ports).

When running serially instead of concurrently, it works fine.

inspec-concurrent-kitchen.log.txt

@otakup0pe
Copy link

Confirming that we are seeing this as well....

@baurmatt
Copy link

baurmatt commented Dec 1, 2016

We're also seeing this problem.

@stefanandres
Copy link

Can I do anything to help resolving this issue? I'm don't really know the code, but this problem occurs in each of our CI runs. :/

@quulah
Copy link

quulah commented Aug 2, 2017

Is this inspec/inspec#1598 issue related?

@ghost
Copy link

ghost commented Sep 10, 2017

This seems to be a problem for almost a year now

@nhudacin
Copy link

That's the workaround, running kitchen verify serially?

@pudge
Copy link
Author

pudge commented Sep 11, 2017

That's the workaround, running kitchen verify serially?

It worked for me, @nhudacin.

@ghost
Copy link

ghost commented Sep 13, 2017

@adamleff do you know of anyone that can help?

@adamleff
Copy link
Contributor

@pantocrator27 Unfortunately there is no one that is actively working on fixing this. As an open source project, we absolutely welcome members of the community helping us by contributing fixes and providing reproduction steps for those that are willing to help.

I just did a quick test with the latest test-kitchen and kitchen-inspec and could not reproduce this, so for someone to engage on this, whether or not they work for Chef, we'll need some more concrete steps we can use to reproduce this issue.

Thank you.

@ghost
Copy link

ghost commented Sep 13, 2017

Thanks @adamleff, I will work on providing replication steps ... however for the time being, what I am personally finding is that if you set a concurrency higher than the number of (platforms x test suites), this is when I am getting the error

@adamleff adamleff added the bug label Sep 20, 2017
@dragon788
Copy link

Seeing this as well, the weird thing is ONLY the verify portion fails if concurrent greater than -c 2, everything else is fine, and it appears to have something to do with the thor/busser/busser-serverspec transfers to the instance (tar errors, checksum errors, file not found errors have all cropped up in the logs), and oddly -c 2 works, but -c 3 or higher fails consistently if the instances involved in the -c N have the same base ami/image.

Below is what we run on our CI system, yes we could just run kitchen verify or kitchen test and it would do all the previous steps, but we want to fail fast and know explicitly if there is a timeout or permissions issue and where it happens. Also we've found that kitchen test is flaky at times, but kitchen converge && kitchen verify works 98% of the time (if not using -c3).

Our CI build script:

export LC_CTYPE=en_US.UTF-8
chef exec cookstyle --parallel --color
chef exec foodcritic -f correctness .
chef exec rspec --color --tty
chef exec kitchen create --color -c4
chef exec kitchen converge --color -c4
chef exec kitchen setup --color -c4
chef exec kitchen verify --color -c2
chef exec kitchen destroy --color -c4

Excerpt of .kitchen.yml:

driver:
  name: ec2
  aws_ssh_key_id: secret-key
  security_group_ids: ["sg-groups"]
  region: us-west-2
  subnet_id: subnet-our-ids
  iam_profile_name: iam-role-chef-test-kitchen
  instance_type: t2.medium
  interface: private
  tags:
    Name: test-kitchen-patching-wrapper

provisioner:
  name: chef_zero
  require_chef_omnibus: 13.6.4

transport:
  username: ubuntu
  ssh_key: ~/.ssh/our-chef-key.pem
  connection_timeout: 10
  connection_retries: 5

platforms:
  - name: ubuntu-14.04
    driver:
      image_id: lightly-modified-1404-ami
      user_data: test/user_data.sh
      block_device_mappings:
        - device_name: /dev/sda1
          ebs:
            volume_type: gp2
            volume_size: 50
            delete_on_termination: true
        - device_name: /dev/sdb
          ebs:
            volume_type: gp2
            volume_size: 100
            delete_on_termination: true
    transport:
      name: sftp
  - name: windows-2012r2
    transport:
      username: administrator
      connection_retry_sleep: 15
      connection_retries: 60

suites:
  - name: tuesday_patch
    run_list:
      - recipe[patching_wrapper::default]
    includes: ["ubuntu-14.04"]
  - name: thursday_patch
    run_list:
      - recipe[patching_wrapper::default]
    attributes:
        jenkins_role: 'master'
    includes: ["ubuntu-14.04"]
  - name: thursday_patch
    run_list:
      - recipe[patching_wrapper::default]
    attributes:
        jenkins_role: 'master'
    includes: ["ubuntu-14.04"]
  - name: windows_chef_upgrade
    run_list:
      - recipe[patching_wrapper::default]
    includes: ["windows-2012r2"]

@dragon788
Copy link

So the crazy thing is we can run kitchen verify -c4 against AWS from our local workstations pretty consistently without any errors but it always fails on our CI system (that actually lives in AWS) above -c2.

It also could be that our Windows is dodging a bullet by virtue of the WinRM transport being slow for file transfers but oddly it always seems to be thursday_patch that loses the race(condition) at -c3 or above.

See this gist for an example of failed run output, https://gist.github.com/dragon788/03c77c7aac1c27efb826387e87b892ef

@jurajseffer
Copy link

When using --parallel or -c I usually get:

>>>>>>     Failed to complete #verify action: [no implicit conversion of nil into String] on default-consul3
>>>>>>     Failed to complete #verify action: [no implicit conversion of nil into String] on default-consul1

but I also got

>>>>>>     Failed to complete #verify action: [Client error, can't connect to 'ssh' backend: Train::Transports::SSH does not implement #connect()] on default-consul2

It's not consistent, sometimes it works, most of the times no parallel runs do. This seems quite broken.

@ricoli
Copy link

ricoli commented Feb 22, 2018

Any progress on this? Also seeing it when using CentOS 6 on AWS...

@ghost
Copy link

ghost commented Mar 8, 2018

In monitoring this, I noticed that when running kitchen list while this is running, Last Action will say Set Up but Last Error will say Type Error

@slve
Copy link

slve commented Mar 9, 2018

I also wanted to prefix the output and came to this solution using GNU parallel,
where $BOX is the pattern I originally passed to kitchen test and -j3 means 3 concurrent runs.

kl=$(kitchen list | cut -d' ' -f1 | sed 1d | grep "$BOX")
parallel -j3 --tag kitchen test {} ::: $kl

@ghost
Copy link

ghost commented Mar 21, 2018

I have switched to using parallel as well per @slve's recommendation to better success

@gionn
Copy link

gionn commented May 4, 2018

To avoid breaking due to deprecation notices while starting up test-kitchen and kitchen.log truncated by every process starting up (but output would be garbled in any case):

kl=$(kitchen list -b -l fatal| cut -d' ' -f1 | sed 1d | grep "$BOX")
parallel -j3 --tag kitchen test --no-log-overwrite {} ::: $kl

@mike10010100
Copy link

We ran into the same issue when running kitchen test -c while using kitchen-dokken. We found a solution by specifying the same number of distinct volumes as there are platforms running simultaneously.

For example, if you have 5 platforms, under driver: you need to add:

volumes: [
    '/var/lib/docker', '/var/lib/docker-one', '/var/lib/docker-two', '/var/lib/docker-three', '/var/lib/docker-four'
  ]

After this addition, we've been able to run concurrent kitchen test runs without this error.

@tas50 tas50 added Type: Bug Doesn't work as expected. and removed bug labels Jan 14, 2019
@kekaichinose
Copy link

What’s happening? Why was this issue closed?
This issue was closed due to some much needed review of legacy issues or issues that were spawned in older versions of InSpec, i.e. < v3.

Why do I care?
You would care about this if this was an issue that you are still seeing and/or feel needs to be addressed in the current version of InSpec.

What do I need to do?
If this issue is no longer important, no further action is necessary. However, if you think this is something that should be addressed, please open a new issue and refer to the original issue in the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Doesn't work as expected.
Projects
None yet
Development

No branches or pull requests