Parallel builds from CircleCI don't aggregate correctly #1341

danadaldos · 2019-08-24T16:52:26Z

Context:

As I understand it, from looking at documentation and replies to other issues, Coveralls.io has three requirements in order to correctly record and aggregate parallel builds (please correct me if I'm wrong):

JSON data for submitted jobs needs to have the parallel: true set either via the ENV var COVERALLS_PARALLEL=true or mix coveralls.circle --parallel (they accomplish the same thing). This suspends the final analysis until the webhook arrives.
Incoming jobs must share the same service_number. As long as they are coming in marked "parallel", Coveralls designates these with the shared build ID and a decimal showing its place (i.e. 21989.3, 21989.4).
A post to the webhook with the Repo Token in order to signal that the build is finished. Docs recommended: https://coveralls.io/webhook repo_token=$COVERALLS_REPO_TOKEN, other sources recommended adding https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[build_num]=$BUILD_NUMBER&payload[status]=done" to explicitly send the status and the build number.

Description:

I have parallelization on CircleCI building correctly with 4 containers and reporting to Coveralls.io via the excoveralls library.

I have set the --parallel flag when excoveralls runs which correctly adds the parallel: true param to the JSON (see below). I also have the COVERALLS_PARALLEL=true set in various places just to be sure.

As the build runs, I see jobs reporting with the expected 22043.1, 22043.2 designations, but then the jobs are replaced with the later job 22043.3, and finally 22043.4, which is the final job and the one that remains on the build. The results do not aggregate correctly and we see a massive drop in coverage over master. Each container on CircleCI ends with Successfully uploaded the report to 'https://coveralls.io'..

A sanity check with 1 container (parallelization: 1) on CircleCI showed that the splitting and building was working correctly @ ~93% coverage: https://coveralls.io/builds/25348343

I have tried a number of different webhook calls, including the documented:

notify:
  webhooks:
    - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN

As well as one that explicitly includes the done status. Notice that [build_num] has been replaced with [service_number], I have tried both ways. :

notify:
  webhooks:
    - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[service_number]=$CIRCLE_BUILD_NUMBER&payload[status]=done"

Neither this, nor manual calls to curl -k https://coveralls.io/webhook... from the terminal have caused the resulting build to work correctly. Posting to either manually from the terminal gave a response of {"done":true}%, which tells me that it's working correctly (other variations resulted in errors).

Note: I am not using workflows, the "service_job_id" and the "service_number" in the JSON payload are the same number, namely the $CIRCLE_BUILD_NUMBER (see JSON below).

JSON:

{"git":{"branch":"dd-parallelize-ci-test","head":{"committer_name":"danadaldos","id":"f34c07b9920025405bb4ec0ed48f50a00e4c3158","message":"Try relying on manual webhook call"}},"parallel":true,"repo_token":<CORRECT REPO TOKEN REDACTED>","service_job_id":"22043","service_name":"circle-ci","service_number":"22043","service_pull_request":null,"source_files":[{"coverage": ...

Screenshots:

Build `22043` showing first two jobs:

Same build showing only the final job and skewed results:

Related Issues:

#1191
#1178
#1093

The text was updated successfully, but these errors were encountered:

kelvintyb · 2020-04-05T09:56:36Z

Will this be looked at? It's blocking most ppl that are using parallel runs in CircleCI i believe. @afinetooth

afinetooth · 2020-04-06T21:30:43Z

@kelvintyb this is being looked at. Team is aware, and I will try to reproduce to gain further insight. No ETA yet, but will feed back asap.

nickmerwin · 2020-04-09T17:49:44Z

Hi @kelvintyb and @danadaldos, could you please post your .circleci/config.yml so we can better understand your setup?

If you'd prefer not to post here publicly, you could email it to us at support@coveralls.io

danadaldos · 2020-04-09T18:11:12Z

@nickmerwin @afinetooth

These are my current, non-Coveralls.io using setup files. They are currently NOT CONFIGURED TO USE COVERALLS. See the steps listed below for the changes that I made in order to configure Coveralls within this setup.

circle/config.yml:
https://gist.github.com/danadaldos/41bf98fe8bafac177fbfe7243bcc2545

test script:
https://gist.github.com/danadaldos/ee84345672d07caed447fdad66da61e1

They have changed somewhat since I posted this issue 8 months ago. Namely, now we are using CircleCI workflows, and when we made that change and instituted parallelization on CircleCI, I tried reinstating Coveralls.io (this was about a month ago). I made the following changes:

Configuration Steps

Change our test script to run: mix coveralls.circle --parallel ${TESTFILES}
Note: The ExCoveralls library adds the "parallel: true" flag to the JSON that is sent with each container when you add the --parallel flag.
I add COVERALLS_PARALLEL=true to our circleci/config.yml docker environment:

     docker:
       - image: circleci/elixir:1.8.2-browsers
         environment:
           COVERALLS_PARALLEL: true

I have confirmed via SSH into Circle builds that this is set correctly in the environment.

I add a final step in circleci/config.yml:

  webhooks:
    - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done"

I have tried calling this in various ways and from various places, including a separate job in CircleCI that posts the correct payload[build_num]=$CIRCLE_WORKFLOW_ID as well as calling this manually from my terminal once all test containers have finished. I have tried other IDs also, CIRCLE_WORKFLOW_ID is the one shared by all containers.

I am remiss to post any of the specific circle/config.yml files that I used because I have tried tweaking literally every variable I could think of to try to get Coveralls to recognize all of the containers in a build. If any of the steps I listed above are wrong, please let me know.

nickmerwin · 2020-04-09T18:14:08Z

@danadaldos I believe this may be the issue:

Note: I am not using workflows, the "service_job_id" and the "service_number" in the JSON payload are the same number, namely the $CIRCLE_BUILD_NUMBER (see JSON below).

Because the excoveralls lib uses the CIRCLE_BUILD_NUM environment variable for Job ID:

  defp get_job_id do
    # When using workflows, each job has a separate `CIRCLE_BUILD_NUM`, so this needs to be used as the Job ID and not
    # the Job Number.
    System.get_env("CIRCLE_BUILD_NUM")
  end

https://github.com/parroty/excoveralls/blob/master/lib/excoveralls/circle.ex#L65

It's expecting to be run within a workflow so that this is unique per parallel job.

Here's how our Ruby library handles it:

config[:service_job_number]   = ENV['CIRCLE_NODE_INDEX']

https://github.com/lemurheavy/coveralls-ruby/blob/master/lib/coveralls/configuration.rb#L66

Otherwise, Coveralls thinks that it's a duplicate job since the Build number and Job Id match, so it removes them. Which is why you're seeing multiple at first in our UI, then they're removed shortly after.

The Elixir lib may need to be updated to support this non-workflow based Circle parallel setup to use CIRCLE_NODE_INDEX additionally.

E.g.:

  defp get_job_id do
    "#{System.get_env("CIRCLE_BUILD_NUM")}-#{System.get_env("CIRCLE_NODE_INDEX")}"
  end

danadaldos · 2020-04-09T18:17:20Z

@nickmerwin Yes, please see my most recent comment. I updated the information. We are using workflows now and I am sending "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done" to Coveralls.

nickmerwin · 2020-04-09T18:18:48Z

Thanks @danadaldos can you link me to your most recent test build using the new workflows setup?

danadaldos · 2020-04-09T19:32:30Z

@nickmerwin Here is our most recent build on Coveralls.io: https://coveralls.io/builds/29361557
When we run Coveralls locally, we're at ~94% coverage.

To be clear, this build was set up exactly as I mentioned:
COVERALLS_PARALLEL: true, mix coveralls.circle --parallel, webhooks: - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done"

nickmerwin · 2020-04-09T19:44:56Z

Thanks @danadaldos, could you SSH into a build and confirm that CIRCLE_WORKFLOW_WORKSPACE_ID is being set?

It appears that CIRCLE_BUILD_NUM is ...5DF4D28A52B7 and is coming over to Coveralls as both the service_number and service_job_id, which is why 6 out of the 7 jobs are considered duplicates and are being culled.

danadaldos · 2020-04-09T19:45:36Z

@nickmerwin Yeah, it will take me a minute to get things reconfigured for Coveralls.

danadaldos · 2020-04-09T20:14:53Z

@nickmerwin I'm sorry, I was giving you wrong information. I actually did get this to build correctly. The issue that I'm now having is that in order for this to build correctly, it's taking 30 minutes to do so. Our test suite on CircleCI finishes after 5 minutes normally. Here is a link to a build around the same time that built correctly with all containers: https://coveralls.io/builds/29358320
You can see on that build that the Job ID is 4c383bb7-6b4b-4369-a459-0ed7e4d9bfe2.40 with the .40 indicating the number of containers.

Any insight as to why it takes so long? I have a build currently running on CircleCI that I will link to once it reports to Coveralls.

danadaldos · 2020-04-09T20:23:40Z

And just to be thorough, My working setup is:

Use mix coveralls.circle --parallel ${TESTFILES} (same as above)
Set COVERALLS_PARALLEL: true to the docker env in CircleCI (same as above)
Report finished workflow via a job in CircleCI, which is the same approach they use in the orb: https://circleci.com/orbs/registry/orb/coveralls/coveralls

...
  notify_coveralls:
    docker:
      - image: circleci/elixir:1.8.2-browsers
        environment:
          COVERALLS_PARALLEL: true
    steps:
      - run: |
          curl "https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN" \
            -d "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done"
          exit 0

workflows:
  build_and_test:
    jobs:
      - run_credo
      - run_tests
      - notify_coveralls:
            requires:
              - run_tests

These are the only changes between a 5 minute build and a 30 minute build with Coveralls.

danadaldos · 2020-04-09T20:56:01Z

@nickmerwin

Most recent job finished after 40 minutes: https://coveralls.io/builds/29978345

Here's a screenshot of our workflows for comparison's sake:

nickmerwin · 2020-04-10T17:23:35Z

@danadaldos I checked the calculation time for that build on our side and it was only 2.2 seconds after the webhook came in. I suspect Circle may have queued up the webhook for those 40 minutes. Perhaps you could add another webhook receiver like https://requestbin.com to confirm the delay.

We keep metrics on how quickly our background processors dequeues jobs here:

https://status.coveralls.io

Since the original issue of parallel coverage run merging is resolved, I'm closing the issue for now, but will monitor the thread for any other questions that arise.

Thank you!

danadaldos · 2020-04-12T14:57:33Z

@nickmerwin Thank you for following-up on this. It's extremely helpful to know how long it took on your end, and now I can take that information to CircleCI to see what's up. Thank you again!

Update - I did try sending the webhook to Requestbin.com and there was no delay. The CI job finished in ~5 minutes and Requestbin received the message right when it finished.

Finally, if I still have your ear @nickmerwin, please please address the documentation found here: https://docs.coveralls.io/parallel-build-webhook
I am fairly certain that the webhook listed for CircleCI flat-out wrong. CircleCI must have changed its implementation since that was written because if you don't explicitly include the payload with the build number/workflow id, Coveralls doesn't report anything.

See lemurheavy/coveralls-public#1341 (comment)

afinetooth added coverage-merging parallel-builds labels Apr 6, 2020

nickmerwin closed this as completed Apr 10, 2020

afinetooth added circle-ci documentation labels Apr 13, 2020

afinetooth assigned nickmerwin Apr 14, 2020

afinetooth added this to High Priority in Priority Issues Apr 14, 2020

fniephaus added a commit to hpi-swa/smalltalkCI that referenced this issue Aug 5, 2020

Set service_job_number instead of flag_name

3586359

See lemurheavy/coveralls-public#1341 (comment)

dhaspden mentioned this issue Sep 18, 2020

Fix issue with CircleCI parallel workflows not picking up separate builds parroty/excoveralls#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel builds from CircleCI don't aggregate correctly #1341

Parallel builds from CircleCI don't aggregate correctly #1341

danadaldos commented Aug 24, 2019 •

edited

kelvintyb commented Apr 5, 2020

afinetooth commented Apr 6, 2020

nickmerwin commented Apr 9, 2020

danadaldos commented Apr 9, 2020 •

edited

nickmerwin commented Apr 9, 2020 •

edited

danadaldos commented Apr 9, 2020

nickmerwin commented Apr 9, 2020

danadaldos commented Apr 9, 2020 •

edited

nickmerwin commented Apr 9, 2020

danadaldos commented Apr 9, 2020

danadaldos commented Apr 9, 2020 •

edited

danadaldos commented Apr 9, 2020 •

edited

danadaldos commented Apr 9, 2020

nickmerwin commented Apr 10, 2020

danadaldos commented Apr 12, 2020 •

edited

Parallel builds from CircleCI don't aggregate correctly #1341

Parallel builds from CircleCI don't aggregate correctly #1341

Comments

danadaldos commented Aug 24, 2019 • edited

Context:

Description:

JSON:

Screenshots:

Build 22043 showing first two jobs:

Same build showing only the final job and skewed results:

Related Issues:

kelvintyb commented Apr 5, 2020

afinetooth commented Apr 6, 2020

nickmerwin commented Apr 9, 2020

danadaldos commented Apr 9, 2020 • edited

nickmerwin commented Apr 9, 2020 • edited

danadaldos commented Apr 9, 2020

nickmerwin commented Apr 9, 2020

danadaldos commented Apr 9, 2020 • edited

nickmerwin commented Apr 9, 2020

danadaldos commented Apr 9, 2020

danadaldos commented Apr 9, 2020 • edited

danadaldos commented Apr 9, 2020 • edited

danadaldos commented Apr 9, 2020

nickmerwin commented Apr 10, 2020

danadaldos commented Apr 12, 2020 • edited

danadaldos commented Aug 24, 2019 •

edited

Build `22043` showing first two jobs:

danadaldos commented Apr 9, 2020 •

edited

nickmerwin commented Apr 9, 2020 •

edited

danadaldos commented Apr 9, 2020 •

edited

danadaldos commented Apr 9, 2020 •

edited

danadaldos commented Apr 9, 2020 •

edited

danadaldos commented Apr 12, 2020 •

edited