Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUMULUS-3223: Fix failed granule stuck in queued #3373

Merged
merged 15 commits into from
May 18, 2023
Merged

Conversation

jennyhliu
Copy link
Contributor

@jennyhliu jennyhliu commented May 9, 2023

Summary: Summary of changes

Addresses CUMULUS-3223 Granule status stuck in 'queued'

Changes

  • Detailed list or prose of changes
  • ...

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

Copy link
Member

@Jkovarik Jkovarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI Jenny - this looks really good, I have a question I'd like to discuss before this getting merged re: the update to the redrive on the sqs lambda, the rest are just informational questions.

Copy link
Member

@Jkovarik Jkovarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updates looks really good @jennyhliu thanks for addressing my initial concern. There are a couple of additional comments/concerns I feel are worth talking through, let me know what you think!

CHANGELOG.md Outdated

- **CUMULUS-3223**
- Update `@cumulus/cmrjs/cmr-utils.getGranuleTemporalInfo` to handle the error when the cmr file s3url is not available
- Update `sfEventSqsToDbRecords` lambda to return partial batch failure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we might want to spell this out a bit, it's not going to be entirely transparent to end users what this means, but it likely matters to integrators who've been working with the DLQ

const executionEvent = parseSQSMessageBody(message);
const cumulusMessage = await getCumulusMessageFromExecutionEvent(executionEvent);

let reprocessFailed = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused re: what this has to do with reprocessing - if I'm reading this right and it's more about processing the execution event for a cumulus message, this should probably be something like:

Suggested change
let reprocessFailed = true;
let processedExecutionEventMessage = false;

and have the below assertion negated for readability. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or .... perhaps remove the conditional logic and instead do something like:

  await Promise.all(sqsMessages.map(async (message) => {
    let cumulusMessage;
    const executionEvent = parseSQSMessageBody(message);
    try {
      cumulusMessage = await getCumulusMessageFromExecutionEvent(executionEvent);
    } catch (error) {
      log.error(`Writing message failed: ${JSON.stringify(message)} on getting message from execution event`, error);
      return batchItemFailures.push({ itemIdentifier: message.messageId });
    }
    try {
      return await writeRecords({ ...event, cumulusMessage, knex });
    } catch (error) {
      log.error(`Writing message failed: ${JSON.stringify(message)})`, error);
      return sendSQSMessage(process.env.DeadLetterQueue, message);
    }
  }));

@@ -147,6 +147,7 @@ resource "aws_sqs_queue_policy" "sf_event_sqs_to_db_records_input_queue_policy"
resource "aws_lambda_event_source_mapping" "sf_event_sqs_to_db_records_mapping" {
event_source_arn = aws_sqs_queue.sf_event_sqs_to_db_records_input_queue.arn
function_name = aws_lambda_function.sf_event_sqs_to_db_records.arn
function_response_types = ["ReportBatchItemFailures"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs related to this are at https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html under Implementing partial batch responses

handlerResponse,
} = await runHandler({
...t.context,
cumulusMessage: null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

This does result in the method under test throwing in the proxyquire mock, and validates the unit case, but does this change warrant a fully integrated unit test or a lambda-invocation integration test
to show something like a message with a granule that should succeed, one that failed on writeGranules and one that failed on getExecutionEvent do the right thing given a single execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jkovarik I have addressed all your comments under e71f482

const executionEvent = parseSQSMessageBody(message);
const cumulusMessage = await getCumulusMessageFromExecutionEvent(executionEvent);

try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment on review - in cases where there's a mangled SQS message, we're still sending the entire batch back. That seems unlikely in most operational situations, but it's a legit corner case. This is an improvement in either case, so it's not going to hold up the PR, but might be worth considering.

try {
return await writeRecords({ ...event, cumulusMessage, knex });
} catch (error) {
log.error(`Writing message failed: ${JSON.stringify(message)}`, error);
return sendSQSMessage(process.env.DeadLetterQueue, message);
}
}));

return { batchItemFailures };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have units covering this is what the function returns, however we don't have a test, in integration, that shows on getCumulusMessageFromExecutionEvent failure, the outputs map to the correct convention and the system does what we expect. I'm fine with the results of our discussion in the Arch meeting re: performance testing being outside the scope of this ticket.

Copy link
Member

@Jkovarik Jkovarik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Jenny! This looks good, thanks for addressing my concerns.

@jennyhliu jennyhliu merged commit 0d4bf26 into master May 18, 2023
@jennyhliu jennyhliu deleted the CUMULUS-3223 branch May 18, 2023 21:18
jennyhliu added a commit that referenced this pull request May 31, 2023
* CUMULUS-3223:Fix failed granule stuck in queued

* skip sqsQueueExists test

* update getGranuleTemporalInfo

* update test match schema

* update getGranuleTemporalInfo

* remove extra await

* remove skip sqsQueueExists

* update `@cumulus/cumulus-message-adapter-js` to `2.0.5`

* update sfEventSqsToDbRecords to return partial batch failure

* fix typo

* handle process error seperately, multiple message test

* update test to process multiple messages
jennyhliu added a commit that referenced this pull request Jun 5, 2023
* CUMULUS-3223:Fix failed granule stuck in queued

* skip sqsQueueExists test

* update getGranuleTemporalInfo

* update test match schema

* update getGranuleTemporalInfo

* remove extra await

* remove skip sqsQueueExists

* update `@cumulus/cumulus-message-adapter-js` to `2.0.5`

* update sfEventSqsToDbRecords to return partial batch failure

* fix typo

* handle process error seperately, multiple message test

* update test to process multiple messages
jennyhliu added a commit that referenced this pull request Jun 23, 2023
* CUMULUS-3223:Fix failed granule stuck in queued

* skip sqsQueueExists test

* update getGranuleTemporalInfo

* update test match schema

* update getGranuleTemporalInfo

* remove extra await

* remove skip sqsQueueExists

* update `@cumulus/cumulus-message-adapter-js` to `2.0.5`

* update sfEventSqsToDbRecords to return partial batch failure

* fix typo

* handle process error seperately, multiple message test

* update test to process multiple messages
jennyhliu added a commit that referenced this pull request Jun 23, 2023
npauzenga added a commit that referenced this pull request Jul 14, 2023
* [CUMULUS-3226] Remove refs to async operations dynamo table (#3331)

* Remove refs to async operations dynamo table

* add changelog entry

* Remove additional async operation table refs

* Update process_dead_letter_archive role

* CUMULUS-3053: removing references to dynamodb from docs (#3322)

* CUMULUS-3053: removing references to dynamodb from docs

* resolving lint errors

* CUMULUS-3285: Updated isAuthBearTokenRequest to handle non-Bearer authorization header (merge to master) (#3350) (#3352)

* CUMULUS-3285: Updated isAuthBearTokenRequest to handle non-Bearer authorization header (#3341)

* Jk rel16/cumulus 3079 (#3357)

* Update endpoints/granules :granuleName parameter to :granuleId (#3333)

* Move PUT logic to PATCH.  Implement error for PUT endpoint

* Add put units/refactor

* Update API client to use PATCH protocol

cumulus's api is moving all existing PUT requests to PATCH.  This
commit updates the api-client to use those, adds 'PATCH' to typings
and updates the test fixtures for the thin unit tests.

* Fix broken API unit

* Update CHANGELOG

* Fix naming/minor refactor

* Fix Dynamodb granule.files default bug

* Fix elasticsearch default array write value

* Update/add PUT endpoint and relevant tests

* Update CHANGELOG

* Update unit test to take advantage of fixed unit helper

* Update API version

* Add api-client method to use updated PUT endpoint

* Add fixed annotations to CHANGELOG

* Update API version unit

* Minor comment refactor

* Add PUT API spec test

* Fix incorrect test label

* Add missing unit tests

* Add removed unit tests

* Jk/cumulus 3072 add header check (#3257)

* Add API version compliance checks to app middleware

* Update middleware/add test defaults/fix typing

* Fix broken unit tests

* cleanup

* Add middleware units

* Add endpoint tests to validate middleware

* Fix unit test

* Remove unneeded type exports

* Remove unneeded test

* Remove async

* Minor formatting fix/changeset reduction

* Add missing ts-check

* Update CHANGELOG

* Fix bad import

* Update 'version' header to 'Cumulus-API-Version'

* Respond to PR feedback

* validateApiCompliance -> requireApiVersion

* Update packages/api/lib/request.js

Co-authored-by: Marc <marc@element84.com>

* Fix isMinVersionApi expression

* Update granule units/fix prior test fixtures

* Update packages/api/app/middleware.js

Co-authored-by: Marc <marc@element84.com>

* Fix spacing

* Add api-client required headers

* Update header restriction with new name, fix api-package header
inclusions

* Fix broken unit

* Bring in fix from #3270

* Update packages/api/app/middleware.js

Co-authored-by: Marc <marc@element84.com>

* isNumber -> Number.isFinite

---------

Co-authored-by: Marc <marc@element84.com>

* Update PUT to use collectionId endpoint *only*

* Update CHANGELOG

* Fix api-client to call right PUT endpoint

* Remove unneeded log output

* Update tests, add granuleId parameter updates

* Fix bad logic assertion

* Update endpoints/granules :granuleName parameter to :granuleId

* Fix lint 🔔

* Fix lint

* Fix merged unit test

* Fix merged comments

---------

Co-authored-by: Marc <marc@element84.com>

* Reorder/updated CHANGELOG

* Minor P3 release edits

---------

Co-authored-by: Marc <marc@element84.com>

* Cumulus 3120 bugfix (#3364)

* [CUMULUS-3226] Remove refs to async operations dynamo table (#3331)

* Remove refs to async operations dynamo table

* add changelog entry

* Remove additional async operation table refs

* Update process_dead_letter_archive role

* Update endpoints/granules :granuleName parameter to :granuleId (#3333)

* Move PUT logic to PATCH.  Implement error for PUT endpoint

* Add put units/refactor

* Update API client to use PATCH protocol

cumulus's api is moving all existing PUT requests to PATCH.  This
commit updates the api-client to use those, adds 'PATCH' to typings
and updates the test fixtures for the thin unit tests.

* Fix broken API unit

* Update CHANGELOG

* Fix naming/minor refactor

* Fix Dynamodb granule.files default bug

* Fix elasticsearch default array write value

* Update/add PUT endpoint and relevant tests

* Update CHANGELOG

* Update unit test to take advantage of fixed unit helper

* Update API version

* Add api-client method to use updated PUT endpoint

* Add fixed annotations to CHANGELOG

* Update API version unit

* Minor comment refactor

* Add PUT API spec test

* Fix incorrect test label

* Add missing unit tests

* Add removed unit tests

* Jk/cumulus 3072 add header check (#3257)

* Add API version compliance checks to app middleware

* Update middleware/add test defaults/fix typing

* Fix broken unit tests

* cleanup

* Add middleware units

* Add endpoint tests to validate middleware

* Fix unit test

* Remove unneeded type exports

* Remove unneeded test

* Remove async

* Minor formatting fix/changeset reduction

* Add missing ts-check

* Update CHANGELOG

* Fix bad import

* Update 'version' header to 'Cumulus-API-Version'

* Respond to PR feedback

* validateApiCompliance -> requireApiVersion

* Update packages/api/lib/request.js

Co-authored-by: Marc <marc@element84.com>

* Fix isMinVersionApi expression

* Update granule units/fix prior test fixtures

* Update packages/api/app/middleware.js

Co-authored-by: Marc <marc@element84.com>

* Fix spacing

* Add api-client required headers

* Update header restriction with new name, fix api-package header
inclusions

* Fix broken unit

* Bring in fix from #3270

* Update packages/api/app/middleware.js

Co-authored-by: Marc <marc@element84.com>

* isNumber -> Number.isFinite

---------

Co-authored-by: Marc <marc@element84.com>

* Update PUT to use collectionId endpoint *only*

* Update CHANGELOG

* Fix api-client to call right PUT endpoint

* Remove unneeded log output

* Update tests, add granuleId parameter updates

* Fix bad logic assertion

* Update endpoints/granules :granuleName parameter to :granuleId

* Fix lint 🔔

* Fix lint

* Fix merged unit test

* Fix merged comments

---------

Co-authored-by: Marc <marc@element84.com>

* CUMULUS-3053: removing references to dynamodb from docs (#3322)

* CUMULUS-3053: removing references to dynamodb from docs

* resolving lint errors

* CUMULUS-3285: Updated isAuthBearTokenRequest to handle non-Bearer authorization header (merge to master) (#3350)

* CUMULUS-3285: Updated isAuthBearTokenRequest to handle non-Bearer authorization header (#3341)

* CUMULUS-3121/3120 (#3360)

* CUMULUS-3121/3120 v15.1.0 backport (#3346)

* backport PR

* finalizing docs and changing remaining groups

* PR feedback + testing changes

* PR feedback

* reverting change

* Revert "update changelog"

This reverts commit ae4627c.

* reverting changes

* PR feedback

* removing EgressLambda from doc

* Update xml2js 0.4.22->0.5 strict (#3330) (#3339)

* Update xml2js 0.4.22->0.5 strict

* Address GHSA-776f-qx25-q3cc/update allow list

* Update CHANGELOG

* Update package pins to 0.5.0 for xmljs

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback

* PR feedback

* PR feedback

* fixing documentation linting

* PR feedback

* PR feedback

* PR feedback

* adding variables to tf-modules/workflow

* PR feedback

* PR feedback

* reverting previous change

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback (docs+changelog)

* fixing docs

* undoing commit

* PR feedback

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* CUMULUS-3121/3120 (#3360)

* CUMULUS-3121/3120 v15.1.0 backport (#3346)

* backport PR

* finalizing docs and changing remaining groups

* PR feedback + testing changes

* PR feedback

* reverting change

* Revert "update changelog"

This reverts commit ae4627c.

* reverting changes

* PR feedback

* removing EgressLambda from doc

* Update xml2js 0.4.22->0.5 strict (#3330) (#3339)

* Update xml2js 0.4.22->0.5 strict

* Address GHSA-776f-qx25-q3cc/update allow list

* Update CHANGELOG

* Update package pins to 0.5.0 for xmljs

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback

* PR feedback

* PR feedback

* fixing documentation linting

* PR feedback

* PR feedback

* PR feedback

* adding variables to tf-modules/workflow

* PR feedback

* PR feedback

* reverting previous change

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback (docs+changelog)

* fixing docs

* undoing commit

* PR feedback

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* CUMULUS-3121/3120 (#3360)

* CUMULUS-3121/3120 v15.1.0 backport (#3346)

* backport PR

* finalizing docs and changing remaining groups

* PR feedback + testing changes

* PR feedback

* reverting change

* Revert "update changelog"

This reverts commit ae4627c.

* reverting changes

* PR feedback

* removing EgressLambda from doc

* Update xml2js 0.4.22->0.5 strict (#3330) (#3339)

* Update xml2js 0.4.22->0.5 strict

* Address GHSA-776f-qx25-q3cc/update allow list

* Update CHANGELOG

* Update package pins to 0.5.0 for xmljs

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback

* PR feedback

* PR feedback

* fixing documentation linting

* PR feedback

* PR feedback

* PR feedback

* adding variables to tf-modules/workflow

* PR feedback

* PR feedback

* reverting previous change

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback (docs+changelog)

* fixing docs

* undoing commit

* PR feedback

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* CUMULUS-3121/3120 (#3360)

* CUMULUS-3121/3120 v15.1.0 backport (#3346)

* backport PR

* finalizing docs and changing remaining groups

* PR feedback + testing changes

* PR feedback

* reverting change

* Revert "update changelog"

This reverts commit ae4627c.

* reverting changes

* PR feedback

* removing EgressLambda from doc

* Update xml2js 0.4.22->0.5 strict (#3330) (#3339)

* Update xml2js 0.4.22->0.5 strict

* Address GHSA-776f-qx25-q3cc/update allow list

* Update CHANGELOG

* Update package pins to 0.5.0 for xmljs

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback

* PR feedback

* PR feedback

* fixing documentation linting

* PR feedback

* PR feedback

* PR feedback

* adding variables to tf-modules/workflow

* PR feedback

* PR feedback

* reverting previous change

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* PR feedback (docs+changelog)

* fixing docs

* undoing commit

* PR feedback

---------

Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* fixing CHANGELOG.md

* removing docs + updating changelog

* changing doc link

---------

Co-authored-by: Nate Pauzenga <npauzenga@gmail.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>
Co-authored-by: Marc <marc@element84.com>
Co-authored-by: nasamoduyebo <123400008+nasamoduyebo@users.noreply.github.com>
Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>

* CUMULUS-3243:Updated granule delete logic (#3338) (#3366) (#3367)

* CUMULUS-3243:Updated granule delete logic (#3338)

* CUMULUS-3243:Updated granule delete logic to delete granule which is not in DynamoDB

* add unit tests

* delete files not in master

* update CHANGELOG

* remove dynamodb from test

* Release 16.0.0 (#3376)

* update docs

* Version up to v16.0.0

* Minor edits/update release doc

* Add PI release version notes

* Minor CL edit

* Update CHANGELOG.md

Co-authored-by: Nate Pauzenga <npauzenga@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Nate Pauzenga <npauzenga@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Nate Pauzenga <npauzenga@gmail.com>

* Address PR feedback

---------

Co-authored-by: Nate Pauzenga <npauzenga@gmail.com>

* [CUMULUS-3172] Update data integrity/migration docs (#3387)

* update docs

* update docs

* add new doc to sidebars

* Update docs/upgrade-notes/rds-phase-3-data-migration-guidance.md

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* Update docs/upgrade-notes/rds-phase-3-data-migration-guidance.md

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* Update docs/upgrade-notes/rds-phase-3-data-migration-guidance.md

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* Update docs/upgrade-notes/rds-phase-3-data-migration-guidance.md

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

---------

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* Fix lint issues (#3392)

* Jk/cumulus 3307 (#3394)

* Backport CUMULUS-3307/PR #3386

* Fixup

* Re-issue v16 docs (#3400)

* CUMULUS-3223: Fix failed granule stuck in queued (#3373) (#3402)

* CUMULUS-3223:Fix failed granule stuck in queued

* skip sqsQueueExists test

* update getGranuleTemporalInfo

* update test match schema

* update getGranuleTemporalInfo

* remove extra await

* remove skip sqsQueueExists

* update `@cumulus/cumulus-message-adapter-js` to `2.0.5`

* update sfEventSqsToDbRecords to return partial batch failure

* fix typo

* handle process error seperately, multiple message test

* update test to process multiple messages

* Jk/cumulus 3315 (#3407)

* CUMULUS-3135 - Update integration test scripts to fail on test timeout (#3401)

* Update integration test scripts to fail on test timeout

* Fixup

* Fixup

* Update script interpreter for test runs

* Fix script

* Fixup

* Fixup

* Update timeout pass/fail conditional

* Jk/cumulus 3135 fix integration tests (#3403)

* Update integration test scripts to fail on test timeout

* Fixup

* Fixup

* Update script interpreter for test runs

* Fix script

* Fixup

* Fixup

* Update api/client and integration test usage of it to fix test failures

* Fix formatting/lint/etc

* Update test/minor fix

---------

Co-authored-by: etcart <37375117+etcart@users.noreply.github.com>

---------

Co-authored-by: etcart <37375117+etcart@users.noreply.github.com>

* v16.0.1 alpha release for testing (#3409)

* bump to 16.0.1-alpha.0

* resolve conflicts

---------

Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>

* Update serve.js to match main

* bump version

* Add missing method for erasing PG tables (#3419)

* add missing method for erasing PG tables

* Remove duplicate declaration

* Remove duplicate declaration

* Take serve.js and serveUtils.js from main

* Update schema endpoint to utilize P3 api/lib schema

---------

Co-authored-by: Jonathan Kovarik <jonathan.kovarik@gmail.com>

* Release 16.0.0 (#3428)

* version bump

* update changelog version

* changelog fixup

* changelog fixup

* update cumulus versions in new lambda

* bump cumulus versions

---------

Co-authored-by: nasamoduyebo <123400008+nasamoduyebo@users.noreply.github.com>
Co-authored-by: jennyhliu <34660846+jennyhliu@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <kovarik@nsidc.org>
Co-authored-by: Marc <marc@element84.com>
Co-authored-by: Naga Nages <66387215+Nnaga1@users.noreply.github.com>
Co-authored-by: etcart <37375117+etcart@users.noreply.github.com>
Co-authored-by: Jonathan Kovarik <jonathan.kovarik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants