Remove the wait for GITC response #7

frankinspace · 2023-10-03T16:13:33Z

After some internal discussions, PO.DAAC feels that we should just eliminate the “wait for response” step entirely. We will still process GITC responses but it will be “out-of-band” and not tied to the Cumulus post processing workflow. This will have a few consequences:

The “Browse Image Workflow” cumulus post-processing workflow will be marked as “Complete” once we have successfully placed an outgoing CNM message onto the GITC queue
Messages sent to GITC will no longer be “throttled” on the PO.DAAC end. Right now we limit concurrent step function executions to 2,000 which includes step functions waiting for responses. This means that there are only ever 2,000 granules “in flight” at any given time (note: for opera 1 granule can equate to 10s of images sent to GITC. So, 2,000 granules might mean 20,000 “in flight” images). If we remove the “wait for response”, podaac will more or less be sending to GITC at the same rate OPERA is sending to PO.DAAC.
GITC responses will just be placed in an s3 bucket in podaac account. Audits would have to be run based on matching the object key so we would want to make sure the CNM “identifier” is something we could use to trace back to the original granule that produced the image

The main question I want to explore is (given item #2) do we feel that GITC can handle and even higher throughput of messages if we remove the waiting?

mattp0 · 2023-10-03T16:43:24Z

This is something that we also were trying to come up with a good design for.

One thought I had was to have the CNM-R from GITC trigger another workflow that would have the same granule id, so in cumulus you would have multiple workflows attached to a single granule. It was brought up in cumulus office hours that this approach is problematic for the current way cumulus works with multi workflow granules as there is no way to de couple the nominal ingest workflow granule status from something like the GIBS workflow at the granule level in the RDS.

We could use the granule id with some additional identifier to like GRANULE_ID-gitc and GRANULE_ID-gitc-response which would then enable you to search the cumulus dashboard for GRANULE_ID and you should get 3 granules that match and all have their own status.

frankinspace · 2023-10-03T16:57:40Z

Does the cumulus dashboard pull in all step function workflows like that or is there another integration mechanism we need to know about?

For example if I just go into the AWS console, create a new step function workflow and execute it; does that execution show up in Cumulus dashboard? I kind of assumed we'd also have to like create an entry in Cumulus' database somewhere.

Also, our operations team has had the same complaint about decoupling the ingest workflow status from the status of the post processing workflows

mattp0 · 2023-10-03T17:08:11Z

As long as steps are returning the cumulus granule object the dashboard will pull outcome of the workflow as the status of the granule. We are returning the granules object in every step so that the CMA picks it up and populates RDS.
I think this is a big of a hacky work around so that you get 3 granules that reference one granule but 3 different workflows.

frankinspace · 2023-10-03T17:15:59Z

Ah ok, just so I make sure I understand, you're saying if the workflow triggered by CNM-R from GITC had steps where each step was a lambda that implemented the CMA library, the cumulus integration is taken care of for us?

mattp0 · 2023-10-03T22:01:44Z

Yes, and to keep the granule from messing up the nominal ingest status we could make the granuleId slightly different so that it can be associated to a unique workflow execution in the dashboard. Ideally its a stop gap until they fix the operational issue for tracking a granule status across many workflows.

could have collections and rule setup to trigger when a CNM-R comes to a this new GITC workflow that would be the GITC-Response decoupling of the wait for response from bignbit?

We are still working on integrating OPERA what sort of response times are you guys seeing from GITC for CNM-R? I see you said you are up to 2000 concurrent (is this minutes of waiting has to be less than the 10 min lambda time out right?)

frankinspace · 2023-10-03T23:10:58Z

Yes, and to keep the granule from messing up the nominal ingest status we could make the granuleId slightly different so that it can be associated to a unique workflow execution in the dashboard. Ideally its a stop gap until they fix the operational issue for tracking a granule status across many workflows.

Got it, we were planning on just throwing the entire CNM-R json response into the cumulus-internal s3 bucket keyed by collection/granule/uuid or something similar. This would take it one step further and make it available in the cumulus dashboard which I agree would be nice for operations.

could have collections and rule setup to trigger when a CNM-R comes to a this new GITC workflow that would be the GITC-Response decoupling of the wait for response from bignbit?

I don't understand your question.

We are still working on integrating OPERA what sort of response times are you guys seeing from GITC for CNM-R? I see you said you are up to 2000 concurrent (is this minutes of waiting has to be less than the 10 min lambda time out right?)

According to our kibana metrics, the average time (over last 30 days, ~1.27 million images total) to process 1 granule has been 21 minutes. The concurrency limit is set as described in throttling-queued-executions and it tracks the number of workflows submitted for execution. We currently have the limit set to 2000 for the BrowseImageGeneration post processing workflow. Currently we allow up to 24 hours before we timeout the response from GITC. It is not limited by lambda timeout because we are using step function task tokens to cause the workflow to enter a "wait for token" state.

mattp0 · 2023-10-04T00:01:24Z

could have collections and rule setup to trigger when a CNM-R comes to a this new GITC workflow that would be the GITC-Response decoupling of the wait for response from bignbit?

I don't understand your question.

After some internal discussions, PO.DAAC feels that we should just eliminate the “wait for response” step entirely.

Sorry I was purposing that using the CNM-R approach described to a new workflow that handled GITC responses could be a soln for maintaining the functionality that is waiting for a response but not a part of bignbit

According to our kibana metrics, the average time (over last 30 days, ~1.27 million images total) to process 1 granule has been 21 minutes. The concurrency limit is set as described in throttling-queued-executions and it tracks the number of workflows submitted for execution. We currently have the limit set to 2000 for the BrowseImageGeneration post processing workflow. Currently we allow up to 24 hours before we timeout the response from GITC. It is not limited by lambda timeout because we are using step function task tokens to cause the workflow to enter a "wait for token" state.

I see, I was confused how the system was handling concurrency! I will check this out, we have not explored throttling-queued-executions and the step function task tokens.

frankinspace · 2023-10-04T15:15:08Z

Sorry I was purposing that using the CNM-R approach described to a new workflow that handled GITC responses could be a soln for maintaining the functionality that is waiting for a response but not a part of bignbit

I understand. Your suggestion of starting another workflow on response is a good idea, but it does not directly address the reason why we want to eliminate the "wait for token" step.

The reason we want to eliminate the wait step is because it couples the DAAC's ingest process to the status of the GITC system. What I mean by that is; if GITC processing is having an issue/downtime, we don't want the DAAC processing to get backed-up enough that messages start getting lost. We experienced that a few times during our initial deployments with OPERA data. By removing the wait, the browse image workflow is considered "complete" once a message is successfully put onto the GITC queue.

frankinspace · 2024-05-02T17:58:58Z

Plan we will be implementing during 24.2:

Image set name will be updated to include 3 parts: the image filename, the dataday, and the granule concept id
CNM message sent to GIBS will use the image set name as the identifier in the message
CNM message sent to GIBS will be uploaded to S3 in the same location the CNM-R message is saved
CNM-R received from GIBS will include same identifier. Upon receipt, bignbit will parse the identifier to look up the granule in order to place the response in the correct location within the audit bucket
Filename of CNM and CNM-R will be similar such that they appear "next to" each other during directory listings

frankinspace · 2024-06-03T22:08:21Z

Testing changes in SIT, ready soon

frankinspace added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Oct 3, 2023

frankinspace added this to the Add support for OPERA_L3_DSWX-S1_PROVISIONAL_V1 milestone Mar 27, 2024

frankinspace assigned torimcd Apr 22, 2024

torimcd mentioned this issue Jun 5, 2024

Feature/issue 7 - Remove wait for GITC response #30

Merged

4 tasks

frankinspace closed this as completed in #30 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the wait for GITC response #7

Remove the wait for GITC response #7

frankinspace commented Oct 3, 2023

mattp0 commented Oct 3, 2023

frankinspace commented Oct 3, 2023 •

edited

Loading

mattp0 commented Oct 3, 2023

frankinspace commented Oct 3, 2023

mattp0 commented Oct 3, 2023 •

edited

Loading

frankinspace commented Oct 3, 2023 •

edited

Loading

mattp0 commented Oct 4, 2023

frankinspace commented Oct 4, 2023

frankinspace commented May 2, 2024 •

edited

Loading

frankinspace commented Jun 3, 2024

Remove the wait for GITC response #7

Remove the wait for GITC response #7

Comments

frankinspace commented Oct 3, 2023

mattp0 commented Oct 3, 2023

frankinspace commented Oct 3, 2023 • edited Loading

mattp0 commented Oct 3, 2023

frankinspace commented Oct 3, 2023

mattp0 commented Oct 3, 2023 • edited Loading

frankinspace commented Oct 3, 2023 • edited Loading

mattp0 commented Oct 4, 2023

frankinspace commented Oct 4, 2023

frankinspace commented May 2, 2024 • edited Loading

frankinspace commented Jun 3, 2024

frankinspace commented Oct 3, 2023 •

edited

Loading

mattp0 commented Oct 3, 2023 •

edited

Loading

frankinspace commented Oct 3, 2023 •

edited

Loading

frankinspace commented May 2, 2024 •

edited

Loading