Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the wait for GITC response #7

Closed
frankinspace opened this issue Oct 3, 2023 · 10 comments · Fixed by #30
Closed

Remove the wait for GITC response #7

frankinspace opened this issue Oct 3, 2023 · 10 comments · Fixed by #30
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@frankinspace
Copy link
Member

After some internal discussions, PO.DAAC feels that we should just eliminate the “wait for response” step entirely. We will still process GITC responses but it will be “out-of-band” and not tied to the Cumulus post processing workflow. This will have a few consequences:

  1. The “Browse Image Workflow” cumulus post-processing workflow will be marked as “Complete” once we have successfully placed an outgoing CNM message onto the GITC queue
  2. Messages sent to GITC will no longer be “throttled” on the PO.DAAC end. Right now we limit concurrent step function executions to 2,000 which includes step functions waiting for responses. This means that there are only ever 2,000 granules “in flight” at any given time (note: for opera 1 granule can equate to 10s of images sent to GITC. So, 2,000 granules might mean 20,000 “in flight” images). If we remove the “wait for response”, podaac will more or less be sending to GITC at the same rate OPERA is sending to PO.DAAC.
  3. GITC responses will just be placed in an s3 bucket in podaac account. Audits would have to be run based on matching the object key so we would want to make sure the CNM “identifier” is something we could use to trace back to the original granule that produced the image

The main question I want to explore is (given item #2) do we feel that GITC can handle and even higher throughput of messages if we remove the waiting?

@frankinspace frankinspace added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Oct 3, 2023
@mattp0
Copy link

mattp0 commented Oct 3, 2023

This is something that we also were trying to come up with a good design for.

One thought I had was to have the CNM-R from GITC trigger another workflow that would have the same granule id, so in cumulus you would have multiple workflows attached to a single granule. It was brought up in cumulus office hours that this approach is problematic for the current way cumulus works with multi workflow granules as there is no way to de couple the nominal ingest workflow granule status from something like the GIBS workflow at the granule level in the RDS.

We could use the granule id with some additional identifier to like GRANULE_ID-gitc and GRANULE_ID-gitc-response which would then enable you to search the cumulus dashboard for GRANULE_ID and you should get 3 granules that match and all have their own status.

@frankinspace
Copy link
Member Author

frankinspace commented Oct 3, 2023

Does the cumulus dashboard pull in all step function workflows like that or is there another integration mechanism we need to know about?

For example if I just go into the AWS console, create a new step function workflow and execute it; does that execution show up in Cumulus dashboard? I kind of assumed we'd also have to like create an entry in Cumulus' database somewhere.

Also, our operations team has had the same complaint about decoupling the ingest workflow status from the status of the post processing workflows

@mattp0
Copy link

mattp0 commented Oct 3, 2023

As long as steps are returning the cumulus granule object the dashboard will pull outcome of the workflow as the status of the granule. We are returning the granules object in every step so that the CMA picks it up and populates RDS.
I think this is a big of a hacky work around so that you get 3 granules that reference one granule but 3 different workflows.

@frankinspace
Copy link
Member Author

Ah ok, just so I make sure I understand, you're saying if the workflow triggered by CNM-R from GITC had steps where each step was a lambda that implemented the CMA library, the cumulus integration is taken care of for us?

@mattp0
Copy link

mattp0 commented Oct 3, 2023

Yes, and to keep the granule from messing up the nominal ingest status we could make the granuleId slightly different so that it can be associated to a unique workflow execution in the dashboard. Ideally its a stop gap until they fix the operational issue for tracking a granule status across many workflows.

could have collections and rule setup to trigger when a CNM-R comes to a this new GITC workflow that would be the GITC-Response decoupling of the wait for response from bignbit?

We are still working on integrating OPERA what sort of response times are you guys seeing from GITC for CNM-R? I see you said you are up to 2000 concurrent (is this minutes of waiting has to be less than the 10 min lambda time out right?)

@frankinspace
Copy link
Member Author

frankinspace commented Oct 3, 2023

Yes, and to keep the granule from messing up the nominal ingest status we could make the granuleId slightly different so that it can be associated to a unique workflow execution in the dashboard. Ideally its a stop gap until they fix the operational issue for tracking a granule status across many workflows.

Got it, we were planning on just throwing the entire CNM-R json response into the cumulus-internal s3 bucket keyed by collection/granule/uuid or something similar. This would take it one step further and make it available in the cumulus dashboard which I agree would be nice for operations.

could have collections and rule setup to trigger when a CNM-R comes to a this new GITC workflow that would be the GITC-Response decoupling of the wait for response from bignbit?

I don't understand your question.

We are still working on integrating OPERA what sort of response times are you guys seeing from GITC for CNM-R? I see you said you are up to 2000 concurrent (is this minutes of waiting has to be less than the 10 min lambda time out right?)

According to our kibana metrics, the average time (over last 30 days, ~1.27 million images total) to process 1 granule has been 21 minutes. The concurrency limit is set as described in throttling-queued-executions and it tracks the number of workflows submitted for execution. We currently have the limit set to 2000 for the BrowseImageGeneration post processing workflow. Currently we allow up to 24 hours before we timeout the response from GITC. It is not limited by lambda timeout because we are using step function task tokens to cause the workflow to enter a "wait for token" state.

@mattp0
Copy link

mattp0 commented Oct 4, 2023

could have collections and rule setup to trigger when a CNM-R comes to a this new GITC workflow that would be the GITC-Response decoupling of the wait for response from bignbit?

I don't understand your question.

After some internal discussions, PO.DAAC feels that we should just eliminate the “wait for response” step entirely.

Sorry I was purposing that using the CNM-R approach described to a new workflow that handled GITC responses could be a soln for maintaining the functionality that is waiting for a response but not a part of bignbit

According to our kibana metrics, the average time (over last 30 days, ~1.27 million images total) to process 1 granule has been 21 minutes. The concurrency limit is set as described in throttling-queued-executions and it tracks the number of workflows submitted for execution. We currently have the limit set to 2000 for the BrowseImageGeneration post processing workflow. Currently we allow up to 24 hours before we timeout the response from GITC. It is not limited by lambda timeout because we are using step function task tokens to cause the workflow to enter a "wait for token" state.

I see, I was confused how the system was handling concurrency! I will check this out, we have not explored throttling-queued-executions and the step function task tokens.

@frankinspace
Copy link
Member Author

Sorry I was purposing that using the CNM-R approach described to a new workflow that handled GITC responses could be a soln for maintaining the functionality that is waiting for a response but not a part of bignbit

I understand. Your suggestion of starting another workflow on response is a good idea, but it does not directly address the reason why we want to eliminate the "wait for token" step.

The reason we want to eliminate the wait step is because it couples the DAAC's ingest process to the status of the GITC system. What I mean by that is; if GITC processing is having an issue/downtime, we don't want the DAAC processing to get backed-up enough that messages start getting lost. We experienced that a few times during our initial deployments with OPERA data. By removing the wait, the browse image workflow is considered "complete" once a message is successfully put onto the GITC queue.

@frankinspace
Copy link
Member Author

frankinspace commented May 2, 2024

Plan we will be implementing during 24.2:

  • Image set name will be updated to include 3 parts: the image filename, the dataday, and the granule concept id
  • CNM message sent to GIBS will use the image set name as the identifier in the message
  • CNM message sent to GIBS will be uploaded to S3 in the same location the CNM-R message is saved
  • CNM-R received from GIBS will include same identifier. Upon receipt, bignbit will parse the identifier to look up the granule in order to place the response in the correct location within the audit bucket
  • Filename of CNM and CNM-R will be similar such that they appear "next to" each other during directory listings

@frankinspace
Copy link
Member Author

Testing changes in SIT, ready soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants