Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Add new field to store document reference for persistent storage #1833

Closed
pxp928 opened this issue Apr 11, 2024 · 8 comments · Fixed by #1844
Closed

[feature] Add new field to store document reference for persistent storage #1833

pxp928 opened this issue Apr 11, 2024 · 8 comments · Fixed by #1844
Labels
enhancement New feature or request

Comments

@pxp928
Copy link
Collaborator

pxp928 commented Apr 11, 2024

Is your feature request related to a problem? Please describe.
With the new addition of the blob store, GUAC now can store documents that were collected and ingested into GUAC. This will allow users to find the original document (or re-ingest in the case of failure) if they need. In order to accurately determine the location of a document in the blob store, a document reference (blob store key) needs to be stored in the DB.

Currently, replacing the origin with this information, as we did in #1811 would result in loss of the original location from where the document originated. We should add a new field to preserve the existing functionality.

This would require an update to the graphQL schema to add in a new field called documentRef that would be empty if the persistent blob store is not being used. We would also add a new field to the SourceInformation struct in Document to contain the new field:

type Document struct {
	Blob              []byte
	Type              DocumentType
	Format            FormatType
	Encoding          EncodingType
	SourceInformation SourceInformation
}
// SourceInformation provides additional information about where the document comes from
type SourceInformation struct {
	// Collector describes the name of the collector providing this information
	Collector string
	// Source describes the source which the collector got this information
	Source string
	// DocumentRef describes the location of the document in the blob store
	DocumentRef string
}

We could either do this for all "verbs" or we can do it for only non-ephemeral documents like hasSBOM, hasSLSA, and CertifyVEX.

Though, in the future, we may want to move to a more in-toto attestation-specific ingestion for verbs like certifyBad, certifyGood, hasSourceAt, pkgEquals to name a few. So that we have evidence that something was attested by someone at this time. We again store this in the blob store for record keeping.

Describe the solution you'd like
Update to the graphQL schema to add in a new field called documentRef that would be empty if the persistent blob store is not being used.

Decision needs to be made if this should be done for all "verbs" or only a select few.

Describe alternatives you've considered

We could concatenate the strings together via a separator value (maybe something like https://github.com/guacsec/guac/blob/main/pkg/assembler/helpers/package.go#L27 so that it does not interfere in the future)?

For example:

mcr.microsoft.com/oss/kubernetes/kubectl@sha256:8035089a59a6f8577255f494c1ced250e1206667d8462869fc0deeca98d79427guac-empty-@@sha256:8534561615616161894984126517

That way we can split out the source in the future and get both the original source and the new blob store key back.

But following this method, we would lose the capability to query via origin and loss of functionality in GUAC.

@pxp928 pxp928 added the enhancement New feature or request label Apr 11, 2024
@pxp928
Copy link
Collaborator Author

pxp928 commented Apr 11, 2024

cc @nchelluri

@pxp928 pxp928 changed the title [feature] Add new filed to store document reference for persistent storage [feature] Add new field to store document reference for persistent storage Apr 11, 2024
@mlieberman85
Copy link
Collaborator

I say we do it for all verbs. I think it's a bit more extra work for helping not just from an evidence perspective but debuggability and maintainability perspective for GUAC.

We can use the blobs whether or not they are API responses or documents or something else for help debugging when something goes wrong as we have the content saved. It also helps if we parse a document, API request/response, etc. differently in the future and need to re-parse the data.

@mihaimaruseac
Copy link
Collaborator

+1 on doing it for all verbs.

We should not combine fields with separators. I think doing so will open us to issues when we'd need to escape the separator further down.

@lumjjb
Copy link
Contributor

lumjjb commented Apr 11, 2024

+1 , this looks good on adding this to SourceInformation

This will allow users to find the original document (or re-ingest in the case of failure) if they need.
i would say that this will not quite directly meet the usecase of "re-ingest in the case of failure". There needs to be another solution for that - something that is more on the pubsub and having a reprocessing pipeline, but that seems out of scope for this issue.

As an aside, one of the issues we've run to before with one of our other projects that has a similar data pipeline is the reprocessing is the issue of duplicates (@mdeicas has been looking at this).

@pxp928
Copy link
Collaborator Author

pxp928 commented Apr 11, 2024

This will allow users to find the original document (or re-ingest in the case of failure) if they need.
i would say that this will not quite directly meet the usecase of "re-ingest in the case of failure". There needs to be another solution for that - something that is more on the pubsub and having a reprocessing pipeline, but that seems out of scope for this issue.

oh yes, this is not the solution to "re-ingest in case of a failure". This is your database blows up and you have to start from scratch.

@pxp928
Copy link
Collaborator Author

pxp928 commented Apr 11, 2024

As an aside, one of the issues we've run to before with one of our other projects that has a similar data pipeline is the reprocessing is the issue of duplicates (@mdeicas has been looking at this).

Interesting, if there is lessons learned we can apply here that would be great.

@mdeicas
Copy link
Collaborator

mdeicas commented Apr 19, 2024

Sorry for the late response, but I think the lesson learned is that an ingestion pipeline may have been designed with an assumption of only ingesting documents once, or otherwise to be idempotent, and so it won't support re-ingesting documents to pick up new parsing features. It might be prudent to document this somewhere for clients?

@pxp928
Copy link
Collaborator Author

pxp928 commented Apr 19, 2024

hmmm that is an interesting case. I added to our agenda to discuss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants