[feature] Add new field to store document reference for persistent storage #1833

pxp928 · 2024-04-11T13:14:42Z

Is your feature request related to a problem? Please describe.
With the new addition of the blob store, GUAC now can store documents that were collected and ingested into GUAC. This will allow users to find the original document (or re-ingest in the case of failure) if they need. In order to accurately determine the location of a document in the blob store, a document reference (blob store key) needs to be stored in the DB.

Currently, replacing the origin with this information, as we did in #1811 would result in loss of the original location from where the document originated. We should add a new field to preserve the existing functionality.

This would require an update to the graphQL schema to add in a new field called documentRef that would be empty if the persistent blob store is not being used. We would also add a new field to the SourceInformation struct in Document to contain the new field:

type Document struct {
	Blob              []byte
	Type              DocumentType
	Format            FormatType
	Encoding          EncodingType
	SourceInformation SourceInformation
}

// SourceInformation provides additional information about where the document comes from
type SourceInformation struct {
	// Collector describes the name of the collector providing this information
	Collector string
	// Source describes the source which the collector got this information
	Source string
	// DocumentRef describes the location of the document in the blob store
	DocumentRef string
}

We could either do this for all "verbs" or we can do it for only non-ephemeral documents like hasSBOM, hasSLSA, and CertifyVEX.

Though, in the future, we may want to move to a more in-toto attestation-specific ingestion for verbs like certifyBad, certifyGood, hasSourceAt, pkgEquals to name a few. So that we have evidence that something was attested by someone at this time. We again store this in the blob store for record keeping.

Describe the solution you'd like
Update to the graphQL schema to add in a new field called documentRef that would be empty if the persistent blob store is not being used.

Decision needs to be made if this should be done for all "verbs" or only a select few.

Describe alternatives you've considered

We could concatenate the strings together via a separator value (maybe something like https://github.com/guacsec/guac/blob/main/pkg/assembler/helpers/package.go#L27 so that it does not interfere in the future)?

For example:

mcr.microsoft.com/oss/kubernetes/kubectl@sha256:8035089a59a6f8577255f494c1ced250e1206667d8462869fc0deeca98d79427guac-empty-@@sha256:8534561615616161894984126517

That way we can split out the source in the future and get both the original source and the new blob store key back.

But following this method, we would lose the capability to query via origin and loss of functionality in GUAC.

The text was updated successfully, but these errors were encountered:

pxp928 · 2024-04-11T13:14:57Z

cc @nchelluri

mlieberman85 · 2024-04-11T13:31:30Z

I say we do it for all verbs. I think it's a bit more extra work for helping not just from an evidence perspective but debuggability and maintainability perspective for GUAC.

We can use the blobs whether or not they are API responses or documents or something else for help debugging when something goes wrong as we have the content saved. It also helps if we parse a document, API request/response, etc. differently in the future and need to re-parse the data.

mihaimaruseac · 2024-04-11T14:05:51Z

+1 on doing it for all verbs.

We should not combine fields with separators. I think doing so will open us to issues when we'd need to escape the separator further down.

lumjjb · 2024-04-11T16:45:52Z

+1 , this looks good on adding this to SourceInformation

This will allow users to find the original document (or re-ingest in the case of failure) if they need.
i would say that this will not quite directly meet the usecase of "re-ingest in the case of failure". There needs to be another solution for that - something that is more on the pubsub and having a reprocessing pipeline, but that seems out of scope for this issue.

As an aside, one of the issues we've run to before with one of our other projects that has a similar data pipeline is the reprocessing is the issue of duplicates (@mdeicas has been looking at this).

pxp928 · 2024-04-11T17:06:18Z

This will allow users to find the original document (or re-ingest in the case of failure) if they need.
i would say that this will not quite directly meet the usecase of "re-ingest in the case of failure". There needs to be another solution for that - something that is more on the pubsub and having a reprocessing pipeline, but that seems out of scope for this issue.

oh yes, this is not the solution to "re-ingest in case of a failure". This is your database blows up and you have to start from scratch.

pxp928 · 2024-04-11T17:07:36Z

As an aside, one of the issues we've run to before with one of our other projects that has a similar data pipeline is the reprocessing is the issue of duplicates (@mdeicas has been looking at this).

Interesting, if there is lessons learned we can apply here that would be great.

mdeicas · 2024-04-19T17:59:21Z

Sorry for the late response, but I think the lesson learned is that an ingestion pipeline may have been designed with an assumption of only ingesting documents once, or otherwise to be idempotent, and so it won't support re-ingesting documents to pick up new parsing features. It might be prudent to document this somewhere for clients?

pxp928 · 2024-04-19T19:03:08Z

hmmm that is an interesting case. I added to our agenda to discuss

pxp928 added the enhancement New feature or request label Apr 11, 2024

pxp928 changed the title ~~[feature] Add new filed to store document reference for persistent storage~~ [feature] Add new field to store document reference for persistent storage Apr 11, 2024

pxp928 mentioned this issue Apr 11, 2024

Update graphQL schema to add documentRef field to all verbs #1834

Merged

7 tasks

pxp928 mentioned this issue Apr 15, 2024

Update gql, parser and backends to add new documentRef field #1844

Merged

7 tasks

kodiakhq bot closed this as completed in #1844 Apr 16, 2024

nchelluri mentioned this issue Apr 16, 2024

[feature] The documentRef GraphQL field is populated by the collectors #1846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Add new field to store document reference for persistent storage #1833

[feature] Add new field to store document reference for persistent storage #1833

pxp928 commented Apr 11, 2024 •

edited

pxp928 commented Apr 11, 2024

mlieberman85 commented Apr 11, 2024

mihaimaruseac commented Apr 11, 2024

lumjjb commented Apr 11, 2024

pxp928 commented Apr 11, 2024

pxp928 commented Apr 11, 2024

mdeicas commented Apr 19, 2024

pxp928 commented Apr 19, 2024

[feature] Add new field to store document reference for persistent storage #1833

[feature] Add new field to store document reference for persistent storage #1833

Comments

pxp928 commented Apr 11, 2024 • edited

pxp928 commented Apr 11, 2024

mlieberman85 commented Apr 11, 2024

mihaimaruseac commented Apr 11, 2024

lumjjb commented Apr 11, 2024

pxp928 commented Apr 11, 2024

pxp928 commented Apr 11, 2024

mdeicas commented Apr 19, 2024

pxp928 commented Apr 19, 2024

pxp928 commented Apr 11, 2024 •

edited