feat: Re-implement importer in Go#4808
Hidden character warning
Conversation
There was a problem hiding this comment.
This file gets copied here via go:generate from the osv-schema submodule.
I'm not really sure what the best practice is here.
We currently don't have any automation to update this, nor check that it is up to date.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is a significant and well-executed rewrite of the importer to Go. The code is well-structured, leveraging concurrency with worker pools and interfaces for testability. The separation of concerns for different source types (Git, GCS, REST) is clean. My review focuses on improving robustness, clarity, and maintainability in a few areas. Overall, great work on this large refactoring!
| if srcTimestamp != nil { | ||
| msg.Attributes["src_timestamp"] = strconv.FormatInt(srcTimestamp.Unix(), 10) | ||
| } else { | ||
| msg.Attributes["src_timestamp"] = "" | ||
| } |
There was a problem hiding this comment.
Instead of setting src_timestamp to an empty string when it's not available, it's generally better to omit the attribute from the message altogether. This avoids potential parsing issues on the consumer side and makes the message cleaner.
if srcTimestamp != nil {
msg.Attributes["src_timestamp"] = strconv.FormatInt(srcTimestamp.Unix(), 10)
}There was a problem hiding this comment.
The Python code did this, so I'm hesitant to change it without changing the worker first.
| if err := os.MkdirAll(dir, 0755); err != nil { | ||
| return nil, fmt.Errorf("failed to create directory: %w", err) | ||
| } | ||
| cmd := exec.CommandContext(ctx, "tar", "-xf", tarPath, "-C", dir) |
There was a problem hiding this comment.
The code shells out to the tar command to extract the repository archive. This creates an external dependency on the tar binary being present and supporting zstd. To make the application more self-contained and portable, consider using Go libraries to handle the decompression and extraction, such as github.com/klauspost/compress/zstd for zstd decompression and the standard archive/tar package for tar extraction.
There was a problem hiding this comment.
I'd like to do this in principle, but it's a lot of lines of code to actually do this in go (especially iterating through the tar archive), versus just one for shelling out.
| } | ||
| // Only log when we see a new ID prefix (i.e. roughly once per data source) | ||
| prefix := filepath.Base(path) | ||
| prefix := filepath.Base(obj.Name) |
There was a problem hiding this comment.
huh? How did this work before?
There was a problem hiding this comment.
I changed Objects to return metadata instead of just filenames (to save on number of requests that the importer needs to make)
| return &VulnerabilityStore{client: client} | ||
| } | ||
|
|
||
| func (s *VulnerabilityStore) ListBySource(ctx context.Context, source string, skipWithdrawn bool) iter.Seq2[*models.VulnSourceRef, error] { |
There was a problem hiding this comment.
Are there plans to split this source thing apart? Or is that more part of the database migration step.
There was a problem hiding this comment.
In Datastore, it's annoying but I don't think it really impacts our ability to do queries
(I don't think we'd ever need to query a path without knowing the source, and the < source + ";" works fine)
I think it's fine to leave it here, but it does make sense to have these fields separate if/when we move to postgres
| filterPath := func(p string) string { | ||
| if !strings.HasSuffix(p, sourceRepo.Extension) { | ||
| return "" | ||
| } | ||
| if dirPath := sourceRepo.Git.Path; dirPath != "" { | ||
| if !strings.HasSuffix(dirPath, "/") { | ||
| dirPath += "/" | ||
| } | ||
| if !strings.HasPrefix(p, dirPath) { | ||
| return "" | ||
| } | ||
| } | ||
| if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) { | ||
| return "" | ||
| } | ||
|
|
||
| return p | ||
| } |
There was a problem hiding this comment.
| filterPath := func(p string) string { | |
| if !strings.HasSuffix(p, sourceRepo.Extension) { | |
| return "" | |
| } | |
| if dirPath := sourceRepo.Git.Path; dirPath != "" { | |
| if !strings.HasSuffix(dirPath, "/") { | |
| dirPath += "/" | |
| } | |
| if !strings.HasPrefix(p, dirPath) { | |
| return "" | |
| } | |
| } | |
| if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) { | |
| return "" | |
| } | |
| return p | |
| } | |
| filterPath := func(p string) string { | |
| // Has Extension suffix? | |
| if !strings.HasSuffix(p, sourceRepo.Extension) { | |
| return "" | |
| } | |
| // Has Git.Path Prefix? | |
| if dirPath := sourceRepo.Git.Path; dirPath != "" { | |
| if !strings.HasSuffix(dirPath, "/") { | |
| dirPath += "/" | |
| } | |
| if !strings.HasPrefix(p, dirPath) { | |
| return "" | |
| } | |
| } | |
| // Does it match an ignore pattern? | |
| if shouldIgnore(path.Base(p), sourceRepo.IDPrefixes, compiledIgnorePatterns) { | |
| return "" | |
| } | |
| return p | |
| } |
There was a problem hiding this comment.
Are these comments really adding anything?
like, Has Extension suffix? and strings.HasSuffix(p, sourceRepo.Extension) basically communicate the same amount of info
go/internal/importer/rest.go
Outdated
| return err | ||
| } | ||
| req = req.WithContext(ctx) | ||
| resp, err := config.HTTPClient.Do(req) |
There was a problem hiding this comment.
Is there an err if response is not 200?
If so do we want to do the fallback when HEAD reqs are not supported?
If not, can we check if it returns 200 and if not log a warning?
There was a problem hiding this comment.
err is nil if there's a non-2xx response.
done, and moved to a checkHEAD function
| timeToUpdate = lastModTime | ||
| } | ||
| sourceRepo.REST.LastUpdated = &timeToUpdate | ||
| sourceRepo.REST.IgnoreLastImportTime = false |
There was a problem hiding this comment.
Should we do the update to switch this back immediately? I guess it shouldn't matter if another importer run doesn't run at the same time.
There was a problem hiding this comment.
If there is some transient error in the importer and it doesn't complete, I don't know if we want to write back immediately versus retrying.
And yeah, since only one importer may run at a time, I think this is okay.
go/cmd/importer/main.go
Outdated
| } | ||
| } | ||
|
|
||
| // importerSampleRate returns the sample rate for the importer (not the individual vulnerability entries). |
There was a problem hiding this comment.
Can you explain the difference between importer and individual vulnerability entries here?
There was a problem hiding this comment.
Added some more detail here and in vulnerabilitySampleRate()
another-rex
left a comment
There was a problem hiding this comment.
Probably should mention in the PR/commit description that this does not do git fetches anymore
|
Big re-implementation of the importer (and importer-deleter) into Go:
I'd appreciate people reviewing this to see how follow-able/self-documenting the code is, and point out where things may be unclear.