Conversation
Neo - PR Security ReviewNo security issues found Highlights
Hardening Notes
Comment |
WalkthroughThe duplicate detection logic in the runner was enhanced to track IP addresses associated with each content hash (simhash). Instead of treating identical content as duplicates regardless of source IP, the system now only deduplicates responses that originate from the same IP, allowing the same content from different IPs to be retained. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@runner/runner.go`:
- Around line 644-656: The loop over r.simHashes.GetALL(false) currently returns
or mutates on the first matching storedHash, causing nondeterministic dedupe
when multiple stored hashes are within threshold; change the logic in the block
that uses simhash.Compare, sliceutil.Contains, and r.simHashes.Set so you first
scan all stored entries: track whether any stored hash is within threshold and
whether any of those already contains ip, and only after the loop either: (a)
return true if any matching storedIPs already contained ip, or (b) attach ip to
an appropriate stored hash (or create a new entry for respSimHash) and return
false; remove the early r.simHashes.Set(...) and return inside the loop to
ensure all matches are considered before deciding.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2beeeee2-d2a0-41d8-8c4b-2d196e0eb71c
📒 Files selected for processing (2)
runner/runner.gorunner/runner_test.go
| for storedHash, storedIPs := range r.simHashes.GetALL(false) { | ||
| if simhash.Compare(storedHash, respSimHash) > 3 { | ||
| continue | ||
| } | ||
| if ip == "" || sliceutil.Contains(storedIPs, ip) { | ||
| gologger.Debug().Msgf("Skipping duplicate response (simhash %d, ip %s) for URL %s\n", respSimHash, ip, result.URL) | ||
| return true | ||
| } | ||
| _ = r.simHashes.Set(storedHash, append(storedIPs, ip)) | ||
| return false | ||
| } | ||
| _ = r.simHashes.Set(respSimHash, struct{}{}) | ||
|
|
||
| _ = r.simHashes.Set(respSimHash, []string{ip}) |
There was a problem hiding this comment.
Scan all matching simhashes before deciding this response is unique.
Because GetALL(false) is iterated as a map, the order here is not stable. If two stored hashes are both within the <= 3 threshold of respSimHash, Line 652 can attach ip to the first one seen and Line 653 returns false before a later matching hash that already contains ip is checked. That makes dedupe results nondeterministic and can leak duplicates.
💡 Proposed fix
func (r *Runner) duplicate(result *Result) bool {
respSimHash := simhash.Simhash(simhash.NewWordFeatureSet(converstionutil.Bytes(result.Raw)))
ip := result.HostIP
for storedHash, storedIPs := range r.simHashes.GetALL(false) {
if simhash.Compare(storedHash, respSimHash) > 3 {
continue
}
if ip == "" || sliceutil.Contains(storedIPs, ip) {
gologger.Debug().Msgf("Skipping duplicate response (simhash %d, ip %s) for URL %s\n", respSimHash, ip, result.URL)
return true
}
- _ = r.simHashes.Set(storedHash, append(storedIPs, ip))
- return false
}
- _ = r.simHashes.Set(respSimHash, []string{ip})
+ if storedIPs, err := r.simHashes.GetIFPresent(respSimHash); err == nil {
+ _ = r.simHashes.Set(respSimHash, append(storedIPs, ip))
+ } else {
+ _ = r.simHashes.Set(respSimHash, []string{ip})
+ }
return false
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@runner/runner.go` around lines 644 - 656, The loop over
r.simHashes.GetALL(false) currently returns or mutates on the first matching
storedHash, causing nondeterministic dedupe when multiple stored hashes are
within threshold; change the logic in the block that uses simhash.Compare,
sliceutil.Contains, and r.simHashes.Set so you first scan all stored entries:
track whether any stored hash is within threshold and whether any of those
already contains ip, and only after the loop either: (a) return true if any
matching storedIPs already contained ip, or (b) attach ip to an appropriate
stored hash (or create a new entry for respSimHash) and return false; remove the
early r.simHashes.Set(...) and return inside the loop to ensure all matches are
considered before deciding.
Proposed changes
Close #2014
Checklist
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests