Skip to content

adding ip aware dedup#2463

Merged
dogancanbakir merged 1 commit intodevfrom
2014-ip-aware-duplicate-filter
Mar 25, 2026
Merged

adding ip aware dedup#2463
dogancanbakir merged 1 commit intodevfrom
2014-ip-aware-duplicate-filter

Conversation

@Mzack9999
Copy link
Member

@Mzack9999 Mzack9999 commented Mar 22, 2026

Proposed changes

Close #2014

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced duplicate detection logic to account for content source, ensuring results from different hosts are properly distinguished rather than being incorrectly filtered as duplicates.
  • Tests

    • Added comprehensive test coverage for duplicate detection scenarios across various edge cases.

@auto-assign auto-assign bot requested a review from dogancanbakir March 22, 2026 12:59
@neo-by-projectdiscovery-dev
Copy link

neo-by-projectdiscovery-dev bot commented Mar 22, 2026

Neo - PR Security Review

No security issues found

Highlights

  • Adds IP-aware duplicate detection to improve deduplication accuracy
  • Stores simhash → IP list mappings in bounded cache (1000 entries)
  • Comprehensive test coverage for various IP/content combinations
Hardening Notes
  • Consider adding mutex protection around simHashes cache operations in runner.go:644-654 to prevent race conditions when multiple goroutines call duplicate() concurrently
  • Add monitoring for the simHashes cache growth to detect scenarios where many IPs return similar content, which could consume memory despite the 1000-entry cap

Comment @pdneo help for available commands. · Open in Neo

@coderabbitai
Copy link

coderabbitai bot commented Mar 22, 2026

Walkthrough

The duplicate detection logic in the runner was enhanced to track IP addresses associated with each content hash (simhash). Instead of treating identical content as duplicates regardless of source IP, the system now only deduplicates responses that originate from the same IP, allowing the same content from different IPs to be retained.

Changes

Cohort / File(s) Summary
Duplicate Detection Logic
runner/runner.go
Modified simHashes storage from gcache.Cache[uint64, struct{}] to gcache.Cache[uint64, []string] to track IPs per simhash. Reworked duplicate detection to be IP-aware: skips responses only if they match/near-match an existing simhash AND the IP is empty or already exists in the stored IP list. Appends new IPs to existing simhashes or initializes with current IP if no match found.
Deduplication Test Suite
runner/runner_test.go
Added comprehensive TestRunner_duplicate test suite covering: identical content on same IP (duplicate), identical content on different IPs (not duplicate), near-duplicates with IP differentiation, empty HostIP fallback behavior, and scale testing with 50 subdomains across same and different IPs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A hash with an IP in tow,
Now tracks which servers come and go—
Same content, different IPs bright,
No longer lost in dedup's sight!
One server, one result, that's right! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'adding ip aware dedup' directly describes the main change: implementing IP-aware deduplication logic in the duplicate detection function.
Linked Issues check ✅ Passed The changes implement IP-aware deduplication as required by issue #2014, storing host IPs per simhash and only marking responses as duplicates when both content and IP match or IP is empty.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing IP-aware deduplication: modifications to duplicate detection logic and corresponding test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 2014-ip-aware-duplicate-filter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Mzack9999 Mzack9999 self-assigned this Mar 22, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@runner/runner.go`:
- Around line 644-656: The loop over r.simHashes.GetALL(false) currently returns
or mutates on the first matching storedHash, causing nondeterministic dedupe
when multiple stored hashes are within threshold; change the logic in the block
that uses simhash.Compare, sliceutil.Contains, and r.simHashes.Set so you first
scan all stored entries: track whether any stored hash is within threshold and
whether any of those already contains ip, and only after the loop either: (a)
return true if any matching storedIPs already contained ip, or (b) attach ip to
an appropriate stored hash (or create a new entry for respSimHash) and return
false; remove the early r.simHashes.Set(...) and return inside the loop to
ensure all matches are considered before deciding.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2beeeee2-d2a0-41d8-8c4b-2d196e0eb71c

📥 Commits

Reviewing files that changed from the base of the PR and between 9836829 and 7d765f4.

📒 Files selected for processing (2)
  • runner/runner.go
  • runner/runner_test.go

Comment on lines +644 to +656
for storedHash, storedIPs := range r.simHashes.GetALL(false) {
if simhash.Compare(storedHash, respSimHash) > 3 {
continue
}
if ip == "" || sliceutil.Contains(storedIPs, ip) {
gologger.Debug().Msgf("Skipping duplicate response (simhash %d, ip %s) for URL %s\n", respSimHash, ip, result.URL)
return true
}
_ = r.simHashes.Set(storedHash, append(storedIPs, ip))
return false
}
_ = r.simHashes.Set(respSimHash, struct{}{})

_ = r.simHashes.Set(respSimHash, []string{ip})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scan all matching simhashes before deciding this response is unique.

Because GetALL(false) is iterated as a map, the order here is not stable. If two stored hashes are both within the <= 3 threshold of respSimHash, Line 652 can attach ip to the first one seen and Line 653 returns false before a later matching hash that already contains ip is checked. That makes dedupe results nondeterministic and can leak duplicates.

💡 Proposed fix
 func (r *Runner) duplicate(result *Result) bool {
 	respSimHash := simhash.Simhash(simhash.NewWordFeatureSet(converstionutil.Bytes(result.Raw)))
 	ip := result.HostIP
 
 	for storedHash, storedIPs := range r.simHashes.GetALL(false) {
 		if simhash.Compare(storedHash, respSimHash) > 3 {
 			continue
 		}
 		if ip == "" || sliceutil.Contains(storedIPs, ip) {
 			gologger.Debug().Msgf("Skipping duplicate response (simhash %d, ip %s) for URL %s\n", respSimHash, ip, result.URL)
 			return true
 		}
-		_ = r.simHashes.Set(storedHash, append(storedIPs, ip))
-		return false
 	}
 
-	_ = r.simHashes.Set(respSimHash, []string{ip})
+	if storedIPs, err := r.simHashes.GetIFPresent(respSimHash); err == nil {
+		_ = r.simHashes.Set(respSimHash, append(storedIPs, ip))
+	} else {
+		_ = r.simHashes.Set(respSimHash, []string{ip})
+	}
 	return false
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runner/runner.go` around lines 644 - 656, The loop over
r.simHashes.GetALL(false) currently returns or mutates on the first matching
storedHash, causing nondeterministic dedupe when multiple stored hashes are
within threshold; change the logic in the block that uses simhash.Compare,
sliceutil.Contains, and r.simHashes.Set so you first scan all stored entries:
track whether any stored hash is within threshold and whether any of those
already contains ip, and only after the loop either: (a) return true if any
matching storedIPs already contained ip, or (b) attach ip to an appropriate
stored hash (or create a new entry for respSimHash) and return false; remove the
early r.simHashes.Set(...) and return inside the loop to ensure all matches are
considered before deciding.

@dogancanbakir dogancanbakir merged commit 2cc5330 into dev Mar 25, 2026
16 checks passed
@dogancanbakir dogancanbakir deleted the 2014-ip-aware-duplicate-filter branch March 25, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate filter feature ideas (-fd, -filter-duplicates)

2 participants