Repair Network Flows documentation#22449
Conversation
The 2026-05-07 closure of SOW-0014 was premature. The learn netlify deploy preview for PR netdata#2852 surfaced major content errors that the prior validation pass missed: multiple statements that contradict the source code, generic flow-monitoring advice imported from research notes that does not apply to Netdata, several invented behaviours, and structural choices that read as academic / generic guidance rather than as a practical Netdata-specific guide. Move file from done/ back to current/, set Status: in-progress, and append a `## Regression - 2026-05-07` section with: - 21 findings transcribed verbatim from the user (F1..F21); - code-citation verdicts for each (per-flow sampling multiplication at decoder/record/core/record.rs:24-26 confirms F4/F5/F15 wrong; template persistence at decoder/protocol/{v9,ipfix}/templates.rs confirms F14 wrong; etc.); - root cause analysis (subagents extracted data accurately but missed behavioural framing claims; codex review focused on autocomplete code, not docs prose; validation evidence was structural, not semantic; closure was driven by "all phases done" rather than "all claims true"); - a three-phase repair plan: per-finding fixes one at a time with no batching (R1), per-page audit subagents that re-run until clean (R2), final close with a whole-section review (R3). The SOW remains in current/ until every F1..F21 has a fix landed with code citation, every page passes a per-page audit, and a whole-section review returns no new findings.
The /docs/network-flows URL rendered as an auto-generated tile grid because the section root meta block carried only `label:` -- no leaf content. Learn's `get_dir_make_file_and_recurse` then synthesised a category index page from the children. Hoist `edit_url:` and `description:` to the section root, pointing at the existing README.md. Drop the redundant child "Overview" entry that pointed to the same file. Pattern reference: every section that renders an Overview at its root URL on learn.netdata.cloud (Collecting Metrics, Dashboards and Charts, Netdata Cloud, Welcome to Netdata, etc.) carries `edit_url:` directly on the section root. Logged as F1 in SOW-0014 regression log with code references.
Two related findings, fixed together because they touched the same paragraphs. - The "doubling" effect (per-packet ingress+egress accounting on a single router) was conflated with bidirectional traffic symmetry. - The doubling fix said "filter by one exporter, one interface, in one direction". The "in one direction" is redundant on top of "one interface" and misleads readers into expecting another 50% halving. - The bidirectional-traffic explanation said "when you see traffic X-to-Y and Y-to-X of similar volume, that's one conversation, not two". Bidirectional conversations are usually asymmetric (downloads vs ACKs), so "similar volume" is wrong as an identification heuristic. Rewritten: - Doubling fix is now: one exporter + one interface (Input Interface OR Output Interface, pick one). Each packet crossing that interface produces exactly one record on it. - The mirror-conversation section is renamed and reframed: separate packets in each direction, separate records, typically asymmetric volumes. Per-direction accounting, not duplication. Files touched: - README.md, quick-start.md: paragraphs rewritten. - summary-sankey.md, anti-patterns.md, validation.md: "in one direction" lines fixed in place. Anti-patterns / validation will be rewritten more broadly under F14-F17 but the wrong claims are removed now. Logged as F2+F3 in SOW-0014 regression log.
…e myth)
Two related findings, fixed together because they are the same wrong
claim repeated across the documentation. Source-code reality:
src/crates/netflow-plugin/src/decoder/record/core/record.rs:24-26
let sampling_rate = rec.sampling_rate.max(1);
rec.bytes = rec.bytes.saturating_mul(sampling_rate);
rec.packets = rec.packets.saturating_mul(sampling_rate);
`sampling_rate` is set per-record from each protocol's appropriate
source (legacy header, v9 IE / Sampling Options Template, IPFIX IE /
options, sFlow per-sample rate, or static override). Multiplication
runs PER FLOW at decode time. Mixed sampling rates across exporters,
interfaces, or time are handled correctly automatically.
Removed false claims wherever they appeared:
- README.md "What sampling does to your numbers" -- "works correctly
only if all your exporters use the same sampling rate" and "the
clean path: keep sampling rates uniform across your network".
Rewrote the paragraph to state per-flow multiplication, explain
why the UI does not surface a single rate (mixed rates have no
meaningful display value; uniform rates are already known to the
operator), and keep the real statistical-floor caveat (sampling
can miss small / short flows regardless of rate uniformity).
- field-reference.md and anti-patterns.md -- `RAW_BYTES` no longer
framed as "use when sampling is uniform". Now correctly framed as
the literal pre-multiplication value the exporter sent.
- troubleshooting.md "Bandwidth doesn't match SNMP" -- "Mixed
sampling rates ... isn't comparable to any single SNMP
measurement" replaced with the actual mistake (comparing
aggregates of many interfaces to a single interface SNMP counter).
Per-flow multiplication is correct regardless of rate uniformity.
- validation.md -- "undocumented sampling rate changes" dropped from
the silent-failure intro; the "Sampling rate change" monitoring
table row removed (per-flow multiplication absorbs rate changes).
- investigation-playbooks.md -- "Sampling rate of the exporter (so
the numbers can be interpreted)" deliverable removed and "A
change in sampling rate during the analysis window invalidates
the trend" caveat removed. Both wrong under per-flow scaling.
- anti-patterns.md cross-protocol-counts section -- "Same goes for
sampling-rate differences across exporters" removed; the
protocol-counts-not-comparable point stays.
The F2/F3 doubling-fix wording ("filter by exporter + interface +
direction") was also wrong and got cleaned in the same anti-patterns
summary table row.
What stays: NetFlow v7 / v5 with rate=0 / v9 / IPFIX without a
Sampling Options Template are real cases where the plugin sees no
rate and undercounts. Those remain documented as the actual silent
failure mode.
Items deferred to F14 / F15 which rewrite their containing sections:
- validation.md silent-failure items #2, #3, #5 (F14 removes them as
a block).
- anti-patterns.md "Ignoring the sampling rate" section + its
summary-table row (F15 removes the section entirely).
Logged as F4+F5 in SOW-0014 regression log with code references.
…ment
The globe and city map render the same query response with the same
table beneath. The 3D projection is not "less useful for analysis";
it is a different rendering with different strengths.
Replaced both occurrences with a neutral framing that the 2D city map
is best for in-continent precision and the 3D globe is best when
distance and great-circle paths matter (transcontinental traffic,
undersea cables, intercontinental CDN routing).
While here, fixed the same page's "Mirroring" subsection to drop the
F2 symmetry myth ("25 top-N = 12 conversations" implied a 1:1 pairing
of A->B and B->A by volume; bidirectional traffic is usually
asymmetric).
Logged as F6 in SOW-0014 regression log.
The docs claimed "the Network Flows tab should appear in the top navigation". The actual Netdata UI exposes Network Flows as a Function under the **Live** tab. Verified against docs/dashboards-and-charts/live-tab.md. Adopted convention: - "Open Network Flows" (verb) - "the Network Flows view" (noun) - "Click the Live tab in the top navigation; Network Flows appears in the Functions list on the right" (setup context). Swept every "Network Flows tab" reference in: - installation.md (the original bad sentence + follow-up). - troubleshooting.md, investigation-playbooks.md, anti-patterns.md. - visualization/dashboard-cards.md (4 occurrences). - visualization/summary-sankey.md. Logged as F7 in SOW-0014 regression log.
Cleans up an unjustified schema redundancy. Today the plugin accepts both top-level `size_of_journal_files` / `duration_of_journal_files` under `journal` AND per-tier values under `journal.tiers.<tier>`. The runtime already uses only per-tier values; the global was just a default that flowed through `retention_for_tier()`. The two paths gave operators a configuration surface with no underlying behavioural difference. After: - `JournalConfig` carries only `tiers: JournalTierRetentionOverrides` (plus journal_dir and the query guardrails). No globals. - Each `JournalTierRetentionConfig` is `Option<ByteSize> / Option<Duration>` directly. Omitted fields fall back to the built-in tier defaults (uniform 10GB / 7d, preserving today's default behaviour). Explicit `null` disables that limit on that tier; validation still requires at least one positive limit. - The `RetentionLimitOverride<T>` enum, its serializer, deserializer, and `resolve` helpers are removed -- no longer reachable. - The orphan `parse_bytesize` helper that fed the removed clap `value_parser` is gone. Tests rewritten to exercise the per-tier-only schema: - `journal_tier_retention_uses_built_in_tier_defaults` - `journal_tier_retention_uses_per_tier_values_when_present` - `journal_rotation_size_derives_from_tier_size_budget` - `journal_rotation_size_uses_100mb_for_time_only_retention` - `journal_validation_rejects_tier_size_below_100mb` - `journal_validation_allows_time_only_retention_when_size_is_disabled` - `journal_tier_retention_null_disables_size_limit_for_that_tier_only` Plus mechanical updates to memory_tests / startup_memory_tests to use the new schema. Full crate: 427 passed, 0 failed. Documentation: - configuration.md: rewrote the `## journal` section with a per-tier-only schema. Updated the production retention profile example. Replaced the existing "Top-level retention" subsection with a "Per-tier retention" subsection. - retention-querying.md: dropped the global-form example; per-tier example only; cross-link to configuration.md. Breaking change notice: any existing user config using `journal.size_of_journal_files` / `journal.duration_of_journal_files` at the top level will now fail to deserialize (deny_unknown_fields). Migrate by moving those values under `journal.tiers.<tier>.*`. The plugin is recently shipped (PR netdata#22439, 2026-05-07); breaking-change risk is low. Logged as F8 in SOW-0014 regression log.
Code investigation: these two journal config knobs were declared, validated, and exposed in YAML, but nothing in `src/.../query/` ever read them. The actual tier auto-pick logic in `query/planner/spans.rs::plan_query_tier_spans_recursive` selects the coarser tier strictly from window / bucket-duration alignment math -- it does not consult either knob. Verdict: dead schema. Removed both: - field declarations on `JournalConfig` - non-zero / ordering validation in validate_journal - two YAML test fixtures that mentioned them - the two configuration.md table rows + code-block lines + the "query-window limits" explanation - the retention-querying.md sentence that referenced them Updated the retention-querying explanation of tier auto-pick to match the actual behaviour: the planner uses bucket alignment, not config-driven window caps. Build + tests: 427 passed, 0 failed. Breaking change notice: any user config that set `journal.query_1m_max_window` or `journal.query_5m_max_window` will now fail to deserialize (deny_unknown_fields). The keys had no effect before; migration is delete-only. Logged as F9 in SOW-0014 regression log.
…values_per_field; document properly
Two journal config knobs, opposite verdicts after code investigation:
- query_max_groups: REAL. Read at query/service.rs:52 and threaded into
the projected group accumulator via query/projected/apply.rs:48. When
the accumulator's grouped_total() exceeds the limit, additional group
keys are folded into a synthetic __overflow__ bucket
(query/grouping/labels.rs:17, query/grouping/model/compact.rs:35) and
the response carries a warning ("Group accumulator limit reached;
additional groups were folded into __overflow__" at
query/timeseries.rs:124). Bounds memory on accidentally wide group-by
combinations. Keep; document properly.
- query_facet_max_values_per_field: DEAD. Declared, validated for
non-zero, but the consumer at query/facets/render.rs:19,27 uses the
hardcoded constant DEFAULT_FACET_ACCUMULATOR_MAX_VALUES_PER_FIELD
(query/request/constants.rs:17) instead of the config knob. The two
coincidentally have the same default value (5000) but the config knob
is never threaded to the consumer. Remove.
Code:
- types/journal.rs: removed the dead field; added a doc comment for
query_max_groups explaining the __overflow__ bucket and the warning.
- defaults.rs: removed default_query_facet_max_values_per_field().
- validation/journal.rs: removed the non-zero check for the dead knob.
- plugin_config_tests.rs: removed
validate_rejects_zero_query_facet_max_values_per_field test entirely;
cleaned YAML fixtures.
Stock config + README:
- src/crates/netflow-plugin/configs/netflow.yaml: rewrote the journal
block to use the per-tier retention form (carries over the F8 schema
in the stock config); dropped both dead knobs; added clear comments
for query_max_groups.
- src/crates/netflow-plugin/README.md: example updated, explanatory
paragraph rewritten to describe what query_max_groups actually does.
Docs:
- configuration.md: Query guardrails table now lists only
query_max_groups, with full description of overflow behaviour.
- retention-querying.md: Group-by limit section consolidated.
- visualization/filters-facets.md: removed the "Facet limits"
subsection that documented the dead knob.
Build + tests: 426 passed, 0 failed (one dead-knob validation test
removed).
Breaking change notice: any user config setting
journal.query_facet_max_values_per_field will now fail to deserialize.
The key had no effect before; migration is delete-only.
Logged as F10 in SOW-0014 regression log.
The file existed as 0 bytes since the original documentation rewrite.
The netlify deploy preview rendered it as an empty page. Multiple other
pages cross-link to it (asn-resolution, static-metadata, network-identity,
README, configuration, the four provider integration cards), so the
empty placeholder was both a UX failure and a coverage gap.
Authored from scratch, code-grounded against:
- src/.../plugin_config/types/enrichment/geoip.rs (config schema)
- src/.../plugin_config/runtime.rs (auto-detect path)
- src/.../enrichment/data/geoip/resolver.rs (load + 30s refresh + per-IP
composing-multiple-databases lookup)
- src/.../enrichment/data/network/asn.rs (AS-name rendering)
Page covers: the fields IP intelligence populates (with tier-preservation
notes), the configuration schema, auto-detection, refresh cadence, lookup
order vs the broader ASN provider chain, the four provider integration
cards (DB-IP / MaxMind / IPtoASN / Custom), private-IP rendering,
IPv6/IPv4 database split behaviour, staleness and accuracy caveats, and a
failure-modes table.
Frontmatter `learn_rel_path` matches the bgp-routing / network-identity
siblings ("Network Flows/Enrichment Concepts") for now -- the source
frontmatter is informational; the actual sidebar position derives from
docs/.map/map.yaml, and F20 will rename the section consistently.
Logged as F11 in SOW-0014 regression log.
…hment" User: "'Encrichement Concepts' is a wrong title. 'Flows Enrichement' is the right one." Renamed in: - docs/.map/map.yaml line 499 (the section label that drives the actual sidebar position on Learn). - All seven `learn_rel_path` frontmatter values across docs/network-flows/enrichment/*.md. Prior state was inconsistent (4 files had "Network Flows/Enrichment", 2 had "Network Flows/Enrichment Concepts", 1 had the F11-introduced "Flows Enrichment"). Settled on the canonical "Network Flows/Flows Enrichment" everywhere. Logged as F20 in SOW-0014 regression log.
…ols"
User: "'Sources' is too generic. 'Flow Protocols' is the right one."
Renamed in:
- integrations/categories.yaml: flows.sources.name now "Flow Protocols".
- The three protocol-card frontmatter values
(`netflow.md`, `ipfix.md`, `sflow.md`) now declare
`learn_rel_path: "Network Flows/Flow Protocols"`.
- src/crates/netflow-plugin/metadata.yaml: removed three
self-referencing learn URLs that pointed at
/docs/network-flows/sources/{netflow,ipfix,sflow}. These links
were broken before the rename (no /docs/network-flows/sources
directory exists in source) and would stay broken under the new
label. Replaced with the surviving "Network Flows Overview"
anchor.
- Re-ran integrations/gen_integrations.py + gen_docs_integrations.py
to regenerate the three protocol cards. Both exit clean.
Logged as F21 in SOW-0014 regression log.
User: "Netdata logs in namespace 'netdata'. Journalctl needs `--namespace netdata`." `-u netdata` selects the systemd UNIT and captures only stdout/stderr the unit emits to the journal. Netdata writes structured logs into a journal NAMESPACE called `netdata`. Without `--namespace netdata`, users see at most unit-level startup/shutdown messages -- not the plugin output that helps with debugging. Swept all `journalctl -u netdata` invocations to `journalctl --namespace netdata` across: - quick-start.md - troubleshooting.md (5 occurrences) - installation.md - enrichment/network-identity.md Grep clean afterwards. Logged as F18 in SOW-0014 regression log.
…tern User: "How is it possible for users to ignore the sampling rate if we calculate the estimated volume at ingestion? You invented reasons for it. ... section must be removed." The premise was wrong on multiple counts: per-flow multiplication is always consistent (each record carries its own rate), users CAN'T "ignore" the rate because Netdata applies it automatically at decode time, and the "uniform rates required" framing is exactly the myth F4 + F5 already corrected. The two real concerns the section conflated remain documented elsewhere: - small flows missed at high sampling rates -- preserved in the Overview's "What sampling does to your numbers" section and in investigation-playbooks "Caveats". - exporter sends no rate (v7 / v5 rate=0 / v9-IPFIX without Sampling Options Template) -- preserved in troubleshooting "Bandwidth doesn't match SNMP" and in validation.md. Removed the entire section. Section numbering renumber will land with F17 once all three section removals have completed. Logged as F15 in SOW-0014 regression log.
…i-pattern User: "Geolocation does not position internal IPs on the map. ... section must be removed." Code-verified at src/crates/netflow-plugin/src/enrichment/data/geoip/decode.rs:40-72. `apply_geo_record` writes country/state/city/latitude/longitude only when the MMDB record carries non-empty values for those fields. For RFC 1918 / private IPs, the MMDB either has no entry or has one tagged `ip_class: "private"` with no country/city/coords. Internal IPs simply do not appear on geographic maps. The "in random countries" claim was invented. Removed the entire section. The numbering renumber lands with F17. The troubleshooting.md "Internal IPs in random countries" subsection carries the same invented claim and will be addressed under F19. Logged as F16 in SOW-0014 regression log.
…ds" + renumber User: "Netdata does not support alerting of flows yet. Remove this section." The section's own footnote already acknowledged the issue: "Netdata's alerting on flow data is in development; for now this pattern lives in your monitoring practice, not in the plugin." So it was advice for users to apply outside Netdata, not a Netdata anti-pattern. Plus: with F15, F16, F17 all deleting sections, this commit renumbers the anti-patterns.md sections to be sequential again (1 ... 9). Dropped three rows from the summary table at the bottom (Ignored sampling, GeoIP for internal IPs, Absolute thresholds) that referred to the removed sections. Updated the cross-link in visualization/time-series.md that called out "time-shifted comparison beats absolute thresholds" -- replaced with a general anti-patterns reference since the specific section is gone. Logged as F17 in SOW-0014 regression log.
User: "This page has a mix of all the above issues: sampling, geoip, etc." Surgical fixes after F2-F18 land: - Removed "Internal IPs in random countries" subsection -- same invented claim as F16 (GeoIP does not position internal IPs on maps; code-verified at src/.../enrichment/data/geoip/decode.rs:40-72). - Rewrote "Things that look like bugs but aren't" entries to remove the F2/F3 "filter to one direction" framing and the F16 GeoIP myth. The two doubling-related bullets now point at exporter+interface filtering and at Source/Destination ASN filtering for bidirectional. Renamed "tier-0" -> "raw-tier" for consistency with the field reference. Items kept (framing already correct after earlier findings): - "Sampling rate not honoured by the exporter" -- correct (F4/F5). - Doubling references in the SNMP-mismatch table (F2/F3). - ASN provider chain debug recipe. - Decapsulation destructive-on-non-tunnel. Items deferred to per-page audit (R2): - "Cisco's default template refresh is 30 minutes" -- vendor- specific claim; verify against current Cisco IOS-XE/XR docs in R2. Logged as F19 in SOW-0014 regression log.
User: "I think the entire 'Validation and Data Quality' is completely off. It mentions again sampling rates, etc." Code-verified facts driving the rewrite: - Per-flow sampling multiplication at decode time (decoder/record/core/record.rs:24-26): users don't need to monitor "sampling rate change" or "sampling rate misinterpretation". - Template persistence across restarts (decoder/protocol/v9/templates.rs:106 + decoder/protocol/ipfix/templates/data.rs:67): users don't need to monitor "template loss after collector restart". - UDP buffer overflow alert already exists (src/health/health.d/udp_errors.conf:6-19): kernel-level UDP drops are signalled by an existing system alert, not a silent failure. Rewrite: - New "What you actually need to watch" table -- five real failure modes (kernel UDP drops via existing alert, exporter stopped sending, wrong interfaces being exported, exporter sampling without communicating the rate, stale MMDB). - Removed the three invented silent-failure items (sampling misinterpretation, sampling change, template loss). - Removed the "Internal IP enrichment validation" section (F16 confirmed GeoIP does not position internal IPs). - Renamed "Sampling rate sanity check" to "Sampling rate verification" with the uniform-rate myth gone; kept the practical RAW_BYTES vs BYTES comparison. - Removed the "Template cache health" subsection. - Renamed the alerting table to "Plugin-side signals worth alerting on" and clarified these are signals the plugin exposes for the operator, not "silent failures" the dashboard hides. Logged as F14 in SOW-0014 regression log.
…actical guide User: "People want sizing and planning directions. This is not an academic paper, not a blog." Rewrote the page from scratch around the user's seven requirements: - plugin cap (single-thread post-decode; ~25k flows/s sustained; ISP-scale anchor) - how ingestion rate maps to storage (single table, 4 rows; ~800 bytes/flow empirical) - raw tier dominates; bound it; example per-tier production config - fast NVMe is the right call for the raw tier; slow storage means shorter retention - memory: routing-trie footprint + page-cache headroom - query speed: indexed fields fast; FTS = full scan of raw tier - distributed deployment as the scaling answer (one agent per router/site; federated via Netdata Cloud; no central aggregation needed for flow data) Removed: - All benchmark tables and methodology. Engineering benchmark numbers remain in src/crates/netflow-plugin/README.md. - The "Bounding storage for capacity planning" formula derivation (ignored tier rollover and dedup; partly invalid). Logged as F13 in SOW-0014 regression log.
…ion Overview User on F12: "Retention is closer to configuration and querying is closer to visualization. ... If you need to put generic visualization rules, these should be a generic 'Visualization/Overview' page, to explain FTS, sharing, grouping, etc." User on F22: "The 'Section index' in the overview page is not needed. Learn already shows the index as a side bar." F12: - New visualization/overview.md page collects "how queries work", "group-by limit and overflow", "full-text search", "URL sharing", filtering pointers, "picking the right view". - retention-querying.md slimmed to retention-only (tiers, what survives rollup, tier auto-pick, "no data", what forces raw tier, default retention misconfig). Sidebar label renamed to "Retention and Tiers". - map.yaml: Visualization sub-section root now carries edit_url + description pointing at visualization/overview.md (matches the F1 pattern). Retention sidebar label renamed. F22: - Removed the "## Section index" block from README.md. The Learn sidebar already renders the same hierarchy. The "Where to start" role-based pointer block stays (not a sidebar duplicate). The "specific feature in depth" bullet now points readers at the sidebar. Logged as F12 + F22 in SOW-0014 regression log.
…Overview Per user: the bullet was redundant with the F22 cleanup. Sidebar guidance is implicit; the four role-based bullets above are the intended "where to start" entry points.
Documentation audits against source code surfaced a small number of inaccuracies and a few low-severity polish items across the most-edited pages. Fixed in-place, with the same surgical-edit policy as Phase R1. Critical/high severity: - configuration.md: drop the false claim that listener / protocols / journal keys can also appear at the top level. The flatten attribute is clap-only (CLI flag flatten), not serde; with deny_unknown_fields the YAML schema rejects unknown top-level keys. Stock file uses the nested form. - validation.md: replace the wrong `dRcv` ss column reference with the actual `d<N>` value inside the `skmem:(...)` line (the sock_drop counter from iproute2 ss output). - validation.md: replace the unreachable RAW_BYTES vs BYTES dashboard comparison with the supported approach -- group by the Sampling Rate field. RAW_BYTES is filtered from supported_flow_field_names and is not surfaced as a default table column. Medium severity: - README.md: rename "Source ASN" to "Source AS Name" in the default group-by description; the actual default uses SRC_AS_NAME, which the dashboard renders as "Source AS Name". Same fix swept through quick-start.md, investigation-playbooks.md, troubleshooting.md, visualization/time-series.md, visualization/summary-sankey.md so the doc text matches the dashboard label. - README.md, quick-start.md, anti-patterns.md, validation.md, visualization/summary-sankey.md: soften the "doubling by default" framing. Both ingress + egress export is a common configuration but not a property of the protocol, and vendor best practice is ingress-only. Wording now reflects that. - validation.md: note the udp_errors alert ships as `to: silent` by default; operators must override `to:` to receive notifications. - validation.md: add a `du -sh` example for cross-checking on-disk tier sizes (cross-link to sizing-capacity.md). Low-severity polish: - README.md: classifier expression language is an Akvorado-compatible subset (matches classifiers.md framing). - README.md: rollup tier note now mentions the dropped fields, so the tier auto-pick claim is not over-broad. - sizing-capacity.md: drop the "after rotation and compression" qualifier on the 800-bytes/flow figure (the bench window is too short to reflect rotation cycles); rephrase the ingest description and the "spinning rust" sentence; soften the BMP/BioRIS RSS guidance to a rough estimate with bench numbers anchored. - configuration.md: document the query_max_groups / query-max-groups alias; document enrichment.geoip.optional and the abort-vs-warn semantics; clarify that default_sampling_rate and override_sampling_rate both accept a single integer or a per-prefix map; mention the 100 MB rotation-size fallback when size_of_journal_files is null. - anti-patterns.md: add the missing "What it costs" line to sections 8 and 9 for shape consistency with sections 1-7.
Second round of per-page audits against source code surfaced several
critical inaccuracies plus the usual long tail of low-severity polish.
Critical / high:
- retention-querying.md: rewrite the tier auto-pick rules. The previous
thresholds were inverted ("8h20m and longer -> 1-hour") -- the actual
planner walks coarsest first and accepts the first tier with at least
100 aligned buckets, so >=100h -> 1h, 8h20m..<100h -> 5m,
100min..<8h20m -> 1m. Verified at
src/crates/netflow-plugin/src/query/planner/timeseries.rs:34-46 and
TIMESERIES_MIN_BUCKETS=100 at src/crates/netflow-plugin/src/query/request/constants.rs:18.
- retention-querying.md: rewrite the rollup-preserved field list. The
previous list claimed AS path, BGP communities, MPLS labels, MACs, and
post-NAT addresses survive into rollups -- they do not. The actual
rollup tier carries only the fields defined in
src/crates/netflow-plugin/src/tiering/rollup/schema/fields/defs/{core,exporter,interface,network,presence}.rs;
every other field is raw-only and forces the query to the raw tier.
- retention-querying.md: correct the "no data" / fallback semantics. The
planner does NOT fall back to a coarser tier for raw-only queries --
rollups don't carry the field, so the span returns empty. Confirmed
at src/crates/netflow-plugin/src/query/planner/prepare.rs:25-28 and
src/crates/netflow-plugin/src/query/planner/spans.rs:99-105
(lower_fallback_candidate_tiers returns &[] for Raw).
- validation.md: replace the "group by Sampling Rate field" verification
with an SNMP-magnitude cross-check. The SAMPLING_RATE field is
filtered out of supported_flow_field_names
(src/crates/netflow-plugin/src/query/request/constants.rs:80),
excluded from the groupable set
(src/crates/netflow-plugin/src/query/fields/rules.rs:33), and not
available as a facet
(src/crates/netflow-plugin/src/facet_catalog.rs:123). Users cannot
pick it from the dashboard. The honest verification path is SNMP
magnitude or a per-prefix override.
- validation.md: correct the alert threshold framing. RcvbufErrors is
read with RRD_ALGORITHM_INCREMENTAL
(src/collectors/proc.plugin/proc_net_netstat.c:400-434), so the value
Netdata stores is per-second. The "lookup: average -1m absolute" plus
"$this > 10" in src/health/health.d/udp_errors.conf means >10
errors/SECOND averaged over 1 minute, not >10/minute as previously
stated.
- quick-start.md: correct the field labels used in the doubling fix
step from "Input Interface Name" / "Output Interface Name" to
"Ingress Interface Name" / "Egress Interface Name", matching the
display labels in src/crates/netflow-plugin/src/presentation/display.rs:39-40.
Same fix swept across anti-patterns.md, troubleshooting.md,
validation.md, investigation-playbooks.md, and
visualization/summary-sankey.md.
- quick-start.md: drop the false "60-second template refresh" claim for
softflowd. softflowd's `expint` flag controls expiry-check interval,
not template refresh; the NetFlow v9 template interval in softflowd
is a compile-time default
(NF9_DEFAULT_TEMPLATE_INTERVAL=16 in netflow9.c) with no CLI knob.
- quick-start.md: complete the Juniper J-Flow snippet. The previous
example defined a sampling instance but never bound it to a
forwarding card and never set a sampling rate, so it would not
produce flows. Add `set chassis fpc 0 sampling-instance NETDATA` and
`set forwarding-options sampling instance NETDATA input rate 1000`,
with a short note explaining the FPC binding requirement.
- quick-start.md: correct the dashboard navigation step from "click the
Network Flows tab" to "open the Live tab and select Network Flows
from the Functions list".
- ip-intelligence.md: correct the GeoLite2 / DB-IP / IPtoASN cadence
claims. DB-IP Lite is monthly. MaxMind GeoLite2 City/Country update
on weekdays; GeoLite2 ASN updates daily since June 2024. IPtoASN is
not MMDB -- it is a public-domain TSV feed that includes both ASN
and country and must be converted to MMDB before the plugin can read
it (the plugin only supports MMDB).
- ip-intelligence.md: correct the dual-stack guidance. Most current
providers ship a single dual-stack MMDB; the previous "configure
both an IPv4 file and a separate IPv6 file" advice was misleading.
- ip-intelligence.md: clarify the asn_providers chain semantics. The
`geoip` provider is a terminal "use 0" shortcut -- when reached the
AS number is forced to 0 (the AS name still comes from the MMDB
lookup independently). Confirmed at
src/crates/netflow-plugin/src/enrichment/asn/resolve.rs:75-109.
- ip-intelligence.md: distinguish the database-composition rules. ASN
fields use pure last-wins; geo fields are written only when the
matching record has a non-empty value, so a later database with an
empty city does not overwrite an earlier database's city
(src/crates/netflow-plugin/src/enrichment/data/geoip/decode.rs:40-72).
Medium / low:
- sizing-capacity.md: rewrite the example raw-tier YAML so the size cap
matches the page's own 25k flows/s framing. The previous example used
200GB / 24h; at 25k flows/s the size cap would fire after ~2.8h, not
24h. Now uses 2TB / 24h with a paragraph on how to scale down for
lighter loads, and explains the size-vs-duration relationship.
- validation.md: drop the misleading `decoder_state_dir` config-key
reference (it is a derived path, not a user-facing key); add the
`-n` flag to the ss command to keep the port numeric in the output.
- anti-patterns.md / troubleshooting.md / validation.md /
visualization/summary-sankey.md: consistent doubling-framing hedge
("a common configuration; vendor best practice is ingress-only")
across all five docs that mention doubling.
- anti-patterns.md: summary-table row "Doubled aggregate" qualified
with "(when ingress + egress are both exported)".
- configuration.md: correct the `override_sampling_rate` default
example from `{}` to `~` (the actual schema default is None).
retention-querying.md: - correct the Time-Series sub-100-min fallback. The planner walks coarsest-first and falls back to the 1-minute tier (TierKind::Minute1) when no tier has >=100 aligned buckets, not to raw. Verified at src/crates/netflow-plugin/src/query/planner/timeseries.rs:39-46. - split the field-eligibility list. The "force raw" set is exactly RAW_ONLY_FIELDS plus V9_*/IPFIX_* prefixes (src/crates/netflow-plugin/src/query/fields/rules.rs:5-11 + src/crates/netflow-plugin/src/query/request/constants.rs:46-57). The previously included AS path / BGP communities / MPLS labels / MAC addresses / NAT addresses do NOT switch tier; they are dropped from rollup output and return null on rollup queries. Page now describes both classes separately. troubleshooting.md: - correct Cisco's default template refresh from "30 minutes" to "600 seconds (10 minutes)" -- this is the IOS / IOS-XE Flexible NetFlow `template data timeout` default. - replace the wrong "/proc/net/udp ... RcvbufErrors column" recipe. The /proc/net/udp file lists open sockets without per-socket drop counters; the kernel-wide RcvbufErrors total lives under the Udp: line of /proc/net/snmp, which is what Netdata's proc.plugin reads (src/collectors/proc.plugin/proc_net_netstat.c:1521). Also dropped the contradictory "30-60 seconds" template-refresh hint. ip-intelligence.md: - correct the ASN composition rule. The previous "last-wins for ASN, but geo writes only when non-empty" framing was wrong -- src/crates/netflow-plugin/src/enrichment/data/geoip/decode.rs:3-28 filters empty / zero values for ASN fields just like geo, so both sets follow the same "last database with a non-empty value wins" rule. - correct the GeoLite2 cadence. MaxMind's documentation publishes City and Country twice weekly (Tuesday and Friday); GeoLite2 ASN moved to every weekday in June 2024.
…+ Enrichment Methods The previous category tree split enrichment-method integrations across three separate sub-categories (IP Intelligence, BGP Routing, Network Identity Sources), which made the integrations page navigation harder than it needed to be and did not match the conceptual model: the operator is choosing a *data source* for enrichment, regardless of what kind of data it produces. Collapsed under flows.enrichment-methods: - ip-intelligence: dbip, maxmind, iptoasn, custom-mmdb - bgp-routing: bmp, bioris - network-identity: aws-ip-ranges, gcp-ip-ranges, azure-ip-ranges, netbox, generic-ipam flows.sources keeps its existing membership (netflow / ipfix / sflow) and the "Flow Protocols" name unchanged. Both YAML files validated. Per-card content merges and the new cross-cutting "Enrichment" + "Enrichment Intel Downloader" pages land in subsequent commits.
…t integration cards
Round-by-round, agent-per-card merges. Each card absorbed the durable
provider-specific content from its corresponding concept page; cross-cutting
content was extracted for the new "Enrichment" page (separate commit). The
agents verified every behavioural claim against current source code at file:line
and every upstream URL by WebFetch -- not a mechanical sweep.
IP Intelligence:
- dbip: framed as the auto-detected default, monthly Lite cadence, CC-BY-4.0,
populated-fields breakdown (geo + ASN), AS0 Private/Unknown labels driven by
the DB-IP-built ip_class flag, raw-tier-only city/lat/lon.
- maxmind: GeoLite2 vs commercial GeoIP2 split, account-id + license-key auth,
twice-weekly Tuesday/Friday cadence for City/Country, every-weekday for ASN
since June 2024, geoipupdate setup. Important correction: the bundled
topology-ip-intel-downloader does NOT support MaxMind (only dbip and iptoasn);
the previous card's hint at the netdata downloader was misleading.
- iptoasn: PDDL public-domain feed, hourly TSV cadence (the previous card
said "daily" -- wrong), bundled topology-ip-intel-downloader natively
supports the TSV->MMDB conversion (correcting an earlier prompt assumption
to the contrary), three setup examples including ASN-only and combined
with DB-IP geo.
- custom-mmdb: reframed as the escape hatch for operators producing custom
MMDBs (CIDR overlays, internal AS labels). Lists the field names the
decoder reads from any MMDB, with file:line evidence; cites mmdbwriter
libraries; recommends `optional: true` during build iteration.
BGP Routing:
- bmp: BMP-v3-only handling -- v1/v2 silently dropped (previously implicit).
RFC 8671 cited for JunOS post-policy support since 18.3R1, separately
from RFC 7854 which was previously lumped together. Cisco IOS-XE BMP
added (was missing). Nokia SR OS added. JunOS minimum 13.3 documented.
No IANA-registered port for BMP.
- bioris: corrected the topology -- Netdata connects to a USER-RUN bio-rd
cmd/ris/ daemon over user-supplied gRPC, NOT directly to RIPE RIS. The
user's bio-rd daemon does the BGP/BMP peering with upstream sources. No
shipped collector list (ris_instances is required and operator-supplied).
Memory cost (~hundreds of MB per peer for full-table feeds), no eviction,
raw-tier-only AS path / communities.
Network Identity:
- aws-ip-ranges: schema reference (top-level + per-entry), live cadence
softened ("whenever AWS IP space changes, often several times per day"
rather than the folklore "every 15 minutes" -- AWS docs do not promise
any fixed schedule). Three jq examples including network_border_group as
site. Plugin's actual config key is `transform`, not `jq_program` (was
wrong in the migration prompt).
- gcp-ip-ranges: cloud.json vs goog.json comparison; cloud.json today
reports `service: "Google Cloud"` uniformly so per-service pivots are
not possible from this file. No fixed Google cadence in the docs.
- azure-ip-ranges: URL rotates weekly. Service Tag Discovery REST API as
authoritative alternative. API data lags JSON file by up to four weeks;
new IPs aren't used for at least one week after publication. Three
workaround patterns documented honestly.
- netbox: documented breaking change in NetBox 4.2 -- the `site` foreign
key on Prefix was replaced with the generic `scope` field; the previous
example used `(.site.name // "")` which silently breaks on 4.2+. New
card ships scope-aware (4.x) and legacy (3.x/4.0/4.1) examples plus a
fallback `(.scope.name // .site.name // "")`. Two token formats
documented (legacy v1 hex and v2 nbt_<key>.<token>).
- generic-ipam: full RemoteNetworkSourceConfig schema (13 options including
proxy, tls.enable, tls.verify, tls.skip_verify with explicit "rejected
by validation" notes). Honest call-outs: POST is sent without a body
(fetch.rs:11-17), interval floored at 60s by service.rs:73, TLS
verification cannot be disabled (validation/enrichment.rs:183-192).
Categories: each card now under flows.enrichment-methods (the previous
ip-intelligence / bgp-routing / network-identity sub-categories were
collapsed in the previous commit). Generated .md files updated via
integrations/gen_docs_integrations.py.
…ownloader page Three new cards under flows.enrichment-methods, completing the "every enrichment method is an integration" model. Each card was authored by an agent that read the corresponding concept page and the source code, then produced a metadata.yaml entry with every claim cited at file:line. The agents flagged real inconsistencies between the concept pages and the code; corrections were absorbed into the cards. static_metadata: - Three configuration surfaces: enrichment.metadata_static.exporters, enrichment.networks, and enrichment.override_sampling_rate (plus default_sampling_rate for the distinction). - Field-population table tied to source at apply/metadata.rs:41-53, data/network/write.rs:93-125, apply/metadata.rs:78-97. - Five corrections vs the previous concept page: boundary "undefined" vs numeric 0 are byte-identical in output; lookup priority is dominated by prefix specificity, not source-kind; override_sampling_rate matches the UDP datagram source IP; the "networks merges last and wins" framing was overstated. classifiers: - Two evaluation surfaces: exporter_classifiers + interface_classifiers, the latter called twice per flow (once per interface side). - Akvorado-compatible expression-language *subset* — explicitly only the operators and actions implemented at enrichment/classifiers/parse.rs; every example in the card uses syntax verified against the parser and the existing test suite. - Output normalisation includes "+" (concept page only listed ". -"); static metadata short-circuits classifier evaluation (enrichment/classify.rs:117-119, :150-154); first-write-wins per slot (runtime/eval/action.rs:43-46); default cache 5m with >=1s validation (defaults.rs:46-48, validation/enrichment.rs:10-12). decapsulation: - Two modes (srv6, vxlan) per protocol.rs:50-57; default none. - Three transport feeders: NetFlow v9 IE 104 (decoder.rs:80), IPFIX IE 315 (decoder.rs:74), sFlow SampledHeader (sflow/record.rs:44-69). - Inner parsers at decoder/common.rs:3-18 (VXLAN port 4789) and :35-63 (SRv6 SRH walker); merge points at packet/transport.rs:21-33 and record/packet/parse/transport.rs:14-21. - Vendor-verification finding: Cisco IOS-XE / IOS-XR collect datalink frame-section could NOT be verified (cisco.com 403'd anonymous WebFetch and Akvorado's IOS-XE recipe deliberately omits L2 frame- section export). The card flags Cisco support as unverified and instructs operators to validate by template inspection. Juniper inline-monitoring with datalink-frame-size confirmed via the Akvorado mirror; sFlow header sampling confirmed via the project's decoder. intel-downloader.md (new operator-tool page): - Documents /usr/sbin/topology-ip-intel-downloader -- supported sources, CLI flags, atomic replacement, auto-detect integration with the netflow plugin's 30s reload window. - Findings: no packaged systemd timer or cron file (operators must install their own; page provides a starter unit + timer pair); MaxMind support confirmed absent (no license_key field anywhere in config.go; only iptoasn:combined and dbip:asn-lite/country-lite/ city-lite are recognised by builtInSource); MaxMind users directed to geoipupdate as the alternative. - Hidden capability documented: interesting_cidrs config knob lets operators stamp public CIDRs as netdata.ip_class = "interesting" in both ASN and geo MMDBs (write.go:228-246). The categories.yaml category for all 14 enrichment methods is now flows.enrichment-methods (previously split across ip-intelligence / bgp-routing / network-identity).
…; update map.yaml Restructure phase 5+8+9: collapse the docs/network-flows/enrichment/ directory (7 concept pages: asn-resolution, bgp-routing, classifiers, decapsulation, ip-intelligence, network-identity, static-metadata) into ONE consolidated cross-cutting page at docs/network-flows/enrichment.md, plus per-method integration cards (already added in previous commits). The new Enrichment page (447 lines) is the single home for cross-cutting concepts that span every enrichment method: - Order of evaluation per flow record (8-step pipeline cited at apply/resolve.rs:5-50 and init.rs:50-64). - The two provider chains (asn_providers / net_providers, the geoip terminal "use 0" shortcut, the AS-number-vs-AS-name distinction). - Composition rules: specificity dominates, ties to static, per-field non-empty-wins merge. - The MMDB shared mechanism (auto-detect path order, last-non-empty-wins composition, 30s signature reload, IPv4/IPv6 dual-stack handling). - Network sources operational properties (fetch loop, 60s floor, jq schema, deny_unknown_fields, TLS-no-disable, no pagination, no auth helpers, POST-without-body, journal diagnostics). - Static-metadata-blocks-classifiers semantics. - Classifier evaluation surfaces and ordering. - Decapsulation inner-packet override. - Routing overlay (BMP+BioRIS shared trie). - Cross-method operational properties: refresh windows, restart behaviour, no in-process freshness signal, empty-tree disables enricher, rollup tier survival table, geographic accuracy, sampling-rate knobs, integration test gap. The page resolved several discrepancies the original concept pages had against the source code -- e.g. the "static metadata > classifiers > network sources > GeoIP > BGP routing" precedence claim was misleading (actual rule: specificity dominates, ties to static, merge primitive is non-empty-wins); the bmp alias for routing lives at providers.rs:10,12, not validation/enrichment.rs; the GeoIP terminal shortcut sets the AS number to 0 but the AS *name* still comes from the MMDB independently; POST is sent without a body (fetch.rs:11-17); interval is silently floored at 60s (service.rs:73). Each correction is cited at file:line. map.yaml: removed the "Flows Enrichment" sub-section (which contained the 7 deleted pages); added two new entries between Configuration and Field Reference: - "Enrichment" -> docs/network-flows/enrichment.md - "Enrichment Intel Downloader" -> docs/network-flows/intel-downloader.md Cross-references updated in 8 surviving pages to point at either the new Enrichment page (for cross-cutting concepts) or the relevant integration cards under src/crates/netflow-plugin/integrations/ (for per-method specifics): configuration, validation, intel-downloader, quick-start, installation, troubleshooting, visualization/maps-globe. The 7 deleted concept pages had their durable content fully absorbed into the Enrichment page (cross-cutting) and the 14 integration cards (per-method); each integration card cites file:line evidence for every behavioural claim and was re-verified against current source code by its merge agent.
The merge agents preserved Learn-URL cross-references from the original concept pages (e.g. https://learn.netdata.cloud/docs/network-flows/enrichment/ ip-intelligence). Those URLs now 404 because the seven concept pages were collapsed into one. Sweep them all to point at the new consolidated https://learn.netdata.cloud/docs/network-flows/enrichment page, which covers the cross-cutting concepts (MMDB shared mechanism, asn_providers chain, network-source operational properties, etc.) that the per-method references were calling out. Generated cards regenerated via gen_docs_integrations.py to flush the new URL into the .md outputs.
There was a problem hiding this comment.
Pull request overview
This pull request aligns Network Flows documentation and generated integration cards with current supported behavior, while also updating the netflow-plugin to (1) use per-tier retention configuration and (2) correctly persist timestamp_source into _SOURCE_REALTIME_TIMESTAMP.
Changes:
- Fix netflow-plugin ingestion so
timestamp_sourcepersists the decoded source timestamp as_SOURCE_REALTIME_TIMESTAMP(keeping journal entry realtime as receive/write time) and add an end-to-end test. - Migrate journal retention configuration to per-tier
journal.tiers.*and remove unsupported/obsolete journal query knobs. - Restructure/refresh Network Flows docs + integration cards (new categories, new visualization overview, updated enrichment/network-source cards, moved/removed older concept pages).
Reviewed changes
Copilot reviewed 60 out of 62 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/crates/netflow-plugin/src/startup_memory_tests.rs | Updates tests to rely on per-tier retention defaults. |
| src/crates/netflow-plugin/src/query/tests.rs | Minor formatting tweak in request deserialization tests. |
| src/crates/netflow-plugin/src/plugin_config/validation/journal.rs | Removes validation for deleted knobs; validates per-tier retention limits. |
| src/crates/netflow-plugin/src/plugin_config/types/journal.rs | Implements per-tier retention schema and removes retired journal query fields. |
| src/crates/netflow-plugin/src/plugin_config/defaults.rs | Removes now-unused retention override helpers and facet limit default. |
| src/crates/netflow-plugin/src/plugin_config.rs | Drops re-export of removed RetentionLimitOverride. |
| src/crates/netflow-plugin/src/plugin_config_tests.rs | Updates config tests for per-tier retention and removed knobs. |
| src/crates/netflow-plugin/src/memory_tests.rs | Updates fixtures to use per-tier retention config. |
| src/crates/netflow-plugin/src/main_tests.rs | Adds E2E test ensuring _SOURCE_REALTIME_TIMESTAMP reflects decoded source timestamp; adds helpers to read first raw journal entry fields. |
| src/crates/netflow-plugin/src/ingest/service/runtime.rs | Persists decoded source timestamp into journal entry timestamps via _SOURCE_REALTIME_TIMESTAMP. |
| src/crates/netflow-plugin/src/ingest_resource_bench_tests.rs | Minor formatting cleanups in benchmark test code. |
| src/crates/netflow-plugin/src/facet_runtime.rs | Minor formatting in tests. |
| src/crates/netflow-plugin/src/api/flows/handler.rs | Minor formatting of spawn_blocking autocomplete path. |
| src/crates/netflow-plugin/README.md | Updates documented config to per-tier retention and query_max_groups semantics. |
| src/crates/netflow-plugin/integrations/sflow.md | Updates generated integration card content and config options for per-tier retention and docs navigation. |
| src/crates/netflow-plugin/integrations/netflow.md | Updates generated NetFlow integration card for per-tier retention and revised guidance. |
| src/crates/netflow-plugin/integrations/maxmind_geoip_-_geolite2.md | Rewrites MaxMind integration card with clarified setup, limits, and composition behavior. |
| src/crates/netflow-plugin/integrations/iptoasn.md | Rewrites IPtoASN integration card (cadence, downloader behavior, composition). |
| src/crates/netflow-plugin/integrations/ipfix.md | Updates IPFIX integration card for per-tier retention and revised guidance. |
| src/crates/netflow-plugin/integrations/gcp_ip_ranges.md | Expands GCP network-source integration card (cloud.json vs goog.json, options, examples). |
| src/crates/netflow-plugin/integrations/decapsulation.md | Adds new generated decapsulation integration card under enrichment methods. |
| src/crates/netflow-plugin/integrations/db-ip_ip_intelligence.md | Updates DB-IP integration card with new defaults and operational guidance. |
| src/crates/netflow-plugin/integrations/custom_mmdb_database.md | Updates custom MMDB integration card (schema expectations, atomic refresh guidance). |
| src/crates/netflow-plugin/integrations/azure_ip_ranges.md | Expands Azure Service Tags integration card (URL rotation, transform schema, examples). |
| src/crates/netflow-plugin/integrations/aws_ip_ranges.md | Expands AWS IP ranges integration card (schema, transform requirements, options). |
| src/crates/netflow-plugin/configs/netflow.yaml | Migrates sample config to per-tier retention and removes retired query knobs. |
| integrations/categories.yaml | Restructures Network Flows categories (Flow Protocols + Enrichment Methods). |
| docs/network-flows/visualization/time-series.md | Updates tier terminology and defaults; describes raw-tier forcing conditions. |
| docs/network-flows/visualization/summary-sankey.md | Updates UI navigation and default group-by language (AS Name). |
| docs/network-flows/visualization/overview.md | Adds shared visualization/query behavior overview page. |
| docs/network-flows/visualization/maps-globe.md | Updates tier terminology and troubleshooting guidance for map/globe views. |
| docs/network-flows/visualization/filters-facets.md | Updates tier terminology and removes docs for deleted facet-limit knob. |
| docs/network-flows/visualization/dashboard-cards.md | Updates wording and troubleshooting commands for plugin health charts. |
| docs/network-flows/validation.md | Reworks validation routine and monitoring guidance. |
| docs/network-flows/troubleshooting.md | Updates troubleshooting steps, commands, and terminology; reflects per-tier retention docs. |
| docs/network-flows/README.md | Updates overview, defaults, and conceptual guidance (sampling, doubling, bidirectional traffic). |
| docs/network-flows/quick-start.md | Updates quick-start configs and dashboard navigation guidance. |
| docs/network-flows/investigation-playbooks.md | Updates playbooks for new terminology/default fields and UI navigation. |
| docs/network-flows/intel-downloader.md | Adds a new page documenting topology-ip-intel-downloader behavior and ops. |
| docs/network-flows/installation.md | Updates installation guidance and IP intelligence defaults/links. |
| docs/network-flows/field-reference.md | Updates field semantics text (timestamps, selectivity, observation time mapping). |
| docs/network-flows/enrichment/static-metadata.md | Removes old enrichment concept page (superseded by new structure/cards). |
| docs/network-flows/enrichment/network-identity.md | Removes old concept page (superseded by new structure/cards). |
| docs/network-flows/enrichment/decapsulation.md | Removes old decapsulation doc page (replaced by integration card/updated docs). |
| docs/network-flows/enrichment/classifiers.md | Removes old classifiers concept page (superseded by new structure/cards). |
| docs/network-flows/enrichment/bgp-routing.md | Removes old BGP routing concept page (superseded by new structure/cards). |
| docs/network-flows/enrichment/asn-resolution.md | Removes old ASN resolution concept page (superseded by new structure/cards). |
| docs/.map/map.yaml | Updates docs sidebar structure for Network Flows section and adds new pages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
9 issues found across 85 files
Confidence score: 4/5
- This PR is likely safe to merge, but there is mild user-facing risk because several documentation updates in
src/crates/netflow-plugin/integrations/*.md,docs/network-flows/enrichment.md, and.agents/skills/*are factually inaccurate (mostly severity 4/10, high confidence). - The most impactful issue is misleading setup/troubleshooting guidance in integration docs (for example
iptoasn.md,aws_ip_ranges.md, andgeneric_json-over-http_ipam.md), which could cause incorrect configuration choices or confusion during onboarding rather than runtime breakage. - There is also a small portability concern in
.agents/skills/codacy-audit/scripts/pr-issues.shwhere baremktempmay behave inconsistently across GNU/BSD/macOS; this is low severity and straightforward to harden. - Pay close attention to
src/crates/netflow-plugin/integrations/iptoasn.md,src/crates/netflow-plugin/integrations/aws_ip_ranges.md,src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md, anddocs/network-flows/enrichment.md- correct inaccurate behavior claims so users are not misled.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".agents/skills/integrations-lifecycle/pipeline.md">
<violation number="1" location=".agents/skills/integrations-lifecycle/pipeline.md:455">
P2: This line overstates CI behavior: `check-markdown.yml` does not enforce that committed `COLLECTORS.md` is fresh, only that regenerated docs pass ingest.</violation>
</file>
<file name=".agents/skills/codacy-audit/scripts/pr-issues.sh">
<violation number="1" location=".agents/skills/codacy-audit/scripts/pr-issues.sh:77">
P3: Use an explicit portable mktemp template instead of bare `mktemp`.
(Based on your team's feedback about portable mktemp usage across GNU/BSD/macOS.) [FEEDBACK_USED]</violation>
</file>
<file name=".agents/skills/learn-pr-preview/SKILL.md">
<violation number="1" location=".agents/skills/learn-pr-preview/SKILL.md:45">
P2: The source-copy command omits intentional untracked docs files, so the documented uncommitted-preview flow is incomplete and can produce a stale preview.</violation>
</file>
<file name="src/crates/netflow-plugin/integrations/iptoasn.md">
<violation number="1" location="src/crates/netflow-plugin/integrations/iptoasn.md:52">
P2: The new wording implies `topology-ip-intel-downloader` is always bundled, but repository docs say packaged 32-bit installs do not include it. Add a platform/installation caveat to avoid broken setup instructions.</violation>
</file>
<file name="src/crates/netflow-plugin/integrations/netbox.md">
<violation number="1" location="src/crates/netflow-plugin/integrations/netbox.md:88">
P3: The performance note claims trie lookup for runtime network-source enrichment, but current code does a linear scan over source records.</violation>
<violation number="2" location="src/crates/netflow-plugin/integrations/netbox.md:309">
P3: The troubleshooting text says HTTP errors are silently consumed, but the refresher logs them as warnings.</violation>
</file>
<file name="src/crates/netflow-plugin/integrations/aws_ip_ranges.md">
<violation number="1" location="src/crates/netflow-plugin/integrations/aws_ip_ranges.md:231">
P2: The new docs incorrectly say `transform: "."` fails with an "empty result" error; with this plugin it typically fails because the emitted object has no required `prefix` field. Update this section to describe schema/row-mapping failure instead of empty output.</violation>
</file>
<file name="src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md">
<violation number="1" location="src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md:358">
P2: This new troubleshooting note is factually incorrect: URL-embedded credentials are handled by reqwest as HTTP Basic auth, so saying they "do not work" can mislead users.</violation>
</file>
<file name="docs/network-flows/enrichment.md">
<violation number="1" location="docs/network-flows/enrichment.md:30">
P2: This statement is inaccurate: provider-chain-only configuration can still enable the enricher, so it does not run *only* when an enrichment input source is configured.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Router as Router/Exporter
participant UDP as UDP Socket
participant Parser as Protocol Decoder
participant Enrich as Enrichment Pipeline
participant Journal as Journal (4 Tiers)
participant Planner as Query Tier Planner
participant Dashboard as Network Flows UI
participant Intel as Intel Downloader
participant MMDB as MMDB Files
participant Config as netflow.yaml
participant DocGen as Doc Generator
participant Learn as learn.netdata.cloud
Note over Router,Learn: PR repairs documentation, persistence, and defaults to match actual behavior
Router->>UDP: Send NetFlow / IPFIX / sFlow datagrams
UDP->>Parser: Decode raw protocol fields
Parser->>Parser: Process timestamp_source config
alt timestamp_source = netflow_first_switched
Parser->>Journal: NEW: Write decoded source timestamp as _SOURCE_REALTIME_TIMESTAMP
Parser->>Journal: Keep entry realtime as receive/write time
else default (input)
Parser->>Journal: Write receive time as _SOURCE_REALTIME_TIMESTAMP
end
Parser->>Enrich: Decoded flow record
Enrich->>MMDB: GeoIP / ASN lookup (auto-detect or explicit path)
MMDB-->>Enrich: Country, city, ASN data
Enrich->>Enrich: Apply classifier rules, static metadata, provider chains
Note over Enrich: Order: decode → decap → GeoIP → static → network sources → classifiers → routing overlay
Enrich->>Journal: Write enriched record to raw tier
Journal->>Journal: Rollup to 1m / 5m / 1h tiers (drops high-cardinality fields)
Planner->>Journal: Select tier for query
alt Raw-only fields (IP/port/city) or FTS
Planner->>Journal: Force raw tier
Journal-->>Planner: Raw data (bounded by raw retention)
else Preserved fields (country/ASN/interface)
Planner->>Journal: Use coarser tier per window alignment
Journal-->>Planner: Aggregated data
end
Dashboard->>Planner: flows / autocomplete query
Planner-->>Dashboard: Top-N groups, facet values
Dashboard->>Dashboard: Render Sankey, Time-Series, Maps
Intel->>Intel: CHANGED: Fetch DB-IP / IPtoASN data
Intel->>MMDB: Atomic rename(2) of topology-ip-asn.mmdb, .geo.mmdb
MMDB-->>Enrich: Reloads on file change within 30s
Config->>Config: Removed top-level retention knobs and query guardrails
Config->>Config: Per-tier retention only (raw, minute_1, minute_5, hour_1)
Config->>Config: Removed query_1m_max_window, query_5m_max_window, query_facet_max_values_per_field
DocGen->>DocGen: CHANGED: Generate integration cards under Flow Protocols / Enrichment Methods
DocGen->>DocGen: Collapse 7 enrichment pages into single Enrichment guide
DocGen->>DocGen: CHANGED: flows category treated as Monitor Anything section
DocGen->>Learn: Publish regenerated COLLECTORS.md, integration cards, docs
Learn-->>Dashboard: Ingest for in-app help and search
Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
|



Summary
timestamp_sourcepersistence so the decoded source timestamp is stored in_SOURCE_REALTIME_TIMESTAMPwhile journal entry realtime remains receive/write time..agents/sow/done/with final validation evidence.Validation
cargo test --manifest-path src/crates/netflow-plugin/Cargo.toml timestamp_source -- --nocapturepython3 integrations/gen_integrations.pypython3 integrations/gen_docs_integrations.pypython3 -c 'import yaml; yaml.safe_load(open("src/crates/netflow-plugin/metadata.yaml")); print("metadata yaml ok")'docs/network-flows/,metadata.yaml, and generated integration-card body text.git diff --check.agents/sow/audit.shconfirmed SOW 14 status/directory consistency after move; it still exits nonzero on an unrelated pre-existing sensitive-data pattern in.agents/skills/mirror-netdata-repos/SKILL.mdand existing root TODO-file warnings.Summary by cubic
Repairs and restructures Network Flows docs and integration cards to match actual behavior, fixes source‑timestamp persistence in the
netflow-plugin, and simplifies retention/query config to per‑tier only. Also ensures Learn renders the section overview, adds Network Flows to the generated “Monitor anything” page, and improves startup by moving the raw rebuild scan off the async startup path.Bug Fixes
_SOURCE_REALTIME_TIMESTAMPwhentimestamp_sourceis used; adds an end‑to‑end test.Classifiers,Decapsulation, andStatic Metadata; updated generators soCOLLECTORS.mdincludes a “Network Flows” section.Migration
journal.size_of_journal_filesandjournal.duration_of_journal_filesinto per‑tier entries underjournal.tiers.*. Each tier requires at least one positive limit; set a field tonullto disable that limit on that tier.journal.query_1m_max_window,journal.query_5m_max_window, andjournal.query_facet_max_values_per_field. Keepjournal.query_max_groups(documented to fold overflow into__overflow__).Written for commit 165e02c. Summary will update on new commits.