Skip to content

Repair Network Flows documentation#22449

Merged
ktsaou merged 34 commits into
netdata:masterfrom
ktsaou:netflow-docs-repair
May 8, 2026
Merged

Repair Network Flows documentation#22449
ktsaou merged 34 commits into
netdata:masterfrom
ktsaou:netflow-docs-repair

Conversation

@ktsaou
Copy link
Copy Markdown
Member

@ktsaou ktsaou commented May 7, 2026

Summary

  • Repair Network Flows documentation and generated integration cards after SOW 14 regression review.
  • Align public docs with supported behavior for retention, IP intelligence defaults, enrichment, visualization limits, troubleshooting, and generated default-behavior text.
  • Fix timestamp_source persistence so the decoded source timestamp is stored in _SOURCE_REALTIME_TIMESTAMP while journal entry realtime remains receive/write time.
  • Close SOW 14 by moving it to .agents/sow/done/ with final validation evidence.

Validation

  • cargo test --manifest-path src/crates/netflow-plugin/Cargo.toml timestamp_source -- --nocapture
  • python3 integrations/gen_integrations.py
  • python3 integrations/gen_docs_integrations.py
  • python3 -c 'import yaml; yaml.safe_load(open("src/crates/netflow-plugin/metadata.yaml")); print("metadata yaml ok")'
  • Targeted public-doc quality scans over docs/network-flows/, metadata.yaml, and generated integration-card body text.
  • git diff --check
  • .agents/sow/audit.sh confirmed SOW 14 status/directory consistency after move; it still exits nonzero on an unrelated pre-existing sensitive-data pattern in .agents/skills/mirror-netdata-repos/SKILL.md and existing root TODO-file warnings.

Summary by cubic

Repairs and restructures Network Flows docs and integration cards to match actual behavior, fixes source‑timestamp persistence in the netflow-plugin, and simplifies retention/query config to per‑tier only. Also ensures Learn renders the section overview, adds Network Flows to the generated “Monitor anything” page, and improves startup by moving the raw rebuild scan off the async startup path.

  • Bug Fixes

    • Persist the decoded source timestamp to _SOURCE_REALTIME_TIMESTAMP when timestamp_source is used; adds an end‑to‑end test.
    • Move raw rebuild scanning off the async startup path to avoid blocking plugin start.
    • Documentation overhaul: collapsed seven enrichment concept pages into a single Enrichment guide; added “Enrichment Intel Downloader” and “Visualization/Overview”; corrected doubling vs bidirectional symmetry, per‑flow sampling handling, UI navigation (Live → Network Flows), GeoIP internal‑IP behavior, tier auto‑pick rules, rollup‑field survival, and Time‑Series bucket limits; Learn map now renders the Overview at the section root; regenerated and recategorized integration cards under “Flow Protocols” and “Enrichment Methods”, including new Classifiers, Decapsulation, and Static Metadata; updated generators so COLLECTORS.md includes a “Network Flows” section.
  • Migration

    • Move top‑level journal.size_of_journal_files and journal.duration_of_journal_files into per‑tier entries under journal.tiers.*. Each tier requires at least one positive limit; set a field to null to disable that limit on that tier.
    • Remove unsupported keys: journal.query_1m_max_window, journal.query_5m_max_window, and journal.query_facet_max_values_per_field. Keep journal.query_max_groups (documented to fold overflow into __overflow__).

Written for commit 165e02c. Summary will update on new commits.

ktsaou added 30 commits May 7, 2026 19:09
The 2026-05-07 closure of SOW-0014 was premature. The learn netlify
deploy preview for PR netdata#2852 surfaced major content errors that the
prior validation pass missed: multiple statements that contradict
the source code, generic flow-monitoring advice imported from research
notes that does not apply to Netdata, several invented behaviours, and
structural choices that read as academic / generic guidance rather
than as a practical Netdata-specific guide.

Move file from done/ back to current/, set Status: in-progress, and
append a `## Regression - 2026-05-07` section with:

- 21 findings transcribed verbatim from the user (F1..F21);
- code-citation verdicts for each (per-flow sampling multiplication
  at decoder/record/core/record.rs:24-26 confirms F4/F5/F15 wrong;
  template persistence at decoder/protocol/{v9,ipfix}/templates.rs
  confirms F14 wrong; etc.);
- root cause analysis (subagents extracted data accurately but missed
  behavioural framing claims; codex review focused on autocomplete
  code, not docs prose; validation evidence was structural, not
  semantic; closure was driven by "all phases done" rather than
  "all claims true");
- a three-phase repair plan: per-finding fixes one at a time with
  no batching (R1), per-page audit subagents that re-run until clean
  (R2), final close with a whole-section review (R3).

The SOW remains in current/ until every F1..F21 has a fix landed
with code citation, every page passes a per-page audit, and a
whole-section review returns no new findings.
The /docs/network-flows URL rendered as an auto-generated tile grid
because the section root meta block carried only `label:` -- no leaf
content. Learn's `get_dir_make_file_and_recurse` then synthesised a
category index page from the children.

Hoist `edit_url:` and `description:` to the section root, pointing at
the existing README.md. Drop the redundant child "Overview" entry that
pointed to the same file.

Pattern reference: every section that renders an Overview at its root
URL on learn.netdata.cloud (Collecting Metrics, Dashboards and Charts,
Netdata Cloud, Welcome to Netdata, etc.) carries `edit_url:` directly
on the section root.

Logged as F1 in SOW-0014 regression log with code references.
Two related findings, fixed together because they touched the same
paragraphs.

- The "doubling" effect (per-packet ingress+egress accounting on a
  single router) was conflated with bidirectional traffic symmetry.
- The doubling fix said "filter by one exporter, one interface, in
  one direction". The "in one direction" is redundant on top of "one
  interface" and misleads readers into expecting another 50% halving.
- The bidirectional-traffic explanation said "when you see traffic
  X-to-Y and Y-to-X of similar volume, that's one conversation, not
  two". Bidirectional conversations are usually asymmetric (downloads
  vs ACKs), so "similar volume" is wrong as an identification
  heuristic.

Rewritten:

- Doubling fix is now: one exporter + one interface (Input Interface
  OR Output Interface, pick one). Each packet crossing that interface
  produces exactly one record on it.
- The mirror-conversation section is renamed and reframed: separate
  packets in each direction, separate records, typically asymmetric
  volumes. Per-direction accounting, not duplication.

Files touched:
- README.md, quick-start.md: paragraphs rewritten.
- summary-sankey.md, anti-patterns.md, validation.md: "in one
  direction" lines fixed in place. Anti-patterns / validation will
  be rewritten more broadly under F14-F17 but the wrong claims
  are removed now.

Logged as F2+F3 in SOW-0014 regression log.
…e myth)

Two related findings, fixed together because they are the same wrong
claim repeated across the documentation. Source-code reality:

  src/crates/netflow-plugin/src/decoder/record/core/record.rs:24-26
  let sampling_rate = rec.sampling_rate.max(1);
  rec.bytes = rec.bytes.saturating_mul(sampling_rate);
  rec.packets = rec.packets.saturating_mul(sampling_rate);

`sampling_rate` is set per-record from each protocol's appropriate
source (legacy header, v9 IE / Sampling Options Template, IPFIX IE /
options, sFlow per-sample rate, or static override). Multiplication
runs PER FLOW at decode time. Mixed sampling rates across exporters,
interfaces, or time are handled correctly automatically.

Removed false claims wherever they appeared:

- README.md "What sampling does to your numbers" -- "works correctly
  only if all your exporters use the same sampling rate" and "the
  clean path: keep sampling rates uniform across your network".
  Rewrote the paragraph to state per-flow multiplication, explain
  why the UI does not surface a single rate (mixed rates have no
  meaningful display value; uniform rates are already known to the
  operator), and keep the real statistical-floor caveat (sampling
  can miss small / short flows regardless of rate uniformity).

- field-reference.md and anti-patterns.md -- `RAW_BYTES` no longer
  framed as "use when sampling is uniform". Now correctly framed as
  the literal pre-multiplication value the exporter sent.

- troubleshooting.md "Bandwidth doesn't match SNMP" -- "Mixed
  sampling rates ... isn't comparable to any single SNMP
  measurement" replaced with the actual mistake (comparing
  aggregates of many interfaces to a single interface SNMP counter).
  Per-flow multiplication is correct regardless of rate uniformity.

- validation.md -- "undocumented sampling rate changes" dropped from
  the silent-failure intro; the "Sampling rate change" monitoring
  table row removed (per-flow multiplication absorbs rate changes).

- investigation-playbooks.md -- "Sampling rate of the exporter (so
  the numbers can be interpreted)" deliverable removed and "A
  change in sampling rate during the analysis window invalidates
  the trend" caveat removed. Both wrong under per-flow scaling.

- anti-patterns.md cross-protocol-counts section -- "Same goes for
  sampling-rate differences across exporters" removed; the
  protocol-counts-not-comparable point stays.

The F2/F3 doubling-fix wording ("filter by exporter + interface +
direction") was also wrong and got cleaned in the same anti-patterns
summary table row.

What stays: NetFlow v7 / v5 with rate=0 / v9 / IPFIX without a
Sampling Options Template are real cases where the plugin sees no
rate and undercounts. Those remain documented as the actual silent
failure mode.

Items deferred to F14 / F15 which rewrite their containing sections:
- validation.md silent-failure items #2, #3, #5 (F14 removes them as
  a block).
- anti-patterns.md "Ignoring the sampling rate" section + its
  summary-table row (F15 removes the section entirely).

Logged as F4+F5 in SOW-0014 regression log with code references.
…ment

The globe and city map render the same query response with the same
table beneath. The 3D projection is not "less useful for analysis";
it is a different rendering with different strengths.

Replaced both occurrences with a neutral framing that the 2D city map
is best for in-continent precision and the 3D globe is best when
distance and great-circle paths matter (transcontinental traffic,
undersea cables, intercontinental CDN routing).

While here, fixed the same page's "Mirroring" subsection to drop the
F2 symmetry myth ("25 top-N = 12 conversations" implied a 1:1 pairing
of A->B and B->A by volume; bidirectional traffic is usually
asymmetric).

Logged as F6 in SOW-0014 regression log.
The docs claimed "the Network Flows tab should appear in the top
navigation". The actual Netdata UI exposes Network Flows as a
Function under the **Live** tab. Verified against
docs/dashboards-and-charts/live-tab.md.

Adopted convention:
- "Open Network Flows" (verb)
- "the Network Flows view" (noun)
- "Click the Live tab in the top navigation; Network Flows appears
  in the Functions list on the right" (setup context).

Swept every "Network Flows tab" reference in:
- installation.md (the original bad sentence + follow-up).
- troubleshooting.md, investigation-playbooks.md, anti-patterns.md.
- visualization/dashboard-cards.md (4 occurrences).
- visualization/summary-sankey.md.

Logged as F7 in SOW-0014 regression log.
Cleans up an unjustified schema redundancy. Today the plugin accepts
both top-level `size_of_journal_files` / `duration_of_journal_files`
under `journal` AND per-tier values under `journal.tiers.<tier>`. The
runtime already uses only per-tier values; the global was just a
default that flowed through `retention_for_tier()`. The two paths gave
operators a configuration surface with no underlying behavioural
difference.

After:
- `JournalConfig` carries only `tiers: JournalTierRetentionOverrides`
  (plus journal_dir and the query guardrails). No globals.
- Each `JournalTierRetentionConfig` is `Option<ByteSize> /
  Option<Duration>` directly. Omitted fields fall back to the
  built-in tier defaults (uniform 10GB / 7d, preserving today's
  default behaviour). Explicit `null` disables that limit on that
  tier; validation still requires at least one positive limit.
- The `RetentionLimitOverride<T>` enum, its serializer, deserializer,
  and `resolve` helpers are removed -- no longer reachable.
- The orphan `parse_bytesize` helper that fed the removed clap
  `value_parser` is gone.

Tests rewritten to exercise the per-tier-only schema:
- `journal_tier_retention_uses_built_in_tier_defaults`
- `journal_tier_retention_uses_per_tier_values_when_present`
- `journal_rotation_size_derives_from_tier_size_budget`
- `journal_rotation_size_uses_100mb_for_time_only_retention`
- `journal_validation_rejects_tier_size_below_100mb`
- `journal_validation_allows_time_only_retention_when_size_is_disabled`
- `journal_tier_retention_null_disables_size_limit_for_that_tier_only`

Plus mechanical updates to memory_tests / startup_memory_tests to use
the new schema. Full crate: 427 passed, 0 failed.

Documentation:
- configuration.md: rewrote the `## journal` section with a
  per-tier-only schema. Updated the production retention profile
  example. Replaced the existing "Top-level retention" subsection
  with a "Per-tier retention" subsection.
- retention-querying.md: dropped the global-form example; per-tier
  example only; cross-link to configuration.md.

Breaking change notice: any existing user config using
`journal.size_of_journal_files` / `journal.duration_of_journal_files`
at the top level will now fail to deserialize (deny_unknown_fields).
Migrate by moving those values under `journal.tiers.<tier>.*`. The
plugin is recently shipped (PR netdata#22439, 2026-05-07); breaking-change
risk is low.

Logged as F8 in SOW-0014 regression log.
Code investigation: these two journal config knobs were declared,
validated, and exposed in YAML, but nothing in `src/.../query/`
ever read them. The actual tier auto-pick logic in
`query/planner/spans.rs::plan_query_tier_spans_recursive` selects
the coarser tier strictly from window / bucket-duration alignment
math -- it does not consult either knob.

Verdict: dead schema. Removed both:
- field declarations on `JournalConfig`
- non-zero / ordering validation in validate_journal
- two YAML test fixtures that mentioned them
- the two configuration.md table rows + code-block lines + the
  "query-window limits" explanation
- the retention-querying.md sentence that referenced them

Updated the retention-querying explanation of tier auto-pick to
match the actual behaviour: the planner uses bucket alignment, not
config-driven window caps.

Build + tests: 427 passed, 0 failed.

Breaking change notice: any user config that set
`journal.query_1m_max_window` or `journal.query_5m_max_window` will
now fail to deserialize (deny_unknown_fields). The keys had no
effect before; migration is delete-only.

Logged as F9 in SOW-0014 regression log.
…values_per_field; document properly

Two journal config knobs, opposite verdicts after code investigation:

- query_max_groups: REAL. Read at query/service.rs:52 and threaded into
  the projected group accumulator via query/projected/apply.rs:48. When
  the accumulator's grouped_total() exceeds the limit, additional group
  keys are folded into a synthetic __overflow__ bucket
  (query/grouping/labels.rs:17, query/grouping/model/compact.rs:35) and
  the response carries a warning ("Group accumulator limit reached;
  additional groups were folded into __overflow__" at
  query/timeseries.rs:124). Bounds memory on accidentally wide group-by
  combinations. Keep; document properly.

- query_facet_max_values_per_field: DEAD. Declared, validated for
  non-zero, but the consumer at query/facets/render.rs:19,27 uses the
  hardcoded constant DEFAULT_FACET_ACCUMULATOR_MAX_VALUES_PER_FIELD
  (query/request/constants.rs:17) instead of the config knob. The two
  coincidentally have the same default value (5000) but the config knob
  is never threaded to the consumer. Remove.

Code:
- types/journal.rs: removed the dead field; added a doc comment for
  query_max_groups explaining the __overflow__ bucket and the warning.
- defaults.rs: removed default_query_facet_max_values_per_field().
- validation/journal.rs: removed the non-zero check for the dead knob.
- plugin_config_tests.rs: removed
  validate_rejects_zero_query_facet_max_values_per_field test entirely;
  cleaned YAML fixtures.

Stock config + README:
- src/crates/netflow-plugin/configs/netflow.yaml: rewrote the journal
  block to use the per-tier retention form (carries over the F8 schema
  in the stock config); dropped both dead knobs; added clear comments
  for query_max_groups.
- src/crates/netflow-plugin/README.md: example updated, explanatory
  paragraph rewritten to describe what query_max_groups actually does.

Docs:
- configuration.md: Query guardrails table now lists only
  query_max_groups, with full description of overflow behaviour.
- retention-querying.md: Group-by limit section consolidated.
- visualization/filters-facets.md: removed the "Facet limits"
  subsection that documented the dead knob.

Build + tests: 426 passed, 0 failed (one dead-knob validation test
removed).

Breaking change notice: any user config setting
journal.query_facet_max_values_per_field will now fail to deserialize.
The key had no effect before; migration is delete-only.

Logged as F10 in SOW-0014 regression log.
The file existed as 0 bytes since the original documentation rewrite.
The netlify deploy preview rendered it as an empty page. Multiple other
pages cross-link to it (asn-resolution, static-metadata, network-identity,
README, configuration, the four provider integration cards), so the
empty placeholder was both a UX failure and a coverage gap.

Authored from scratch, code-grounded against:
- src/.../plugin_config/types/enrichment/geoip.rs (config schema)
- src/.../plugin_config/runtime.rs (auto-detect path)
- src/.../enrichment/data/geoip/resolver.rs (load + 30s refresh + per-IP
  composing-multiple-databases lookup)
- src/.../enrichment/data/network/asn.rs (AS-name rendering)

Page covers: the fields IP intelligence populates (with tier-preservation
notes), the configuration schema, auto-detection, refresh cadence, lookup
order vs the broader ASN provider chain, the four provider integration
cards (DB-IP / MaxMind / IPtoASN / Custom), private-IP rendering,
IPv6/IPv4 database split behaviour, staleness and accuracy caveats, and a
failure-modes table.

Frontmatter `learn_rel_path` matches the bgp-routing / network-identity
siblings ("Network Flows/Enrichment Concepts") for now -- the source
frontmatter is informational; the actual sidebar position derives from
docs/.map/map.yaml, and F20 will rename the section consistently.

Logged as F11 in SOW-0014 regression log.
…hment"

User: "'Encrichement Concepts' is a wrong title. 'Flows Enrichement' is
the right one."

Renamed in:
- docs/.map/map.yaml line 499 (the section label that drives the actual
  sidebar position on Learn).
- All seven `learn_rel_path` frontmatter values across
  docs/network-flows/enrichment/*.md. Prior state was inconsistent (4
  files had "Network Flows/Enrichment", 2 had "Network Flows/Enrichment
  Concepts", 1 had the F11-introduced "Flows Enrichment"). Settled on
  the canonical "Network Flows/Flows Enrichment" everywhere.

Logged as F20 in SOW-0014 regression log.
…ols"

User: "'Sources' is too generic. 'Flow Protocols' is the right one."

Renamed in:
- integrations/categories.yaml: flows.sources.name now "Flow Protocols".
- The three protocol-card frontmatter values
  (`netflow.md`, `ipfix.md`, `sflow.md`) now declare
  `learn_rel_path: "Network Flows/Flow Protocols"`.
- src/crates/netflow-plugin/metadata.yaml: removed three
  self-referencing learn URLs that pointed at
  /docs/network-flows/sources/{netflow,ipfix,sflow}. These links
  were broken before the rename (no /docs/network-flows/sources
  directory exists in source) and would stay broken under the new
  label. Replaced with the surviving "Network Flows Overview"
  anchor.
- Re-ran integrations/gen_integrations.py + gen_docs_integrations.py
  to regenerate the three protocol cards. Both exit clean.

Logged as F21 in SOW-0014 regression log.
User: "Netdata logs in namespace 'netdata'. Journalctl needs
`--namespace netdata`."

`-u netdata` selects the systemd UNIT and captures only stdout/stderr
the unit emits to the journal. Netdata writes structured logs into a
journal NAMESPACE called `netdata`. Without `--namespace netdata`,
users see at most unit-level startup/shutdown messages -- not the
plugin output that helps with debugging.

Swept all `journalctl -u netdata` invocations to
`journalctl --namespace netdata` across:
- quick-start.md
- troubleshooting.md (5 occurrences)
- installation.md
- enrichment/network-identity.md

Grep clean afterwards.

Logged as F18 in SOW-0014 regression log.
…tern

User: "How is it possible for users to ignore the sampling rate if we
calculate the estimated volume at ingestion? You invented reasons for
it. ... section must be removed."

The premise was wrong on multiple counts: per-flow multiplication is
always consistent (each record carries its own rate), users CAN'T
"ignore" the rate because Netdata applies it automatically at decode
time, and the "uniform rates required" framing is exactly the myth F4
+ F5 already corrected.

The two real concerns the section conflated remain documented
elsewhere:
- small flows missed at high sampling rates -- preserved in the
  Overview's "What sampling does to your numbers" section and in
  investigation-playbooks "Caveats".
- exporter sends no rate (v7 / v5 rate=0 / v9-IPFIX without Sampling
  Options Template) -- preserved in troubleshooting "Bandwidth doesn't
  match SNMP" and in validation.md.

Removed the entire section. Section numbering renumber will land with
F17 once all three section removals have completed.

Logged as F15 in SOW-0014 regression log.
…i-pattern

User: "Geolocation does not position internal IPs on the map. ...
section must be removed."

Code-verified at
src/crates/netflow-plugin/src/enrichment/data/geoip/decode.rs:40-72.
`apply_geo_record` writes country/state/city/latitude/longitude only
when the MMDB record carries non-empty values for those fields. For
RFC 1918 / private IPs, the MMDB either has no entry or has one
tagged `ip_class: "private"` with no country/city/coords. Internal
IPs simply do not appear on geographic maps. The "in random
countries" claim was invented.

Removed the entire section. The numbering renumber lands with F17.

The troubleshooting.md "Internal IPs in random countries" subsection
carries the same invented claim and will be addressed under F19.

Logged as F16 in SOW-0014 regression log.
…ds" + renumber

User: "Netdata does not support alerting of flows yet. Remove this
section."

The section's own footnote already acknowledged the issue: "Netdata's
alerting on flow data is in development; for now this pattern lives in
your monitoring practice, not in the plugin." So it was advice for
users to apply outside Netdata, not a Netdata anti-pattern.

Plus: with F15, F16, F17 all deleting sections, this commit renumbers
the anti-patterns.md sections to be sequential again (1 ... 9). Dropped
three rows from the summary table at the bottom (Ignored sampling,
GeoIP for internal IPs, Absolute thresholds) that referred to the
removed sections. Updated the cross-link in
visualization/time-series.md that called out
"time-shifted comparison beats absolute thresholds" -- replaced with a
general anti-patterns reference since the specific section is gone.

Logged as F17 in SOW-0014 regression log.
User: "This page has a mix of all the above issues: sampling, geoip,
etc."

Surgical fixes after F2-F18 land:

- Removed "Internal IPs in random countries" subsection -- same
  invented claim as F16 (GeoIP does not position internal IPs on
  maps; code-verified at
  src/.../enrichment/data/geoip/decode.rs:40-72).
- Rewrote "Things that look like bugs but aren't" entries to remove
  the F2/F3 "filter to one direction" framing and the F16 GeoIP
  myth. The two doubling-related bullets now point at
  exporter+interface filtering and at Source/Destination ASN
  filtering for bidirectional. Renamed "tier-0" -> "raw-tier" for
  consistency with the field reference.

Items kept (framing already correct after earlier findings):
- "Sampling rate not honoured by the exporter" -- correct (F4/F5).
- Doubling references in the SNMP-mismatch table (F2/F3).
- ASN provider chain debug recipe.
- Decapsulation destructive-on-non-tunnel.

Items deferred to per-page audit (R2):
- "Cisco's default template refresh is 30 minutes" -- vendor-
  specific claim; verify against current Cisco IOS-XE/XR docs in R2.

Logged as F19 in SOW-0014 regression log.
User: "I think the entire 'Validation and Data Quality' is completely
off. It mentions again sampling rates, etc."

Code-verified facts driving the rewrite:
- Per-flow sampling multiplication at decode time
  (decoder/record/core/record.rs:24-26): users don't need to monitor
  "sampling rate change" or "sampling rate misinterpretation".
- Template persistence across restarts
  (decoder/protocol/v9/templates.rs:106 +
  decoder/protocol/ipfix/templates/data.rs:67): users don't need to
  monitor "template loss after collector restart".
- UDP buffer overflow alert already exists
  (src/health/health.d/udp_errors.conf:6-19): kernel-level UDP drops
  are signalled by an existing system alert, not a silent failure.

Rewrite:
- New "What you actually need to watch" table -- five real failure
  modes (kernel UDP drops via existing alert, exporter stopped
  sending, wrong interfaces being exported, exporter sampling without
  communicating the rate, stale MMDB).
- Removed the three invented silent-failure items (sampling
  misinterpretation, sampling change, template loss).
- Removed the "Internal IP enrichment validation" section (F16
  confirmed GeoIP does not position internal IPs).
- Renamed "Sampling rate sanity check" to "Sampling rate
  verification" with the uniform-rate myth gone; kept the practical
  RAW_BYTES vs BYTES comparison.
- Removed the "Template cache health" subsection.
- Renamed the alerting table to "Plugin-side signals worth alerting
  on" and clarified these are signals the plugin exposes for the
  operator, not "silent failures" the dashboard hides.

Logged as F14 in SOW-0014 regression log.
…actical guide

User: "People want sizing and planning directions. This is not an
academic paper, not a blog."

Rewrote the page from scratch around the user's seven requirements:
- plugin cap (single-thread post-decode; ~25k flows/s sustained;
  ISP-scale anchor)
- how ingestion rate maps to storage (single table, 4 rows;
  ~800 bytes/flow empirical)
- raw tier dominates; bound it; example per-tier production config
- fast NVMe is the right call for the raw tier; slow storage means
  shorter retention
- memory: routing-trie footprint + page-cache headroom
- query speed: indexed fields fast; FTS = full scan of raw tier
- distributed deployment as the scaling answer (one agent per
  router/site; federated via Netdata Cloud; no central aggregation
  needed for flow data)

Removed:
- All benchmark tables and methodology. Engineering benchmark
  numbers remain in src/crates/netflow-plugin/README.md.
- The "Bounding storage for capacity planning" formula derivation
  (ignored tier rollover and dedup; partly invalid).

Logged as F13 in SOW-0014 regression log.
…ion Overview

User on F12: "Retention is closer to configuration and querying is
closer to visualization. ... If you need to put generic visualization
rules, these should be a generic 'Visualization/Overview' page, to
explain FTS, sharing, grouping, etc."
User on F22: "The 'Section index' in the overview page is not needed.
Learn already shows the index as a side bar."

F12:
- New visualization/overview.md page collects "how queries work",
  "group-by limit and overflow", "full-text search", "URL sharing",
  filtering pointers, "picking the right view".
- retention-querying.md slimmed to retention-only (tiers, what
  survives rollup, tier auto-pick, "no data", what forces raw tier,
  default retention misconfig). Sidebar label renamed to "Retention
  and Tiers".
- map.yaml: Visualization sub-section root now carries edit_url +
  description pointing at visualization/overview.md (matches the F1
  pattern). Retention sidebar label renamed.

F22:
- Removed the "## Section index" block from README.md. The Learn
  sidebar already renders the same hierarchy. The "Where to start"
  role-based pointer block stays (not a sidebar duplicate). The
  "specific feature in depth" bullet now points readers at the
  sidebar.

Logged as F12 + F22 in SOW-0014 regression log.
…Overview

Per user: the bullet was redundant with the F22 cleanup. Sidebar
guidance is implicit; the four role-based bullets above are the
intended "where to start" entry points.
Documentation audits against source code surfaced a small number of
inaccuracies and a few low-severity polish items across the most-edited
pages. Fixed in-place, with the same surgical-edit policy as Phase R1.

Critical/high severity:

- configuration.md: drop the false claim that listener / protocols /
  journal keys can also appear at the top level. The flatten attribute
  is clap-only (CLI flag flatten), not serde; with deny_unknown_fields
  the YAML schema rejects unknown top-level keys. Stock file uses the
  nested form.
- validation.md: replace the wrong `dRcv` ss column reference with the
  actual `d<N>` value inside the `skmem:(...)` line (the sock_drop
  counter from iproute2 ss output).
- validation.md: replace the unreachable RAW_BYTES vs BYTES dashboard
  comparison with the supported approach -- group by the Sampling Rate
  field. RAW_BYTES is filtered from supported_flow_field_names and is
  not surfaced as a default table column.

Medium severity:

- README.md: rename "Source ASN" to "Source AS Name" in the default
  group-by description; the actual default uses SRC_AS_NAME, which the
  dashboard renders as "Source AS Name". Same fix swept through
  quick-start.md, investigation-playbooks.md, troubleshooting.md,
  visualization/time-series.md, visualization/summary-sankey.md so the
  doc text matches the dashboard label.
- README.md, quick-start.md, anti-patterns.md, validation.md,
  visualization/summary-sankey.md: soften the "doubling by default"
  framing. Both ingress + egress export is a common configuration but
  not a property of the protocol, and vendor best practice is
  ingress-only. Wording now reflects that.
- validation.md: note the udp_errors alert ships as `to: silent` by
  default; operators must override `to:` to receive notifications.
- validation.md: add a `du -sh` example for cross-checking on-disk
  tier sizes (cross-link to sizing-capacity.md).

Low-severity polish:

- README.md: classifier expression language is an Akvorado-compatible
  subset (matches classifiers.md framing).
- README.md: rollup tier note now mentions the dropped fields, so the
  tier auto-pick claim is not over-broad.
- sizing-capacity.md: drop the "after rotation and compression" qualifier
  on the 800-bytes/flow figure (the bench window is too short to reflect
  rotation cycles); rephrase the ingest description and the "spinning
  rust" sentence; soften the BMP/BioRIS RSS guidance to a rough estimate
  with bench numbers anchored.
- configuration.md: document the query_max_groups / query-max-groups
  alias; document enrichment.geoip.optional and the abort-vs-warn
  semantics; clarify that default_sampling_rate and override_sampling_rate
  both accept a single integer or a per-prefix map; mention the 100 MB
  rotation-size fallback when size_of_journal_files is null.
- anti-patterns.md: add the missing "What it costs" line to sections 8
  and 9 for shape consistency with sections 1-7.
Second round of per-page audits against source code surfaced several
critical inaccuracies plus the usual long tail of low-severity polish.

Critical / high:

- retention-querying.md: rewrite the tier auto-pick rules. The previous
  thresholds were inverted ("8h20m and longer -> 1-hour") -- the actual
  planner walks coarsest first and accepts the first tier with at least
  100 aligned buckets, so >=100h -> 1h, 8h20m..<100h -> 5m,
  100min..<8h20m -> 1m. Verified at
  src/crates/netflow-plugin/src/query/planner/timeseries.rs:34-46 and
  TIMESERIES_MIN_BUCKETS=100 at src/crates/netflow-plugin/src/query/request/constants.rs:18.

- retention-querying.md: rewrite the rollup-preserved field list. The
  previous list claimed AS path, BGP communities, MPLS labels, MACs, and
  post-NAT addresses survive into rollups -- they do not. The actual
  rollup tier carries only the fields defined in
  src/crates/netflow-plugin/src/tiering/rollup/schema/fields/defs/{core,exporter,interface,network,presence}.rs;
  every other field is raw-only and forces the query to the raw tier.

- retention-querying.md: correct the "no data" / fallback semantics. The
  planner does NOT fall back to a coarser tier for raw-only queries --
  rollups don't carry the field, so the span returns empty. Confirmed
  at src/crates/netflow-plugin/src/query/planner/prepare.rs:25-28 and
  src/crates/netflow-plugin/src/query/planner/spans.rs:99-105
  (lower_fallback_candidate_tiers returns &[] for Raw).

- validation.md: replace the "group by Sampling Rate field" verification
  with an SNMP-magnitude cross-check. The SAMPLING_RATE field is
  filtered out of supported_flow_field_names
  (src/crates/netflow-plugin/src/query/request/constants.rs:80),
  excluded from the groupable set
  (src/crates/netflow-plugin/src/query/fields/rules.rs:33), and not
  available as a facet
  (src/crates/netflow-plugin/src/facet_catalog.rs:123). Users cannot
  pick it from the dashboard. The honest verification path is SNMP
  magnitude or a per-prefix override.

- validation.md: correct the alert threshold framing. RcvbufErrors is
  read with RRD_ALGORITHM_INCREMENTAL
  (src/collectors/proc.plugin/proc_net_netstat.c:400-434), so the value
  Netdata stores is per-second. The "lookup: average -1m absolute" plus
  "$this > 10" in src/health/health.d/udp_errors.conf means >10
  errors/SECOND averaged over 1 minute, not >10/minute as previously
  stated.

- quick-start.md: correct the field labels used in the doubling fix
  step from "Input Interface Name" / "Output Interface Name" to
  "Ingress Interface Name" / "Egress Interface Name", matching the
  display labels in src/crates/netflow-plugin/src/presentation/display.rs:39-40.
  Same fix swept across anti-patterns.md, troubleshooting.md,
  validation.md, investigation-playbooks.md, and
  visualization/summary-sankey.md.

- quick-start.md: drop the false "60-second template refresh" claim for
  softflowd. softflowd's `expint` flag controls expiry-check interval,
  not template refresh; the NetFlow v9 template interval in softflowd
  is a compile-time default
  (NF9_DEFAULT_TEMPLATE_INTERVAL=16 in netflow9.c) with no CLI knob.

- quick-start.md: complete the Juniper J-Flow snippet. The previous
  example defined a sampling instance but never bound it to a
  forwarding card and never set a sampling rate, so it would not
  produce flows. Add `set chassis fpc 0 sampling-instance NETDATA` and
  `set forwarding-options sampling instance NETDATA input rate 1000`,
  with a short note explaining the FPC binding requirement.

- quick-start.md: correct the dashboard navigation step from "click the
  Network Flows tab" to "open the Live tab and select Network Flows
  from the Functions list".

- ip-intelligence.md: correct the GeoLite2 / DB-IP / IPtoASN cadence
  claims. DB-IP Lite is monthly. MaxMind GeoLite2 City/Country update
  on weekdays; GeoLite2 ASN updates daily since June 2024. IPtoASN is
  not MMDB -- it is a public-domain TSV feed that includes both ASN
  and country and must be converted to MMDB before the plugin can read
  it (the plugin only supports MMDB).

- ip-intelligence.md: correct the dual-stack guidance. Most current
  providers ship a single dual-stack MMDB; the previous "configure
  both an IPv4 file and a separate IPv6 file" advice was misleading.

- ip-intelligence.md: clarify the asn_providers chain semantics. The
  `geoip` provider is a terminal "use 0" shortcut -- when reached the
  AS number is forced to 0 (the AS name still comes from the MMDB
  lookup independently). Confirmed at
  src/crates/netflow-plugin/src/enrichment/asn/resolve.rs:75-109.

- ip-intelligence.md: distinguish the database-composition rules. ASN
  fields use pure last-wins; geo fields are written only when the
  matching record has a non-empty value, so a later database with an
  empty city does not overwrite an earlier database's city
  (src/crates/netflow-plugin/src/enrichment/data/geoip/decode.rs:40-72).

Medium / low:

- sizing-capacity.md: rewrite the example raw-tier YAML so the size cap
  matches the page's own 25k flows/s framing. The previous example used
  200GB / 24h; at 25k flows/s the size cap would fire after ~2.8h, not
  24h. Now uses 2TB / 24h with a paragraph on how to scale down for
  lighter loads, and explains the size-vs-duration relationship.

- validation.md: drop the misleading `decoder_state_dir` config-key
  reference (it is a derived path, not a user-facing key); add the
  `-n` flag to the ss command to keep the port numeric in the output.

- anti-patterns.md / troubleshooting.md / validation.md /
  visualization/summary-sankey.md: consistent doubling-framing hedge
  ("a common configuration; vendor best practice is ingress-only")
  across all five docs that mention doubling.

- anti-patterns.md: summary-table row "Doubled aggregate" qualified
  with "(when ingress + egress are both exported)".

- configuration.md: correct the `override_sampling_rate` default
  example from `{}` to `~` (the actual schema default is None).
retention-querying.md:
- correct the Time-Series sub-100-min fallback. The planner walks coarsest-first
  and falls back to the 1-minute tier (TierKind::Minute1) when no tier has
  >=100 aligned buckets, not to raw. Verified at
  src/crates/netflow-plugin/src/query/planner/timeseries.rs:39-46.
- split the field-eligibility list. The "force raw" set is exactly
  RAW_ONLY_FIELDS plus V9_*/IPFIX_* prefixes
  (src/crates/netflow-plugin/src/query/fields/rules.rs:5-11 +
  src/crates/netflow-plugin/src/query/request/constants.rs:46-57). The
  previously included AS path / BGP communities / MPLS labels / MAC addresses /
  NAT addresses do NOT switch tier; they are dropped from rollup output and
  return null on rollup queries. Page now describes both classes separately.

troubleshooting.md:
- correct Cisco's default template refresh from "30 minutes" to
  "600 seconds (10 minutes)" -- this is the IOS / IOS-XE Flexible NetFlow
  `template data timeout` default.
- replace the wrong "/proc/net/udp ... RcvbufErrors column" recipe. The
  /proc/net/udp file lists open sockets without per-socket drop counters; the
  kernel-wide RcvbufErrors total lives under the Udp: line of
  /proc/net/snmp, which is what Netdata's proc.plugin reads
  (src/collectors/proc.plugin/proc_net_netstat.c:1521). Also dropped the
  contradictory "30-60 seconds" template-refresh hint.

ip-intelligence.md:
- correct the ASN composition rule. The previous "last-wins for ASN, but
  geo writes only when non-empty" framing was wrong -- src/crates/netflow-plugin/src/enrichment/data/geoip/decode.rs:3-28
  filters empty / zero values for ASN fields just like geo, so both sets
  follow the same "last database with a non-empty value wins" rule.
- correct the GeoLite2 cadence. MaxMind's documentation publishes City and
  Country twice weekly (Tuesday and Friday); GeoLite2 ASN moved to every
  weekday in June 2024.
…+ Enrichment Methods

The previous category tree split enrichment-method integrations across three
separate sub-categories (IP Intelligence, BGP Routing, Network Identity Sources),
which made the integrations page navigation harder than it needed to be and did
not match the conceptual model: the operator is choosing a *data source* for
enrichment, regardless of what kind of data it produces.

Collapsed under flows.enrichment-methods:

- ip-intelligence: dbip, maxmind, iptoasn, custom-mmdb
- bgp-routing: bmp, bioris
- network-identity: aws-ip-ranges, gcp-ip-ranges, azure-ip-ranges, netbox,
  generic-ipam

flows.sources keeps its existing membership (netflow / ipfix / sflow) and the
"Flow Protocols" name unchanged.

Both YAML files validated. Per-card content merges and the new cross-cutting
"Enrichment" + "Enrichment Intel Downloader" pages land in subsequent commits.
…t integration cards

Round-by-round, agent-per-card merges. Each card absorbed the durable
provider-specific content from its corresponding concept page; cross-cutting
content was extracted for the new "Enrichment" page (separate commit). The
agents verified every behavioural claim against current source code at file:line
and every upstream URL by WebFetch -- not a mechanical sweep.

IP Intelligence:

- dbip: framed as the auto-detected default, monthly Lite cadence, CC-BY-4.0,
  populated-fields breakdown (geo + ASN), AS0 Private/Unknown labels driven by
  the DB-IP-built ip_class flag, raw-tier-only city/lat/lon.
- maxmind: GeoLite2 vs commercial GeoIP2 split, account-id + license-key auth,
  twice-weekly Tuesday/Friday cadence for City/Country, every-weekday for ASN
  since June 2024, geoipupdate setup. Important correction: the bundled
  topology-ip-intel-downloader does NOT support MaxMind (only dbip and iptoasn);
  the previous card's hint at the netdata downloader was misleading.
- iptoasn: PDDL public-domain feed, hourly TSV cadence (the previous card
  said "daily" -- wrong), bundled topology-ip-intel-downloader natively
  supports the TSV->MMDB conversion (correcting an earlier prompt assumption
  to the contrary), three setup examples including ASN-only and combined
  with DB-IP geo.
- custom-mmdb: reframed as the escape hatch for operators producing custom
  MMDBs (CIDR overlays, internal AS labels). Lists the field names the
  decoder reads from any MMDB, with file:line evidence; cites mmdbwriter
  libraries; recommends `optional: true` during build iteration.

BGP Routing:

- bmp: BMP-v3-only handling -- v1/v2 silently dropped (previously implicit).
  RFC 8671 cited for JunOS post-policy support since 18.3R1, separately
  from RFC 7854 which was previously lumped together. Cisco IOS-XE BMP
  added (was missing). Nokia SR OS added. JunOS minimum 13.3 documented.
  No IANA-registered port for BMP.
- bioris: corrected the topology -- Netdata connects to a USER-RUN bio-rd
  cmd/ris/ daemon over user-supplied gRPC, NOT directly to RIPE RIS. The
  user's bio-rd daemon does the BGP/BMP peering with upstream sources. No
  shipped collector list (ris_instances is required and operator-supplied).
  Memory cost (~hundreds of MB per peer for full-table feeds), no eviction,
  raw-tier-only AS path / communities.

Network Identity:

- aws-ip-ranges: schema reference (top-level + per-entry), live cadence
  softened ("whenever AWS IP space changes, often several times per day"
  rather than the folklore "every 15 minutes" -- AWS docs do not promise
  any fixed schedule). Three jq examples including network_border_group as
  site. Plugin's actual config key is `transform`, not `jq_program` (was
  wrong in the migration prompt).
- gcp-ip-ranges: cloud.json vs goog.json comparison; cloud.json today
  reports `service: "Google Cloud"` uniformly so per-service pivots are
  not possible from this file. No fixed Google cadence in the docs.
- azure-ip-ranges: URL rotates weekly. Service Tag Discovery REST API as
  authoritative alternative. API data lags JSON file by up to four weeks;
  new IPs aren't used for at least one week after publication. Three
  workaround patterns documented honestly.
- netbox: documented breaking change in NetBox 4.2 -- the `site` foreign
  key on Prefix was replaced with the generic `scope` field; the previous
  example used `(.site.name // "")` which silently breaks on 4.2+. New
  card ships scope-aware (4.x) and legacy (3.x/4.0/4.1) examples plus a
  fallback `(.scope.name // .site.name // "")`. Two token formats
  documented (legacy v1 hex and v2 nbt_<key>.<token>).
- generic-ipam: full RemoteNetworkSourceConfig schema (13 options including
  proxy, tls.enable, tls.verify, tls.skip_verify with explicit "rejected
  by validation" notes). Honest call-outs: POST is sent without a body
  (fetch.rs:11-17), interval floored at 60s by service.rs:73, TLS
  verification cannot be disabled (validation/enrichment.rs:183-192).

Categories: each card now under flows.enrichment-methods (the previous
ip-intelligence / bgp-routing / network-identity sub-categories were
collapsed in the previous commit). Generated .md files updated via
integrations/gen_docs_integrations.py.
…ownloader page

Three new cards under flows.enrichment-methods, completing the "every
enrichment method is an integration" model. Each card was authored by an
agent that read the corresponding concept page and the source code, then
produced a metadata.yaml entry with every claim cited at file:line. The
agents flagged real inconsistencies between the concept pages and the
code; corrections were absorbed into the cards.

static_metadata:

- Three configuration surfaces: enrichment.metadata_static.exporters,
  enrichment.networks, and enrichment.override_sampling_rate (plus
  default_sampling_rate for the distinction).
- Field-population table tied to source at apply/metadata.rs:41-53,
  data/network/write.rs:93-125, apply/metadata.rs:78-97.
- Five corrections vs the previous concept page: boundary "undefined" vs
  numeric 0 are byte-identical in output; lookup priority is dominated
  by prefix specificity, not source-kind; override_sampling_rate matches
  the UDP datagram source IP; the "networks merges last and wins"
  framing was overstated.

classifiers:

- Two evaluation surfaces: exporter_classifiers + interface_classifiers,
  the latter called twice per flow (once per interface side).
- Akvorado-compatible expression-language *subset* — explicitly only the
  operators and actions implemented at enrichment/classifiers/parse.rs;
  every example in the card uses syntax verified against the parser and
  the existing test suite.
- Output normalisation includes "+" (concept page only listed ". -");
  static metadata short-circuits classifier evaluation
  (enrichment/classify.rs:117-119, :150-154); first-write-wins per slot
  (runtime/eval/action.rs:43-46); default cache 5m with >=1s validation
  (defaults.rs:46-48, validation/enrichment.rs:10-12).

decapsulation:

- Two modes (srv6, vxlan) per protocol.rs:50-57; default none.
- Three transport feeders: NetFlow v9 IE 104 (decoder.rs:80), IPFIX
  IE 315 (decoder.rs:74), sFlow SampledHeader (sflow/record.rs:44-69).
- Inner parsers at decoder/common.rs:3-18 (VXLAN port 4789) and
  :35-63 (SRv6 SRH walker); merge points at packet/transport.rs:21-33
  and record/packet/parse/transport.rs:14-21.
- Vendor-verification finding: Cisco IOS-XE / IOS-XR collect datalink
  frame-section could NOT be verified (cisco.com 403'd anonymous
  WebFetch and Akvorado's IOS-XE recipe deliberately omits L2 frame-
  section export). The card flags Cisco support as unverified and
  instructs operators to validate by template inspection. Juniper
  inline-monitoring with datalink-frame-size confirmed via the
  Akvorado mirror; sFlow header sampling confirmed via the project's
  decoder.

intel-downloader.md (new operator-tool page):

- Documents /usr/sbin/topology-ip-intel-downloader -- supported sources,
  CLI flags, atomic replacement, auto-detect integration with the
  netflow plugin's 30s reload window.
- Findings: no packaged systemd timer or cron file (operators must
  install their own; page provides a starter unit + timer pair);
  MaxMind support confirmed absent (no license_key field anywhere in
  config.go; only iptoasn:combined and dbip:asn-lite/country-lite/
  city-lite are recognised by builtInSource); MaxMind users directed to
  geoipupdate as the alternative.
- Hidden capability documented: interesting_cidrs config knob lets
  operators stamp public CIDRs as netdata.ip_class = "interesting" in
  both ASN and geo MMDBs (write.go:228-246).

The categories.yaml category for all 14 enrichment methods is now
flows.enrichment-methods (previously split across ip-intelligence /
bgp-routing / network-identity).
…; update map.yaml

Restructure phase 5+8+9: collapse the docs/network-flows/enrichment/ directory
(7 concept pages: asn-resolution, bgp-routing, classifiers, decapsulation,
ip-intelligence, network-identity, static-metadata) into ONE consolidated
cross-cutting page at docs/network-flows/enrichment.md, plus per-method
integration cards (already added in previous commits).

The new Enrichment page (447 lines) is the single home for cross-cutting
concepts that span every enrichment method:

- Order of evaluation per flow record (8-step pipeline cited at
  apply/resolve.rs:5-50 and init.rs:50-64).
- The two provider chains (asn_providers / net_providers, the geoip
  terminal "use 0" shortcut, the AS-number-vs-AS-name distinction).
- Composition rules: specificity dominates, ties to static, per-field
  non-empty-wins merge.
- The MMDB shared mechanism (auto-detect path order, last-non-empty-wins
  composition, 30s signature reload, IPv4/IPv6 dual-stack handling).
- Network sources operational properties (fetch loop, 60s floor, jq
  schema, deny_unknown_fields, TLS-no-disable, no pagination, no auth
  helpers, POST-without-body, journal diagnostics).
- Static-metadata-blocks-classifiers semantics.
- Classifier evaluation surfaces and ordering.
- Decapsulation inner-packet override.
- Routing overlay (BMP+BioRIS shared trie).
- Cross-method operational properties: refresh windows, restart behaviour,
  no in-process freshness signal, empty-tree disables enricher, rollup
  tier survival table, geographic accuracy, sampling-rate knobs,
  integration test gap.

The page resolved several discrepancies the original concept pages had
against the source code -- e.g. the "static metadata > classifiers >
network sources > GeoIP > BGP routing" precedence claim was misleading
(actual rule: specificity dominates, ties to static, merge primitive is
non-empty-wins); the bmp alias for routing lives at providers.rs:10,12,
not validation/enrichment.rs; the GeoIP terminal shortcut sets the AS
number to 0 but the AS *name* still comes from the MMDB independently;
POST is sent without a body (fetch.rs:11-17); interval is silently
floored at 60s (service.rs:73). Each correction is cited at file:line.

map.yaml: removed the "Flows Enrichment" sub-section (which contained
the 7 deleted pages); added two new entries between Configuration and
Field Reference:
- "Enrichment" -> docs/network-flows/enrichment.md
- "Enrichment Intel Downloader" -> docs/network-flows/intel-downloader.md

Cross-references updated in 8 surviving pages to point at either the new
Enrichment page (for cross-cutting concepts) or the relevant integration
cards under src/crates/netflow-plugin/integrations/ (for per-method
specifics): configuration, validation, intel-downloader, quick-start,
installation, troubleshooting, visualization/maps-globe.

The 7 deleted concept pages had their durable content fully absorbed
into the Enrichment page (cross-cutting) and the 14 integration cards
(per-method); each integration card cites file:line evidence for every
behavioural claim and was re-verified against current source code by
its merge agent.
The merge agents preserved Learn-URL cross-references from the original
concept pages (e.g. https://learn.netdata.cloud/docs/network-flows/enrichment/
ip-intelligence). Those URLs now 404 because the seven concept pages were
collapsed into one. Sweep them all to point at the new consolidated
https://learn.netdata.cloud/docs/network-flows/enrichment page, which
covers the cross-cutting concepts (MMDB shared mechanism, asn_providers
chain, network-source operational properties, etc.) that the per-method
references were calling out.

Generated cards regenerated via gen_docs_integrations.py to flush the
new URL into the .md outputs.
@ktsaou ktsaou requested a review from Copilot May 7, 2026 21:23
@ktsaou ktsaou marked this pull request as ready for review May 7, 2026 21:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aligns Network Flows documentation and generated integration cards with current supported behavior, while also updating the netflow-plugin to (1) use per-tier retention configuration and (2) correctly persist timestamp_source into _SOURCE_REALTIME_TIMESTAMP.

Changes:

  • Fix netflow-plugin ingestion so timestamp_source persists the decoded source timestamp as _SOURCE_REALTIME_TIMESTAMP (keeping journal entry realtime as receive/write time) and add an end-to-end test.
  • Migrate journal retention configuration to per-tier journal.tiers.* and remove unsupported/obsolete journal query knobs.
  • Restructure/refresh Network Flows docs + integration cards (new categories, new visualization overview, updated enrichment/network-source cards, moved/removed older concept pages).

Reviewed changes

Copilot reviewed 60 out of 62 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/crates/netflow-plugin/src/startup_memory_tests.rs Updates tests to rely on per-tier retention defaults.
src/crates/netflow-plugin/src/query/tests.rs Minor formatting tweak in request deserialization tests.
src/crates/netflow-plugin/src/plugin_config/validation/journal.rs Removes validation for deleted knobs; validates per-tier retention limits.
src/crates/netflow-plugin/src/plugin_config/types/journal.rs Implements per-tier retention schema and removes retired journal query fields.
src/crates/netflow-plugin/src/plugin_config/defaults.rs Removes now-unused retention override helpers and facet limit default.
src/crates/netflow-plugin/src/plugin_config.rs Drops re-export of removed RetentionLimitOverride.
src/crates/netflow-plugin/src/plugin_config_tests.rs Updates config tests for per-tier retention and removed knobs.
src/crates/netflow-plugin/src/memory_tests.rs Updates fixtures to use per-tier retention config.
src/crates/netflow-plugin/src/main_tests.rs Adds E2E test ensuring _SOURCE_REALTIME_TIMESTAMP reflects decoded source timestamp; adds helpers to read first raw journal entry fields.
src/crates/netflow-plugin/src/ingest/service/runtime.rs Persists decoded source timestamp into journal entry timestamps via _SOURCE_REALTIME_TIMESTAMP.
src/crates/netflow-plugin/src/ingest_resource_bench_tests.rs Minor formatting cleanups in benchmark test code.
src/crates/netflow-plugin/src/facet_runtime.rs Minor formatting in tests.
src/crates/netflow-plugin/src/api/flows/handler.rs Minor formatting of spawn_blocking autocomplete path.
src/crates/netflow-plugin/README.md Updates documented config to per-tier retention and query_max_groups semantics.
src/crates/netflow-plugin/integrations/sflow.md Updates generated integration card content and config options for per-tier retention and docs navigation.
src/crates/netflow-plugin/integrations/netflow.md Updates generated NetFlow integration card for per-tier retention and revised guidance.
src/crates/netflow-plugin/integrations/maxmind_geoip_-_geolite2.md Rewrites MaxMind integration card with clarified setup, limits, and composition behavior.
src/crates/netflow-plugin/integrations/iptoasn.md Rewrites IPtoASN integration card (cadence, downloader behavior, composition).
src/crates/netflow-plugin/integrations/ipfix.md Updates IPFIX integration card for per-tier retention and revised guidance.
src/crates/netflow-plugin/integrations/gcp_ip_ranges.md Expands GCP network-source integration card (cloud.json vs goog.json, options, examples).
src/crates/netflow-plugin/integrations/decapsulation.md Adds new generated decapsulation integration card under enrichment methods.
src/crates/netflow-plugin/integrations/db-ip_ip_intelligence.md Updates DB-IP integration card with new defaults and operational guidance.
src/crates/netflow-plugin/integrations/custom_mmdb_database.md Updates custom MMDB integration card (schema expectations, atomic refresh guidance).
src/crates/netflow-plugin/integrations/azure_ip_ranges.md Expands Azure Service Tags integration card (URL rotation, transform schema, examples).
src/crates/netflow-plugin/integrations/aws_ip_ranges.md Expands AWS IP ranges integration card (schema, transform requirements, options).
src/crates/netflow-plugin/configs/netflow.yaml Migrates sample config to per-tier retention and removes retired query knobs.
integrations/categories.yaml Restructures Network Flows categories (Flow Protocols + Enrichment Methods).
docs/network-flows/visualization/time-series.md Updates tier terminology and defaults; describes raw-tier forcing conditions.
docs/network-flows/visualization/summary-sankey.md Updates UI navigation and default group-by language (AS Name).
docs/network-flows/visualization/overview.md Adds shared visualization/query behavior overview page.
docs/network-flows/visualization/maps-globe.md Updates tier terminology and troubleshooting guidance for map/globe views.
docs/network-flows/visualization/filters-facets.md Updates tier terminology and removes docs for deleted facet-limit knob.
docs/network-flows/visualization/dashboard-cards.md Updates wording and troubleshooting commands for plugin health charts.
docs/network-flows/validation.md Reworks validation routine and monitoring guidance.
docs/network-flows/troubleshooting.md Updates troubleshooting steps, commands, and terminology; reflects per-tier retention docs.
docs/network-flows/README.md Updates overview, defaults, and conceptual guidance (sampling, doubling, bidirectional traffic).
docs/network-flows/quick-start.md Updates quick-start configs and dashboard navigation guidance.
docs/network-flows/investigation-playbooks.md Updates playbooks for new terminology/default fields and UI navigation.
docs/network-flows/intel-downloader.md Adds a new page documenting topology-ip-intel-downloader behavior and ops.
docs/network-flows/installation.md Updates installation guidance and IP intelligence defaults/links.
docs/network-flows/field-reference.md Updates field semantics text (timestamps, selectivity, observation time mapping).
docs/network-flows/enrichment/static-metadata.md Removes old enrichment concept page (superseded by new structure/cards).
docs/network-flows/enrichment/network-identity.md Removes old concept page (superseded by new structure/cards).
docs/network-flows/enrichment/decapsulation.md Removes old decapsulation doc page (replaced by integration card/updated docs).
docs/network-flows/enrichment/classifiers.md Removes old classifiers concept page (superseded by new structure/cards).
docs/network-flows/enrichment/bgp-routing.md Removes old BGP routing concept page (superseded by new structure/cards).
docs/network-flows/enrichment/asn-resolution.md Removes old ASN resolution concept page (superseded by new structure/cards).
docs/.map/map.yaml Updates docs sidebar structure for Network Flows section and adds new pages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/crates/netflow-plugin/src/plugin_config/types/journal.rs Outdated
Comment thread docs/network-flows/visualization/time-series.md Outdated
Comment thread docs/network-flows/visualization/maps-globe.md Outdated
@github-actions github-actions Bot added the area/collectors Everything related to data collection label May 7, 2026
@ktsaou
Copy link
Copy Markdown
Member Author

ktsaou commented May 7, 2026

@cubic-dev-ai please review again

@ktsaou ktsaou requested a review from Copilot May 7, 2026 23:14
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 7, 2026

@cubic-dev-ai please review again

@ktsaou I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 85 files

Confidence score: 4/5

  • This PR is likely safe to merge, but there is mild user-facing risk because several documentation updates in src/crates/netflow-plugin/integrations/*.md, docs/network-flows/enrichment.md, and .agents/skills/* are factually inaccurate (mostly severity 4/10, high confidence).
  • The most impactful issue is misleading setup/troubleshooting guidance in integration docs (for example iptoasn.md, aws_ip_ranges.md, and generic_json-over-http_ipam.md), which could cause incorrect configuration choices or confusion during onboarding rather than runtime breakage.
  • There is also a small portability concern in .agents/skills/codacy-audit/scripts/pr-issues.sh where bare mktemp may behave inconsistently across GNU/BSD/macOS; this is low severity and straightforward to harden.
  • Pay close attention to src/crates/netflow-plugin/integrations/iptoasn.md, src/crates/netflow-plugin/integrations/aws_ip_ranges.md, src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md, and docs/network-flows/enrichment.md - correct inaccurate behavior claims so users are not misled.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".agents/skills/integrations-lifecycle/pipeline.md">

<violation number="1" location=".agents/skills/integrations-lifecycle/pipeline.md:455">
P2: This line overstates CI behavior: `check-markdown.yml` does not enforce that committed `COLLECTORS.md` is fresh, only that regenerated docs pass ingest.</violation>
</file>

<file name=".agents/skills/codacy-audit/scripts/pr-issues.sh">

<violation number="1" location=".agents/skills/codacy-audit/scripts/pr-issues.sh:77">
P3: Use an explicit portable mktemp template instead of bare `mktemp`.

(Based on your team's feedback about portable mktemp usage across GNU/BSD/macOS.) [FEEDBACK_USED]</violation>
</file>

<file name=".agents/skills/learn-pr-preview/SKILL.md">

<violation number="1" location=".agents/skills/learn-pr-preview/SKILL.md:45">
P2: The source-copy command omits intentional untracked docs files, so the documented uncommitted-preview flow is incomplete and can produce a stale preview.</violation>
</file>

<file name="src/crates/netflow-plugin/integrations/iptoasn.md">

<violation number="1" location="src/crates/netflow-plugin/integrations/iptoasn.md:52">
P2: The new wording implies `topology-ip-intel-downloader` is always bundled, but repository docs say packaged 32-bit installs do not include it. Add a platform/installation caveat to avoid broken setup instructions.</violation>
</file>

<file name="src/crates/netflow-plugin/integrations/netbox.md">

<violation number="1" location="src/crates/netflow-plugin/integrations/netbox.md:88">
P3: The performance note claims trie lookup for runtime network-source enrichment, but current code does a linear scan over source records.</violation>

<violation number="2" location="src/crates/netflow-plugin/integrations/netbox.md:309">
P3: The troubleshooting text says HTTP errors are silently consumed, but the refresher logs them as warnings.</violation>
</file>

<file name="src/crates/netflow-plugin/integrations/aws_ip_ranges.md">

<violation number="1" location="src/crates/netflow-plugin/integrations/aws_ip_ranges.md:231">
P2: The new docs incorrectly say `transform: "."` fails with an "empty result" error; with this plugin it typically fails because the emitted object has no required `prefix` field. Update this section to describe schema/row-mapping failure instead of empty output.</violation>
</file>

<file name="src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md">

<violation number="1" location="src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md:358">
P2: This new troubleshooting note is factually incorrect: URL-embedded credentials are handled by reqwest as HTTP Basic auth, so saying they "do not work" can mislead users.</violation>
</file>

<file name="docs/network-flows/enrichment.md">

<violation number="1" location="docs/network-flows/enrichment.md:30">
P2: This statement is inaccurate: provider-chain-only configuration can still enable the enricher, so it does not run *only* when an enrichment input source is configured.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Router as Router/Exporter
    participant UDP as UDP Socket
    participant Parser as Protocol Decoder
    participant Enrich as Enrichment Pipeline
    participant Journal as Journal (4 Tiers)
    participant Planner as Query Tier Planner
    participant Dashboard as Network Flows UI
    participant Intel as Intel Downloader
    participant MMDB as MMDB Files
    participant Config as netflow.yaml
    participant DocGen as Doc Generator
    participant Learn as learn.netdata.cloud

    Note over Router,Learn: PR repairs documentation, persistence, and defaults to match actual behavior

    Router->>UDP: Send NetFlow / IPFIX / sFlow datagrams
    UDP->>Parser: Decode raw protocol fields
    Parser->>Parser: Process timestamp_source config

    alt timestamp_source = netflow_first_switched
        Parser->>Journal: NEW: Write decoded source timestamp as _SOURCE_REALTIME_TIMESTAMP
        Parser->>Journal: Keep entry realtime as receive/write time
    else default (input)
        Parser->>Journal: Write receive time as _SOURCE_REALTIME_TIMESTAMP
    end

    Parser->>Enrich: Decoded flow record
    Enrich->>MMDB: GeoIP / ASN lookup (auto-detect or explicit path)
    MMDB-->>Enrich: Country, city, ASN data
    Enrich->>Enrich: Apply classifier rules, static metadata, provider chains

    Note over Enrich: Order: decode → decap → GeoIP → static → network sources → classifiers → routing overlay

    Enrich->>Journal: Write enriched record to raw tier
    Journal->>Journal: Rollup to 1m / 5m / 1h tiers (drops high-cardinality fields)

    Planner->>Journal: Select tier for query
    alt Raw-only fields (IP/port/city) or FTS
        Planner->>Journal: Force raw tier
        Journal-->>Planner: Raw data (bounded by raw retention)
    else Preserved fields (country/ASN/interface)
        Planner->>Journal: Use coarser tier per window alignment
        Journal-->>Planner: Aggregated data
    end

    Dashboard->>Planner: flows / autocomplete query
    Planner-->>Dashboard: Top-N groups, facet values
    Dashboard->>Dashboard: Render Sankey, Time-Series, Maps

    Intel->>Intel: CHANGED: Fetch DB-IP / IPtoASN data
    Intel->>MMDB: Atomic rename(2) of topology-ip-asn.mmdb, .geo.mmdb
    MMDB-->>Enrich: Reloads on file change within 30s

    Config->>Config: Removed top-level retention knobs and query guardrails
    Config->>Config: Per-tier retention only (raw, minute_1, minute_5, hour_1)
    Config->>Config: Removed query_1m_max_window, query_5m_max_window, query_facet_max_values_per_field

    DocGen->>DocGen: CHANGED: Generate integration cards under Flow Protocols / Enrichment Methods
    DocGen->>DocGen: Collapse 7 enrichment pages into single Enrichment guide
    DocGen->>DocGen: CHANGED: flows category treated as Monitor Anything section
    DocGen->>Learn: Publish regenerated COLLECTORS.md, integration cards, docs
    Learn-->>Dashboard: Ingest for in-app help and search
Loading

Note: This PR contains a large number of files. cubic only reviews up to 75 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.

Comment thread .agents/skills/integrations-lifecycle/pipeline.md Outdated
Comment thread .agents/skills/learn-pr-preview/SKILL.md Outdated
Comment thread src/crates/netflow-plugin/integrations/iptoasn.md Outdated
Comment thread src/crates/netflow-plugin/integrations/aws_ip_ranges.md Outdated
Comment thread src/crates/netflow-plugin/integrations/generic_json-over-http_ipam.md Outdated
Comment thread docs/network-flows/enrichment.md Outdated
Comment thread .agents/skills/codacy-audit/scripts/pr-issues.sh Outdated
Comment thread src/crates/netflow-plugin/integrations/netbox.md Outdated
Comment thread src/crates/netflow-plugin/integrations/netbox.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 82 out of 85 changed files in this pull request and generated 2 comments.

Comment thread src/crates/netflow-plugin/src/plugin_config/types/journal.rs
Comment thread docs/network-flows/visualization/overview.md Outdated
@ktsaou
Copy link
Copy Markdown
Member Author

ktsaou commented May 7, 2026

@cubic-dev-ai please review again

@ktsaou ktsaou requested a review from Copilot May 7, 2026 23:33
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 7, 2026

@cubic-dev-ai please review again

@ktsaou I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 82 out of 85 changed files in this pull request and generated 2 comments.

Comment thread src/crates/netflow-plugin/src/ingest/rebuild.rs Outdated
Comment thread src/crates/netflow-plugin/src/ingest/rebuild.rs Outdated
@ktsaou
Copy link
Copy Markdown
Member Author

ktsaou commented May 7, 2026

@cubic-dev-ai please review again

@ktsaou ktsaou requested a review from Copilot May 7, 2026 23:44
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 7, 2026

@cubic-dev-ai please review again

@ktsaou I have started the AI code review. It will take a few minutes to complete.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 82 out of 85 changed files in this pull request and generated no new comments.

@ktsaou ktsaou merged commit 3562289 into netdata:master May 8, 2026
130 of 147 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/collectors Everything related to data collection area/docs area/metadata Integrations metadata

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants