At LSFMM 2024 we reviewed automating measuring memory fragmentation and essentially we had 0 consensus on how to visualize memory fragmentation.
This project aims at trying to expand on that with tracepoints, eBPF scripts, and matplotlib visualizations to help us get closer to something sensible.
Our analysis reveals fragmentation patterns across different XFS block sizes under parallel writeback workloads:
The visualizations demonstrate how external fragmentation events, migration patterns, and compaction success rates vary between configurations, providing critical insights for filesystem performance optimization.
We primarily aim at evaluating as a case study, does Large Block Size support on the Linux kernel create worse memory fragmentation situation for users? Uses of this visualization effort can be used for other analysis. The ultimate goal is to include this as part of kdevops for monitoring and analyzing any workflow.
This effort is brought to you by:
- Swarna Prabhu
- David Bueso
- Luis Chamberlain
You can use the 20250903-compaction-tracepoints branch with two patches which are important in current analysis:
-
mm: add simple folio migration debugfs stats
Provides a way to evaluate folio migrations over time. -
mm: add compaction success/failure tracepoints and fragmentation tracking
Provides a tracepoint at compaction success/failures with enough semantics for us to evaluate at what order compaction succeeded or failed and at what fragmentation index.
apt-get install python3-bpfccThe fragmentation_tracker.py is an eBPF script which tracks all tracepoints with order, fragmentation index and page mobility. Since the compaction tracepoints are out of tree, they are optional. The script will ignore those tracepoints if your kernel lacks them.
Just run:
make # Generate all visualizations (fast mode - samples to 50K events)
make test # Verify setup is working correctlyThe default make will:
- Extract data from
pw-data-v1.tar.gz(if needed) - Generate all visualizations using fast sampling (50K events max per file)
- Complete in under 2 minutes using all CPU cores in parallel
For specific analyses:
make pw-analysis # Run only parallel writeback analysis (also parallel)
make simple # Run only simple demo analysis
make compare # Run only A/B comparison demoNote: All targets now run in parallel by default using all CPU cores for maximum speed. The data is automatically extracted from the compressed archive on first run to stay within GitHub's file size limits.
Performance Note: The pw-data-v1 files contain ~500K events each. The fast visualizer automatically samples large files to 50K events for quick processing while preserving statistical significance. Full analysis completes in under 2 minutes.
The parallel writeback evaluation tests memory fragmentation patterns across different XFS configurations with varying block sizes. This analysis helps determine whether larger block sizes lead to worse fragmentation under parallel writeback workloads.
The pw-data-v1/ directory contains fragmentation data from the following XFS configurations:
- XFS 4K: Baseline 4KB block size configuration
- XFS 8K-dev: 8KB blocks on development kernel
- XFS 16K-dev: 16KB blocks on development kernel
- XFS 32K: 32KB blocks on stable kernel
- XFS 32K-dev: 32KB blocks on development kernel
Running make pw-analysis generates:
-
Individual Configuration Analysis (
output/*_single.png)- Compaction events timeline
- External fragmentation (extfrag) event distribution
- Migration pattern heatmap with severity indicators
- Per-configuration statistics summary
-
A/B Comparisons (
output/pw2-*.png)- 4K vs 16K block size comparison
- 4K vs 32K block size comparison
- 16K vs 32K block size comparison
- Development kernel comparisons
- Statistical comparison tables
- External Fragmentation Events: Page steals vs claims from different migrate types
- Migration Patterns: Tracking migrations between MOVABLE, UNMOVABLE, and RECLAIMABLE zones
- Compaction Success Rate: Percentage of successful vs failed compaction attempts
- Fragmentation Severity: Color-coded indicators showing impact of different migration patterns
The visualizations use the following color coding:
- π΄ Red: Bad migrations (e.g., MOVABLE β UNMOVABLE)
- π’ Green: Good migrations (e.g., UNMOVABLE β MOVABLE)
- π‘ Yellow: Neutral migrations
- π΅ Blue/Orange: Dataset A vs B in comparison mode
Migration severity indicators:
!!: Very bad migration pattern (severity β€ -2)!: Bad migration pattern (severity β€ -1)+: Good migration pattern (severity β₯ 1)=: Neutral migration pattern
All visualization outputs are saved to the output/ directory:
output/
βββ pw2-xfs-reflink-4k_single.png # Individual 4K analysis
βββ pw2-xfs-reflink-16k-4ks-dev_single.png # Individual 16K analysis
βββ pw2-xfs-reflink-32k-4ks_single.png # Individual 32K analysis
βββ pw2-xfs-reflink-4k-vs-16k.png # 4K vs 16K comparison
βββ pw2-xfs-reflink-4k-vs-32k.png # 4K vs 32K comparison
βββ pw2-xfs-reflink-16k-vs-32k.png # 16K vs 32K comparison
βββ pw2-all-configs-comparison.png # All configurations overviewTwo extra tracepoints are added so we can get enough context for at what order and at what fragmentation index did compaction succeed or fail over time. These are not upstream, and so the
π Refer to the kernel external fragmentation documentation.
Tracking /proc/pagetypeinfo can be insufficient as it is ultimately limited by not having information on mixed pageblocks. The real trouble for fragmentation begins when there are no pages of the required migrate type to satisfy an allocation. The missing pages are taken/stolen from another migrate type instead, producing pageblock with mixed migrate types.
This is where this tracepoint helps - it provides the most insight on system fragmentation. This can occur if either the allocator fallback order is smaller than a pageblock order (order-9 on x86-64) or through compaction capturing, and it is considered an event that will cause external fragmentation issues in the future. This can ultimately impede compaction from defragmenting and creating opportunities for large pages/folio, just with a single unmovable allocation in a movable pageblock. Among other factors, when falling back to another migratetype during allocation, this allocation type will depend on how stealing semantics work, as to whether or not a pageblock is polluted or entirely claimed to satisfy new allocations. This is important because it goes back to again the movable vs unmovable semantics. Since unmovable/reclaimable allocations would cause permanent fragmentation if they fell back to allocating from a movable block (polluting it), such cases will claim the whole block regardless of the allocation size. Later movable allocations can steal from this block, which is less problematic. The kernel heuristics tries as much as possible to avoid fragmentation and mixing pageblocks by stealing the largest possible pages such that the pageblock's migratetype can be simply changed and avoid fragmentation altogether. Further, when a page is freed, it will return to a free list with migrate type of its pageblock.
Linux attempts to mitigate against these mm_page_alloc_extfrag events by temporarily boosting reclaim for the respective zone watermark when these events occur via the watermark_boost_factor tunnable.
Once
reclaim
has finished, kcompacd is awoken to defragment the pageblocks. Note that this
boosted reclaim is simple such to avoid disrupting the system, so things like
writeback and swapping and are avoided by kswapd. If boosting has occurred, it
will be reflected in the boost field in
/proc/zoneinfo.
Another important and most recent option is the
defrag_mode,
which pushes for reclaim/compaction to be done in full extent before incurring
in mixed pageblock fallback events. Another way to mitigate against such
occurrences is by the buddy allocator trying to get pages from a lower NUMA
local zone (ZONE_DMA32 specifically) before fragmenting a higher zone (Normal).
This makes a very big difference. The reason it's only ZONE_DMA32 and no other
lower zone type is a balance between avoiding premature low memory pressure and
size of the zone.







