breadchunks

Heading-aware, token-budgeted semantic chunker for Markdown.

Given a Markdown document, breadchunks splits it by heading hierarchy and merges/splits chunks to stay within a character budget. Designed for RAG pipelines and embedding workflows where section context matters.

Supported platforms

Prebuilt binaries: macOS arm64/x64, Linux glibc arm64/x64, Windows x64. Alpine/musl and Windows arm64 users must build from source via napi build.

How it works

Three-phase pipeline:

Phase 1 — Split: Split at header boundaries. Every paragraph becomes its own chunk, tagged with its full heading breadcrumb (H1 > H2 > H3). Code blocks are protected — # comment inside a fenced block is never treated as a Markdown heading.
Phase 2 — Merge same-breadcrumb: Merge adjacent chunks that share the same breadcrumb and are below minLength.
Phase 3 — Parent absorption (bottom-up, h6→h1): Absorb small child sections into their parent header when the combined size stays under maxLength.

Supported Markdown: ATX headers only (# H1 through ###### H6). Setext headers (====/---- underlines) are not recognized. Backtick-fenced code blocks (```) and inline code (`) are protected. Tilde fences (~~~) and 4-space-indented code are not — # inside them is treated as a header. Switch to backtick fences if your document uses tildes.

Chunk length is a character count after collapsing all whitespace runs to a single space. The same logic applies when computing the length of breadcrumb + "\n\n" + text (the full string an embedding model sees).

Rust

[dependencies]
breadchunks = "0.1"

use breadchunks::{chunk, ChunkOptions};

let markdown = "# Introduction\n\nHello world.\n\n## Details\n\nMore info.";
let chunks = chunk(markdown, Some(ChunkOptions {
    min_length: Some(400),
    max_length: Some(2000),
    ..Default::default()
}));

for c in &chunks {
    println!("[{}] {}", c.breadcrumb, &c.text[..c.text.len().min(80)]);
}

`ChunkOptions`

Field	Type	Default	Description
`min_length`	`Option<u32>`	`512`	Target minimum chunk size (chars)
`max_length`	`Option<u32>`	`3072`	Hard maximum chunk size (chars)
`phase`	`Option<u32>`	`3`	Stop after this phase (1, 2, or 3)
`title`	`Option<String>`	`None`	Document title — prepended to every breadcrumb

`Chunk`

Field	Type	Description
`level`	`u32`	Heading depth (0 = preface, 1–6 = h1–h6)
`header`	`Option<String>`	Text of the nearest heading
`headers`	`Vec<Option<String>>`	Full 6-slot heading stack (h1–h6)
`breadcrumb`	`String`	Human-readable path: `"H1 > H2 > H3"`
`text`	`String`	Chunk body (without the heading line or breadcrumb). To get the full string an embedding model sees, prepend `breadcrumb + "\n\n"` when `breadcrumb` is non-empty.
`length`	`usize`	Character count of `breadcrumb + "\n\n" + text` after whitespace collapse. `text` alone is shorter; callers must prepend `breadcrumb` to reproduce this measurement.

`default_length_counter`

Collapses all whitespace runs to a single space, trims, then counts Unicode characters (not bytes). This is what populates chunk.length. Use it for consistent counts when building the string you send to an embedding model (breadcrumb + "\n\n" + text). Export it for consistent counts elsewhere:

use breadchunks::default_length_counter;
let n = default_length_counter("hello  world"); // 11

Node (N-API)

npm install breadchunks

import { chunk } from 'breadchunks'

// Preferred: async batch with Buffers (runs on libuv threadpool)
const [chunks] = await chunk([Buffer.from(markdown)], { minLength: 400, maxLength: 2000 })
for (const c of chunks) {
  console.log(`[${c.breadcrumb}]`, c.text.slice(0, 80))
}

// Process multiple documents in one call
const results = await chunk([docA, docB, docC])
// results[0] → Chunk[] for docA, results[1] → Chunk[] for docB, …

TypeScript types are included (index.d.ts). Options and return shape mirror the Rust API.

chunk accepts a batch of Buffer | string inputs. Buffer is preferred — it avoids a round-trip UTF-8 re-encode from the JS string heap. Pass string when you already have one.

Node `ChunkOptions`

Field	Type	Default	Description
`minLength`	`number?`	`512`	Target minimum chunk size (chars)
`maxLength`	`number?`	`3072`	Hard maximum chunk size (chars)
`phase`	`number?`	`3`	Stop after this phase (1, 2, or 3)
`title`	`string?`	`undefined`	Document title — prepended to every breadcrumb

Development

# Rust crate tests + coverage report
cd crate
cargo test
cargo llvm-cov --html   # opens target/llvm-cov/html/index.html

# Node package (requires Rust toolchain + Node 18+)
cd package
npm install
npm run build:debug
npm test

Releasing

Releases are managed via the Release workflow. Trigger it manually from GitHub Actions with a patch, minor, or major bump. The workflow:

Bumps crate/Cargo.toml and package/package.json versions, commits, and tags.
Cross-compiles the native module for all supported platforms.
Publishes the Rust crate to crates.io and the Node package to npm.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
crate		crate
fixtures		fixtures
package		package
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

breadchunks

Supported platforms

How it works

Rust

`ChunkOptions`

`Chunk`

`default_length_counter`

Node (N-API)

Node `ChunkOptions`

Development

Releasing

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

breadchunks

Supported platforms

How it works

Rust

ChunkOptions

Chunk

default_length_counter

Node (N-API)

Node ChunkOptions

Development

Releasing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`ChunkOptions`

`Chunk`

`default_length_counter`

Node `ChunkOptions`

Packages