Skip to content

Commit a8ada80

Browse files
committed
fix(llms.txt): support sections and notes
1 parent eba1ed4 commit a8ada80

File tree

4 files changed

+394
-70
lines changed

4 files changed

+394
-70
lines changed

README.md

Lines changed: 10 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222

2323
- 🧠 Custom built HTML to Markdown Convertor Optimized for LLMs (~50% fewer tokens)
2424
- 🔍 Generates [Minimal](./packages/mdream/src/preset/minimal.ts) GitHub Flavored Markdown: Frontmatter, Nested & HTML markup support.
25-
- ✂️ LangChain compatible [Markdown Text Splitter](#text-splitter) for single-pass chunking.
25+
- ✂️ LangChain compatible [Markdown Text Splitter](./packages/mdream/README.md#markdown-splitting) for single-pass chunking.
2626
- 🚀 Ultra Fast: Stream 1.4MB of HTML to markdown in ~50ms.
2727
- ⚡ Tiny: 6kB gzip, zero dependency core.
2828
- ⚙️ Run anywhere: [CLI Crawler](#mdream-crawl), [Docker](#docker-usage), [GitHub Actions](#github-actions-integration), [Vite](#vite-integration), & more.
@@ -329,71 +329,16 @@ const markdown = htmlToMarkdown('<h1>Hello World</h1>')
329329
console.log(markdown) // # Hello World
330330
```
331331

332-
See the [Mdream Package README](./packages/mdream/README.md) for complete documentation on API usage, streaming, presets, and the plugin system.
332+
**Core Functions:**
333+
- [htmlToMarkdown](./packages/mdream/README.md#api-usage) - Convert HTML to Markdown
334+
- [streamHtmlToMarkdown](./packages/mdream/README.md#api-usage) - Stream HTML to Markdown
335+
- [parseHtml](./packages/mdream/README.md#api-usage) - Parse HTML to AST
333336

334-
## Text Splitter
335-
336-
Mdream includes a [LangChain](https://python.langchain.com/api_reference/text_splitters/markdown/langchain_text_splitters.markdown.ExperimentalMarkdownSyntaxTextSplitter.html) compatible Markdown splitter that runs efficiently in single pass.
337-
338-
This provides significant performance improvements over traditional multi-pass splitters and allows
339-
you to integrate with your custom Mdream plugins.
340-
341-
```ts
342-
import { htmlToMarkdownSplitChunks } from 'mdream/splitter'
343-
344-
const chunks = await htmlToMarkdownSplitChunks('<h1>Hello World</h1><p>This is a paragraph.</p>', {
345-
chunkSize: 1000,
346-
chunkOverlap: 200,
347-
})
348-
console.log(chunks) // Array of text chunks
349-
```
350-
351-
See the [Text Splitter Documentation](./packages/mdream/docs/splitter.md) for complete usage and configuration.
352-
353-
## Streaming llms.txt Generation
354-
355-
Generate `llms.txt` and `llms-full.txt` files by streaming pages to disk without keeping full content in memory. Ideal for programmatic generation from crawlers or build systems.
356-
357-
```ts
358-
import { createLlmsTxtStream } from 'mdream/llms-txt'
359-
360-
const stream = createLlmsTxtStream({
361-
siteName: 'My Docs',
362-
description: 'Documentation site',
363-
origin: 'https://example.com',
364-
generateFull: true,
365-
outputDir: './dist',
366-
})
367-
368-
const writer = stream.getWriter()
369-
370-
// Stream pages as they're processed
371-
await writer.write({
372-
title: 'Home',
373-
content: '# Welcome\n\nHome page content.',
374-
url: '/',
375-
metadata: { description: 'Welcome page' },
376-
})
377-
378-
await writer.write({
379-
title: 'About',
380-
content: '# About\n\nAbout page content.',
381-
url: '/about',
382-
})
383-
384-
await writer.close()
385-
```
386-
387-
**Options:**
388-
- `siteName` - Site name for header (default: 'Site')
389-
- `description` - Site description for header
390-
- `origin` - Base URL to prepend to relative URLs
391-
- `generateFull` - Generate llms-full.txt with complete content (default: false)
392-
- `outputDir` - Directory to write files (default: process.cwd())
393-
394-
**Output:**
395-
- `llms.txt` - List of pages with titles and descriptions
396-
- `llms-full.txt` - Complete page content with frontmatter (if `generateFull: true`)
337+
**Utilities:**
338+
- [Presets](./packages/mdream/README.md#presets) - Pre-configured plugin combinations
339+
- [Plugin System](./packages/mdream/README.md#plugin-system) - Customize conversion behavior
340+
- [Markdown Splitting](./packages/mdream/README.md#markdown-splitting) - Split HTML into chunks
341+
- [llms.txt Generation](./packages/mdream/README.md#llmstxt-generation) - Generate llms.txt artifacts
397342

398343
## Credits
399344

packages/mdream/README.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,102 @@ const chunks = htmlToMarkdownSplitChunks(html, withMinimalPreset({
422422
}))
423423
```
424424

425+
## llms.txt Generation
426+
427+
Generate [llms.txt](https://llmstxt.org) files from HTML content for improved LLM discoverability. Mdream provides both streaming and batch APIs for creating llms.txt artifacts.
428+
429+
### createLlmsTxtStream
430+
431+
Stream llms.txt generation without keeping full content in memory:
432+
433+
```ts
434+
import { createLlmsTxtStream } from 'mdream'
435+
436+
const stream = createLlmsTxtStream({
437+
siteName: 'My Docs',
438+
description: 'Documentation site',
439+
origin: 'https://example.com',
440+
outputDir: './dist',
441+
generateFull: true, // Also generate llms-full.txt
442+
sections: [
443+
{
444+
title: 'Getting Started',
445+
description: 'Quick start guide',
446+
links: [
447+
{ title: 'Installation', href: '/install', description: 'How to install' },
448+
{ title: 'Quick Start', href: '/quickstart' },
449+
],
450+
},
451+
],
452+
notes: ['Generated by mdream', 'Last updated: 2024'],
453+
})
454+
455+
const writer = stream.getWriter()
456+
await writer.write({
457+
title: 'Home',
458+
content: '# Welcome\n\nHome page content.',
459+
url: '/',
460+
metadata: {
461+
description: 'Welcome page',
462+
},
463+
})
464+
await writer.close()
465+
```
466+
467+
This creates:
468+
- `llms.txt` - Links to all pages with metadata
469+
- `llms-full.txt` - Complete content with frontmatter (if `generateFull: true`)
470+
471+
### generateLlmsTxtArtifacts
472+
473+
Process HTML files or ProcessedFile objects:
474+
475+
```ts
476+
import { generateLlmsTxtArtifacts } from 'mdream'
477+
478+
const result = await generateLlmsTxtArtifacts({
479+
patterns: '**/*.html', // Glob pattern for HTML files
480+
siteName: 'My Site',
481+
origin: 'https://example.com',
482+
generateFull: true,
483+
sections: [
484+
{
485+
title: 'Resources',
486+
links: [
487+
{ title: 'Docs', href: '/docs' },
488+
],
489+
},
490+
],
491+
notes: 'Footer notes',
492+
})
493+
494+
console.log(result.llmsTxt) // llms.txt content
495+
console.log(result.llmsFullTxt) // llms-full.txt content
496+
console.log(result.processedFiles) // Array of processed files
497+
```
498+
499+
### Structure
500+
501+
llms.txt follows this structure:
502+
503+
```markdown
504+
# Site Name
505+
506+
> Site description
507+
508+
## Custom Section
509+
510+
Section description
511+
512+
- [Link Title](url): Optional description
513+
514+
## Pages
515+
516+
- [Page Title](url): Page description
517+
518+
Custom notes
519+
```
520+
425521
## Credits
426522

427523
- [ultrahtml](https://github.com/natemoo-re/ultrahtml): HTML parsing inspiration

0 commit comments

Comments
 (0)