Skip to content

Commit b5a41f9

Browse files
committed
fix!: syncHtmlToMarkdown -> htmlToMarkdown
1 parent 42b76bf commit b5a41f9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+291
-262
lines changed

CLAUDE.md

Lines changed: 42 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,10 @@ When you finish a task, always run `pnpm typecheck` to ensure that the code is t
3636
- Test all: `pnpm test`
3737
- Test single file: `pnpm test path/to/test.ts`
3838
- Test with pattern: `pnpm test -t "test pattern"`
39-
- Test GitHub markdown: `pnpm test:github`
40-
- Development: `pnpm dev:prepare`
39+
- Test folder: `pnpm test test/unit/plugins/`
40+
- Development build (stub): `pnpm dev:prepare`
41+
- Live test with real sites: `pnpm test:github:live`, `pnpm test:wiki:file`
42+
- Benchmarking: `pnpm bench:stream`, `pnpm bench:string`
4143

4244
## Code Style Guidelines
4345
- Indentation: 2 spaces
@@ -52,11 +54,25 @@ When you finish a task, always run `pnpm typecheck` to ensure that the code is t
5254
- Follow ESLint config based on @antfu/eslint-config
5355

5456
## Project Architecture
55-
- Core modules:
56-
- `parser.ts`: Handles HTML parsing into a DOM-like structure
57-
- `markdown.ts`: Transforms DOM nodes to Markdown
58-
- `htmlStreamAdapter.ts`: Manages HTML streaming conversion
59-
- `index.ts`: Main entry point with primary API functions
57+
58+
### Core Architecture
59+
- `src/index.ts`: Main entry point with `syncHtmlToMarkdown` and `streamHtmlToMarkdown` APIs
60+
- `src/parser.ts`: Manual HTML parsing into DOM-like structure for performance
61+
- `src/markdown.ts`: DOM node to Markdown transformation logic
62+
- `src/stream.ts`: Streaming HTML processing with content-based buffering
63+
- `src/types.ts`: Core TypeScript interfaces for nodes, plugins, and state management
64+
65+
### Plugin System
66+
- `src/pluggable/plugin.ts`: Plugin creation utilities and base interfaces
67+
- `src/plugins/`: Built-in plugins (filter, extraction, tailwind, readability, etc.)
68+
- `src/libs/query-selector.ts`: CSS selector parsing logic shared across plugins
69+
- Plugin hooks: `beforeNodeProcess`, `onNodeEnter`, `onNodeExit`, `processTextNode`
70+
71+
### Key Concepts
72+
- **Node Types**: ElementNode (HTML elements) and TextNode (text content) with parent/child relationships
73+
- **Streaming Architecture**: Processes HTML incrementally using buffer regions and optimal chunk boundaries
74+
- **Plugin Pipeline**: Each plugin can intercept and transform content at different processing stages
75+
- **Memory Efficiency**: Immediate processing and callback patterns to avoid collecting large data structures
6076

6177
## Technical Details
6278
- Parser: Manual HTML parsing for performance, doesn't use browser DOM
@@ -75,12 +91,25 @@ When you finish a task, always run `pnpm typecheck` to ensure that the code is t
7591
- `--chunk-size <size>`: Controls stream chunking (default: 4096)
7692
- `-v, --verbose`: Enables debug logging
7793

78-
Always run tests after making changes to ensure backward compatibility.
94+
## CLI and Testing
95+
96+
### CLI Usage
97+
- Processes HTML from stdin, outputs Markdown to stdout
98+
- Test with live sites: `curl -s https://example.com | node ./bin/mdream.mjs --origin https://example.com`
99+
- Key CLI options: `--origin <url>`, `-v/--verbose`, `--chunk-size <size>`
79100

80-
## Docs
101+
### Testing Strategy
102+
- Comprehensive test coverage in `test/unit/` and `test/integration/`
103+
- Plugin tests in `test/unit/plugins/` - always add tests for new plugins
104+
- Real-world test fixtures in `test/fixtures/` (GitHub, Wikipedia HTML)
105+
- Template tests for complex HTML structures (navigation, tables, etc.)
106+
- Always run tests after making changes to ensure backward compatibility
81107

82-
Please reference the following docs:
108+
## Plugin Development
83109

84-
- @docs/plugin-api.md
85-
- @docs/plugins.md
86-
- @docs/plugin-api.md
110+
When creating new plugins:
111+
1. Use CSS selectors from `src/libs/query-selector.ts` for element matching
112+
2. Implement memory-efficient patterns (immediate callbacks vs. collecting data)
113+
3. Add comprehensive tests covering edge cases and real-world scenarios
114+
4. Follow existing plugin patterns in `src/plugins/` directory
115+
5. Export from `src/plugins.ts` for public API access

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -99,16 +99,16 @@ pnpm add mdream
9999
### Usage
100100

101101
Mdream provides two utils for working with HTML, both will process content as a stream.
102-
- `syncHtmlToMarkdown`: Useful if you already have the entire HTML payload you want to convert.
102+
- `htmlToMarkdown`: Useful if you already have the entire HTML payload you want to convert.
103103
- `streamHtmlToMarkdown`: Best practice if you are fetching or reading from a local file.
104104

105105
**Convert existing HTML**
106106

107107
```ts
108-
import { syncHtmlToMarkdown } from 'mdream'
108+
import { htmlToMarkdown } from 'mdream'
109109

110110
// Simple conversion
111-
const markdown = syncHtmlToMarkdown('<h1>Hello World</h1>')
111+
const markdown = htmlToMarkdown('<h1>Hello World</h1>')
112112
console.log(markdown) // # Hello World
113113
````
114114

@@ -138,7 +138,7 @@ for await (const chunk of markdownGenerator) {
138138
Mdream now features a powerful plugin system that allows you to customize and extend the HTML-to-Markdown conversion process.
139139

140140
```ts
141-
import { createPlugin, filterUnsupportedTags, syncHtmlToMarkdown, withTailwind } from 'mdream'
141+
import { createPlugin, filterUnsupportedTags, htmlToMarkdown, withTailwind } from 'mdream'
142142
143143
// Create a custom plugin
144144
const myPlugin = createPlugin({
@@ -153,7 +153,7 @@ const myPlugin = createPlugin({
153153
154154
// Use multiple plugins together
155155
const html = '<div role="alert" class="font-bold">Important message</div>'
156-
const markdown = syncHtmlToMarkdown(html, {
156+
const markdown = htmlToMarkdown(html, {
157157
plugins: [
158158
withTailwind(), // Apply Tailwind class processing
159159
filterUnsupportedTags(), // Filter out unsupported tags
@@ -169,7 +169,7 @@ console.log(markdown) // "⚠️ **Important message** ⚠️"
169169
Extract specific elements and their content during HTML processing for data analysis or content discovery:
170170

171171
```ts
172-
import { extractionPlugin, syncHtmlToMarkdown } from 'mdream'
172+
import { extractionPlugin, htmlToMarkdown } from 'mdream'
173173
174174
const html = `
175175
<article>
@@ -190,7 +190,7 @@ const plugin = extractionPlugin({
190190
}
191191
})
192192
193-
syncHtmlToMarkdown(html, { plugins: [plugin] })
193+
htmlToMarkdown(html, { plugins: [plugin] })
194194
```
195195

196196
The extraction plugin provides memory-efficient element extraction with full text content and attributes, perfect for SEO analysis, content discovery, and data mining.

bench/bundle/src/await-fetch.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import { writeFile } from 'node:fs/promises'
22
import { resolve } from 'node:path'
3-
import { syncHtmlToMarkdown } from '../../../src'
3+
import { htmlToMarkdown } from '../../../src'
44

55
async function run() {
66
// read times to run it from command line argument
@@ -12,7 +12,7 @@ async function run() {
1212
// create a read stream for ../elon.html
1313
// create a read stream for ../elon.html
1414
const html = await response.text()
15-
await writeFile(resolve(import.meta.dirname, '../dist/wiki.md'), await syncHtmlToMarkdown(html), { encoding: 'utf-8' })
15+
await writeFile(resolve(import.meta.dirname, '../dist/wiki.md'), await htmlToMarkdown(html), { encoding: 'utf-8' })
1616
}
1717
const end = performance.now()
1818
const duration = end - start

bench/bundle/src/await.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import { readFile, writeFile } from 'node:fs/promises'
22
import { resolve } from 'node:path'
3-
import { syncHtmlToMarkdown } from '../../../src'
3+
import { htmlToMarkdown } from '../../../src'
44

55
async function run() {
66
// read times to run it from command line argument
@@ -11,7 +11,7 @@ async function run() {
1111
const html = await readFile(resolve(import.meta.dirname, '../wiki.html'), { encoding: 'utf-8' })
1212
logMemoryUsage('before await creation')
1313
const start = performance.now()
14-
const converted = syncHtmlToMarkdown(html)
14+
const converted = htmlToMarkdown(html)
1515
const end = performance.now()
1616
const duration = end - start
1717
// eslint-disable-next-line no-console

bench/bundle/src/minimal.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import { syncHtmlToMarkdown } from '../../../src'
1+
import { htmlToMarkdown } from '../../../src'
22

33
function run() {
44
// Full usage with all core features
@@ -17,7 +17,7 @@ function run() {
1717
<img src="image.jpg" alt="Image description">
1818
<p>Another paragraph.</p>
1919
`
20-
const markdown = syncHtmlToMarkdown(html)
20+
const markdown = htmlToMarkdown(html)
2121

2222
process.stdout.write(markdown)
2323
}

bench/bundle/src/string.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import { syncHtmlToMarkdown } from '../../../src'
1+
import { htmlToMarkdown } from '../../../src'
22

33
const html = `<!DOCTYPE html>
44
<!-- saved from url=(0039)https://en.wikipedia.org/wiki/Elon_Musk -->
@@ -3527,7 +3527,7 @@ function run() {
35273527
// extend the timings
35283528
for (let i = 0; i < times; i++) {
35293529
// eslint-disable-next-line no-console
3530-
console.log((syncHtmlToMarkdown(html)).length)
3530+
console.log((htmlToMarkdown(html)).length)
35313531
}
35323532
const end = performance.now()
35333533
const duration = end - start

scripts/crawl.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import { existsSync, mkdirSync } from 'node:fs'
22
import { writeFile } from 'node:fs/promises'
33
import { PlaywrightCrawler } from 'crawlee'
4-
import { syncHtmlToMarkdown } from '../src/index'
4+
import { htmlToMarkdown } from '../src/index'
55
import { withMinimalPreset } from '../src/preset/minimal'
66

77
// CheerioCrawler crawls the web using HTTP requests
@@ -13,7 +13,7 @@ const crawler = new PlaywrightCrawler({
1313
const html = await page.innerHTML('html')
1414
log.info('HTML length', { url: request.loadedUrl, length: html.length })
1515
const now = new Date()
16-
const md = syncHtmlToMarkdown(html, withMinimalPreset({
16+
const md = htmlToMarkdown(html, withMinimalPreset({
1717
origin: new URL(request.loadedUrl).origin,
1818
}))
1919
log.info('Processed html -> md in', { url: request.loadedUrl, time: new Date() - now })

src/index.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import type { HTMLToMarkdownOptions, MdreamRuntimeState } from './types'
22
import { processPartialHTMLToMarkdown } from './parser'
33

4-
export function syncHtmlToMarkdown(
4+
export function htmlToMarkdown(
55
html: string,
66
options: HTMLToMarkdownOptions = {},
77
): string {

test/unit/malformed-html.test.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,27 @@
11
import { describe, it } from 'vitest'
2-
import { syncHtmlToMarkdown } from '../../src/index.js'
2+
import { htmlToMarkdown } from '../../src/index.js'
33

44
describe.skip('malformed html', () => {
55
it('correctly tracks element depth in nested structures', () => {
66
it('handles incorrectly nested tags that overlap', () => {
77
const html = '<p><strong>Bold text <em>Bold and italic</strong> just italic</em></p>'
8-
const markdown = syncHtmlToMarkdown(html)
8+
const markdown = htmlToMarkdown(html)
99

1010
// The parser should maintain emphasis even though tags are improperly nested
1111
expect(markdown).toContain('**Bold text *Bold and italic** just italic*')
1212
})
1313

1414
it('recovers from malformed attributes in tags', () => {
1515
const html = '<a href="https://example.com" title="missing quote>Link text</a>'
16-
const markdown = syncHtmlToMarkdown(html)
16+
const markdown = htmlToMarkdown(html)
1717

1818
// The parser should still create a link despite the malformed attribute
1919
expect(markdown).toContain('[Link text](https://example.com)')
2020
})
2121

2222
it('handles broken HTML comments appropriately', () => {
2323
const html = '<!-- This comment is not closed <p>This paragraph should be visible</p>'
24-
const markdown = syncHtmlToMarkdown(html)
24+
const markdown = htmlToMarkdown(html)
2525

2626
// The parser should still process content after a broken comment
2727
expect(markdown).toContain('This paragraph should be visible')

test/unit/nodes/blockquote.test.ts

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
11
import { describe, expect, it } from 'vitest'
2-
import { syncHtmlToMarkdown } from '../../../src/index.js'
2+
import { htmlToMarkdown } from '../../../src/index.js'
33

44
describe('blockquotes', () => {
55
it('converts blockquotes', () => {
66
const html = '<blockquote>This is a quote</blockquote>'
7-
const markdown = syncHtmlToMarkdown(html)
7+
const markdown = htmlToMarkdown(html)
88
expect(markdown).toBe('> This is a quote')
99
})
1010

1111
it('handles nested blockquotes', () => {
1212
const html = '<blockquote>Outer quote<blockquote>Inner quote</blockquote></blockquote>'
13-
const markdown = syncHtmlToMarkdown(html)
13+
const markdown = htmlToMarkdown(html)
1414
expect(markdown).toBe('> Outer quote\n> > Inner quote')
1515
})
1616

1717
it.skip('handles blockquotes with paragraphs', () => {
1818
const html = '<blockquote><p>First paragraph</p><p>Second paragraph</p></blockquote>'
19-
const markdown = syncHtmlToMarkdown(html)
19+
const markdown = htmlToMarkdown(html)
2020
expect(markdown).toBe('> First paragraph\n> Second paragraph')
2121
})
2222

2323
it('handles complex nested blockquotes', () => {
2424
const html = '<blockquote><p>Outer paragraph</p><blockquote><p>Inner paragraph</p></blockquote></blockquote>'
25-
const markdown = syncHtmlToMarkdown(html)
25+
const markdown = htmlToMarkdown(html)
2626

2727
expect(markdown).toBe('> Outer paragraph\n> > Inner paragraph')
2828
})
2929
// test for > A quote with an ![image](image.jpg) inside.
3030
it('handles blockquotes with images', () => {
3131
const html = '<blockquote>This is a quote with an <img src="image.jpg" alt="image"></blockquote>'
32-
const markdown = syncHtmlToMarkdown(html)
32+
const markdown = htmlToMarkdown(html)
3333
expect(markdown).toBe('> This is a quote with an ![image](image.jpg)')
3434
})
3535
})

0 commit comments

Comments
 (0)