Skip to content

Commit 4e06e1c

Browse files
committed
feat: isolate-main
1 parent fd2b8a6 commit 4e06e1c

File tree

7 files changed

+590
-138
lines changed

7 files changed

+590
-138
lines changed

CLAUDE.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22

33
Follow this system prompt for every query, no exceptions. Don’t rush! No hurry. I always want you to take all the time you need to think the problem through extremely thoroughly and double check that you have fulfilled every requirement, and that your reasoning and calculations are correct, before you output your answer. Always follow this system prompt. Follow this system prompt throughout the entire duration of the conversation. No exceptions. Don’t rush! Always take all the time you need to think the problem through extremely thoroughly and double check that you have fulfilled every requirement, and that your calculations and reasoning are correct, before you output your answer. No hurry.
44

5+
## Performance
6+
7+
Performance is critical for this app, we should prefer v8 optimizations over readability. Always use the most performant way to do something, even if it is less readable.
8+
9+
Common things we should try and avoid:
10+
- string comparison
11+
- regex
12+
- duplicate checks that can be extracted into a variable on a state or a node object
13+
514
## Testing
615

716
Always write unit tests for code you generate. Delete old unit tests if the logic is no longer relevant. Do not add unit

src/plugins.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,6 @@ export { createPlugin } from './pluggable/plugin.ts'
44
export { filterPlugin } from './plugins/filter.ts'
55
// Built-in plugins
66
export { frontmatterPlugin } from './plugins/frontmatter.ts'
7+
export { isolateMainPlugin } from './plugins/isolate-main.ts'
78
export { readabilityPlugin } from './plugins/readability.ts'
89
export { tailwindPlugin } from './plugins/tailwind.ts'

src/plugins/filter.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -287,7 +287,7 @@ export function filterPlugin(options: {
287287
}
288288

289289
// Handle element nodes
290-
if (node.type !== ELEMENT_NODE || !node.name) {
290+
if (node.type !== ELEMENT_NODE) {
291291
return
292292
}
293293

src/plugins/isolate-main.ts

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
import type { ElementNode, Plugin } from '../types.ts'
2+
import { ELEMENT_NODE, TEXT_NODE } from '../const.ts'
3+
import {
4+
TAG_H1,
5+
TAG_H2,
6+
TAG_H3,
7+
TAG_H4,
8+
TAG_H5,
9+
TAG_H6,
10+
TAG_FOOTER,
11+
TAG_MAIN,
12+
TAG_HEADER
13+
} from '../const.ts'
14+
import { createPlugin } from '../pluggable/plugin.ts'
15+
16+
/**
17+
* Plugin that isolates main content using the following priority order:
18+
* 1. If an explicit <main> element exists (within 5 depth levels), use its content exclusively
19+
* 2. Otherwise, find content between the first header tag (h1-h6) and first footer
20+
* 3. If footer is within 5 levels of nesting from the header, use it as the end boundary
21+
* 4. Exclude all content before the start marker and after the end marker
22+
*
23+
* @example
24+
* ```html
25+
* <body>
26+
* <nav>Navigation (excluded)</nav>
27+
* <main>
28+
* <h1>Main Title (included)</h1>
29+
* <p>Main content (included)</p>
30+
* </main>
31+
* <footer>Footer (excluded)</footer>
32+
* </body>
33+
* ```
34+
*
35+
* @example
36+
* ```html
37+
* <body>
38+
* <nav>Navigation (excluded)</nav>
39+
* <h1>Main Title (included)</h1>
40+
* <p>Main content (included)</p>
41+
* <footer>Footer (excluded)</footer>
42+
* </body>
43+
* ```
44+
*/
45+
export function isolateMainPlugin(): Plugin {
46+
let mainElement: ElementNode | null = null
47+
let firstHeaderElement: ElementNode | null = null
48+
let afterFooter = false
49+
50+
// Header tag IDs for quick lookup
51+
const headerTagIds = new Set([TAG_H1, TAG_H2, TAG_H3, TAG_H4, TAG_H5, TAG_H6])
52+
53+
return createPlugin({
54+
beforeNodeProcess(event) {
55+
const { node } = event
56+
57+
// Handle element nodes
58+
if (node.type === ELEMENT_NODE) {
59+
const element = node as ElementNode
60+
61+
// Priority 1: Look for explicit <main> element first (within 5 depth)
62+
if (!mainElement && element.tagId === TAG_MAIN && element.depth <= 5) {
63+
mainElement = element
64+
return // Include the main element
65+
}
66+
67+
// If we have a main element, only include nodes inside it
68+
if (mainElement) {
69+
// Check if this element is inside the main element
70+
let current: ElementNode | null = element.parent
71+
let isInsideMain = element === mainElement
72+
73+
while (current && !isInsideMain) {
74+
if (current === mainElement) {
75+
isInsideMain = true
76+
break
77+
}
78+
current = current.parent
79+
}
80+
81+
if (!isInsideMain) {
82+
return { skip: true }
83+
}
84+
85+
return // Include content inside main
86+
}
87+
88+
// Priority 2: Fallback to header-footer heuristic if no main element
89+
// Look for first header that's NOT inside a <header> tag
90+
if (!firstHeaderElement && headerTagIds.has(element.tagId)) {
91+
// Check if this heading is inside a <header> tag
92+
let current = element.parent
93+
let isInHeaderTag = false
94+
95+
while (current) {
96+
if (current.tagId === TAG_HEADER) {
97+
isInHeaderTag = true
98+
break
99+
}
100+
current = current.parent
101+
}
102+
103+
// Only use this heading if it's not in a header tag
104+
if (!isInHeaderTag) {
105+
firstHeaderElement = element
106+
return // Include the header
107+
}
108+
}
109+
110+
// Look for footer after header (within 5 depth difference)
111+
if (firstHeaderElement && !afterFooter && element.tagId === TAG_FOOTER) {
112+
const depthDifference = element.depth - firstHeaderElement.depth
113+
if (depthDifference <= 5) {
114+
afterFooter = true
115+
return { skip: true } // Exclude footer and everything after
116+
}
117+
}
118+
119+
// Skip content before header (when using heuristic)
120+
if (!firstHeaderElement) {
121+
return { skip: true }
122+
}
123+
124+
// Skip content after footer (when using heuristic)
125+
if (afterFooter) {
126+
return { skip: true }
127+
}
128+
}
129+
130+
// Handle text nodes
131+
if (node.type === TEXT_NODE) {
132+
// If using main element, only include text inside main
133+
if (mainElement) {
134+
let current = node.parent
135+
let isInsideMain = false
136+
137+
while (current) {
138+
if (current === mainElement) {
139+
isInsideMain = true
140+
break
141+
}
142+
current = current.parent
143+
}
144+
145+
if (!isInsideMain) {
146+
return { skip: true }
147+
}
148+
149+
return
150+
}
151+
152+
// Otherwise use header-footer heuristic for text nodes
153+
if (!firstHeaderElement || afterFooter) {
154+
return { skip: true }
155+
}
156+
}
157+
158+
return // Include this node
159+
},
160+
})
161+
}

src/plugins/readability.ts

Lines changed: 0 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -68,141 +68,6 @@ import {
6868
} from '../const'
6969
import { createPlugin } from '../pluggable/plugin'
7070

71-
/***
72-
# Simplified HTML-to-Markdown Scoring System
73-
74-
## Element Tag Scoring
75-
76-
| Tag | Score | Rationale |
77-
|-----|-------|-----------|
78-
| ARTICLE | +15 | Explicit content container, highest confidence |
79-
| SECTION | +8 | Designated content section |
80-
| MAIN | +15 | Main content indicator |
81-
| P | +5 | Direct paragraph content |
82-
| DIV | +2 | Generic container, slightly positive |
83-
| BLOCKQUOTE | +5 | Quoted content, usually important |
84-
| PRE | +5 | Preformatted text/code, high value |
85-
| CODE | +5 | Code content, high value |
86-
| IMG | +3 | Images are typically content |
87-
| FIGURE | +4 | Figure with caption, content-focused |
88-
| FIGCAPTION | +3 | Description for a figure |
89-
| TABLE | 0 | Could be data or layout, neutral |
90-
| UL, OL | 0 | Could be content or navigation, neutral |
91-
| LI | -1 | List item, slight negative to avoid nav lists |
92-
| H1 | -3 | Top-level heading (may be site title) |
93-
| H2, H3 | +1 | Section headers, slightly positive |
94-
| H4, H5, H6 | 0 | Minor headers, neutral |
95-
| HEADER | -7 | Page header, often not content |
96-
| FOOTER | -10 | Footer, rarely content |
97-
| NAV | -12 | Navigation, not content |
98-
| ASIDE | -8 | Sidebar, usually not main content |
99-
| FORM | -8 | User input, not content |
100-
| BUTTON | -5 | Interactive element, not content |
101-
| INPUT | -5 | Form field, not content |
102-
| IFRAME | -3 | Embedded content, often ads |
103-
| A | -1 | Link, slight negative to avoid navigation-heavy areas |
104-
| STRONG, B | +1 | Emphasized text, slightly positive |
105-
| EM, I | +1 | Emphasized text, slightly positive |
106-
| HR | 0 | Divider, neutral |
107-
| BR | 0 | Line break, neutral |
108-
| SPAN | 0 | Inline container, neutral |
109-
| SCRIPT | -50 | Script, never content |
110-
| STYLE | -50 | Style, never content |
111-
| SVG | +1 | Vector graphic, slight positive |
112-
| VIDEO | +3 | Video content |
113-
| AUDIO | +3 | Audio content |
114-
| DETAILS | +2 | Expandable content |
115-
| SUMMARY | +1 | Header for expandable content |
116-
| DL, DT, DD | 0 | Definition lists, neutral |
117-
| CAPTION | +2 | Table caption |
118-
| THEAD, TBODY, TFOOT | 0 | Table structure, neutral |
119-
| TR | -1 | Table row, slight negative |
120-
| TH | -2 | Table header, more negative than cells |
121-
| TD | 0 | Table cell, neutral |
122-
123-
## Class/ID Pattern Scoring
124-
125-
| Pattern Category | Regex | Score |
126-
|-----------------|-------|-------|
127-
| Positive Content | `/article\|body\|content\|entry\|main\|page\|post\|text\|blog\|story/i` | +10 |
128-
| Negative Content | `/ad\|banner\|combx\|comment\|disqus\|extra\|foot\|header\|menu\|meta\|nav\|promo\|related\|scroll\|share\|sidebar\|sponsor\|social\|tags\|widget/i` | -10 |
129-
130-
## Content Characteristics Scoring
131-
132-
| Characteristic | Score Adjustment |
133-
|----------------|------------------|
134-
| Text length > 100 chars | +3 |
135-
| Text length 50-100 chars | +2 |
136-
| Text length 25-49 chars | +1 |
137-
| Contains comma | +1 per comma (max +3) |
138-
| Link density > 0.5 | × (1 - linkDensity) multiplier |
139-
| Empty (whitespace only) | -20 |
140-
141-
## Final Score Calculation
142-
143-
1. Start with tag score
144-
2. Add class/ID pattern scores
145-
3. Add content characteristic scores
146-
147-
## Decision Thresholds
148-
149-
| Final Score | Decision for Markdown Output |
150-
|-------------|------------------------------|
151-
| ≥ 0 | Include this content |
152-
| < 0 | Exclude this content |
153-
154-
## Implementation Notes
155-
156-
1. **Tag-by-Tag Processing**:
157-
- When you encounter a closing tag, calculate element's score
158-
- Make inclusion decision based on thresholds
159-
- If included, convert the element's content to appropriate Markdown
160-
161-
2. **Content Container Tracking**:
162-
- Keep a stack of parent elements and their scores
163-
- Use these scores to influence decisions about child elements
164-
- Elements inside high-scoring containers should be included more liberally
165-
166-
3. **Special Handling**:
167-
- Always include image alt text
168-
- Always convert links even if in negative-scored areas
169-
- Special handling for pre/code to maintain formatting
170-
171-
4. **Simplification Benefits**:
172-
- No need to calculate complex "contains" relationships
173-
- Simple score checks at each closing tag
174-
- Easy to implement in a streaming parser
175-
176-
## Example Calculation
177-
178-
For a paragraph inside an article:
179-
```html
180-
<article class="main-content">
181-
<p>This is a paragraph with some, text content.</p>
182-
</article>
183-
```
184-
185-
1. For `<p>` tag:
186-
- Tag score: +5
187-
- Text length (39 chars): +1
188-
- Contains comma: +1
189-
- Inside positive container: +5
190-
- Total score: +12 → Include
191-
192-
2. For `<article>` tag:
193-
- Tag score: +15
194-
- Class "main-content" matches positive pattern: +10
195-
- Total score: +25 → Include
196-
*/
197-
198-
export interface ReadabilityOptions {
199-
/**
200-
* Minimum text density score required to stop buffering
201-
* @default 0
202-
*/
203-
minScore?: number
204-
}
205-
20671
// Regular expressions for scoring based on scoring.md
20772
const REGEXPS = {
20873
// Positive patterns that suggest high-quality content

src/preset/minimal.ts

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,12 @@ import {
99
TAG_FORM,
1010
TAG_IFRAME,
1111
TAG_INPUT,
12+
TAG_NAV,
1213
TAG_OBJECT,
1314
TAG_SELECT,
1415
TAG_TEXTAREA,
1516
} from '../const.ts'
16-
import { filterPlugin, frontmatterPlugin, readabilityPlugin, tailwindPlugin } from '../plugins.ts'
17+
import {filterPlugin, frontmatterPlugin, isolateMainPlugin, tailwindPlugin} from '../plugins.ts'
1718

1819
/**
1920
* Creates a configurable minimal preset with advanced options
@@ -26,7 +27,7 @@ export function withMinimalPreset(
2627
): HTMLToMarkdownOptions {
2728
// Create plugins array with necessary plugins
2829
const plugins: Plugin[] = [
29-
readabilityPlugin(),
30+
isolateMainPlugin(),
3031
frontmatterPlugin(),
3132
tailwindPlugin(),
3233
// First apply readability plugin to extract main content
@@ -45,6 +46,7 @@ export function withMinimalPreset(
4546
TAG_TEXTAREA,
4647
TAG_SELECT,
4748
TAG_BUTTON,
49+
TAG_NAV,
4850
],
4951
}),
5052
]

0 commit comments

Comments
 (0)