Fast and accurate web content extraction for Node.js.
High-performance NAPI bindings for rs-trafilatura - a Rust port of trafilatura. Extracts clean, readable content from web pages while removing boilerplate, navigation, and advertisements.
- Fast: 71 files/s for articles, 46 files/s overall (native Rust)
- Accurate: F1 0.966 on ScrapingHub benchmark, F1 0.859 across 7 page types
- Page Type Classification: Auto-detects 7 page types (article, forum, product, collection, listing, documentation, service)
- Per-Type Extraction: Specialized extraction profiles for each page type
- Extraction Quality Predictor: ML-based confidence scoring (0.0-1.0)
- Markdown Output: GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks
- Rich Metadata: Title, author, date, description, categories, tags, license, images from JSON-LD, Open Graph, Dublin Core
- Configurable: 28 options to tune precision/recall tradeoff, content selection, and output format
- Robust: Handles malformed HTML with automatic character encoding detection
npm install trafilaturaimport { extract } from 'trafilatura'
const html = `
<html>
<head><title>Example Article</title></head>
<body>
<nav>Home | About | Contact</nav>
<article>
<h1>Main Title</h1>
<p>This is the main content of the article.</p>
</article>
<footer>Copyright 2024</footer>
</body>
</html>
`
const result = extract(html)
console.log('Title:', result.metadata.title)
console.log('Content:', result.contentText)
console.log('Page type:', result.metadata.pageType)
console.log('Quality:', result.extractionQuality)import { extract } from 'trafilatura'
const result = extract(html, {
outputMarkdown: true,
includeImages: true,
favorPrecision: true,
url: 'https://example.com/article',
})
console.log(result.contentMarkdown)
console.log(result.images)import { extract } from 'trafilatura'
const result = extract(html, {
pageType: 'product', // Force product page extraction
})For HTML with unknown encoding:
import { extractBytes } from 'trafilatura'
const htmlBuffer = await fs.promises.readFile('page.html')
const result = extractBytes(htmlBuffer, { url: 'https://example.com' })Extract content from HTML string with optional options.
Extract content from Buffer (handles encoding detection) with optional options.
| Field | Type | Description |
|---|---|---|
| contentText | string? | Main content as plain text |
| contentHtml | string? | Main content as HTML |
| contentMarkdown | string? | Main content as Markdown |
| commentsText | string? | Comments section as text |
| commentsHtml | string? | Comments section as HTML |
| images | ImageData[] | Extracted images |
| metadata | Metadata | Extracted metadata |
| classificationConfidence | number? | ML classifier confidence (0.0-1.0) |
| extractionQuality | number | Extraction quality confidence (0.0-1.0) |
| warnings | string[] | Processing warnings |
| Option | Type | Description |
|---|---|---|
| includeComments | boolean | Include comments in output |
| includeTables | boolean | Include tables |
| includeImages | boolean | Include images |
| includeLinks | boolean | Include links |
| favorPrecision | boolean | Favor precision over recall |
| favorRecall | boolean | Favor recall over precision |
| targetLanguage | string | Target language code |
| url | string | Source URL |
| authorBlacklist | string[] | Author names to exclude |
| deduplicate | boolean | Remove duplicate content |
| minExtractedSize | number | Minimum extracted content size |
| minExtractedLen | number | Minimum extracted length |
| maxExtractedLen | number | Maximum extracted length |
| minOutputSize | number | Minimum output size |
| minOutputCommSize | number | Minimum comments size |
| minScore | number | Minimum quality score |
| maxDuplicateRatio | number | Max duplicate ratio threshold |
| maxLinkDensity | number | Max link density threshold |
| minParagraphCluster | number | Min paragraph cluster size |
| includeFormatting | boolean | Include text formatting |
| onlyWithMetadata | boolean | Only extract pages with metadata |
| maxTreeDepth | number | Maximum DOM tree depth |
| minWordLength | number | Minimum word length |
| useFallbackExtraction | boolean | Use fallback extraction |
| dedupCacheSize | number | Deduplication cache size |
| includeTitleInContent | boolean | Include title in content |
| outputMarkdown | boolean | Output as Markdown |
| pageType | string | Override page type |
# Install build dependencies
npm install
# Build the native module
npm run buildnpm testMIT
- trafilatura - Original Python implementation by Adrien Barbaresi
- rs-trafilatura - Rust port by Murrough Foley