Skip to content

gorango/napi-rs-trafilatura

Repository files navigation

napi-rs-trafilatura

Fast and accurate web content extraction for Node.js.

High-performance NAPI bindings for rs-trafilatura - a Rust port of trafilatura. Extracts clean, readable content from web pages while removing boilerplate, navigation, and advertisements.

Features

  • Fast: 71 files/s for articles, 46 files/s overall (native Rust)
  • Accurate: F1 0.966 on ScrapingHub benchmark, F1 0.859 across 7 page types
  • Page Type Classification: Auto-detects 7 page types (article, forum, product, collection, listing, documentation, service)
  • Per-Type Extraction: Specialized extraction profiles for each page type
  • Extraction Quality Predictor: ML-based confidence scoring (0.0-1.0)
  • Markdown Output: GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks
  • Rich Metadata: Title, author, date, description, categories, tags, license, images from JSON-LD, Open Graph, Dublin Core
  • Configurable: 28 options to tune precision/recall tradeoff, content selection, and output format
  • Robust: Handles malformed HTML with automatic character encoding detection

Installation

npm install trafilatura

Usage

import { extract } from 'trafilatura'

const html = `
<html>
  <head><title>Example Article</title></head>
  <body>
    <nav>Home | About | Contact</nav>
    <article>
      <h1>Main Title</h1>
      <p>This is the main content of the article.</p>
    </article>
    <footer>Copyright 2024</footer>
  </body>
</html>
`

const result = extract(html)
console.log('Title:', result.metadata.title)
console.log('Content:', result.contentText)
console.log('Page type:', result.metadata.pageType)
console.log('Quality:', result.extractionQuality)

With Options

import { extract } from 'trafilatura'

const result = extract(html, {
  outputMarkdown: true,
  includeImages: true,
  favorPrecision: true,
  url: 'https://example.com/article',
})

console.log(result.contentMarkdown)
console.log(result.images)

Page Type Override

import { extract } from 'trafilatura'

const result = extract(html, {
  pageType: 'product', // Force product page extraction
})

Working with Bytes

For HTML with unknown encoding:

import { extractBytes } from 'trafilatura'

const htmlBuffer = await fs.promises.readFile('page.html')
const result = extractBytes(htmlBuffer, { url: 'https://example.com' })

API

extract(html: string, options?: Options): ExtractResult

Extract content from HTML string with optional options.

extractBytes(buffer: Buffer, options?: Options): ExtractResult

Extract content from Buffer (handles encoding detection) with optional options.

ExtractResult

Field Type Description
contentText string? Main content as plain text
contentHtml string? Main content as HTML
contentMarkdown string? Main content as Markdown
commentsText string? Comments section as text
commentsHtml string? Comments section as HTML
images ImageData[] Extracted images
metadata Metadata Extracted metadata
classificationConfidence number? ML classifier confidence (0.0-1.0)
extractionQuality number Extraction quality confidence (0.0-1.0)
warnings string[] Processing warnings

Options

Option Type Description
includeComments boolean Include comments in output
includeTables boolean Include tables
includeImages boolean Include images
includeLinks boolean Include links
favorPrecision boolean Favor precision over recall
favorRecall boolean Favor recall over precision
targetLanguage string Target language code
url string Source URL
authorBlacklist string[] Author names to exclude
deduplicate boolean Remove duplicate content
minExtractedSize number Minimum extracted content size
minExtractedLen number Minimum extracted length
maxExtractedLen number Maximum extracted length
minOutputSize number Minimum output size
minOutputCommSize number Minimum comments size
minScore number Minimum quality score
maxDuplicateRatio number Max duplicate ratio threshold
maxLinkDensity number Max link density threshold
minParagraphCluster number Min paragraph cluster size
includeFormatting boolean Include text formatting
onlyWithMetadata boolean Only extract pages with metadata
maxTreeDepth number Maximum DOM tree depth
minWordLength number Minimum word length
useFallbackExtraction boolean Use fallback extraction
dedupCacheSize number Deduplication cache size
includeTitleInContent boolean Include title in content
outputMarkdown boolean Output as Markdown
pageType string Override page type

Build from Source

# Install build dependencies
npm install

# Build the native module
npm run build

Test

npm test

License

MIT

Acknowledgments

About

Fast and accurate web content extraction

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages