napi-rs-trafilatura

Fast and accurate web content extraction for Node.js.

High-performance NAPI bindings for rs-trafilatura - a Rust port of trafilatura. Extracts clean, readable content from web pages while removing boilerplate, navigation, and advertisements.

Features

Fast: 71 files/s for articles, 46 files/s overall (native Rust)
Accurate: F1 0.966 on ScrapingHub benchmark, F1 0.859 across 7 page types
Page Type Classification: Auto-detects 7 page types (article, forum, product, collection, listing, documentation, service)
Per-Type Extraction: Specialized extraction profiles for each page type
Extraction Quality Predictor: ML-based confidence scoring (0.0-1.0)
Markdown Output: GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks
Rich Metadata: Title, author, date, description, categories, tags, license, images from JSON-LD, Open Graph, Dublin Core
Configurable: 28 options to tune precision/recall tradeoff, content selection, and output format
Robust: Handles malformed HTML with automatic character encoding detection

Installation

npm install trafilatura

Usage

import { extract } from 'trafilatura'

const html = `
<html>
  <head><title>Example Article</title></head>
  <body>
    <nav>Home | About | Contact</nav>
    <article>
      <h1>Main Title</h1>
      <p>This is the main content of the article.</p>
    </article>
    <footer>Copyright 2024</footer>
  </body>
</html>
`

const result = extract(html)
console.log('Title:', result.metadata.title)
console.log('Content:', result.contentText)
console.log('Page type:', result.metadata.pageType)
console.log('Quality:', result.extractionQuality)

With Options

import { extract } from 'trafilatura'

const result = extract(html, {
  outputMarkdown: true,
  includeImages: true,
  favorPrecision: true,
  url: 'https://example.com/article',
})

console.log(result.contentMarkdown)
console.log(result.images)

Page Type Override

import { extract } from 'trafilatura'

const result = extract(html, {
  pageType: 'product', // Force product page extraction
})

Working with Bytes

For HTML with unknown encoding:

import { extractBytes } from 'trafilatura'

const htmlBuffer = await fs.promises.readFile('page.html')
const result = extractBytes(htmlBuffer, { url: 'https://example.com' })

API

extract(html: string, options?: Options): ExtractResult

Extract content from HTML string with optional options.

extractBytes(buffer: Buffer, options?: Options): ExtractResult

Extract content from Buffer (handles encoding detection) with optional options.

ExtractResult

Field	Type	Description
contentText	string?	Main content as plain text
contentHtml	string?	Main content as HTML
contentMarkdown	string?	Main content as Markdown
commentsText	string?	Comments section as text
commentsHtml	string?	Comments section as HTML
images	ImageData[]	Extracted images
metadata	Metadata	Extracted metadata
classificationConfidence	number?	ML classifier confidence (0.0-1.0)
extractionQuality	number	Extraction quality confidence (0.0-1.0)
warnings	string[]	Processing warnings

Options

Option	Type	Description
includeComments	boolean	Include comments in output
includeTables	boolean	Include tables
includeImages	boolean	Include images
includeLinks	boolean	Include links
favorPrecision	boolean	Favor precision over recall
favorRecall	boolean	Favor recall over precision
targetLanguage	string	Target language code
url	string	Source URL
authorBlacklist	string[]	Author names to exclude
deduplicate	boolean	Remove duplicate content
minExtractedSize	number	Minimum extracted content size
minExtractedLen	number	Minimum extracted length
maxExtractedLen	number	Maximum extracted length
minOutputSize	number	Minimum output size
minOutputCommSize	number	Minimum comments size
minScore	number	Minimum quality score
maxDuplicateRatio	number	Max duplicate ratio threshold
maxLinkDensity	number	Max link density threshold
minParagraphCluster	number	Min paragraph cluster size
includeFormatting	boolean	Include text formatting
onlyWithMetadata	boolean	Only extract pages with metadata
maxTreeDepth	number	Maximum DOM tree depth
minWordLength	number	Minimum word length
useFallbackExtraction	boolean	Use fallback extraction
dedupCacheSize	number	Deduplication cache size
includeTitleInContent	boolean	Include title in content
outputMarkdown	boolean	Output as Markdown
pageType	string	Override page type

Build from Source

# Install build dependencies
npm install

# Build the native module
npm run build

Test

npm test

License

MIT

Acknowledgments

trafilatura - Original Python implementation by Adrien Barbaresi
rs-trafilatura - Rust port by Murrough Foley

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
lib		lib
npm		npm
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

napi-rs-trafilatura

Features

Installation

Usage

With Options

Page Type Override

Working with Bytes

API

extract(html: string, options?: Options): ExtractResult

extractBytes(buffer: Buffer, options?: Options): ExtractResult

ExtractResult

Options

Build from Source

Test

License

Acknowledgments

About

Uh oh!

Releases 4

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

napi-rs-trafilatura

Features

Installation

Usage

With Options

Page Type Override

Working with Bytes

API

extract(html: string, options?: Options): ExtractResult

extractBytes(buffer: Buffer, options?: Options): ExtractResult

ExtractResult

Options

Build from Source

Test

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Contributors

Uh oh!

Languages