Skip to content

Extraction fails for frontend-heavy Angular SPA HTML with large inline style/font blocks #283

@shivangi2990

Description

@shivangi2990

I tested Defuddle against a large Angular SPA page source and observed extraction failure when the HTML primarily contained framework boilerplate and massive inline CSS/font definitions.

Observed Behavior

Defuddle returned empty/invalid markdown output and extraction failed because the HTML contained very little semantic readable content compared to DOM noise.

The HTML included:

large inline <style> blocks
thousands of @font-face declarations
bootstrap/material CSS
Angular app shell markup
tracking scripts and metadata
Expected Behavior

Defuddle should ideally:

ignore noisy/non-semantic nodes during preprocessing
or provide a preprocessing option for frontend-heavy SPA HTML
Suggested Improvement

A preprocessing step before readability extraction could help significantly, for example removing:

script
style
noscript
svg
stylesheet-related nodes

before running extraction.

Additional Context

The issue was reproduced consistently using fixture-based testing with a saved HTML payload from an Angular application page source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions