Skip to content

Matching any of multiple tags #30

@impredicative

Description

@impredicative

I currently have:

<body>
    { <p @text:content /> }
</body>

Obvious this matches all p tags in body at any level. I however want something like:

<body>
    { <p|h[1-6] @text:content /> }
</body>

or more explicitly:

<body>
    { <p|h1|h2|h3|h4|h5|h6 @text:content /> }
</body>

I mean I also want to match h1 through h6, not just p. This doesn't seem to be supported by hext at this time. This is an important and urgent use case for me for extracting text from an HTML article for machine learning purposes. I don't however want to match any other tags at this time. Is there any way to do this?

Currently, to use hext for this purpose, I have to first use a string replacement to replace all h1-h6 tags with p tags, which is a hacky thing to do via string manipulation, risking errors.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions