Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: allow to set content extraction rules for each domain #4

Closed
larryhudson opened this issue Sep 29, 2023 · 2 comments
Closed

Idea: allow to set content extraction rules for each domain #4

larryhudson opened this issue Sep 29, 2023 · 2 comments

Comments

@larryhudson
Copy link
Owner

larryhudson commented Sep 29, 2023

  • After testing this out with a few different websites, there are little things that I'd like to tweak in the text output that are making the audio a bit harder to listen to
  • One thing is being able to manually tweak the text content before it becomes audio, but I'll talk about that in a different issue. Here, I'll talk about setting 'rules' or filters for content.
  • In the app source code, I've got some hard-coded tweaks for Wikipedia in particular, see below:
    addTransformations({
    patterns: [/([\w]+.)?wikipedia.org\/*/],
    // */
    pre: (document) => {
    // do something with document
    const selectorsToRemove = [
    "figure",
    "img",
    "figcaption",
    "sup.reference",
    "sup.noprint",
    "div.thumb",
    "table.infobox",
    "ol.references",
    ".mw-editsection",
    ];
    selectorsToRemove.forEach((selector) => {
    document.querySelectorAll(selector).forEach((elem) => {
    elem.parentNode.removeChild(elem);
    });
    });
    return document;
    },
    post: (document) => {
    // do something with document
    return document;
    },
    });
  • I want to make it easier to add rules / filters for domains. For example:
    • On the Mixmag website, there were full image URLs in the middle of the audio. I think I would filter out the 'figure' tag
    • On The Conversation, there are many "Read more:" links interspersed within the actual prose content. I would filter out paragraphs that begin with the text "Read more: " (not just a 'selector' to filter out).
    • Rules could either be 'global' (apply to all sites) or domain-specific.

How we would implement this

  • Add a database table called 'extraction rules'. Columns would be label, active boolean, domain (if set to null, then it is global), rule type - selector or regex (?), rule content. [Might need a bit more experimentation on the exact implementation here]
  • Move those hard-coded Wikipedia tweaks into database rules
  • When you're extracting content, get matching rules from the database and apply them to the content
@larryhudson
Copy link
Owner Author

I've almost got this working. One thing that is causing a little bit of weirdness, is my decision to treat extraction rules where the 'domain' is null as 'global' rules that apply all the time. Because when you submit a form with an empty input, the database field is not 'null', it is an empty string.

So maybe I should re-think that decision, and have a different way of setting global vs domain-specific rules.

@larryhudson
Copy link
Owner Author

This is done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant