Idea: allow to set content extraction rules for each domain #4

larryhudson · 2023-09-29T21:13:40Z

After testing this out with a few different websites, there are little things that I'd like to tweak in the text output that are making the audio a bit harder to listen to
One thing is being able to manually tweak the text content before it becomes audio, but I'll talk about that in a different issue. Here, I'll talk about setting 'rules' or filters for content.

In the app source code, I've got some hard-coded tweaks for Wikipedia in particular, see below:

astro-sqlite-tts-feed/src/utils/extract-article.js

Lines 42 to 71 in f037b31

    
           addTransformations({ 
        
             patterns: [/([\w]+.)?wikipedia.org\/*/], 
        
             // */ 
        
             pre: (document) => { 
        
               // do something with document 
        
               const selectorsToRemove = [ 
        
                 "figure", 
        
                 "img", 
        
                 "figcaption", 
        
                 "sup.reference", 
        
                 "sup.noprint", 
        
                 "div.thumb", 
        
                 "table.infobox", 
        
                 "ol.references", 
        
                 ".mw-editsection", 
        
               ]; 
        
               selectorsToRemove.forEach((selector) => { 
        
                 document.querySelectorAll(selector).forEach((elem) => { 
        
                   elem.parentNode.removeChild(elem); 
        
                 }); 
        
               }); 
        
               return document; 
        
             }, 
        
             post: (document) => { 
        
               // do something with document 
        
               return document; 
        
             }, 
        
           });

I want to make it easier to add rules / filters for domains. For example:
- On the Mixmag website, there were full image URLs in the middle of the audio. I think I would filter out the 'figure' tag
- On The Conversation, there are many "Read more:" links interspersed within the actual prose content. I would filter out paragraphs that begin with the text "Read more: " (not just a 'selector' to filter out).
- Rules could either be 'global' (apply to all sites) or domain-specific.

How we would implement this

Add a database table called 'extraction rules'. Columns would be label, active boolean, domain (if set to null, then it is global), rule type - selector or regex (?), rule content. [Might need a bit more experimentation on the exact implementation here]
Move those hard-coded Wikipedia tweaks into database rules
When you're extracting content, get matching rules from the database and apply them to the content

larryhudson · 2023-10-01T02:10:17Z

I've almost got this working. One thing that is causing a little bit of weirdness, is my decision to treat extraction rules where the 'domain' is null as 'global' rules that apply all the time. Because when you submit a form with an empty input, the database field is not 'null', it is an empty string.

So maybe I should re-think that decision, and have a different way of setting global vs domain-specific rules.

larryhudson · 2023-10-01T02:36:23Z

This is done now.

This was referenced Sep 29, 2023

Idea: allow to manually tweak text content before converting to audio #5

Closed

Idea: add a button for re-extracting article text content #12

Closed

larryhudson mentioned this issue Oct 1, 2023

Add ability to add custom extraction rules for text to speech content extraction #13

Merged

larryhudson closed this as completed Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: allow to set content extraction rules for each domain #4

Idea: allow to set content extraction rules for each domain #4

larryhudson commented Sep 29, 2023 •

edited

larryhudson commented Oct 1, 2023

larryhudson commented Oct 1, 2023

Idea: allow to set content extraction rules for each domain #4

Idea: allow to set content extraction rules for each domain #4

Comments

larryhudson commented Sep 29, 2023 • edited

How we would implement this

larryhudson commented Oct 1, 2023

larryhudson commented Oct 1, 2023

larryhudson commented Sep 29, 2023 •

edited