[FR] News Downloader - html filtering #6185

georgew21 · 2020-05-25T08:16:44Z

KOReader version: 2020.5
Device: PW3

Feature Request

Hello dear developers,

I find that News Downloader is inconvient. In many feeds, I need to turn a lot pages to find the text of article [on the other side, if i set to downlaod full aricle to false, i have only the description and not the full article). Also, it's annoying going back and after waiting to open each new artictle seperatly.

I would like to suggest 2 changes.

First, the option to download only specific tags of the wepbage (for example, div "body", mainarticle etc) - with the possibility of user to add more tags - so, if the plugin doesn't automatically get full article, the user can add some tags).
Saving all the articles of a feed in a single html file with contents table in the begining and hyper-links to each article.

Have a great day!

Frenzie · 2020-05-28T22:22:49Z

@NiLuJe Better to discuss that proxy thing here. ;-)

#6205 mentioned https://flak.tedunangst.com/post/miniwebproxy which looks like an interesting alternative to Proxomitron and its modern semi-clone Privoxy. Definitely an interesting program.

hngt · 2020-06-01T08:02:45Z

I think trying to put Go into KOReader is not really the optimal solution. There is LuaXML and it seems it would fit our use-case perfectly. It is just a matter of implementing it, and I am not that good with Lua.

EDIT: It seems LuaXML is already implemented, so isn't it a matter of just creating a CSS element filter list?

Frenzie · 2020-06-01T09:47:37Z

I didn't mean "interesting" in the "ship it" sense, quite the opposite. Apologies for any confusion. It was more thinking out loud in the "once upon a time I used Proxomitron" sense.

EDIT: It seems LuaXML is already implemented, so isn't it a matter of just creating a CSS element filter list?

Probably, query_selector would be the most obvious thing to look into if you want to implement something like that. (You'll need a much newer version of LuaXML for parseQuery & friends of course.)

hngt · 2020-06-02T19:03:48Z

@Frenzie

After a day of research, I have finally managed to find solution that is comparable in complexity to miniwebproxy. Luarock "htmlparser" does the task very well and I've managed to write simple filter just to check whether it would fit this usecase.

local file = io.input("./file.html")
local text = io.read("*a") file:close()
local htmlparser = require("htmlparser")
local root = htmlparser.parse(text)

local selectors = {
        "main",
        "article",
        "div#main",
        "#main-article",
        ".main-content",
        "#body",
        "#content",
        ".content",
        "div#article",
        "div.article",
        "div.post",
        "div.post-outer",
        ".l-root",
        ".content-container",
        ".StandardArticleBody_body",
        "div#article-inner",
        "div#newsstorytext",
        "div.general",
}

for _,sel in pairs(selectors) do
        local elements = root:select(sel)
        for _,e in ipairs(elements) do
        print(e:getcontent())
    end
end

Are you (and other people important to the project) fine with including it with KOReader (it has LGPLv3 as license). Then, I could easily write in the function within the next week. The algo is generally not very fast (have not benchmarked it yet on my ereader) but on my laptop it is instantaneous for simpler sites, and a staggering time of 1 second for a Reuters article (result (from 255 kiB to 11 kiB)). Nonetheless, I still think this is an extremely worthy endevour.

Frenzie · 2020-06-02T20:25:30Z

It looks acceptable to me; how about you @NiLuJe ?

hngt · 2020-06-02T21:00:29Z

Also, the second idea by @georgew21 is also nice, but then I think it needs another issue as it is vastly different problem, but imho doable.

NiLuJe · 2020-06-02T21:34:35Z

I don't use the feature, so, can't really comment, but we certainly do bundle other stuff via luarocks, so I don't have any issue with that ;).

hngt · 2020-06-03T19:23:40Z

@NiLuJe @Frenzie
How then do I add packages to KOReader luarocks? I can't find the general parts of code. The functional element of the patch is already written, and I am only left with adding the luarocks dependency.

NiLuJe · 2020-06-03T19:38:03Z

c.f., how lua-Spore is handled (thirdparty/lua-Spore/CMakeLists.txt & Makefile.third in koreader-base).

hngt · 2020-06-04T19:19:59Z

I think the second suggestion deserves new Issue, as it is whole beast upon itself and will require a widely different method (i.e. download/merge all feeds and then use epubDownloader which will need a different version if we want it to filter elements and have chapters).

This gazette mode (let's call it like that) would be very handy for stuff like tweetRSS where individual feed of content is quite small.

Frenzie · 2020-06-04T19:22:52Z

@lich-tex Please feel free to open a new issue. I intend to close this one as the CSS selector issue.

Fixes #6185.

Frenzie added the Plugin label May 25, 2020

pazos mentioned this issue May 28, 2020

[feature request] newsdownloader html filtering #6205

Closed

pazos changed the title ~~[FR] News Downloader - contents and article mode~~ [FR] News Downloader - html filtering May 28, 2020

hngt mentioned this issue Jun 3, 2020

[NewsDownloader] Added an HTML filter through a CSS selector #6228

Merged

Frenzie linked a pull request Jun 4, 2020 that will close this issue

[NewsDownloader] Added an HTML filter through a CSS selector #6228

Merged

Frenzie added this to the 2020.06 milestone Jun 4, 2020

hngt mentioned this issue Jun 4, 2020

[FR] NewsDownloader - Gazette Mode - Feed entries in one EPUB file #6234

Closed

Frenzie closed this as completed in #6228 Jun 4, 2020

Frenzie pushed a commit that referenced this issue Jun 4, 2020

[NewsDownloader] Added an HTML filter through a CSS selector (#6228)

b741fce

Fixes #6185.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] News Downloader - html filtering #6185

[FR] News Downloader - html filtering #6185

georgew21 commented May 25, 2020

Frenzie commented May 28, 2020

hngt commented Jun 1, 2020 •

edited

Loading

Frenzie commented Jun 1, 2020

hngt commented Jun 2, 2020 •

edited

Loading

Frenzie commented Jun 2, 2020

hngt commented Jun 2, 2020

NiLuJe commented Jun 2, 2020

hngt commented Jun 3, 2020

NiLuJe commented Jun 3, 2020

hngt commented Jun 4, 2020

Frenzie commented Jun 4, 2020

[FR] News Downloader - html filtering #6185

[FR] News Downloader - html filtering #6185

Comments

georgew21 commented May 25, 2020

Feature Request

Frenzie commented May 28, 2020

hngt commented Jun 1, 2020 • edited Loading

Frenzie commented Jun 1, 2020

hngt commented Jun 2, 2020 • edited Loading

Frenzie commented Jun 2, 2020

hngt commented Jun 2, 2020

NiLuJe commented Jun 2, 2020

hngt commented Jun 3, 2020

NiLuJe commented Jun 3, 2020

hngt commented Jun 4, 2020

Frenzie commented Jun 4, 2020

hngt commented Jun 1, 2020 •

edited

Loading

hngt commented Jun 2, 2020 •

edited

Loading