Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] News Downloader - html filtering #6185

Closed
georgew21 opened this issue May 25, 2020 · 11 comments · Fixed by #6228
Closed

[FR] News Downloader - html filtering #6185

georgew21 opened this issue May 25, 2020 · 11 comments · Fixed by #6228
Labels
Milestone

Comments

@georgew21
Copy link

  • KOReader version: 2020.5
  • Device: PW3

Feature Request

Hello dear developers,

I find that News Downloader is inconvient. In many feeds, I need to turn a lot pages to find the text of article [on the other side, if i set to downlaod full aricle to false, i have only the description and not the full article). Also, it's annoying going back and after waiting to open each new artictle seperatly.

I would like to suggest 2 changes.

  1. First, the option to download only specific tags of the wepbage (for example, div "body", mainarticle etc) - with the possibility of user to add more tags - so, if the plugin doesn't automatically get full article, the user can add some tags).

  2. Saving all the articles of a feed in a single html file with contents table in the begining and hyper-links to each article.

Have a great day!

@Frenzie Frenzie added the Plugin label May 25, 2020
@pazos pazos changed the title [FR] News Downloader - contents and article mode [FR] News Downloader - html filtering May 28, 2020
@Frenzie
Copy link
Member

Frenzie commented May 28, 2020

@NiLuJe Better to discuss that proxy thing here. ;-)

#6205 mentioned https://flak.tedunangst.com/post/miniwebproxy which looks like an interesting alternative to Proxomitron and its modern semi-clone Privoxy. Definitely an interesting program.

@hngt
Copy link
Contributor

hngt commented Jun 1, 2020

I think trying to put Go into KOReader is not really the optimal solution. There is LuaXML and it seems it would fit our use-case perfectly. It is just a matter of implementing it, and I am not that good with Lua.

EDIT: It seems LuaXML is already implemented, so isn't it a matter of just creating a CSS element filter list?

@Frenzie
Copy link
Member

Frenzie commented Jun 1, 2020

I didn't mean "interesting" in the "ship it" sense, quite the opposite. Apologies for any confusion. It was more thinking out loud in the "once upon a time I used Proxomitron" sense.

EDIT: It seems LuaXML is already implemented, so isn't it a matter of just creating a CSS element filter list?

Probably, query_selector would be the most obvious thing to look into if you want to implement something like that. (You'll need a much newer version of LuaXML for parseQuery & friends of course.)

@hngt
Copy link
Contributor

hngt commented Jun 2, 2020

@Frenzie

After a day of research, I have finally managed to find solution that is comparable in complexity to miniwebproxy. Luarock "htmlparser" does the task very well and I've managed to write simple filter just to check whether it would fit this usecase.

local file = io.input("./file.html")
local text = io.read("*a") file:close()
local htmlparser = require("htmlparser")
local root = htmlparser.parse(text)

local selectors = {
        "main",
        "article",
        "div#main",
        "#main-article",
        ".main-content",
        "#body",
        "#content",
        ".content",
        "div#article",
        "div.article",
        "div.post",
        "div.post-outer",
        ".l-root",
        ".content-container",
        ".StandardArticleBody_body",
        "div#article-inner",
        "div#newsstorytext",
        "div.general",
}

for _,sel in pairs(selectors) do
        local elements = root:select(sel)
        for _,e in ipairs(elements) do
        print(e:getcontent())
    end
end

Are you (and other people important to the project) fine with including it with KOReader (it has LGPLv3 as license). Then, I could easily write in the function within the next week. The algo is generally not very fast (have not benchmarked it yet on my ereader) but on my laptop it is instantaneous for simpler sites, and a staggering time of 1 second for a Reuters article (result (from 255 kiB to 11 kiB)). Nonetheless, I still think this is an extremely worthy endevour.

@Frenzie
Copy link
Member

Frenzie commented Jun 2, 2020

It looks acceptable to me; how about you @NiLuJe ?

@hngt
Copy link
Contributor

hngt commented Jun 2, 2020

Also, the second idea by @georgew21 is also nice, but then I think it needs another issue as it is vastly different problem, but imho doable.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 2, 2020

I don't use the feature, so, can't really comment, but we certainly do bundle other stuff via luarocks, so I don't have any issue with that ;).

@hngt
Copy link
Contributor

hngt commented Jun 3, 2020

@NiLuJe @Frenzie
How then do I add packages to KOReader luarocks? I can't find the general parts of code. The functional element of the patch is already written, and I am only left with adding the luarocks dependency.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 3, 2020

c.f., how lua-Spore is handled (thirdparty/lua-Spore/CMakeLists.txt & Makefile.third in koreader-base).

@hngt
Copy link
Contributor

hngt commented Jun 4, 2020

I think the second suggestion deserves new Issue, as it is whole beast upon itself and will require a widely different method (i.e. download/merge all feeds and then use epubDownloader which will need a different version if we want it to filter elements and have chapters).

This gazette mode (let's call it like that) would be very handy for stuff like tweetRSS where individual feed of content is quite small.

@Frenzie
Copy link
Member

Frenzie commented Jun 4, 2020

@lich-tex Please feel free to open a new issue. I intend to close this one as the CSS selector issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants