Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewsDownloader externalize download engine #3618

Merged
merged 24 commits into from
Jan 31, 2018

Conversation

mwoz123
Copy link
Contributor

@mwoz123 mwoz123 commented Jan 18, 2018

appreciated help to build :
https://github.com/daurnimator/lua-http as lua library

I stuck on segfault :(

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 28, 2018

Ok. so I'll add lua-http as backend next time.

This time I'll only focus on interfacing backend (already done)
and fix problems with redirects.

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 28, 2018

Can someone take a look at my code?
Especially around this line:
https://github.com/koreader/koreader/pull/3618/files#diff-17a7fb48fe6727b445dbfa79f04c4c55R51
As something might be wrong here as it sometimes creates empty (0b) files (for 301, and other situation that was working previously).

return processDownloadAndReturnSink(redirected_url)
end
else
logger.warn("translator HTTP status not okay:", status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not translator here ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


function InternalDownloadBackend:download(url, path)
local sink = processDownloadAndReturnSink(url)
ltn12.sink.file(io.open(path, 'w'))
Copy link
Member

@Frenzie Frenzie Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what's going on here.

  1. sink is for use as an argument to LuaSocket: http://w3.impa.br/~diego/software/luasocket/ltn12.html#sink.table
    You can return the table that sink created for you, not the sink. You're doing that of course, but you're naming the table itself sink, which I find confusing at best. It's not a sink. It's just a results table for the sink.
  2. Regardless of all that, the ltn12.sink here is unrelated to the local sink right above it, isn't it?
  3. Anyway, so you'd probably want to table.concat(table_name) (please name it something not sink) just like you're doing in that getResponseAsString() function. But that probably makes more sense as return table.concat(table_name) on line 40.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I fixed it.


local InternalDownloadBackend = {}

local function processDownloadAndReturnSink(url)
Copy link
Member

@Frenzie Frenzie Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just call it something simple like get or getURL. Skip the getResponseAsString since this function should just do return table.concat(response).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I replaced processDownloadAndReturnSink with getResponseAsString as this ways' better.


if status ~= "HTTP/1.1 200 OK" then
logger.dbg("InternalDownloadBackend: HTTP response code <> 200. Response code: ", status)
if status and string.sub(status, 1, 1) ~= "3" then -- handle 301, 302...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misreading this makes no sense as written. It should be == 3, not not ~= 3. But since the status code is a number you can just do code > 300 and code < 400. Just use it instead of ignoring it up above on line 24.

Also are you sure that this will only apply to HTTPS: that is, regular HTTP only returns something after all the internal redirections have already been taken care of?

Copy link
Contributor Author

@mwoz123 mwoz123 Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right copy-paste from pseudocode.

actually as status is something like:
"HTTP/1.1 200 OK"
"HTTP/1.1 400 Bad Request"
it should be :
string.sub(status, 10, 1) == "3"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something wrong with that. I give extra debug, substring without second arg:
logger.dbg("InternalDownloadBackend: HTTP response code <> 200. Response code: ", status)
local code = string.sub(status, 10)
logger.dbg("RESPO:", code)

gives:

01/29/18-22:09:24 DEBUG InternalDownloadBackend: HTTP response code <> 200. Response code: HTTP/1.1 301 Moved Permanently
01/29/18-22:09:24 DEBUG RESPO: 301 Moved Permanently

adding second arg to substring:
logger.dbg("InternalDownloadBackend: HTTP response code <> 200. Response code: ", status)
local code = string.sub(status, 10, 1)
logger.dbg("RESPO:", code)

gives:

01/29/18-22:12:09 DEBUG InternalDownloadBackend: HTTP response code <> 200. Response code: HTTP/1.1 301 Moved Permanently
01/29/18-22:12:09 DEBUG RESPO:
./luajit: plugins/newsdownloader.koplugin/internaldownloadbackend.lua:38: InternalDownloadBackend: Don't know how to handle HTTP status:

see here 01/29/18-22:12:09 DEBUG RESPO: <- it's blank!!! empty null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you should use the number code, not the human readable string status. Up above where you're explicitly dismissing it in the dummy variable _, replace that with something like code and use it the way I said: code > 300 and code < 400

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But status returns me full string status:
"HTTP/1.1 200 OK"
"HTTP/1.1 400 Bad Request"

I don't have number (code only) status:(

found ugly but working workaround.

local socket_url = require("socket.url")

local InternalDownloadBackend = {}
local max_redirects = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't really matter but most browsers default to 20.


function InternalDownloadBackend:download(url, path)
local response = self:getResponseAsString(url)
local file = assert(io.open(path, 'w'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that an assert means insta-crash in case it fails. Which is excellent during development, but not so much for regular use. ;-)

There are a couple of methods we've added to get around this issue:

local dbg = require("dbg")
if dbg.is_on then
    -- do some asserting or erroring
end

Example use:

if dbg.is_on then
local mt = {
__index = function(t, k)
local prop_value = rawget(t, k)
local prop_exists = prop_value ~= nil
if not prop_exists then
local warning = rawget(t, "_name") and string.format("Size.%s.%s", rawget(t, "_name"), k)
or string.format("Size.%s", k)
error("Size: this property does not exist: " .. warning)
end
assert(prop_exists)
return prop_value
end
}
setmetatable(Size, mt)
for el, table in pairs(Size) do
table._name = el
setmetatable(Size[el], mt)
end
end

Then there's also dbg.guard and dbg.dassert.

dbg.dassert(file ~= nil)

dbg:guard(UIManager, 'schedule',
function(self, time, action)
assert(time[1] >= 0 and time[2] >= 0, "Only positive time allowed")
assert(action ~= nil)
end)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I'll remove assert

@codecov-io
Copy link

codecov-io commented Jan 29, 2018

Codecov Report

Merging #3618 into master will decrease coverage by 0.12%.
The diff coverage is 38.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3618      +/-   ##
==========================================
- Coverage   53.22%   53.09%   -0.13%     
==========================================
  Files         200      201       +1     
  Lines       26446    26693     +247     
==========================================
+ Hits        14075    14173      +98     
- Misses      12371    12520     +149
Impacted Files Coverage Δ
frontend/apps/reader/modules/readersearch.lua 63.82% <ø> (ø) ⬆️
frontend/apps/reader/modules/readerwikipedia.lua 14.09% <0%> (-0.27%) ⬇️
frontend/ui/trapper.lua 10.34% <0%> (-0.09%) ⬇️
frontend/ui/widget/keyvaluepage.lua 19.21% <0%> (-0.16%) ⬇️
frontend/ui/widget/textboxwidget.lua 45.92% <0%> (ø) ⬆️
frontend/ui/widget/htmlboxwidget.lua 20.8% <0%> (-0.52%) ⬇️
frontend/ui/widget/buttondialog.lua 90.47% <100%> (+0.64%) ⬆️
frontend/ui/widget/bookstatuswidget.lua 89.01% <100%> (ø) ⬆️
frontend/ui/widget/confirmbox.lua 75.22% <100%> (+0.67%) ⬆️
frontend/ui/widget/menu.lua 77.67% <100%> (+0.77%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 775b7a3...893ccbc. Read the comment docs.

@mwoz123 mwoz123 changed the title [w.i.p] NewsDownloader externalize download engine NewsDownloader externalize download engine Jan 29, 2018
@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 29, 2018

can I retrigger CI?

/bin/bash: line 4: 2849 Segmentation fault ./luajit /home/ko/.luarocks/bin/busted --sort-files --no-auto-insulate --output=gtest --exclude-....

this doesn't seem to be something I should be responsible for.


local httpRequest = parsed.scheme == 'http' and http.request or https.request
-- first argument returned by skip is code
local _, headers, status = socket.skip(1, httpRequest(request))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3618 (comment)

I don't have number (code only) status:(

This line is where you're explicitly ignoring it.

local code, headers, status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks:)

@Frenzie
Copy link
Member

Frenzie commented Jan 30, 2018

Is the redirect handling working as intended on Reuters? If so I'll merge this.

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 30, 2018

Is the redirect handling working as intended on Reuters? If so I'll merge this.

Reuters works (fixed). PC world full download still doesn't (so no changes here).

Added better exception handling.

@@ -212,28 +212,38 @@ end

function NewsDownloader:processFeedSource(url, limit, unsupported_feeds_urls, download_full_article)

local response = DownloadBackend:getResponseAsString(url)
local ok, response = pcall(function()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pcall(DownloadBackend.getResponseAsString, url)

Copy link
Contributor Author

@mwoz123 mwoz123 Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replacing

local ok, response = pcall(function()
        return DownloadBackend:getResponseAsString(url)
    end)

with local ok, response = pcall(DownloadBackend.getResponseAsString, url)

brokes code.
It displays error for all sources after <1 sec.

@Frenzie
Copy link
Member

Frenzie commented Jan 30, 2018

Reuters works (fixed). PC world full download still doesn't (so no changes here).

Oh, I just fixed both of those in #3644 :-P

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 30, 2018

Oh, I just fixed both of those in #3644 :-P

ok, should I revert internaldownloadbackend to inital extract:
https://github.com/mwoz123/koreader/blob/2ec2e955f4a2aef1d464f4e0e512a5f531d0a7ab/plugins/newsdownloader.koplugin/internaldownloadbackend.lua
?
EDIT: or should it work as is?

@Frenzie
Copy link
Member

Frenzie commented Jan 30, 2018

Sure!

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 30, 2018

@Frenzie I suspect your commit didn't fix any of the problems.
Routers again have 301 moved perm, and PCWorld 400 bad request.

What I did:
I reverted back internaldownloadbackend see commit above (7ec4025)

I merge upstream master to my branch latest develop:

marcin@ubuntu:/media/marcin/Dane/koreader$ git remote -v
origin  https://github.com/mwoz123/koreader.git (fetch)
origin  https://github.com/mwoz123/koreader.git (push)
upstream        https://github.com/koreader/koreader.git (fetch)
upstream        https://github.com/koreader/koreader.git (push)
marcin@ubuntu:/media/marcin/Dane/koreader$ git fetch upstream 
marcin@ubuntu:/media/marcin/Dane/koreader$ git merge upstream/master 
Already up-to-date.
marcin@ubuntu:/media/marcin/Dane/koreader$ date
Tue Jan 30 21:45:11 CET 2018

build koreader. run download in news donwloader.

Can you check if it works for you ?
Other wise I'll rollback 7ec4025
as it will fix at least Reuters.

@Frenzie
Copy link
Member

Frenzie commented Jan 31, 2018

As always, git submodule update --recursive.

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 31, 2018

Ok. If it's fine then please merge

@Frenzie
Copy link
Member

Frenzie commented Jan 31, 2018

Wait, is that file even used?

local LuaHttpDownloadBackend = {}

function LuaHttpDownloadBackend:getResponseAsString(url)
local _, stream = assert(http_request.new_from_uri(url):go())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the asserts had been replaced by something more user-friendly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, you stuck it behind a pcall.

@Frenzie
Copy link
Member

Frenzie commented Jan 31, 2018

Personally I would leave that

I don't particularly want any dead code lying around in master. That's why Git has branches.

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 31, 2018

Oh, no please don't remove it. It's already prepared for external lib. Please don't make me additional work, once I'm back to the topic

@Frenzie Frenzie merged commit d0f9b53 into koreader:master Jan 31, 2018
@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 31, 2018

Thanks

@Frenzie
Copy link
Member

Frenzie commented Jan 31, 2018

No problem.

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 31, 2018

@Frenzie I tested again (submodule updated). And again it seems it doesn't work for full download Reuters and full download PCWorld.

Maybe I'm doing something wrong - I did fetch upstream(oryginal koreader), merge latest koreader master to my local master and update submodules?

@Frenzie could you check it?

It does create files with correct names (it did so before, as we've title in feeds first xml we download), but when you open it you have "301 Moved" (for Reuters) and "400 bad request" (for PC World).
Maybe you'd just checked if the files are created, without opening them?

@Frenzie
Copy link
Member

Frenzie commented Jan 31, 2018

Oh, my bad. That's a bug in this version of NewsDownloader. It should only create files when the status returned is 200. I believe that was fixed in the code as it was?

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 31, 2018

But the code was to handle prepare for change download backend, and in addition I added http 301 handling, and just at the end I added better exception handling.

But I thought you mean that after submodule bump it should also handle 301 and other (non 200, e.g 400) statuses, in old code without my extra changes for 301?

That's what i assumed from :

Oh, I just fixed both of those in #3644 :-P

I thought that's why you wanted to revert my changes handling 301 statuses?

ok, should I revert internaldownloadbackend to inital extract: (...)

Sure!

Ok so I'll pr for handling 301 again

@mwoz123
Copy link
Contributor Author

mwoz123 commented Jan 31, 2018

Done please merge . #3652

@Frenzie
Copy link
Member

Frenzie commented Jan 31, 2018

I thought it was fixed because like you said it wrote the files. It didn't write any files previously for me; it just said there was a problem with the feed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants