Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LIFO queue option for recursive download #1

Closed
wants to merge 1 commit into from

Conversation

john-peterson
Copy link
Member

basic problem

the basic problem is that the FIFO queue can create a long time between downloading a page and its links. this is different from the browser experience that the page is designed for. resulting in wget fail that a browser user dont experience

savannah link

this patch is also posted at https://savannah.gnu.org/bugs/?37581

making it optional

To get your patch into git please add a command-line option to activate LIFO behavior.

k the patch is changed here #1

the patch file is https://github.com/mirror/wget/pull/1.patch

reason to place html pages at the top of the queue

if ll_bubblesort isn't used only the deepest level links are downloaded directly after its parent page despite using LIFO

alternative solution

enqueue child directly after parent seem difficult

another solution is to enqueue the depth n+1 links directly after enqueuing its parent depth n link instead of continuing enqueuing depth n links

this require interrupting the depth n enqueue at html links. dequeue everything (including the html link). enqueue the depth n+1 links. and the continue the depth n enqueue. this require a big reorganization or doesnt make sense

a way to do this could be to store the non-enqueued links in a temporary queue and enqueue them after everything else

the LIFO solution is better than this solution bc

  • it's simpler code
  • the only benefit is small: that it would download html pages from top to bottom instead of in an arbitrary order (sort place html pages on top of the queue in an arbitrary order)

enqueue html last doesnt work

keeping FIFO and enqueue html links last (with sort) doesnt solve the problem because all depth n links are still downloaded before any depth n+1 links

test case description

I am not sure why you expect that all the resources from 60 "branches" can be downloaded in less than 60s when the "branches" itself can't.

i dont mean that all resources can be downloaded fast. i just mean that they are downloaded directly after the page that contain them

the example is an image hosting site (imagevenue.com) where all images has its own html page (imagevenue.com/img.php) with a generated image link that expires a while after the html page is generated to prevent links directly to image files

all links can be downloaded with lifo because each branch page has only 1 link in this example and there's more than enough time to download that 1 link if the download begin directly after the link is generated

if a branch page (f.e. imagevenue.com/img.php) had many images (links) there could still be a problem. but the problem would be the same for regular users (browsers) that download the resource directly after the page is loaded and the fault is therefore the site's rather than wget's

test

imagevenue fail

this fails to download the imagevenue.com/img.php images because it's downloading all the img.php pages before the temporary image links in them, and by the time it gets to them they're expired

wget -rHpE -l1 -t2 -T10 -np -nc -nH -nd -e robots=off -D'imagevenue.com' -R'th_*.jpg,th_*.JPG,.gif,.png,.css,.js' http://forum.glam0ur.com/hot-babe-galleries/11956-merilyn-sekova-aka-busty-merilyn.html

this downloads images directly after a img.php page is downloaded so they dont have time to expire

wget -rHpE -l1 -t2 -T10 -np -nc -nH -nd --queue-type=lifo -e robots=off -D'imagevenue.com' -R'th_*.jpg,th_*.JPG,.gif,.png,.css,.js' http://forum.glam0ur.com/hot-babe-galleries/11956-merilyn-sekova-aka-busty-merilyn.html

invalid input

invalid input is prevented

wget --queue-type=fiffo

wget: --queue-type: Invalid value ‘fiffo’.

download order

this test show the FIFO and LIFO download order

i created this local site:

$ tree
.
├── a
│   ├── a
│   │   ├── a-a-x.jpg
│   │   ├── a-a-y.jpg
│   │   └── a-a.html
│   ├── a-x.jpg
│   ├── a-y.jpg
│   ├── a.html
│   └── b
│       ├── a-b-x.jpg
│       ├── a-b-y.jpg
│       └── a-b.html
├── b
│   ├── a
│   │   ├── b-a-x.jpg
│   │   ├── b-a-y.jpg
│   │   └── b-a.html
│   ├── b
│   │   ├── b-b-x.jpg
│   │   ├── b-b-y.jpg
│   │   └── b-b.html
│   ├── b-x.jpg
│   ├── b-y.jpg
│   └── b.html
├── i.html
├── x.jpg
└── y.jpg

6 directories, 21 files

i.html

<a href="a/a.html"><img src="x.jpg"></a>
<a href="b/b.html"><img src="y.jpg"></a>

a.html

<a href="a/a-a.html"><img src="a-x.jpg"></a>
<a href="b/a-b.html"><img src="a-y.jpg"></a>

a-a.html

<img src="a-a-x.jpg">
<img src="a-a-y.jpg">

a-b.html

<img src="a-b-x.jpg">
<img src="a-b-y.jpg">

b.html

<a href="a/b-a.html"><img src="b-x.jpg"></a>
<a href="b/b-b.html"><img src="b-y.jpg"></a>

b-a.html

<img src="b-a-x.jpg">
<img src="b-a-y.jpg">

b-b.html

<img src="b-b-x.jpg">
<img src="b-b-y.jpg">

fifo download links long after its parent page. especially the deepest level links

wget -vdrp -nd http://localhost/code/html/test/download/i.html 2>&1 | egrep "^Enqueuing|Dequeuing|Saving to"

Enqueuing http://localhost/code/html/test/download/i.html at depth 0
Dequeuing http://localhost/code/html/test/download/i.html at depth 0
Saving to: ‘i.html’
Enqueuing http://localhost/code/html/test/download/a/a.html at depth 1
Enqueuing http://localhost/code/html/test/download/x.jpg at depth 1
Enqueuing http://localhost/code/html/test/download/b/b.html at depth 1
Enqueuing http://localhost/code/html/test/download/y.jpg at depth 1
Dequeuing http://localhost/code/html/test/download/a/a.html at depth 1
Saving to: ‘a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/x.jpg at depth 1
Saving to: ‘x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b.html at depth 1
Saving to: ‘b.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/y.jpg at depth 1
Saving to: ‘y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Saving to: ‘a-a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Saving to: ‘a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Saving to: ‘a-b.html’
Enqueuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Saving to: ‘a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Saving to: ‘b-a.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Saving to: ‘b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Saving to: ‘b-b.html’
Enqueuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Saving to: ‘b-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Saving to: ‘a-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Saving to: ‘a-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Saving to: ‘a-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Saving to: ‘a-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Saving to: ‘b-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Saving to: ‘b-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Saving to: ‘b-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Saving to: ‘b-b-y.jpg’

lifo download links directly after its parent page

wget -vdrp -nd --queue-type=lifo http://localhost/code/html/test/download/i.html 2>&1 | egrep "^Enqueuing|Dequeuing|Saving to"

Enqueuing http://localhost/code/html/test/download/i.html at depth 0
Dequeuing http://localhost/code/html/test/download/i.html at depth 0
Saving to: ‘i.html’
Enqueuing http://localhost/code/html/test/download/a/a.html at depth 1
Enqueuing http://localhost/code/html/test/download/b/b.html at depth 1
Enqueuing http://localhost/code/html/test/download/x.jpg at depth 1
Enqueuing http://localhost/code/html/test/download/y.jpg at depth 1
Dequeuing http://localhost/code/html/test/download/y.jpg at depth 1
Saving to: ‘y.jpg’
Dequeuing http://localhost/code/html/test/download/x.jpg at depth 1
Saving to: ‘x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b.html at depth 1
Saving to: ‘b.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Saving to: ‘b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Saving to: ‘b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Saving to: ‘b-b.html’
Enqueuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Saving to: ‘b-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Saving to: ‘b-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Saving to: ‘b-a.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Saving to: ‘b-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Saving to: ‘b-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a.html at depth 1
Saving to: ‘a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Saving to: ‘a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Saving to: ‘a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Saving to: ‘a-b.html’
Enqueuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Saving to: ‘a-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Saving to: ‘a-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Saving to: ‘a-a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Saving to: ‘a-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Saving to: ‘a-a-x.jpg’

@john-peterson john-peterson changed the title Fixing recursion resource expiration problem Add LIFO queue option for recursive download Jan 2, 2015
@rockdaboot
Copy link
Contributor

Nice, could you just add these two little changes:
(one fixes a warning for me, the other is a show-stopper which prevents generating the docs here)
After amending, could you post your suggestion to bug-wget@gnu.org mailing list ? A short explanation + a link to this page should be ok. Most people there don't mess with the Savannah bug tracker.

diff --git a/doc/wget.texi b/doc/wget.texi
index 67f74ba..a981fd2 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -1916,7 +1916,7 @@ case.
Turn on recursive retrieving. @Xref{Recursive Download}, for more
details. The default maximum depth is 5.

-@itemx --queue-type=@var{queuetype}
+@item --queue-type=@var{queuetype}
Specify the queue type (@pxref{Recursive Download}). Accepted values are @samp{fifo} (the default)
and @samp{lifo}.

diff --git a/src/init.c b/src/init.c
index cd17f98..71b1203 100644
--- a/src/init.c
+++ b/src/init.c
@@ -1448,7 +1448,7 @@ cmd_spec_recursive (const char _com, const char *val, void *place_ignored GL_UN
/
Validate --queue-type and set the choice. */

static bool
-cmd_spec_queue_type (const char *com, const char *val, void *place_ignored)
+cmd_spec_queue_type (const char *com, const char *val, void *place_ignored _GL_UNUSED)
{
static const struct decode_item choices[] = {
{ "fifo", queue_type_fifo },

@john-peterson
Copy link
Member Author

the other is a show-stopper which prevents generating the docs here

how do i generate docs to detect that error? this command doesnt show any error about that

(cd doc; make)

@rockdaboot
Copy link
Contributor

Normally, this will be done automatically by 'make'.
Maybe something is missing on your installation (e.g. pod2man, textinfo, makeinfo) so the creation is skipped ?
The error was:
wget.texi:1919: @itemx must follow @item
Makefile:1346: recipe for target 'wget.info' failed
make[2]: *** [wget.info] Error 1

@john-peterson
Copy link
Member Author

Normally, this will be done automatically by 'make'.

is there a make type for that? like make docs? that does something different than (cd doc; make)

Maybe something is missing on your installation (e.g. pod2man, textinfo, makeinfo) so the creation is skipped ?

theres nothing about texinfo in config.log. this is the makeinfo and pod2man output:

configure:38656: checking for makeinfo
configure:38683: result: ${SHELL} /d/repo/wget/build-aux/missing --run makeinfo

configure:38745: checking for pod2man
configure:38763: found /usr/bin/pod2man
configure:38776: result: /usr/bin/pod2man

am i supposed to run that command to get more info?

$ /d/repo/wget/build-aux/missing --run makeinfo
makeinfo: missing file argument.
Try `makeinfo --help' for more information.

makeinfo is 4.13

$ makeinfo --version
makeinfo (GNU texinfo) 4.13

@rockdaboot
Copy link
Contributor

'textinfo' is a typo, should be texinfo ;-)
'cd doc; make clean; make' should output

test -z "wget.dvi wget.pdf wget.ps wget.html"
|| rm -rf wget.dvi wget.pdf wget.ps wget.html
test -z "*~ *.bak *.cat *.pod" || rm -f *~ *.bak *.cat *.pod
rm -rf wget.t2d wget.t2p
rm -f vti.tmp
oms@blitz-lx:~/src/wget/doc$ make
./texi2pod.pl -D VERSION="1.16.1.36-8238-dirty" ./wget.texi wget.pod
/usr/bin/pod2man --center="GNU Wget" --release="GNU Wget 1.16.1.36-8238-dirty" wget.pod > wget.1

So maybe it is this ./texi2pod.pl working different here (or for you) ?

@john-peterson
Copy link
Member Author

i get the error now. dunno why i didnt get it before. maybe bc i didnt do make clean

(cd doc; make clean; make)

../../doc/wget.texi:1919: @itemx must follow @item
Makefile:1346: recipe for target `../../doc/wget.info' failed
make: *** [../../doc/wget.info] Error 1

@john-peterson
Copy link
Member Author

email

After amending, could you post your suggestion to bug-wget@gnu.org mailing list ? A short explanation + a link to this page should be ok.

k email sent

feedback wanted for this patch #1

@john-peterson
Copy link
Member Author

basic problem

as I understand your aim, you want Wget behave a bit more like a browser in respect to downloading. This means after downloading the first HTML page, first download non-HTML links (mainly images), second HTML pages.

yes

depth doesnt matter

I don't see a reason why the 'deepness' of those HTML pages should matter when queuing. Since a user doesn't know how deep the link is that he clicks on.

yup. depth no matter

alternative solution

enqueue html last isnt enough

This leads to a queuing without sorting: put the HTML links at the bottom and the non-HTML links to the top. This would lead to a download order that you documented under 'lifo download links directly after its parent page'.

keeping FIFO and enqueue html links last (with sort) isnt enough because all depth n links are still downloaded before any depth n+1 links

FIFO enqueue html last ≠ LIFO enqueue html first

@john-peterson
Copy link
Member Author

enqueue html last isnt enough

This is not what I said. I said: enqueue html last + enqueue non-html first

This basically the same as having two queues: one for HTML and one for non-HTML. non-HTML working as LIFO, always picked before HTML. If empty, pick from HTML queue (FIFO).

show it with code because i dont understand

the current FIFO code is:

while (1)
    // FIFO
    url_dequeue

    if (descend)
        for (; child; child = child->next)
            url_enqueue

the LIFO solution is:

while (1)
    // LIFO
    url_dequeue

    if (descend)
        // place html pages on top
        ll_bubblesort(&child);
        for (; child; child = child->next)
            url_enqueue

this can fix a problem that links expire before they're dequeued for download

the result of using LIFO instead FIFO is that links are downloaded immediately after the page they're in, instead of after other links are downloaded which can be a considerable time

Test case: The download targets are all deepest (depth 2) links. They expire a while after their parent depth 1 page is downloaded. The FIFO queue download all depth 1 pages before downloading any depth 2 links. This takes so long that the depth 2 links expire before they're dequeued for download
@john-peterson
Copy link
Member Author

closed in favor of #2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants