wget.sh generated but nothing follows #6

briatte · 2015-10-07T13:25:32Z

Hi,

Would you mind adding some notes on how to troubleshoot the script?

I'm trying to download this list with the following parameters:

export _GROUP="ggplot2"
export _WGET_OPTIONS="--no-check-certificate"

The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:

./crawler.sh -sh > wget.sh
bash wget.sh

Thanks in advance for any pointers. The wget.sh file I get is copied below.

#!/usr/bin/env bash

export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:---no-check-certificate}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}

The text was updated successfully, but these errors were encountered:

icy · 2015-10-07T15:01:41Z

Would you mind adding some notes on how to troubleshoot the script?

I will. Basically, it's about adding some wget option to get more verbose message.

The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:

What's OS you're running? Do you have any output from the command ./crawler.s -sh > wget.sh?

I've tried to run (exactly as you did, except I don't need export _WGET_OPTIONS="--no-check-certificate" on my ArchLinux machine), and I have very good result (as below).

I suggest you to remove the temporary directory (the ggplot2 directory in place you run crawler.sh command) and start again. You may record all logs for future debugging (crawler.sh >test.log 2>&1).

Hope this helps

Result on my machine

#!/usr/bin/env bash

export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:-}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
__wget__ "./ggplot2//mbox/m.0cgvmtmwmac.kmLcl5JnAwAJ" \
  "https://groups.google.com/forum/message/raw?msg=ggplot2/0cgvmtmwmac/kmLcl5JnAwAJ"
__wget__ "./ggplot2//mbox/m.40Qd5d_OTpg.8Cw2WxXsGgAJ" \ 
  "https://groups.google.com/forum/message/raw?msg=ggplot2/40Qd5d_OTpg/8Cw2WxXsGgAJ"

## a lot more commands

briatte · 2015-10-07T15:05:27Z

I'm running Mac OS X 10.9.5, and here's the requested output:

:: Creating './ggplot2//threads/t.0' with 'forum/ggplot2'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2'...
--2015-10-07 17:03:37--  https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2
Resolving groups.google.com... 64.233.166.139, 64.233.166.101, 64.233.166.138, ...
Connecting to groups.google.com|64.233.166.139|:443... connected.
WARNING: cannot verify groups.google.com's certificate, issued by '/C=US/O=Google Inc/CN=Google Internet Authority G2':
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'STDOUT'

    [ <=>                                   ] 6,188       --.-K/s   in 0.008s  

2015-10-07 17:03:38 (714 KB/s) - written to stdout [6188]

cat: ./ggplot2//msgs/m.*: No such file or directory

Anything weird in that output?

I have tried refreshing the ggplot2 folder completely, to no avail.

briatte · 2015-10-07T15:07:26Z

More pointers to my config:

GNU Wget 1.15 built on darwin13.1.0.
awk version 20070501

(I had to install wget through homebrew.)

briatte · 2015-10-07T15:24:52Z

Okay, just ran your script with Xubuntu, and it works fine.

Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.

Thanks again for your help!

icy · 2015-10-07T15:42:15Z

Okay, just ran your script with Xubuntu, and it works fine.

Perfect. I don't have a Mac for test; I would ask some guy to improve the script

Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.

By default, crawler.sh will get all threads, messages from your Google archive. When you use -rss option (as in crawler.sh -rss) it will read group's atom file for the latest messages.

For example, I can fetch a 4-year archive of my group (http://l.archlinuxvn.org/archlinuxvn/). After I fetch all mesages, I need to run crawler.sh -rss once every hour to get an exact mirror of my group.

briatte · 2015-10-07T15:45:46Z

Hmm, I have run the following commands, and my mbox folder has only 95 messages… Is Google limiting the number of messages that I can retrieve?

export _GROUP="ggplot2"
./crawler.sh -sh > wget.sh
bash wget.sh

Similarly, I get only one file in threads/, called t.0, with only 23 lines.

Sorry if my questions are very basic. I'm struggling to understand how this all works.

icy · 2015-10-07T16:08:40Z

Sorry if my questions are very basic. I'm struggling to understand how this all works.

Let me check. There may be something wrong with the script!

icy · 2015-10-07T16:43:03Z

I've fixed the regular expression issue in the last two commits. Please try to run wget -sh again (you don't need to remove the current temporary directory.)

Thanks a lot!

briatte · 2015-10-07T17:14:07Z

The scraper has been running for some time now, everything seems to be alright with crawler.sh. I have not yet tested wget.sh but I guess it will go fine.

Thanks a lot!

icy · 2015-10-07T23:03:22Z

Ah my bad, it's not wget-sh what I meant was crawler.sh.

Thanks again for your patience. I would reopen this ticket because there was a problem with Mac support.

briatte · 2015-10-07T23:34:03Z

As far as I can tell, it's not your fault: it must have to do with the versions of sed / awk / bash / wget that are installed as part of Mac OS X. My best guess is that the issue is either with awk or with wget.

Also note that I am using Mac OS X 10.9.5, which is quite old by now (current OSX is 11.0).

What versions of awk and wget are you running?

icy · 2015-10-08T00:00:14Z

I understand.

My versions are GNU awk 4.1.3, and GNU wget 1.16.3. It's possibly the order of options that matter. (Similar issue icy/pacapt#59)

cuonglm · 2015-10-13T16:57:34Z

@icy @briatte: I bet that it's not awk problem.

awk '{print $NF}' work in all known awk variants, include the oawk from heirloom tools chest or Brian Kernighan own one.

icy · 2015-10-13T17:07:15Z

@Gnouc I thought it's due to a wget issue. I used -O (output) option at the end of the argument list, as below

wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";

As I recall that won't work on FreeBSD system. Similar to what you said about grep foo -q issue in pacapt project.

cuonglm · 2015-10-13T17:38:41Z

@icy: If you used GNU tools, you're fine with that.

wget google.com -O /tmp/test work fine in my FreeBSD 11.

cmpitg · 2015-10-13T20:25:19Z

Confirm working for FreeBSD 10.2 as well, GNU's wget from FreshPorts.

icy · 2015-10-13T21:01:24Z

Thanks @Gnouc and @cmpitg {happy to see you again;)}

cmpitg · 2015-10-18T12:35:19Z

Me too :-).

luk4hn · 2015-12-15T03:25:42Z

@icy : just have a quick test on OSX 10.10.5
The problem is BSD sed version on OSX don't interpret \n as a newline so it would break _links_dump() function.
Try replace sed -e "s#['\"]#\n#g" with tr "['\"]" "[\r\n]" worked.

cuonglm · 2015-12-15T03:39:15Z

@luk4hn What's the point of "[\r\n]"? It will replace ' with \r and " with \n.

If you worry about the newline character in OSX, then insert it literal:

sed -e "s#['\"]#\
#g"

or using bash quoting $'\n'.

luk4hn · 2015-12-15T03:45:53Z

@Gnouc : Hehe I just tried to point out the problem.
Thank you for the bash quoting 👍

icy · 2015-12-16T15:01:22Z

@luk4hn Can you please send a pull request?

Fix #6: pass \n to sed as ANSI-C quoting BSD version of sed wont interpret '\n' as newline character. Passing '\n' to sed as ANSI-C quoting would avoid this problem.

icy added the bug label Oct 7, 2015

icy added a commit that referenced this issue Oct 7, 2015

Fix regexp in thead listing. Cf. #6

6881fc2

icy added a commit that referenced this issue Oct 7, 2015

Fix regexp (2) in thead listing. Cf. #6

8badf23

icy self-assigned this Oct 7, 2015

briatte closed this as completed Oct 7, 2015

icy reopened this Oct 7, 2015

icy added the enhancement label Oct 8, 2015

icy assigned icy and unassigned icy Dec 15, 2015

icy closed this as completed in b2058a6 Dec 19, 2015

icy added a commit that referenced this issue Dec 19, 2015

Merge pull request #8 from luk4hn/bsd_sed_newline

954fe94

Fix #6: pass \n to sed as ANSI-C quoting BSD version of sed wont interpret '\n' as newline character. Passing '\n' to sed as ANSI-C quoting would avoid this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wget.sh generated but nothing follows #6

wget.sh generated but nothing follows #6

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

briatte commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 8, 2015

cuonglm commented Oct 13, 2015

icy commented Oct 13, 2015

cuonglm commented Oct 13, 2015

cmpitg commented Oct 13, 2015

icy commented Oct 13, 2015

cmpitg commented Oct 18, 2015

luk4hn commented Dec 15, 2015

cuonglm commented Dec 15, 2015

luk4hn commented Dec 15, 2015

icy commented Dec 16, 2015

wget.sh generated but nothing follows #6

wget.sh generated but nothing follows #6

Comments

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

Result on my machine

briatte commented Oct 7, 2015

briatte commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 7, 2015

briatte commented Oct 7, 2015

icy commented Oct 8, 2015

cuonglm commented Oct 13, 2015

icy commented Oct 13, 2015

cuonglm commented Oct 13, 2015

cmpitg commented Oct 13, 2015

icy commented Oct 13, 2015

cmpitg commented Oct 18, 2015

luk4hn commented Dec 15, 2015

cuonglm commented Dec 15, 2015

luk4hn commented Dec 15, 2015

icy commented Dec 16, 2015