Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wget.sh generated but nothing follows #6

Closed
briatte opened this issue Oct 7, 2015 · 22 comments
Closed

wget.sh generated but nothing follows #6

briatte opened this issue Oct 7, 2015 · 22 comments
Assignees

Comments

@briatte
Copy link

briatte commented Oct 7, 2015

Hi,

Would you mind adding some notes on how to troubleshoot the script?

I'm trying to download this list with the following parameters:

export _GROUP="ggplot2"
export _WGET_OPTIONS="--no-check-certificate"

The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:

./crawler.sh -sh > wget.sh
bash wget.sh

Thanks in advance for any pointers. The wget.sh file I get is copied below.

#!/usr/bin/env bash

export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:---no-check-certificate}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
@icy
Copy link
Owner

icy commented Oct 7, 2015

Would you mind adding some notes on how to troubleshoot the script?

I will. Basically, it's about adding some wget option to get more verbose message.

The next commands then generate the wget.sh file and try to run it, but the file itself does not seem to run on anything:

What's OS you're running? Do you have any output from the command ./crawler.s -sh > wget.sh?

I've tried to run (exactly as you did, except I don't need export _WGET_OPTIONS="--no-check-certificate" on my ArchLinux machine), and I have very good result (as below).

I suggest you to remove the temporary directory (the ggplot2 directory in place you run crawler.sh command) and start again. You may record all logs for future debugging (crawler.sh >test.log 2>&1).

Hope this helps

Result on my machine

#!/usr/bin/env bash

export _GROUP="${_GROUP:-ggplot2}"
export _D_OUTPUT="${_D_OUTPUT:-./ggplot2/}"
export _USER_AGENT="${_USER_AGENT:-Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0}"
export _WGET_OPTIONS="${_WGET_OPTIONS:-}"

__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}
__wget__ "./ggplot2//mbox/m.0cgvmtmwmac.kmLcl5JnAwAJ" \
  "https://groups.google.com/forum/message/raw?msg=ggplot2/0cgvmtmwmac/kmLcl5JnAwAJ"
__wget__ "./ggplot2//mbox/m.40Qd5d_OTpg.8Cw2WxXsGgAJ" \ 
  "https://groups.google.com/forum/message/raw?msg=ggplot2/40Qd5d_OTpg/8Cw2WxXsGgAJ"

## a lot more commands

@briatte
Copy link
Author

briatte commented Oct 7, 2015

I'm running Mac OS X 10.9.5, and here's the requested output:

:: Creating './ggplot2//threads/t.0' with 'forum/ggplot2'
:: Fetching data from 'https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2'...
--2015-10-07 17:03:37--  https://groups.google.com/forum/?_escaped_fragment_=forum/ggplot2
Resolving groups.google.com... 64.233.166.139, 64.233.166.101, 64.233.166.138, ...
Connecting to groups.google.com|64.233.166.139|:443... connected.
WARNING: cannot verify groups.google.com's certificate, issued by '/C=US/O=Google Inc/CN=Google Internet Authority G2':
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'STDOUT'

    [ <=>                                   ] 6,188       --.-K/s   in 0.008s  

2015-10-07 17:03:38 (714 KB/s) - written to stdout [6188]

cat: ./ggplot2//msgs/m.*: No such file or directory

Anything weird in that output?

I have tried refreshing the ggplot2 folder completely, to no avail.

@briatte
Copy link
Author

briatte commented Oct 7, 2015

More pointers to my config:

  • GNU Wget 1.15 built on darwin13.1.0.
  • awk version 20070501

(I had to install wget through homebrew.)

@briatte
Copy link
Author

briatte commented Oct 7, 2015

Okay, just ran your script with Xubuntu, and it works fine.

Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.

Thanks again for your help!

@icy
Copy link
Owner

icy commented Oct 7, 2015

Okay, just ran your script with Xubuntu, and it works fine.

Perfect. I don't have a Mac for test; I would ask some guy to improve the script

Last question: what do I need to set to scrape old messages? The default settings seem to have scraped only a tiny fraction of the emails, and I supposed that those that got scraped are the most recent ones.

By default, crawler.sh will get all threads, messages from your Google archive. When you use -rss option (as in crawler.sh -rss) it will read group's atom file for the latest messages.

For example, I can fetch a 4-year archive of my group (http://l.archlinuxvn.org/archlinuxvn/). After I fetch all mesages, I need to run crawler.sh -rss once every hour to get an exact mirror of my group.

@briatte
Copy link
Author

briatte commented Oct 7, 2015

Hmm, I have run the following commands, and my mbox folder has only 95 messages… Is Google limiting the number of messages that I can retrieve?

export _GROUP="ggplot2"
./crawler.sh -sh > wget.sh
bash wget.sh

Similarly, I get only one file in threads/, called t.0, with only 23 lines.

Sorry if my questions are very basic. I'm struggling to understand how this all works.

@icy
Copy link
Owner

icy commented Oct 7, 2015

Sorry if my questions are very basic. I'm struggling to understand how this all works.

Let me check. There may be something wrong with the script!

@icy icy added the bug label Oct 7, 2015
icy added a commit that referenced this issue Oct 7, 2015
icy added a commit that referenced this issue Oct 7, 2015
@icy
Copy link
Owner

icy commented Oct 7, 2015

I've fixed the regular expression issue in the last two commits. Please try to run wget -sh again (you don't need to remove the current temporary directory.)

Thanks a lot!

@icy icy self-assigned this Oct 7, 2015
@briatte
Copy link
Author

briatte commented Oct 7, 2015

The scraper has been running for some time now, everything seems to be alright with crawler.sh. I have not yet tested wget.sh but I guess it will go fine.

Thanks a lot!

@briatte briatte closed this as completed Oct 7, 2015
@icy
Copy link
Owner

icy commented Oct 7, 2015

Ah my bad, it's not wget-sh what I meant was crawler.sh.

Thanks again for your patience. I would reopen this ticket because there was a problem with Mac support.

@icy icy reopened this Oct 7, 2015
@briatte
Copy link
Author

briatte commented Oct 7, 2015

As far as I can tell, it's not your fault: it must have to do with the versions of sed / awk / bash / wget that are installed as part of Mac OS X. My best guess is that the issue is either with awk or with wget.

Also note that I am using Mac OS X 10.9.5, which is quite old by now (current OSX is 11.0).

What versions of awk and wget are you running?

@icy
Copy link
Owner

icy commented Oct 8, 2015

I understand.

My versions are GNU awk 4.1.3, and GNU wget 1.16.3. It's possibly the order of options that matter. (Similar issue icy/pacapt#59)

@icy icy added the enhancement label Oct 8, 2015
@cuonglm
Copy link
Contributor

cuonglm commented Oct 13, 2015

@icy @briatte: I bet that it's not awk problem.

awk '{print $NF}' work in all known awk variants, include the oawk from heirloom tools chest or Brian Kernighan own one.

@icy
Copy link
Owner

icy commented Oct 13, 2015

@Gnouc I thought it's due to a wget issue. I used -O (output) option at the end of the argument list, as below

wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";

As I recall that won't work on FreeBSD system. Similar to what you said about grep foo -q issue in pacapt project.

@cuonglm
Copy link
Contributor

cuonglm commented Oct 13, 2015

@icy: If you used GNU tools, you're fine with that.

wget google.com -O /tmp/test work fine in my FreeBSD 11.

@cmpitg
Copy link
Contributor

cmpitg commented Oct 13, 2015

Confirm working for FreeBSD 10.2 as well, GNU's wget from FreshPorts.

@icy
Copy link
Owner

icy commented Oct 13, 2015

Thanks @Gnouc and @cmpitg {happy to see you again;)}

@cmpitg
Copy link
Contributor

cmpitg commented Oct 18, 2015

Me too :-).

@luk4hn
Copy link
Contributor

luk4hn commented Dec 15, 2015

@icy : just have a quick test on OSX 10.10.5
The problem is BSD sed version on OSX don't interpret \n as a newline so it would break _links_dump() function.
Try replace sed -e "s#['\"]#\n#g" with tr "['\"]" "[\r\n]" worked.

@cuonglm
Copy link
Contributor

cuonglm commented Dec 15, 2015

@luk4hn What's the point of "[\r\n]"? It will replace ' with \r and " with \n.

If you worry about the newline character in OSX, then insert it literal:

sed -e "s#['\"]#\
#g"

or using bash quoting $'\n'.

@luk4hn
Copy link
Contributor

luk4hn commented Dec 15, 2015

@Gnouc : Hehe I just tried to point out the problem.
Thank you for the bash quoting 👍

@icy icy assigned icy and unassigned icy Dec 15, 2015
@icy
Copy link
Owner

icy commented Dec 16, 2015

@luk4hn Can you please send a pull request?

@icy icy closed this as completed in b2058a6 Dec 19, 2015
icy added a commit that referenced this issue Dec 19, 2015
Fix #6: pass \n to sed as ANSI-C quoting

BSD version of sed wont interpret '\n' as newline character.
Passing '\n' to sed as ANSI-C quoting would avoid this problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants