Permalink
Browse files

Add back the tumblr Googlebot UA instruction now that it works again;…

… bump Firefox UA
  • Loading branch information...
ivan committed Dec 16, 2018
1 parent d5b7aa6 commit a9727902cd5e126c6d2cb8453dc10b9298cd76d2
Showing with 7 additions and 4 deletions.
  1. +5 −2 README.md
  2. +1 −1 libgrabsite/__init__.py
  3. +1 −1 libgrabsite/main.py
@@ -393,8 +393,11 @@ or by using https://archive.is/ instead of grab-site.

#### Tumblr blogs

Don't crawl from Europe: tumblr redirects to a GDPR `/privacy/consent` page and
the `Googlebot` user agent override no longer has any effect.
Either don't crawl from Europe (because tumblr redirects to a GDPR `/privacy/consent` page), or add `Googlebot` to the user agent:

```
--ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0 but not really nor Googlebot/2.1"
```

Use [`--igsets=singletumblr`](https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/singletumblr)
to avoid crawling the homepages of other tumblr blogs.
@@ -1 +1 @@
__version__ = '2.1.11'
__version__ = '2.1.12'
@@ -109,7 +109,7 @@ def is_multicast(text):
'Try to limit each WARC file to around BYTES bytes before rolling over '
'to a new WARC file (default: 5368709120, which is 5GiB).')

@click.option('--ua', default="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0",
@click.option('--ua', default="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0",
metavar='STRING', help='Send User-Agent: STRING instead of pretending to be Firefox on Windows.')

@click.option('--wpull-args', default="",

0 comments on commit a972790

Please sign in to comment.