Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tumblr archives may not play back properly #126

Open
RomeSilvanus opened this issue Aug 3, 2018 · 9 comments
Open

tumblr archives may not play back properly #126

RomeSilvanus opened this issue Aug 3, 2018 · 9 comments

Comments

@RomeSilvanus
Copy link

I hope it is okay if I reopen this here, and to put three issues at once in it!

Coming from my old issue #94

1.) t.umblr.com redirect

I tried this once again and the t.umblr.com redirect still doesn't work.

This is the command I'm using for this test (note, I am currently using the Docker build by slang800/grab-site, I don't think it should make any difference though):

URL=http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might
DEST=imgur_test

docker exec \
grab-site-server \
grab-site \
--dir=/data/Pony/"Tumblr archive"/"$DEST"/tmp \
--finished-warc-dir=/data/Pony/"Tumblr archive"/"$DEST"/warc \
--ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \
--igsets=singletumblr \
"$URL"

The t.umblr.com links to Imgur. According to your other post it should follow the link when --no-offsite-links isn't used, however this still isn't the case.

This is how the grab-site grab looks:
screenshot 2018-08-03 22 13 43

When trying to open the link in OpenWayback and WebarchivePlayer I get a 404:
screenshot 2018-08-03 22 14 00

Coming back to what you said about the timeout by &t=, I don't think it matters since both the original URL and the one grab-site grabbed are the same:
screenshot 2018-08-03 22 15 12

What I want it to do is follow these redirects and also grab the page that they redirect to. Many Tumblrs I try to archive use these redirects and that leaves me with a lot of broken posts.

2.) Images on different subdomains

I also found some other problems with Tumblr that grab-site doesn't grab right.

This is how it should look:
screenshot 2018-08-03 22 38 43

How ever Tumblr does a redirect to a different subdomain:
screenshot 2018-08-03 22 30 43

Trying to open these gives me a 302:
screenshot 2018-08-03 22 30 53
Which then opens the right image, however it still doesn't show it on the page.

3.) Audio files

( the URL with the audio: http://ask-firefox.tumblr.com/post/106080600065/im-not-big-on-mistletoe-but-if-you-wanna-rave )

Some links to audio files look as follow:
screenshot 2018-08-03 22 34 30

They redirect to this, but grab-site doesn't follow them and just leaves a 404 in the .warc instead.
screenshot 2018-08-03 22 35 05

I tried adding a.tumblr.com to the singletumblr ignoreset, but this didn't help.

==============================================

Maybe there is a Regex fix for all of these issues I can put in the singletumblr ignoreset, but I don't really know my way around Regex at all.

@ivan
Copy link
Contributor

ivan commented Aug 3, 2018

Thanks for the report. I have confirmed the first issue so far. You are right that the tumblr igset is unhelpfully ignoring non-tumblr domains, including the t.umblr.com redirector. I'll see if I can fix the igset. As a workaround (though it might crawl too much tumblr), just not using the singletumblr igset might help crawl offsite stuff.

(the commits linked by github here don't fix the problem, ignore them)

@RomeSilvanus
Copy link
Author

RomeSilvanus commented Aug 3, 2018

Sadly not using the singletumblr ignoreset isnt't an option since a try to crawl a few 100-1000 Tumblr blogs periodically. And I certainly don't want to download all of Tumblr with every single crawl.

But I look forward to a possible solution to these problems!

(Maybe if there would be an option to specify URL that always get included in a crawl, regardless of the ignoreset used? Like a whitelist or a file containing the URLs)

Edit:

Also trying without the singletumblr ignoreset just makes the .warc redirect me to the Imgur website instead to a page grabbed by grab-site.

@ivan ivan changed the title Reopen - Tumblr redirect #94 - Also some other things singletumblr ignore set ignores all non-tumblr domains Aug 4, 2018
ivan added a commit that referenced this issue Aug 6, 2018
@ivan
Copy link
Contributor

ivan commented Aug 6, 2018

I confirmed that problems 1.) and 3.) are fixed in ca8fd22 (or, for 1, at least t.umblr.com is no longer ignored). Can you please check if 2.) is fixed, or give me some way to reproduce that problem?

The imgur problem is going to require a separate investigation, so I opened #127 for it.

@RomeSilvanus
Copy link
Author

What URLs did you test this with? I installed a brand new copy of ca8fd22, but with the URLs I use it's still the same as before.

1.) t.umblr.com redirect:

Same URL. Redirect is still not in the archive and gives a 404.
screenshot 2018-08-07 21 57 38

2.) Navigation buttons

Confirmed working

3.) Audio redirect:

Tried the same URL again. The audio files still do not get grabbed. 404.

The URL:
https://www.tumblr.com/audio_file/ask-firefox/106080600065/tumblr_nh2ihqEeN51replby?plead=please-dont-download-this-or-our-lawyers-wont-let-us-host-audio

is a redirect that actually leads to:
https://a.tumblr.com/tumblr_nh2ihqEeN51replbyo1.mp3
(the display URL doesn't actually change to it, but the embed HTML5 player uses it)

I can just guess that grab-site doesn't follow it properly. Since it's not even in the log.

@ivan
Copy link
Contributor

ivan commented Aug 7, 2018

I tried with the URLs you gave for 1 and 3, then used gs-dump-urls on wpull.db to check whether wpull grabbed them. Can you try with --igon to confirm that something isn't ignoring t.umblr.com?

@ivan
Copy link
Contributor

ivan commented Aug 7, 2018

# grab-site --version
1.7.0
# grab-site http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might --igsets=singletumblr --ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1"
[...]
# grep t.umblr.com wpull.log
2018-08-07 20:38:02,352 - wpull.processor.web - INFO - Fetching ‘https://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com%2FwjnoZ.gif&t=ZTY0ODM4NWU0NjM4NWQ5NWM0N2Q0NzQ2OWU1YTA0MDA4ZmM2OWYxOCxybTlkYWZZNg%3D%3D&b=t%3Aq4se2ivApp6dqsTf9TlZTw&p=http%3A%2F%2Fwoonastuck.tumblr.com%2Fpost%2F32979225783%2Fdevise-the-most-impossible-but-it-just-might&m=1’.
2018-08-07 20:38:03,698 - wpull.processor.web - INFO - Fetched ‘https://t.umblr.com/redirect?z=http%3A%2F%2Fi.imgur.com%2FwjnoZ.gif&t=ZTY0ODM4NWU0NjM4NWQ5NWM0N2Q0NzQ2OWU1YTA0MDA4ZmM2OWYxOCxybTlkYWZZNg%3D%3D&b=t%3Aq4se2ivApp6dqsTf9TlZTw&p=http%3A%2F%2Fwoonastuck.tumblr.com%2Fpost%2F32979225783%2Fdevise-the-most-impossible-but-it-just-might&m=1’: 200 OK. Length: unspecified [text/html; charset=utf-8].

@RomeSilvanus
Copy link
Author

RomeSilvanus commented Aug 7, 2018

I see the problem. I use it with --no-offsite-links since without it grab-site downloads way too many other Tumblrs and websites.
I was under the impression that it will still grab the redirect even when using this flag since it is kinda an embed in the start URL. It is in the log though.

It does work when not using --no-offsite-links, but that's not really a good solution since, as I said, it tends to download way too much unrelated data. It makes my ~850MB .warc into a 5GB+ .warc.

But even then. The redirect does not work, and it's not in the archive. Nor does it show up in both applications I tried (Webrecorder Player, OpenWayback).

screenshot 2018-08-07 22 58 12
screenshot 2018-08-07 22 58 17

I know you said it needs more work, but I assumed that it would at least display the embed image on the Tumblr page.


Using grab-site with:

~/gs-venv/bin/grab-site \
--igon \
--dir=/home/user/test/woonastuck/tmp \
--finished-warc-dir=/home/user/test/woonastuck/warc \
--ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \
--igsets=singletumblr \
"http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might"


For the audio:

At least it does fetch the audio file now. But viewing it in either application still gives me a 404.
The HTML5 audio player on the page still doesn't find the file.

So either grab-site doesn't rewrite something correctly, or there's a general problem with these applications.

@ivan ivan changed the title singletumblr ignore set ignores all non-tumblr domains tumblr archives may not play back properly Oct 9, 2018
@beret
Copy link

beret commented Dec 15, 2018

It looks like the changed singletumblr igset might be preventing crawls starting at the root of a tumblelog eg https://staff.tumblr.com when they lack a trailing slash.

Is this expected behavior?

@ivan
Copy link
Contributor

ivan commented Dec 15, 2018

No, that's unexpected and undesired, I'll file a bug for it. Thanks for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants