-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tumblr archives may not play back properly #126
Comments
Thanks for the report. I have confirmed the first issue so far. You are right that the tumblr igset is unhelpfully ignoring non-tumblr domains, including the (the commits linked by github here don't fix the problem, ignore them) |
Sadly not using the But I look forward to a possible solution to these problems! (Maybe if there would be an option to specify URL that always get included in a crawl, regardless of the ignoreset used? Like a whitelist or a file containing the URLs) Edit:Also trying without the |
What URLs did you test this with? I installed a brand new copy of ca8fd22, but with the URLs I use it's still the same as before. 1.) t.umblr.com redirect:Same URL. Redirect is still not in the archive and gives a 404. 2.) Navigation buttonsConfirmed working 3.) Audio redirect:Tried the same URL again. The audio files still do not get grabbed. 404. The URL: is a redirect that actually leads to: I can just guess that grab-site doesn't follow it properly. Since it's not even in the log. |
I tried with the URLs you gave for 1 and 3, then used |
|
I see the problem. I use it with It does work when not using But even then. The redirect does not work, and it's not in the archive. Nor does it show up in both applications I tried (Webrecorder Player, OpenWayback). I know you said it needs more work, but I assumed that it would at least display the embed image on the Tumblr page. Using grab-site with:
For the audio: At least it does fetch the audio file now. But viewing it in either application still gives me a 404. So either grab-site doesn't rewrite something correctly, or there's a general problem with these applications. |
It looks like the changed singletumblr igset might be preventing crawls starting at the root of a tumblelog eg Is this expected behavior? |
No, that's unexpected and undesired, I'll file a bug for it. Thanks for the report. |
I hope it is okay if I reopen this here, and to put three issues at once in it!
Coming from my old issue #94
1.) t.umblr.com redirect
I tried this once again and the t.umblr.com redirect still doesn't work.
This is the command I'm using for this test (note, I am currently using the Docker build by slang800/grab-site, I don't think it should make any difference though):
URL=http://woonastuck.tumblr.com/post/32979225783/devise-the-most-impossible-but-it-just-might
DEST=imgur_test
docker exec \
grab-site-server \
grab-site \
--dir=/data/Pony/"Tumblr archive"/"$DEST"/tmp \
--finished-warc-dir=/data/Pony/"Tumblr archive"/"$DEST"/warc \
--ua "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 but not really nor Googlebot/2.1" \
--igsets=singletumblr \
"$URL"
The t.umblr.com links to Imgur. According to your other post it should follow the link when
--no-offsite-links
isn't used, however this still isn't the case.This is how the grab-site grab looks:
When trying to open the link in OpenWayback and WebarchivePlayer I get a 404:
Coming back to what you said about the timeout by
&t=
, I don't think it matters since both the original URL and the one grab-site grabbed are the same:What I want it to do is follow these redirects and also grab the page that they redirect to. Many Tumblrs I try to archive use these redirects and that leaves me with a lot of broken posts.
2.) Images on different subdomains
I also found some other problems with Tumblr that grab-site doesn't grab right.
This is how it should look:
How ever Tumblr does a redirect to a different subdomain:
Trying to open these gives me a 302:
Which then opens the right image, however it still doesn't show it on the page.
3.) Audio files
( the URL with the audio: http://ask-firefox.tumblr.com/post/106080600065/im-not-big-on-mistletoe-but-if-you-wanna-rave )
Some links to audio files look as follow:
They redirect to this, but grab-site doesn't follow them and just leaves a 404 in the .warc instead.
I tried adding
a.tumblr.com
to the singletumblr ignoreset, but this didn't help.==============================================
Maybe there is a Regex fix for all of these issues I can put in the singletumblr ignoreset, but I don't really know my way around Regex at all.
The text was updated successfully, but these errors were encountered: