[instagram] Add support for GraphSidecar media types #201

iamleot · 2019-03-23T17:49:59Z

Refactor _extract_postpage() to always return a list of medias.

Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding sidecar_media_id' and sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.

While here join the copyright comment lines in a single one.

Closes #178.

Refactor _extract_postpage() to always return a list of medias. Fetch common keywords and gracefully handle GraphSidecar media type by extracting each single media and adding `sidecar_media_id' and `sidecar_shortcode' keywords to indicate the parent of sidecar childrens. While here join the copyright comment lines in a single one. Closes #178.

iamleot · 2019-03-23T18:10:02Z

Relevant parts of the Travis CI logs (possible TLDR;).

It seems that all instagram extractor related tests passes:

test_InstagramImageExtractor_1 (test.test_results.TestExtractorResults) ... ok
test_InstagramImageExtractor_2 (test.test_results.TestExtractorResults) ... ok
test_InstagramImageExtractor_3 (test.test_results.TestExtractorResults) ... ok
test_InstagramUserExtractor_1 (test.test_results.TestExtractorResults) ... ok

...and the failure is a komikast one (but I have not investigated further):

======================================================================
FAIL: test_KomikcastMangaExtractor_1 (test.test_results.TestExtractorResults)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 252, in test
    self._run_test(extr, url, result)
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 94, in _run_test
    self.assertEqual(keyword, tjob.hash_keyword.hexdigest())
AssertionError: '837a7e96867344ff59d840771c04c20dc46c0ab1' != '0b83000d3c6073c830be49dd7b69b3eae16d8343'
- 837a7e96867344ff59d840771c04c20dc46c0ab1
+ 0b83000d3c6073c830be49dd7b69b3eae16d8343
-------------------- >> begin captured stdout << ---------------------
https://komikcast.com/komik/090-eko-to-issho/
--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
komikcast: DEBUG: Using KomikcastMangaExtractor for 'https://komikcast.com/komik/090-eko-to-issho/'
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): komikcast.com:443
urllib3.connectionpool: DEBUG: https://komikcast.com:443 "GET /komik/090-eko-to-issho HTTP/1.1" 301 None
urllib3.connectionpool: DEBUG: https://komikcast.com:443 "GET /komik/090-eko-to-issho/ HTTP/1.1" 200 None
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 463 tests in 518.691s
FAILED (failures=1)

mikf · 2019-03-23T18:33:02Z

Thanks a lot for this. (Ignore the one failed test, its not important.)

There is one small thing that might be an issue for users who have already downloaded a few images and are now re-downloading everything to get the new, additional images that are now being downloaded:

Take for example https://www.instagram.com/p/BoHk1haB5tM/. Before your change it got 1 image with ID 1875629777499953996. Now it gets the same image (and 4 others), but with a different ID than before (1875628837415270345). This could cause users to have downloaded the same image twice.

gallery_dl/extractor/instagram.py

mikf · 2019-03-23T18:42:41Z

Reading through your changes again made me realize the old ID was for the whole sidecar and the new IDs are for the actual media files, meaning they are now "more correct" than before, so I guess it's fine the way it is right now.

And one can also use some filename shenanigans to give those IDs a bit more meaning:
"{sidecar_media_id:?/_/}{media_id}.{extension}" for example.

@mikf

Add a possible leading `media_id' of the sidecar for GraphSidecar media. Thanks to @mikf for the suggestion!

iamleot · 2019-03-23T18:47:59Z

That's a good idea! I have just adjusted it, thanks again!

iamleot · 2019-03-23T18:53:06Z

And, regarding possible IDs conflicts, that's right. Previously only the (now) sidecar_media_id was used while before commit fde0c25 only the childrens IDs were used (leading to possible duplicates of the same image in case of GraphImage or a "screenshot" of video in case of GraphVideo). Unfortunately I think that we have no grace way to possibly handle that except to suggest users to manually remove them.

mikf · 2019-03-23T21:10:20Z

I found an issue that's indirectly caused by this PR: GraphSidecar posts with multiple videos will download those videos multiple times, once for each shortcode.

Example: https://www.instagram.com/p/BtOvDOfhvRr/
This post has 2 videos and each one will get downloaded twice, because the youtube-dl downloader will interpret both video URLs as a playlist of 2 videos.

Possible solution: add a _ytdl_index field to the metadata dict that specifies the playlist index for youtube-dl and I will add the necessary changes in the downloader module. This might be tricky to implement if there are GraphSidecar posts with both videos and images, so I'm open for any suggestion in this regard.

(Maybe this should be its own issue/pull request, but I thought it would fit in here as well)

iamleot · 2019-03-23T21:52:19Z

Whooops, nice catch @mikf!

I guess that the problem is that each children shortcode - e.g. for BtOvDOfhvRr these are:
BtOu6eqhX6N and BtOu61khfr_ youtube-dl will just follow the redirect and each ytdl: special URL will leads to the parent BtOvDOfhvRr.

Another possible kludge is to extract just the first GraphVideo in a GraphSidecar (maybe by directly returning the sidecar_shortcodeid) but in that way we will lose possible metadata information about the single children... So, I think it's probably better to introduce a _ytdl_index as you have proposed. I will try to implement it ASAP!

GraphSidecar children ytdl: URLs when consumed by youtube-dl redirects to the URL of their parent. In GraphSidecar-s with multiple GraphVideo-s this leads to downloading the same video multiple times. Add a `_ytdl_index' field to indicate the index of the youtube-dl playlist corresponding the children of the sidecar. This will be used by the `ytdl' downloader.

iamleot · 2019-03-24T10:07:09Z

Hello Mike, Mike Fährmann writes:

Merged #201 into master.

Thank you for quickly merging it and for all the reviews/suggestions!

arisboch · 2019-03-24T14:13:34Z

Still doesn't work ;-(

mikf · 2019-03-24T14:28:39Z

In what way? Does it crash?
If you have a multi-image post where it doesn't get all images, please post its URL so it can be fixed.

iamleot · 2019-03-24T14:39:42Z

Hello Aris, Aris Boch writes:

Still doesn't work ;-(

Can you please share the complete gallery-dl incantation (including the used URL(s)) and maybe also add `--verbose' option? Thanks!

iamleot mentioned this pull request Mar 23, 2019

[Instagram] Only one image of multi-image posts gets downloaded #178

Closed

mikf reviewed Mar 23, 2019

View reviewed changes

gallery_dl/extractor/instagram.py Outdated Show resolved Hide resolved

[instagram] Use yield from' instead of for ... yield' (thanks @mikf)!

5b8419d

[instagram] Adjust filename for GraphSidecar medias

fde0c25

Add a possible leading `media_id' of the sidecar for GraphSidecar media. Thanks to @mikf for the suggestion!

mikf merged commit 1e38f65 into mikf:master Mar 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[instagram] Add support for GraphSidecar media types #201

[instagram] Add support for GraphSidecar media types #201

iamleot commented Mar 23, 2019

iamleot commented Mar 23, 2019 •

edited

mikf commented Mar 23, 2019

mikf commented Mar 23, 2019

iamleot commented Mar 23, 2019

iamleot commented Mar 23, 2019

mikf commented Mar 23, 2019

iamleot commented Mar 23, 2019

iamleot commented Mar 24, 2019 via email

arisboch commented Mar 24, 2019

mikf commented Mar 24, 2019

iamleot commented Mar 24, 2019 via email

[instagram] Add support for GraphSidecar media types #201

[instagram] Add support for GraphSidecar media types #201

Conversation

iamleot commented Mar 23, 2019

iamleot commented Mar 23, 2019 • edited

mikf commented Mar 23, 2019

mikf commented Mar 23, 2019

iamleot commented Mar 23, 2019

iamleot commented Mar 23, 2019

mikf commented Mar 23, 2019

iamleot commented Mar 23, 2019

iamleot commented Mar 24, 2019 via email

arisboch commented Mar 24, 2019

mikf commented Mar 24, 2019

iamleot commented Mar 24, 2019 via email

iamleot commented Mar 23, 2019 •

edited