Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[instagram] Add support for GraphSidecar media types #201

Merged
merged 4 commits into from
Mar 24, 2019
Merged

[instagram] Add support for GraphSidecar media types #201

merged 4 commits into from
Mar 24, 2019

Conversation

iamleot
Copy link
Contributor

@iamleot iamleot commented Mar 23, 2019

Refactor _extract_postpage() to always return a list of medias.

Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding sidecar_media_id' and sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.

While here join the copyright comment lines in a single one.

Closes #178.

Refactor _extract_postpage() to always return a list of medias.

Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding `sidecar_media_id' and
`sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.

While here join the copyright comment lines in a single one.

Closes #178.
@iamleot
Copy link
Contributor Author

iamleot commented Mar 23, 2019

Relevant parts of the Travis CI logs (possible TLDR;).

It seems that all instagram extractor related tests passes:

test_InstagramImageExtractor_1 (test.test_results.TestExtractorResults) ... ok
test_InstagramImageExtractor_2 (test.test_results.TestExtractorResults) ... ok
test_InstagramImageExtractor_3 (test.test_results.TestExtractorResults) ... ok
test_InstagramUserExtractor_1 (test.test_results.TestExtractorResults) ... ok

...and the failure is a komikast one (but I have not investigated further):

======================================================================
FAIL: test_KomikcastMangaExtractor_1 (test.test_results.TestExtractorResults)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 252, in test
    self._run_test(extr, url, result)
  File "/home/travis/build/mikf/gallery-dl/test/test_results.py", line 94, in _run_test
    self.assertEqual(keyword, tjob.hash_keyword.hexdigest())
AssertionError: '837a7e96867344ff59d840771c04c20dc46c0ab1' != '0b83000d3c6073c830be49dd7b69b3eae16d8343'
- 837a7e96867344ff59d840771c04c20dc46c0ab1
+ 0b83000d3c6073c830be49dd7b69b3eae16d8343
-------------------- >> begin captured stdout << ---------------------
https://komikcast.com/komik/090-eko-to-issho/
--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
komikcast: DEBUG: Using KomikcastMangaExtractor for 'https://komikcast.com/komik/090-eko-to-issho/'
urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): komikcast.com:443
urllib3.connectionpool: DEBUG: https://komikcast.com:443 "GET /komik/090-eko-to-issho HTTP/1.1" 301 None
urllib3.connectionpool: DEBUG: https://komikcast.com:443 "GET /komik/090-eko-to-issho/ HTTP/1.1" 200 None
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 463 tests in 518.691s
FAILED (failures=1)

@mikf
Copy link
Owner

mikf commented Mar 23, 2019

Thanks a lot for this. (Ignore the one failed test, its not important.)

There is one small thing that might be an issue for users who have already downloaded a few images and are now re-downloading everything to get the new, additional images that are now being downloaded:

Take for example https://www.instagram.com/p/BoHk1haB5tM/. Before your change it got 1 image with ID 1875629777499953996. Now it gets the same image (and 4 others), but with a different ID than before (1875628837415270345). This could cause users to have downloaded the same image twice.

@mikf
Copy link
Owner

mikf commented Mar 23, 2019

Reading through your changes again made me realize the old ID was for the whole sidecar and the new IDs are for the actual media files, meaning they are now "more correct" than before, so I guess it's fine the way it is right now.

And one can also use some filename shenanigans to give those IDs a bit more meaning:
"{sidecar_media_id:?/_/}{media_id}.{extension}" for example.

Add a possible leading `media_id' of the sidecar for GraphSidecar
media.

Thanks to @mikf for the suggestion!
@iamleot
Copy link
Contributor Author

iamleot commented Mar 23, 2019

That's a good idea! I have just adjusted it, thanks again!

@iamleot
Copy link
Contributor Author

iamleot commented Mar 23, 2019

And, regarding possible IDs conflicts, that's right. Previously only the (now) sidecar_media_id was used while before commit fde0c25 only the childrens IDs were used (leading to possible duplicates of the same image in case of GraphImage or a "screenshot" of video in case of GraphVideo). Unfortunately I think that we have no grace way to possibly handle that except to suggest users to manually remove them.

@mikf
Copy link
Owner

mikf commented Mar 23, 2019

I found an issue that's indirectly caused by this PR: GraphSidecar posts with multiple videos will download those videos multiple times, once for each shortcode.

Example: https://www.instagram.com/p/BtOvDOfhvRr/
This post has 2 videos and each one will get downloaded twice, because the youtube-dl downloader will interpret both video URLs as a playlist of 2 videos.

Possible solution: add a _ytdl_index field to the metadata dict that specifies the playlist index for youtube-dl and I will add the necessary changes in the downloader module. This might be tricky to implement if there are GraphSidecar posts with both videos and images, so I'm open for any suggestion in this regard.

(Maybe this should be its own issue/pull request, but I thought it would fit in here as well)

@iamleot
Copy link
Contributor Author

iamleot commented Mar 23, 2019

Whooops, nice catch @mikf!

I guess that the problem is that each children shortcode - e.g. for BtOvDOfhvRr these are:
BtOu6eqhX6N and BtOu61khfr_ youtube-dl will just follow the redirect and each ytdl: special URL will leads to the parent BtOvDOfhvRr.

Another possible kludge is to extract just the first GraphVideo in a GraphSidecar (maybe by directly returning the sidecar_shortcodeid) but in that way we will lose possible metadata information about the single children... So, I think it's probably better to introduce a _ytdl_index as you have proposed. I will try to implement it ASAP!

GraphSidecar children ytdl: URLs when consumed by youtube-dl
redirects to the URL of their parent.  In GraphSidecar-s with
multiple GraphVideo-s this leads to downloading the same video
multiple times.

Add a `_ytdl_index' field to indicate the index of the youtube-dl
playlist corresponding the children of the sidecar.

This will be used by the `ytdl' downloader.
@mikf mikf merged commit 1e38f65 into mikf:master Mar 24, 2019
@iamleot
Copy link
Contributor Author

iamleot commented Mar 24, 2019 via email

@arisboch
Copy link

Still doesn't work ;-(

@mikf
Copy link
Owner

mikf commented Mar 24, 2019

In what way? Does it crash?
If you have a multi-image post where it doesn't get all images, please post its URL so it can be fixed.

@iamleot
Copy link
Contributor Author

iamleot commented Mar 24, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants