Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for BlueSky? #4438

Open
Erkhyan opened this issue Aug 18, 2023 · 27 comments
Open

Support for BlueSky? #4438

Erkhyan opened this issue Aug 18, 2023 · 27 comments

Comments

@Erkhyan
Copy link

Erkhyan commented Aug 18, 2023

The site is still invite-only for now, but I’m willing to provide an invite code as soon as I get a new one (should be in ~5 days).

@EpicLPer
Copy link

Heya, I have a few invite codes left over if @mikf wants one to implement this :)

@GiovanH
Copy link
Contributor

GiovanH commented Oct 26, 2023

I've implemented some basics for this in an unrelated project. Chitose makes it relatively easy, but I'm not sure what the contribution guidelines are for new library dependencies. @mikf ?

Logic is basically

    def login(self, instance="bsky.social"):
        rc = netrc.netrc()
        (BSKY_USER, _, BSKY_PASSWD) = rc.authenticators(instance)

        self.api = chitose.BskyAgent(service=f'https://{instance}')
        self.api.login(BSKY_USER, BSKY_PASSWD)
        logging.info(f"Logged into {instance} as {BSKY_USER}")

    def getPostMedia(self, json_obj) -> typing.Iterable[typing.Tuple[str, str]]:
        for image_def in json_obj.get('embed', {}).get('images', []):
            src_url = image_def['fullsize']
            name = posixpath.split(src_url)[-1].replace('@', '.')
            yield (name, src_url)

    def bskyGetThread(self, post_reference: PostReference) -> dict:
        thread_response = self.api.get_post_thread(uri=self.bskyTupleToUri(post_reference))
        thread_response = json.loads(thread_response)

        return thread_response

    def getSkeetJsonApi(self, post_reference: PostReference, reason=""):
        try:
            thread_response = self.bskyGetThread(post_reference)
            thread_response['thread']['post']['id'] = post_reference.post_id

            logging.info(f"Downloaded new {self.NOUN_POST} for {post_reference} ({reason})")
            # print(thread_response)
            json_obj = thread_response['thread']['post']

            return json_obj

        except urllib.error.HTTPError as e:  # type: ignore[attr-defined]
            logging.error(e.headers)
            logging.error(e.fp.read())
            raise e
        except Exception:
            raise

@Iron-Squid
Copy link

Bluesky is now open to the public, FYI:

https://www.pcmag.com/news/twitter-alternative-bluesky-makes-posts-publicly-viewable

@EpicLPer
Copy link

Bluesky is now open to the public, FYI:

https://www.pcmag.com/news/twitter-alternative-bluesky-makes-posts-publicly-viewable

Not for every account tho, you can manually set if you want your posts to be publicly viewable or only for people who are logged in.

@qub1750ul
Copy link

BlueSky posts are always public. You can request for your profile to be hidden from the unauthenticated human-friendly web interface, but that doesn't make it private. It will always be readable via public API.

@mikf
Copy link
Owner

mikf commented Feb 10, 2024

I've added a bunch of bluesky code.
Could someone test it and let me know what else should be added/improved/etc?

@biznizz
Copy link

biznizz commented Feb 11, 2024

Starting experimentation with the version 1.26.7 on bluesky with username, password and cookies.

		 "bluesky":
        {
            "username": "EMAIL",
			"password": "PASS",
			"retweets": false,
			"original": true,
			"cookies": "C:\\Users\\USER\\cookies.txt",
			"cookies-update": true
        },

Using this post as a test: https://bsky.app/profile/toomanyboners.bsky.social/post/3khucm2ygso2z

Picture downloaded with gallery-dl results in a 1470 x 1260 JPG.
2023-12-31T18_07_37_3khucm2ygso2z_1

Opening picture in browser with "Open in new tab" gives a picture of 1000 x 857 JPG: https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg

Clicking on picture to open it up in browser and "Open in new tab" gives a picture of 2000 x 1714 JPG: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg

Removing "original": true, and cookies lines results in same resolution downloaded; between "preview" and "fullsize" resolutions. Not sure if "fullsize" is actually that, or if the one ripped is the true size, but figured this should be known for clarity's sake.

@Erkhyan
Copy link
Author

Erkhyan commented Feb 11, 2024

I just tested the same link, same parameters except for not providing cookies.

Downloading the link the first time gave me the 2000 × 1714 JPG file.

All subsequent downloads of the same link using the same exact settings gave me the 1470 × 1260 JPG file.

@mikf
Copy link
Owner

mikf commented Feb 11, 2024

Each image uploaded to bluesky has 3 different versions (at least, haven't found more at this point).

(from https://bsky.app/profile/mikf.bsky.social/post/3kkn2rkvdls2v)

gallery-dl is currently downloading everything in original size (55bbd49).
Guess I'll add an option for this.

edit: cookies don't work on bluesky. The site itself doesn't use cookies. You need to provide username and password to login, but you can remove password after the first login.

@biznizz
Copy link

biznizz commented Feb 11, 2024

That's confusing and a bit annoying how bluesky is doing this image resizing. You know it's bad when Twitter is more consistent with filesizes than this new alternative. So, if a 3000 x 3000 pic is uploaded, it'll always be downsized to 2000 x 2000 with no way to get the true original size, all after having to put up with image conversation and severe decompression.

mikf added a commit that referenced this issue Feb 15, 2024
and reduce default depth and parentHeight values
mikf added a commit that referenced this issue Feb 15, 2024
allow extracting 'user' metadata and
make 'facets' extraction optional
mikf added a commit that referenced this issue Feb 17, 2024
Both https://bsky.app/search?q=QUERY and https://bsky.app/search/QUERY
are recognized as search URLs, where QUERY gets forwarded unmodified as
'q' parameter for app.bsky.feed.searchPosts .

User searches are not supported yet.
@Freso
Copy link

Freso commented Feb 25, 2024

Even though I call this original, it is still a modified version of the uploaded file, as in every file gets converted to JPEG and even uploaded JPEGs get re-compressed.

But I think this is something Bluesky does when uploading the images, no? Ie., I don’t think Bluesky stores the original image anywhere, only their (potentially downscaled) JPEG image.

edit: cookies don't work on bluesky. The site itself doesn't use cookies. You need to provide username and password to login, but you can remove password after the first login.

Is there a reason for asking for/using login information at all? Better rate limits? (As @qub1750ul mentioned earlier, all Bluesky posts (incl. images) are always public, so there’s no need for logging in to access them.)

@mikf
Copy link
Owner

mikf commented Feb 25, 2024

But I think this is something Bluesky does when uploading the images, no? Ie., I don’t think Bluesky stores the original image anywhere, only their (potentially downscaled) JPEG image.

Seems like it. Bluesky does not store the originally uploaded image.

Is there a reason for asking for/using login information at all?

Certain (private) feeds, like /likes or /lists/<LIST-ID>, only return posts when logged in.

You don't need to login if all you want to do is download a user's media.

@quentinwolf
Copy link

I just updated to gain access to the Bluesky functionality, although have a question, as my Python Script I run for a bot uses a separate downloader function, when I attempt to run it using the usual works-with-everything command (and using the example post above)

gallery-dl --get-urls --no-download --option search-endpoint=graphq1 https://bsky.app/profile/toomanyboners.bsky.social/post/3khucm2ygso2z

Rather than outputting the actual image url as https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg which is what shows up in a browser, it spits out a blob

https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:zyctzyihzisjnrdoiw75xvhm&cid=bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii

Granted I am seeing similarities in the URL's so a bit of rewriting the URL could accomplish what I need to pass over to my separate downloading function, just wondering if there's any command-line flags when using the --get-urls and --no-download function to instead output the correct https://cdn.bsky.app/img/feed_fullsize/plain/did: url instead of https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did: ?

@mikf
Copy link
Owner

mikf commented Feb 29, 2024

@quentinwolf See #4438 (comment)

There is currently no such option, but I'd think original resolution is better than the upscaled-to-2000px version.

@Freso
Copy link

Freso commented Feb 29, 2024

instead output the correct

The https://bsky.social/xrpc/com.atproto.sync.getBlob?did=$DID&cid=$CID URL is the more “correct” URL since that one won’t break when(/if) other AT protocol nodes start getting added to the federation network. (DID is a unique user/account identifier across all AT protocol instances, CID is a unique content identifier.) The cdn.bsky.app URLs are implementation details specific to how Bluesky is handling the AT protocol and probably shouldn’t be considered stable (emphasis mine):

Blobs for a specific account can be listed and downloaded using endpoints in the com.atproto.sync.* NSID space. These endpoints give access to the complete original blob, as uploaded. A common pattern is for applications to mirror both the original blob and any downsized thumbnail or preview versions via separate URLs (eg, on a CDN), instead of deep-linking to the getBlob endpoint on the original PDS.

@mikf
Copy link
Owner

mikf commented Feb 29, 2024

These endpoints give access to the complete original blob, as uploaded

Now that's not true, at least for the bsky.app instance it isn't

https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:cslxjqkeexku6elp5xowxkq7&cid=bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq is not the same file I uploaded. Or is this URL somehow wrong, e.g. wrong CID?

@Freso
Copy link

Freso commented Mar 2, 2024

I think it’s true in that it’s the “complete original blob, as uploaded” by bsky.app to their storage backend, even if not by the user to bsky.app, hence also my earlier comment about Bluesky’s handling of uploaded images.

I haven’t looked at the what’s going on in the browser, but the JPEGifying and (potential) downscaling could even be happening browserside (I know that there are JavaScript libraries that do this anyway) so the original‐original might never touch any bsky.app infrastructure at all.

@mikf
Copy link
Owner

mikf commented Mar 2, 2024

JPEGifying and (potential) downscaling

Even JPEG files that don't get downscaled are modified:
https://bsky.app/profile/mikf.bsky.social/post/3kkzcewddop2o

@a84r7a3rga76fg
Copy link

Requesting for unique ID to be added, a string with numbers/letters unique to the account

@mikf
Copy link
Owner

mikf commented Mar 9, 2024

Bluesky's equivalent to Twitter's user IDs as unchanging, unique IDs are DIDs.

Each user has a handle and a DID, and both can be used with gallery-dl.

https://bsky.app/profile/bsky.app
https://bsky.app/profile/did:plc:z72i7hdynmk6r22z27h6tvur

A user's DID can found at author['did'] (or user['did'] when enabled).

gallery-dl --filter "print(author['did']) or abort()" https://bsky.app/profile/bsky.app
gallery-dl -o metadata=user --filter "print(user['did']) or abort()" https://bsky.app/profile/bsky.app

It is also included in -K and -j outputs.

@a84r7a3rga76fg
Copy link

Doesn't work for archives because of the colon

[bluesky][warning] Failed to open download archive at 'D:\test/gallery-dl/archives/bluesky/did:plc:z72i7hdynmk6r22z27h6tvur.sqlite' (OperationalError: unable to open database file)

@mikf
Copy link
Owner

mikf commented Mar 9, 2024

Then replace : (:R:/_/) or remove the first 8 characters ([8:]) in your format string.

@a84r7a3rga76fg
Copy link

That worked. I want to request the equivalent of Mastodon extractor's {instance}. It's similar to {category} except it includes the domain as well.

@a84r7a3rga76fg
Copy link

I don't think dots are allowed in the username. Can there be a version of author['handle'] without the domain (any text that comes after the first dot)? Such a keyword exists for Mastodon and Misskey.

@EpicLPer
Copy link

EpicLPer commented Mar 20, 2024

I don't think dots are allowed in the username. Can there be a version of author['handle'] without the domain (any text that comes after the first dot)? Such a keyword exists for Mastodon and Misskey.

Dots are allowed if you use a custom domain name. I know this for a fact cause I've done so with an alt account of mine (NSFW so can't post the name here).

EDIT: I mean sub-domains with this, as example "@sub.epiclper.com" would be a valid Blue Sky username.

@Freso
Copy link

Freso commented Mar 20, 2024

I don't think dots are allowed in the username.

In Bluesky/the AT protocol, usernames are domain names (or as the documentation says: Handles are DNS names), so not only are dots allowed, they are required to have at least one in them. :) Most languages will have libraries for handling domain names (or you can just split on . and grab the first part of the resulting array), so you can use that if you’re only interested in the sub‐most part of the domain name. Do keep in mind if you do that, that you shouldn’t expect those to be unique – e.g., @freso.dk and @freso.bsky.social would both resolve to freso.

@Kuroo2021
Copy link

Using the code
` "bluesky":
{
"username": "12@abc.com",
"password": "bl;ahblah",
"filename": "{createdAt[:19]}{post_id}{num}.{extension}",
"directory": ["{category}", "{author[handle]}"],
"include": "avatar,media",
"reposts": false
"retweets": false,
"original": true,
"cookies-update": true
},

`
But it scans the URLs I have in the file txt but it doesn't download the files it finds.

JackTildeD added a commit to JackTildeD/gallery-dl-forked that referenced this issue Apr 24, 2024
* save cookies to tempfile, then rename

avoids wiping the cookies file if the disk is full

* [deviantart:stash] fix 'index' metadata (mikf#5335)

* [deviantart:stash] recognize 'deviantart.com/stash/…' URLs

* [gofile] fix extraction

* [kemonoparty] add 'revision_count' metadata field (mikf#5334)

* [kemonoparty] add 'order-revisions' option (mikf#5334)

* Fix imagefap extrcator

* [twitter] add 'birdwatch' metadata field (mikf#5317)

should probably get a better name,
but this is what it's called internally by Twitter

* [hiperdex] update URL patterns & fix 'manga' metadata (mikf#5340)

* [flickr] add 'contexts' option (mikf#5324)

* [tests] show full path for nested values

'user.name' instead of just 'name' when testing for
"user": { … , "name": "…", … }

* [bluesky] add 'instance' metadata field (mikf#4438)

* [vipergirls] add 'like' option (mikf#4166)

* [vipergirls] add 'domain' option (mikf#4166)

* [gelbooru] detect returned favorites order (mikf#5220)

* [gelbooru] add 'date_favorited' metadata field

* Update fapello.py

get fullsize image instead resized

* fapello.py Fullsize image

by remove ".md" and ".th" in image url, it will download fullsize of images

* [formatter] fix local DST datetime offsets for ':O'

'O' would get the *current* local UTC offset and apply it to all
'datetime' objects it gets applied to.
This would result in a wrong offset if the current offset includes
DST and the target 'datetime' does not or vice-versa.

'O' now determines the correct local UTC offset while respecting DST for
each individual 'datetime'.

* [subscribestar] fix 'date' metadata

* [idolcomplex] support new pool URLs

* [idolcomplex] fix metadata extraction

- replace legacy 'id' vales with alphanumeric ones, since the former are
  no longer available
- approximate 'vote_average', since the real value is no longer
  available
- fix 'vote_count'

* [bunkr] remove 'description' metadata

album descriptions are no longer available on album pages
and the previous code erroneously returned just '0'

* [deviantart] improve 'index' extraction for stash files (mikf#5335)

* [kemonoparty] fix exception for '/revision/' URLs

caused by 03a9ce9

* [steamgriddb] raise proper exception for deleted assets

* [tests] update extractor results

* [pornhub:gif] extract 'viewkey' and 'timestamp' metadata (mikf#4463)

mikf#4463 (comment)

* [tests] use 'datetime.timezone.utc' instead of 'datetime.UTC'

'datetime.UTC' was added in Python 3.11
and is not defined in older versions.

* [gelbooru] add 'order-posts' option for favorites (mikf#5220)

* [deviantart] handle CloudFront blocks in general (mikf#5363)

This was already done for non-OAuth requests (mikf#655)
but CF is now blocking OAuth API requests as well.

* release version 1.26.9

* [kemonoparty] fix KeyError for empty files (mikf#5368)

* [twitter] fix pattern for single tweet (mikf#5371)

- Add optional slash
- Update tests to include some non-standard tweet URLs

* [kemonoparty:favorite] support 'sort' and 'order' query params (mikf#5375)

* [kemonoparty] add 'announcements' option (mikf#5262)

mikf#5262 (comment)

* [wikimedia] suppress exception for entries without 'imageinfo' (mikf#5384)

* [docs] update defaults of 'sleep-request', 'browser', 'tls12'

* [docs] complete Authentication info in supportedsites.md

* [twitter] prevent crash when extracting 'birdwatch' metadata (mikf#5403)

* [workflows] build complete docs Pages only on gdl-org/docs

deploy only docs/oauth-redirect.html on mikf.github.io/gallery-dl

* [docs] document 'actions' (mikf#4543)

or at least attempt to

* store 'match' and 'groups' in Extractor objects

* [foolfuuka] improve 'board' pattern & support pages (mikf#5408)

* [reddit] support comment embeds (mikf#5366)

* [build] add minimal pyproject.toml

* [build] generate sdist and wheel packages using 'build' module

* [build] include only the latest CHANGELOG entries

The CHANGELOG is now at a size where it takes up roughly 50kB or 10% of
an sdist or wheel package.

* [oauth] use Extractor.request() for HTTP requests (mikf#5433)

Enables using proxies and general network options.

* [kemonoparty] fix crash on posts with missing datetime info (mikf#5422)

* restore LD_LIBRARY_PATH for PyInstaller builds (mikf#5421)

* remove 'contextlib' imports

* [pp:ugoira] log errors for general exceptions

* [twitter] match '/photo/' Tweet URLs (mikf#5443)

fixes regression introduced in 40c0553

* [pp:mtime] do not overwrite '_mtime' for None values (mikf#5439)

* [wikimedia] fix exception for files with empty 'metadata'

* [wikimedia] support wiki.gg wikis

* [pixiv:novel] add 'covers' option (mikf#5373)

* [tapas] add 'creator' extractor (mikf#5306)

* [twitter] implement 'relogin' option (mikf#5445)

* [docs] update docs/configuration links (mikf#5059, mikf#5369, mikf#5423)

* [docs] replace AnchorJS with custom script

use it in rendered .rst documents as well as in .md ones

* [text] catch general Exceptions

* compute tempfile path only once

* Add warnings flag

This commit adds a warnings flag

It can be combined with -q / --quiet to display warnings.
The intent is to provide a silent option that still surfaces
warning and error messages so that they are visible in logs.

* re-order verbose and warning options

* [gelbooru] improve pagination logic for meta tags (mikf#5478)

similar to 494acab

* [common] add Extractor.input() method

* [twitter] improve username & password login procedure (mikf#5445)

- handle more subtasks
- support 2FA
- support email verification codes

* [common] update Extractor.wait() message format

* [common] simplify 'status_code' check in Extractor.request()

* [common] add 'sleep-429' option (mikf#5160)

* [common] fix NameError in Extractor.request()

… when accessing 'code' after an requests exception was raised.

Caused by the changes in 566472f

* [common] show full URL in Extractor.request() error messages

* [hotleak] download files with 404 status code (mikf#5395)

* [pixiv] change 'sanity_level' debug message to a warning (mikf#5180)

* [twitter] handle missing 'expanded_url' fields (mikf#5463, mikf#5490)

* [tests] allow filtering extractor result tests by URL or comment

python test_results.py twitter:+/i/web/
python test_results.py twitter:~twitpic

* [exhentai] detect CAPTCHAs during login (mikf#5492)

* [output] extend 'output.colors' (mikf#2566)

allow specifying ANSI colors for all loglevels
(debug, info, warning, error)

* [output] enable colors by default

* add '--no-colors' command-line option

---------

Co-authored-by: Luc Ritchie <luc.ritchie@gmail.com>
Co-authored-by: Mike Fährmann <mike_faehrmann@web.de>
Co-authored-by: Herp <asdf@qwer.com>
Co-authored-by: wankio <31354933+wankio@users.noreply.github.com>
Co-authored-by: fireattack <human.peng@gmail.com>
Co-authored-by: Aidan Harris <me@aidanharr.is>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests