Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions, Feedback, and Suggestions #4 #5262

Open
mikf opened this issue Mar 1, 2024 · 69 comments
Open

Questions, Feedback, and Suggestions #4 #5262

mikf opened this issue Mar 1, 2024 · 69 comments

Comments

@mikf
Copy link
Owner

mikf commented Mar 1, 2024

Continuation of the previous issue as a central place for any sort of question or suggestion not deserving their own separate issue.

Links to older issues: #11, #74, #146.

@BakedCookie
Copy link

For most sites I'm able to sort files into year/month folders like this:

"directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]

However for redgifs it doesn't look like there's a date keyword available for directory. There's only a date keyword available for filename. Is this an oversight?

@mikf
Copy link
Owner Author

mikf commented Mar 2, 2024

Yep, that's a mistake that happened when adding support for galleries in 5a6fd80.
Will be fixed with the next git push.

edit: 82c73c7

@taskhawk
Copy link

taskhawk commented Mar 6, 2024

There's a typo in extractor.reddit.client-id & .user-agent:

"I'm not a rebot"

@the-blank-x
Copy link
Contributor

There's also another typo in extractor.reddit.client-id & .user-agent, "reCATCHA"

@biggestsonicfan
Copy link

Can you grab all the media from quoted tweets? Example.

mikf added a commit that referenced this issue Mar 7, 2024
#5262 (comment)

It's implemented as a search for 'quoted_tweet_id:…' on Twitter.
mikf added a commit that referenced this issue Mar 7, 2024
#5262 (comment)

This on was on the same line as the previous one ... (9fd851c)
@mikf
Copy link
Owner Author

mikf commented Mar 7, 2024

Regarding typos, thanks for pointing them out.
I would be surprised if there aren't at least 10 more somewhere in this file.

@biggestsonicfan
This is implemented as a search for quoted_tweet_id:…- on Twitter's end.
I've added an extractor for it similar to the hashtags one (40c0553), but it only does said search under the hood.

@BakedCookie
Copy link

BakedCookie commented Mar 7, 2024

Normally %-encoded characters in the URL get converted nicely when running gallery-dl, eg.

https://gelbooru.com/index.php?page=post&s=list&tags=nighthawk_%28circle%29
gives me a nighthawk_(circle) folder

but for this url:
https://gelbooru.com/index.php?page=post&s=list&tags=shin%26%23039%3Bya_%28shin%26%23039%3Byanchi%29

I'm getting a shin'ya_(shin'yanchi) folder. Shouldn't I be getting a shin'ya_(shin'yanchi) folder instead?

EDIT: Actually, I think there's just something wrong with that URL. I had it saved for a long time and searching that tag normally gives a different URL (https://gelbooru.com/index.php?page=post&s=list&tags=shin%27ya_%28shin%27yanchi%29). I still got valid posts from the weird URL so I didn't think much of it.

@mikf
Copy link
Owner Author

mikf commented Mar 7, 2024

%28 and so on are URL escaped values, which do get resolved.
#039; is the HTML escaped value for '.

You could use {search_tags!U} to convert them.

@taskhawk
Copy link

taskhawk commented Mar 8, 2024

Is there support to remove metadata like this?

gallery-dl -K https://www.reddit.com/r/carporn/comments/axo236/mean_ctsv/

...
preview['images'][N]['resolutions'][N]['height']
  144
preview['images'][N]['resolutions'][N]['url']
  https://preview.redd.it/mcerovafack21.jpg?width=108&crop=smart&auto=webp&s=f8516c60ad7fa17c84143d549c070738b8bcc989
preview['images'][N]['resolutions'][N]['width']
  108
...

Post-processor:

"filter-metadata":
    {
      "name": "metadata",
      "mode": "delete",
      "event": "prepare",
      "fields": ["preview[images][0][resolutions]"]
    }

I've tried a few variations but no dice.

"fields": ["preview[images][][resolutions]"]
"fields": ["preview[images][N][resolutions]"]
"fields": ["preview['images'][0]['resolutions']"]

@YuanGYao
Copy link

YuanGYao commented Mar 8, 2024

Hello, I left a comment in #4168 . Does the _pagination method of the WeiboExtractor class in weibo.py return when data["list"] is an empty list?
When I used gallery-dl to batch download the album page of Weibo, the download also appeared incomplete.
Through testing on the web page, I found that Weibo's getImageWall api sometimes returns an empty list when the image is not completely loaded. I think this may be what causes gallery-dl to terminate the download.

@mikf
Copy link
Owner Author

mikf commented Mar 8, 2024

@taskhawk
fields selectors are quite limited and can't really handle lists.
You might want to use a python post processor (example) and write some code that does this.

def remove_resolutions(metadata):
    for image in metadata["preview"]["images"]:
        del image["resolutions"]

(untested, might need some check whether preview and/or images exists)

@YuanGYao
Yes, the code currently stops when Weibo's API returns no more results (empty list).
This is probably not ideal, as I've hinted at in #4168 (comment)

@YuanGYao
Copy link

YuanGYao commented Mar 9, 2024

@mikf
Well, I think for Weibo's album page, since_id should be used to determine whether the image is fully loaded.
I updated my comment in #4168(comment) and attached the response returned by Weibo's getImageWall API.
I think this should help solve this problem.

@BakedCookie
Copy link

Not sure if I'm missing something, but are directory specific configurations exclusive to running gallery-dl via the executable?

Basically, I have a directory for regular tags, and a directory for artist tags. For regular tags I use "directory": ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"] since the tag number is manageable. For artist tags though, there's way more of them so this "directory": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"] makes more sense.

So right now the only way I know to get this per-directory configuration to work, is to copy the gallery-dl executable everywhere I want to use a master configuration override. Am I missing something? It feels like there should be a better way.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 11, 2024

Huh? No, the configuration works always in the same way. You're simply using different configuration files?

@BakedCookie
Copy link

@Hrxn

From the readme:

When run as executable, gallery-dl will also look for a gallery-dl.conf file in the same directory as said executable.

It is possible to use more than one configuration file at a time. In this case, any values from files after the first will get merged into the already loaded settings and potentially override previous ones.

I want to override my master configuration %APPDATA%\gallery-dl\config.json in specific directories with a local gallery-dl.conf but it seems like that's only possible with the standalone executable.

@taskhawk
Copy link

taskhawk commented Mar 11, 2024

You can load additional configuration files from the console with:

-c, --config FILE           Additional configuration files

You just need to specify the path to the file and any options there will overwrite your main configuration file.

Edit: From my understanding, yeah, automatic loading of local config files in each directory is only possible having the standalone executable in each directory. Are different directory options the only thing you need?

@BakedCookie
Copy link

@taskhawk

Thanks, that's exactly what I was looking for! Guess I didn't read the documentation thoroughly enough.

For now the only thing I'd want to override is the directory structure for artist tags. I don't think it's possible to determine from the metadata alone if a given tag is the name of an artist or not, so I thought the best way to go about it is to just have a separate directory for artists, and use a configuration override. So yeah, loading that override with the -c flag works great for that purpose, thanks again!

@taskhawk
Copy link

taskhawk commented Mar 11, 2024

You kinda can, but you need to enable tags for Gelbooru in your configuration to get them, which will require an additional request:

    "gelbooru": {
      "directory": {
        "search_tags in tags_artists": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
        ""                           : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
      },
      "tags": true
    },

Set "tags": true in your config and run a test with gallery-dl -K "https://gelbooru.com/index.php?page=post&s=list&tags=TAG" so you can see the tags_* keywords.

Of course, this depends on the artists being correctly tagged. Not sure if it happens on Gelbooru, but at least in other boorus and booru-like sites I've come across posts with the artist tagged as a general tag instead of an artist tag. Another limitation is that your search tag can only include one artist at a time, doing more will require a more complex expression to check all tags are present in tags_artists.

What I do instead is that I inject a keyword to influence where it will be saved, like this:

gallery-dl -o keywords='{"search_tags_type":"artists"}' "https://gelbooru.com/index.php?page=post&s=list&tags=ARTIST"

And in my config I have

    "gelbooru": {
      "directory": ["boorus", "{search_tags_type}", "{search_tags}"]
    },

You can have:

    "gelbooru": {
      "directory": {
        "search_tags_type == 'artists'": ["{category}", "{search_tags[0]!u}", "{search_tags}", "{date:%Y}", "{date:%m}"],
        ""                             : ["{category}", "{search_tags}", "{date:%Y}", "{date:%m}"]
      }
    },

You can do this for other tag types, like general, copyright, characters, etc.

Because it's a chore to type that option every time I made a wrapper script, so I just call it like this because artists is my default:

~/script.sh "TAG"

For other tag types I can do:

~/script.sh --copyright "TAG"
~/script.sh --characters "TAG"
~/script.sh --general "TAG"

@BakedCookie
Copy link

Thanks for pointing out there's a tags option available for the gelbooru extractor. I already used it in the kemono extractor to get the name of the artist, but it didn't occur to me that gelbooru might also have such an option (and just accepted that the tags aren't categorized).

For artists I store all the url's in their respective gelbooru.txt, rule34.txt, etc files like so:

https://gelbooru.com/index.php?page=post&s=list&tags=john_doe
https://gelbooru.com/index.php?page=post&s=list&tags=blue-senpai
https://gelbooru.com/index.php?page=post&s=list&tags=kaneru
.
.
.

And then just run gallery-dl -c gallery-dl.conf -i gelbooru.txt. Since the search_tags ends up being the artist anyway, getting tags_artists is probably not worth the extra request. Same for general tags, and copyright tags, in their respective directories. With this workflow I can't immediately see where I'd be able to utilize keyword injection, but it's definitely a useful feature that I'll keep in mind.

@Wiiplay123
Copy link
Contributor

When I'm making an extractor, what do I do if the site doesn't have different URL patterns for different page types? Every single page is just a numerical ID that could be a forum post, image, blog post, or something completely different.

@mikf
Copy link
Owner Author

mikf commented Mar 19, 2024

@Wiiplay123 You handle everything with a single extractor and decide what type of result to return on the fly. The gofile code is a good example for this I think, or aryion.

@I-seah
Copy link

I-seah commented Mar 20, 2024

Hi, what options should I use in my config file to change the format of dates in metadata files? I would like to use "%Y-%m-%dT%H:%M:%S%z" for the values of "date" and "published" (from coomer/kemono downloads).

And would it also be possible to do this for json files that ytdl creates? I downloaded some videos with gallery-dl but the dates got saved as "upload_date": "20230910" and "timestamp": 1694344011, so I think it might be better to convert the timestamp to a date to get a more precise upload time, but I'm not sure if it's possible to do that either.

@throwaway26425
Copy link

throwaway26425 commented Mar 31, 2024

hi, how do I download files from oldest to newest?

I'm using this:

https://www.instagram.com/{my_user}/saved/all-posts/

and I need to start downloading from the oldest posts first, how do I do that?

@JailSeed
Copy link

Hi! Is it possible to download posts from Pixiv from a specified bookmark page? For example I want to download not all bookmarks but only from page 2. I tried this URL /bookmarks/artworks?p=2 but gallery-dl still downloads all my bookmarks.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 1, 2024

@mikf
Would a formatting option like "{title!t:?/__/R__//}"be legitimate?
Would an order of operations like this

  1. trim
  2. replace '__' with ''
  3. append '__' if title

be possible?

@mikf
Copy link
Owner Author

mikf commented Apr 1, 2024

@throwaway26425
Not possible, especially not with how IG returns its results.
You could theoretically grab all download links (-g) from newest to oldest, reverse their order, and then download those.

@JailSeed
Not really supported. You could use --range, but that selects by file count and not post count.

@Hrxn
This would work, but it probably crashes when there's no title. !t would need to be applied after ? or at least after some form of check that title is a string.

This could be more reliably done in an f-string:

\fF …{title.strip().replace("__", "") + "__" if title else ""}…

@Hrxn
Copy link
Contributor

Hrxn commented Apr 1, 2024

@mikf Thanks, that helps.
Agree about the f-string part, but I think in this case the site always provides a title, so I don't think there would be anything that contends against continuing to use "{title!t:?/__/R__//}"..

@Hrxn
Copy link
Contributor

Hrxn commented Apr 3, 2024

@mikf The scenario: Submission on reddit, hosted on redgifs, but it's actually an image (yes, I know.. edge case. But I've seen it at least once)

I believe it should be possible to solve this with a conditional directory setting using what we already got in gallery-dl, but I'm not sure.

Accessing metadata coming from reddit can be done with locals().get('_reddit_), but I'm unsure if we can proceed from there on without breaking..

Example from -K on a reddit link:

is_video
  False

but at the same time

media['oembed']['type']
  video

and

post_hint
  rich:video

which.. totally makes sense..

The easiest way would probably be something like this

                "directory": {
                    "'_reddit_' in locals() and extension in ('mp4', 'webm')" : ["Video"],
                    "'_reddit_' in locals() and extension in ('gif', 'apng')" : ["Gif"],
                    "'_reddit_' in locals() and extension in ('jpg', 'png')"  : ["Picture"],

in using extension from redgif, which already exists! But it does not work for "directory", because it's a metadata entry for "file and filter". Would it be very complicated to make extension also available as a directory metadata value?

@mikf
Copy link
Owner Author

mikf commented Apr 3, 2024

Wouldn't a classify post processor work here?

It wouldn't really be complicated to make extension available for directories, but it is kind of wrong given the current "directory" semantics.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 4, 2024

classify would work, and I'm using it for everything in redgifs except for the image subcategory so that I can differentiate between downloading a single item directly on redgifs vs a submission on reddit hosted on redgifs, and I'm not sure how to achieve that otherwise.

config excerpt (a bit simplified), giving me the output paths I'm using for a while now and would like to keep:

"redgifs":
{
    "image":
    {
        "directory": {
            "'_reddit_' in locals()": ["+Clips"],
            "locals().get('bkey')"  : ["Redgifs", "Clips", "{bkey}"],
            ""                      : ["Redgifs", "Clips", "Unsorted"]
        }
    }
}

I'm using "parent-directory": true and "parent-metadata": "_reddit_" for reddit, obviously, and the result is basically this:

input URL from.. Output Destination
redgifs "base-directory" / Redgifs / bkey | Unsorted / <filename with metadata from redgifs only>
reddit "base-directory" / Reddit / Submissions / <subreddit title> / bkey | Unsorted / +Clips / [1]

[1] = <filename with metadata from redgifs and from _reddit_>

This is an example with a direct submission link from reddit, but it works the same with different categories from reddit (with a different "prefix" name instead of Submissions, of course)

It wouldn't really be complicated to make extension available for directories, but it is kind of wrong given the current "directory" semantics.

Ah, okay. I thought this would be just one more metadata field, basically, without breaking anything.
Best to forget this approach then, I'll see if I can come up with another one.

@mikf
Copy link
Owner Author

mikf commented Apr 4, 2024

Wouldn't it be possible to use reddit>redgifs as category to distinguish Reddit-posted Redgifs links from regular ones and only use the post processor there?

"reddit>redgifs":
{
    "image":
    {
        "directory": ["+Clips"],
        "postprocessors": ["classify"]
    }
},
"redgifs":
{
    "image":
    {
        "directory": {
            "locals().get('bkey')"  : ["Redgifs", "Clips", "{bkey}"],
            ""                      : ["Redgifs", "Clips", "Unsorted"]
        }
    }
}

@Hrxn
Copy link
Contributor

Hrxn commented Apr 4, 2024

Good idea. Almost forgot that this option exists.
To be honest, I've never used this "new" extractor>child-extractor option syntax.

Seems like it should be the right fit for such a task. But does this change anything with regard to how the "archive" option works? Or is this just an additional step, i.e. the options in "reddit>redgifs", for example, get simply added "on the top", and everything else like archive options etc. is kept as is?

@mikf
Copy link
Owner Author

mikf commented Apr 4, 2024

Yep, this is just an additional step. It will still load options from "redgifs" when they are not specified for "reddit>redgifs".

@AyHa1810
Copy link

@mikf I would like to use base-directory as a keyword within the config file to use relative paths within the base directory without it breaking when I use --directory, as in the example below:

"pixiv": {
    "postprocessors": [
        {
            "name": "python",
            "event": "prepare",
            "function": "{base-directory}/utils.py:pixiv_tags"
        }
    ]
}

is it possible to do so?

@Hrxn
Copy link
Contributor

Hrxn commented Apr 11, 2024

@mikf Congrats for making it into the GitHub 10k stars club! 🔥

@mikf
Copy link
Owner Author

mikf commented Apr 11, 2024

@AyHa1810
The path for function (paths in general, really) do not support {…} replacement fields, only environment variables and home directories ~. Otherwise you'd be able to access base-directory by enabling metadata-path. It would probably be best to define an environment variable and use it for both base-directory and function.

Also, --directory overrides your config's base-directory value, so accessing it then wouldn't even result in the same value as the one specified in your config.

@Hrxn
Thank you.

@fireattack
Copy link
Contributor

fireattack commented Apr 13, 2024

Correct me if I'm wrong, but it looks like when using things like "skip": "abort:1" together with "archive", it counts not only items existing in archive as "skipped" (so count towards 4), but also the ones that have files existed.

Is there a way to make it only count existing items in "archive" as skipped, but not the ones that are have existing files (but preferably still not redownload these)?

Basically, what I want to accomplish is to find a way to periodically download all posts until reached the last downloaded record (so abort:1). But between two download sessions, I may have already downloaded some of these posts manually and put the files into the folder already. I don't want these to terminate my download session prematurely.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 13, 2024

@fireattack Suggestion: Use different "archive-format" settings for different sub-extractors, this way you can download some posts manually and entire user profiles etc. independent of each other.

@docholllidae
Copy link

hello, i don't really understand what the difference is between sleep: and sleep-request:
could someone eli5 please?, particularly in the context of downloading a twitter profile

@AyHa1810
Copy link

@AyHa1810 The path for function (paths in general, really) do not support {…} replacement fields, only environment variables and home directories ~. Otherwise you'd be able to access base-directory by enabling metadata-path. It would probably be best to define an environment variable and use it for both base-directory and function.

Also, --directory overrides your config's base-directory value, so accessing it then wouldn't even result in the same value as the one specified in your config.

@Hrxn Thank you.

I mean before it gets converted to path, its just a string, right? So it should be possible imo

also yeah I do want it to get override with the --directory option :P

@mikf
Copy link
Owner Author

mikf commented Apr 15, 2024

@fireattack
Files skipped by archive or existing files are treated the same and there is currently no way to separate them.

@docholllidae
--sleep causes gallery-dl to sleep before each file download.
--sleep-request causes gallery-dl to sleep before each non-download HTTP request like loading a webpage, API calls, etc.

It is usually the latter that gets restricted by some sort of rate limit, as is the case for Twitter.

@AyHa1810
Copy link

@mikf the env var method works, thanks for the suggestion!

@taskhawk
Copy link

taskhawk commented Apr 16, 2024

I think I haven't come across any ugoira using PNGs for its images. Does anyone have an example they could share?

@throwaway26425
Copy link

throwaway26425 commented Apr 22, 2024

how do I prevent myself from getting banned on Instagram?

I'm currently using:

--sleep 2-10
--sleep-request 15-45

should I increase those numbers?? how much?

(are there any other parameters that I can use to prevent myself from being banned on IG?)

JackTildeD added a commit to JackTildeD/gallery-dl-forked that referenced this issue Apr 24, 2024
* save cookies to tempfile, then rename

avoids wiping the cookies file if the disk is full

* [deviantart:stash] fix 'index' metadata (mikf#5335)

* [deviantart:stash] recognize 'deviantart.com/stash/…' URLs

* [gofile] fix extraction

* [kemonoparty] add 'revision_count' metadata field (mikf#5334)

* [kemonoparty] add 'order-revisions' option (mikf#5334)

* Fix imagefap extrcator

* [twitter] add 'birdwatch' metadata field (mikf#5317)

should probably get a better name,
but this is what it's called internally by Twitter

* [hiperdex] update URL patterns & fix 'manga' metadata (mikf#5340)

* [flickr] add 'contexts' option (mikf#5324)

* [tests] show full path for nested values

'user.name' instead of just 'name' when testing for
"user": { … , "name": "…", … }

* [bluesky] add 'instance' metadata field (mikf#4438)

* [vipergirls] add 'like' option (mikf#4166)

* [vipergirls] add 'domain' option (mikf#4166)

* [gelbooru] detect returned favorites order (mikf#5220)

* [gelbooru] add 'date_favorited' metadata field

* Update fapello.py

get fullsize image instead resized

* fapello.py Fullsize image

by remove ".md" and ".th" in image url, it will download fullsize of images

* [formatter] fix local DST datetime offsets for ':O'

'O' would get the *current* local UTC offset and apply it to all
'datetime' objects it gets applied to.
This would result in a wrong offset if the current offset includes
DST and the target 'datetime' does not or vice-versa.

'O' now determines the correct local UTC offset while respecting DST for
each individual 'datetime'.

* [subscribestar] fix 'date' metadata

* [idolcomplex] support new pool URLs

* [idolcomplex] fix metadata extraction

- replace legacy 'id' vales with alphanumeric ones, since the former are
  no longer available
- approximate 'vote_average', since the real value is no longer
  available
- fix 'vote_count'

* [bunkr] remove 'description' metadata

album descriptions are no longer available on album pages
and the previous code erroneously returned just '0'

* [deviantart] improve 'index' extraction for stash files (mikf#5335)

* [kemonoparty] fix exception for '/revision/' URLs

caused by 03a9ce9

* [steamgriddb] raise proper exception for deleted assets

* [tests] update extractor results

* [pornhub:gif] extract 'viewkey' and 'timestamp' metadata (mikf#4463)

mikf#4463 (comment)

* [tests] use 'datetime.timezone.utc' instead of 'datetime.UTC'

'datetime.UTC' was added in Python 3.11
and is not defined in older versions.

* [gelbooru] add 'order-posts' option for favorites (mikf#5220)

* [deviantart] handle CloudFront blocks in general (mikf#5363)

This was already done for non-OAuth requests (mikf#655)
but CF is now blocking OAuth API requests as well.

* release version 1.26.9

* [kemonoparty] fix KeyError for empty files (mikf#5368)

* [twitter] fix pattern for single tweet (mikf#5371)

- Add optional slash
- Update tests to include some non-standard tweet URLs

* [kemonoparty:favorite] support 'sort' and 'order' query params (mikf#5375)

* [kemonoparty] add 'announcements' option (mikf#5262)

mikf#5262 (comment)

* [wikimedia] suppress exception for entries without 'imageinfo' (mikf#5384)

* [docs] update defaults of 'sleep-request', 'browser', 'tls12'

* [docs] complete Authentication info in supportedsites.md

* [twitter] prevent crash when extracting 'birdwatch' metadata (mikf#5403)

* [workflows] build complete docs Pages only on gdl-org/docs

deploy only docs/oauth-redirect.html on mikf.github.io/gallery-dl

* [docs] document 'actions' (mikf#4543)

or at least attempt to

* store 'match' and 'groups' in Extractor objects

* [foolfuuka] improve 'board' pattern & support pages (mikf#5408)

* [reddit] support comment embeds (mikf#5366)

* [build] add minimal pyproject.toml

* [build] generate sdist and wheel packages using 'build' module

* [build] include only the latest CHANGELOG entries

The CHANGELOG is now at a size where it takes up roughly 50kB or 10% of
an sdist or wheel package.

* [oauth] use Extractor.request() for HTTP requests (mikf#5433)

Enables using proxies and general network options.

* [kemonoparty] fix crash on posts with missing datetime info (mikf#5422)

* restore LD_LIBRARY_PATH for PyInstaller builds (mikf#5421)

* remove 'contextlib' imports

* [pp:ugoira] log errors for general exceptions

* [twitter] match '/photo/' Tweet URLs (mikf#5443)

fixes regression introduced in 40c0553

* [pp:mtime] do not overwrite '_mtime' for None values (mikf#5439)

* [wikimedia] fix exception for files with empty 'metadata'

* [wikimedia] support wiki.gg wikis

* [pixiv:novel] add 'covers' option (mikf#5373)

* [tapas] add 'creator' extractor (mikf#5306)

* [twitter] implement 'relogin' option (mikf#5445)

* [docs] update docs/configuration links (mikf#5059, mikf#5369, mikf#5423)

* [docs] replace AnchorJS with custom script

use it in rendered .rst documents as well as in .md ones

* [text] catch general Exceptions

* compute tempfile path only once

* Add warnings flag

This commit adds a warnings flag

It can be combined with -q / --quiet to display warnings.
The intent is to provide a silent option that still surfaces
warning and error messages so that they are visible in logs.

* re-order verbose and warning options

* [gelbooru] improve pagination logic for meta tags (mikf#5478)

similar to 494acab

* [common] add Extractor.input() method

* [twitter] improve username & password login procedure (mikf#5445)

- handle more subtasks
- support 2FA
- support email verification codes

* [common] update Extractor.wait() message format

* [common] simplify 'status_code' check in Extractor.request()

* [common] add 'sleep-429' option (mikf#5160)

* [common] fix NameError in Extractor.request()

… when accessing 'code' after an requests exception was raised.

Caused by the changes in 566472f

* [common] show full URL in Extractor.request() error messages

* [hotleak] download files with 404 status code (mikf#5395)

* [pixiv] change 'sanity_level' debug message to a warning (mikf#5180)

* [twitter] handle missing 'expanded_url' fields (mikf#5463, mikf#5490)

* [tests] allow filtering extractor result tests by URL or comment

python test_results.py twitter:+/i/web/
python test_results.py twitter:~twitpic

* [exhentai] detect CAPTCHAs during login (mikf#5492)

* [output] extend 'output.colors' (mikf#2566)

allow specifying ANSI colors for all loglevels
(debug, info, warning, error)

* [output] enable colors by default

* add '--no-colors' command-line option

---------

Co-authored-by: Luc Ritchie <luc.ritchie@gmail.com>
Co-authored-by: Mike Fährmann <mike_faehrmann@web.de>
Co-authored-by: Herp <asdf@qwer.com>
Co-authored-by: wankio <31354933+wankio@users.noreply.github.com>
Co-authored-by: fireattack <human.peng@gmail.com>
Co-authored-by: Aidan Harris <me@aidanharr.is>
@Immueggpain
Copy link

How to put artist name in file path for e-hentai? because the "artist:xxx" is in tags. I can't find a variable for "directory": ["{artist}"].

@mikf
Copy link
Owner Author

mikf commented Apr 25, 2024

@taskhawk
I slightly modified the Danbooru extractor to have it go through all ugoira posts uploaded there (https://danbooru.donmai.us/posts?tags=ugoira), and non of them had .png frames.
I'm aware that this is just a small subset, but at least its data can be accessed a lot faster than on Pixiv itself.

@throwaway26425
Using the same --user-agent string as the browser you got your cookies from might help.
Updating the HTTP headers sent during API requests is also something that needs to be done again ...

@Immueggpain
See #2117

@throwaway26425
Copy link

throwaway26425 commented Apr 26, 2024

Using the same --user-agent string as the browser you got your cookies from might help.

I'm using -o browser=firefox, is that the same?

or, do I need to use both?

-o browser=firefox
--user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0"

Updating the HTTP headers sent during API requests is also something that needs to be done again.

I don't understand this, can you please explain it better? :(

@mikf
Copy link
Owner Author

mikf commented Apr 26, 2024

I'm using -o browser=firefox, is that the same?

browser=firefox overrides your user agent to Firefox 115 ESR, regardless of your --user-agent setting. It also sets a bunch of extra HTTP headers and TLS cipher suites to somewhat mimic a real browser, but maybe you're better off without this option.

I don't understand this, can you please explain it better? :(

The instagram code sends specific HTTP headers when making API requests, which might now be out-of-date, meaning I should update them again. The last time I did this was October 2023 (969be65).

@fireattack
Copy link
Contributor

fireattack commented May 3, 2024

I'm pretty sure this has been asked before but can't find it.

My goal is to run gallery-dl as a module to download, while also get the record of processed posts (URLs, post ids) so I can use that info to do some custom functions.

I've read #642, but I still don't quite get it. It looks like you have to use DownloadJob for downloading, but in parallel use DataJob (or even a customized Job) to get the data?

My current code is pretty simple, just

def load_config():
    ....
def set_config(user_id):
    ....

def update(user_id):
    load_config()
    profile_url = set_config(user_id)
    job.DownloadJob(profile_url).run()

I tried to patch DownloadJob's handle_url so I can save the URLs and metadata into something like self.mydata, but that isn't enough because in handle_queue, it creates a new job with job = self.__class__(extr, self) for actual downloading, which makes it more complicated than I want in order to pass the data back to "parent" instance.

So I'm curious if there is an easier way to just do it other than re-write a whole new Job? Thanks in advance.

@throwaway26425
Copy link

throwaway26425 commented May 7, 2024

does gallery-dl mark Instagram stories as "Seen" when you're downloading them? (using cookies)

@climbTheStairs
Copy link

I have a suggestion, though I'm not sure how feasible or practical it would be.

Currently behavior:

  • twitter num starts at 1 for all posts
  • pixiv num starts at 0 for all posts
  • reddit num starts at 1 for posts containing multiple images and is 0 for posts containing one

Could the behavior for indices be made consistent across all sites?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests