Extractor documentation #6489
Replies: 2 comments
-
|
For what it's worth, I've attempted some of this myself. I'm not a strong programmer, but thought doing some documentation might help me understand my gaps. Here is my attempt: https://github.com/SpiffyChatterbox/gallery-dl/wiki As well as some other discussions: #5750 |
Beta Was this translation helpful? Give feedback.
-
|
I've taken another stab at this, to see if another format is useful. Please take a look and provide feedback or edit yourself: https://github.com/mikf/gallery-dl/wiki/Developing-Extractors Full transparency - I used an AI to create this, so it may have errors. I read through myself, and found a few things, but I'm not a gallery-dl expert so may have missed something. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The purpose of this discussion is to improve the documentation on how to create extractors.
I've noticed that there is no proper doc regarding this topic and it was asked multiple times in the past.
The best discussion I found that goes into more detail on how to create an extractor is this one: #1656
But it wasn't properly explaining how to use
Message.Queuewhich I managed to find slightly more info here: #1345Also, the Message class has a Doc String which explains a little bit on what it does which also helped to understand what Message.Queue does.
So, I tried compiling as much information I could find about this topic, taking into account the above 2 discussions.
Hopefully, this can later be added into the Wiki for easy access:
https://github.com/mikf/gallery-dl/wiki
Contribution is greatly appreciated as there might be some things I got wrong or is not complete.
I am still working on more details on this doc, but for now, this is what I have.
@mikf , If possible, can you confirm if the below is correct, or if there is something I got wrong?
Extractor
To create an extractor, the following 2 methods are important:
items()should run the below statements:yield Message.Directory, metadatayield Message.Url, url, metadatayield Message.Queue, url, metadata:yield Message.Queue, youtube_url, metadatato use the youtube extractor to download the video._extractorproperty and pass it as metadata like so:request()is used for HTTP requests. It works more or less like request'ssession.request()in that you'd do something likeself.request(url, params=params, headers=headers).json()to for example fetch a JSON resource.The below attributes are also important for your extractor:
category: It is the category of the extractor. You can think of this as the name of the target site to extract. For example, on an extractor for facebook, you would put facebook here. But if for example, the same extractor can work for multiple sites (like extracting from all Wordpress sites using the Madara theme), you would put wordpress-madara or something like that.subcategory: What you are extracting. For example, in the case of instagram, you have multiple categories like posts, stories, reels, tagged posts, and others.*_fmt: These are the default string formats to use (overriden by the config), each with their respective purposes as described below:directory_fmt: The default name of the directory when downloading items.directory_fmt = ("{category}", "{username}")filename_fmt: The default filename of the target item to download.filename_fmt = "{media_id}.{extension}"archive_fmt: The default name of the item id to store in the archive.archive_fmt = "{media_id}"pattern: This is the regular expression that should match all URLs the extractor can handle. The resulting match object is the first real argument of an extractors's__init__()Sample code
Here, we create the
BaseExampleExtractorclass inheriting from Gallery-DL'sExtractorclass.The
BaseExampleExtractorclass contains the base configuration for all your extractors.Then, we create the
ExamplePostsExtractorclass, inheriting fromBaseExampleExtractor, which implements the actual logic to download the items.Separating the logic this way keeps things organized and scalable. Sure, you could go ahead, use only 1 class and put all the logic in that class. But its not easily scalable, not easy to maintain and it gets hard to read if you have multiple types (or subcategories) of items being downloaded in the same extractor.
As a rule, try to have 1 extractor class per subcategory and have each extractor class inheriting from a base class with the default config for all your extractors.
Beta Was this translation helpful? Give feedback.
All reactions