-
-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make incremental downloading of new images easier (scraping) #5255
Comments
|
I saw this but I think it's not adequate for this job. It could fail under weird circumstances, like when not downloading retweets and the user retweeted a large number of images. |
Use a download archive instead,
Use a metadata postprocessor that parses the string and overwrites the original {
"fix-kemono-date":
{
"name": "metadata",
"mode": "modify",
"event": "prepare",
"fields": {
"date": "{added[:20]:RT/ /}"
}
}
} This should leave it in the same format
I didn't get this one.
You can probably use a metadata postprocessor here too to make available most of that metadata as fields so you don't have to go extracting it later.
You can use an I haven't messed with retweets so I have no clue with those.
The skip option already covers that. |
To summarize what my script does compared to native gallery-dl features:
I'm not convinced that In theory, What is the purpose of A date based approach also makes sense because you could stop fetching at text-only tweets. Unfortunately this doesn't work with In the rest of the post I'm replying with details, you can skip it if it's too much.
This is good feature I wasn't aware of yet. But it lacks one thing my script does: it deletes files older than the time window I'm setting with the date filter. The --download-archive file only stores the filenames, no date, so old entries can't be removed. If it saved a date, you could prune old entries with a SQL statement. Maybe a download date column could be added to it? (I've started my script 4 years ago. I have almost 10000 files saved with it, and these are only the files I manually selected to keep. Yes, the list of files is going to build up.)
Interesting, but still a bit clumsy. The hardest part is checking that the date format is correctly parsed. What I actually want is the following:
In this case, I don't care about the actual post date. But in general it would be awesome if all extractors had an "posted or uploaded at" metadata field that worked the same.
Calling Some sites don't seem to sort posts by date in all cases. Twitter normally does, except for retweets or pinned posts, so I added exceptions for these cases. (I want the retweets for some users.) For other sites, I gave up trying to use
If I don't need dummy files, I could just copy the entire gallery-dl output directory. The
Retweets are like linking an older post made by another user. There are two problems with retweets:
I assume you mean |
Not much you can do about that, as that is a limitation of the site itself. This happens to me with booru and booru-like sites, where files can be tagged with the tags I'm interested in at any moment after I made a full run, and having just
Is a middle ground option between aborting right away and continuing until the end, by giving it some headroom to grab new files that may have become available since the last run. If you adjust the abort number considering the number of results for each "page", for example, you're basically saying "look for new files in the last n pages and if nothing new abort, if you find something new also keep looking for n pages more". I used to use it with Pixiv, but now I just do a normal
It's as I said above, not much you can do about it if that's what the site offers. With Twitter however you can use the Because I don't deal with text tweets I'm not sure if it's possible but I feel like you should be able to record an entry for them in the archive so At least for Twitter I think you should move away from using dates to stop processing.
That's true, but it's not as bad. My largest archive has 2,247,980 entries and has a filesize of only 163.2 MiB. That's peanuts compared to the 2.1 TiB of corresponding downloaded media. My Twitter archive has 1,589,556 entries and it's only 108.3 MiB against ~844 GiB of media. I don't notice enough of a performance hit either. You can modify the format of the entry in the archive file with https://gdl-org.github.io/docs/configuration.html
Yeah, a bit more uniformity on the dates among extractors would be better. That could extend to other shared data fields among extractors too. I have dealt with that by normalizing the metadata myself with postprocessors (preprocessors? most are triggered before downloading the file).
No, the https://gdl-org.github.io/docs/configuration.html You should check it out for more Twitter options, if you didn't know about the archive files, you probably don't know about other Twitter options that could be helpful to you. |
Very interesting. Unfortunately things seem to be getting more complicated than I thought, instead of easier. It looks like Writing a script which just periodically fetches new images from an account remains a complicated tasks, at least if you want retweets and try to reduce bandwidth. |
I just remembered one important case that sucks to use |
Hey OP, hit me up if you ever finished writing that script, I'm trying to do the same thing here and also got stuck at only scraping new images! Lol |
My script is way too weird and special-cased. It probably doesn't even solve the problem with scraping only new images. It would be better to come up with a way to make this easier to do with gallery-dl directly. |
It's crazy that neither gallery-dl itself by default or any of its forks can do such a supposedly simple task... Have you found any alternative ? |
No. gallery-dl supports many sites, and I'm using some of them (not just twitter). Also coming up with good method to do this isn't as simple as it seemed at first. Though when nitter still worked, I had my own code which processed its html. |
I made a wrapper script that handles each site I download form differently. For twitter, it checks what the largest twitter id (the most recent one) from my local copy of the user using the username from the url to find the folder then uses that id to add this argument to the command before running it: Edit: It's better to use |
I suspect this won't work too well. gallery-dl downloads from newest to oldest post. So if it gets interrupted and only some of the new posts got downloaded, you'll miss some posts. (Depending on how you call gallery-dl again.) It's even worse if you want retweets and sticky posts. This could be improved by
Just brainstorming after not thinking about it a while and only reading your post. |
I've just had an idea on how to potentially prevent the issue with |
This sounds like a very good idea. Still might break easily with retweets and sticky posts. |
hm. This is just what i need. Guessing there's no easy way to tell it last X posts or last X days with no need to keep updating the date range values in the script. :3 |
I agree! |
keeps archive IDs in memory and only writes them to disk in a 'finalize' step
I'm running gallery-dl periodically on a list of twitter accounts (and other sites) to download new images. Doing this without wasting a lot of bandwidth (and getting throttled or blocked earlier) is pretty tricky, because gallery-dl doesn't seem to have a native mechanism to support this.
I'm facing the following problems with the following hacks to work them around:
--filter (date >= datetime.fromtimestamp(x))
(x
replaced with the current UTC time minus the time range, about a week).date
is the wrong field. Kemono has anadded
field in a different format, which I need to special-case and parse. On Kemono,date
is the original publication date on the source site, whileadded
is the Kemono upload date which is the correct field for this purpose.or abort()
to stop network access completely.or (user["name"] != author["name"])
to the filter.-o extractor.twitter.pinned=false
to the command line.--write-metadata
to the command line to recover some information, like actual author for retweets.gallery-dl
like this, my script iterates the gallery-dl directory, looks for new images, restores the native filename (filename
field from the metadata files), adds the author name to it, moves them to my output directory, and creates a dummy file to prevent gallery-dl from downloading it again.Shouldn't this be easier? At least making gallery-dl fetch only new images could probably be a builtin feature, instead of having to mess with site-specific filter hacks.
The text was updated successfully, but these errors were encountered: