Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't grab all pages if download cap is enabled #48

Closed
OmgImAlexis opened this issue Feb 17, 2021 · 7 comments
Closed

Don't grab all pages if download cap is enabled #48

OmgImAlexis opened this issue Feb 17, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@OmgImAlexis
Copy link

On initial index if "Download cap" is set then pages should only be fetched until it hits that the cap instead of fetching every single page of videos.

is older than cap age 2021-01-18 07:55:30.639997+00:00, skipping

I'd like to avoid seeing this over and over in my logs if possible.

@meeb
Copy link
Owner

meeb commented Feb 17, 2021

While this would be nice I don't think it's possible. If it is possible it would require a lot more hacking into youtube-dl internals which I'd probably like to avoid doing just so updating libraries etc. is trivial. youtube-dl's extract_info() with extract_flat=True just returns all video IDs on a channel or playlist etc. From a quick check, there's no way to know the order of videos on a playlist or channel so you can't just not crawl "page 22" or similar because "page 21" already has videos older than the download cap. You would need to index every video on a playlist/channel to find potentially new videos, which would result in having to check every video upload date against any set age caps. That could be classed as a debug log message not an info log message though to be suppressed that way...

@OmgImAlexis
Copy link
Author

Yep looks like I need to add a PR to youtube-dl.

ytdl-org/youtube-dl#1816

@OmgImAlexis
Copy link
Author

Dateafter shouldn’t download pages outside of the range.

@meeb
Copy link
Owner

meeb commented Feb 18, 2021

While that upstream issue if added might help the logs, it likely won't stop the requirement that TubeSync will still need to index all YouTube video IDs in a playlist each time it does an index as they are not assured to be chronologically returned by YouTube when it gets crawled. The initial requirement with TubeSync is to "find all new video IDs" which still means indexing entire channels and playlists. This flag, if implemented upstream in youtube-dl, would likely just limit what's returned by extract_info() rather than limit what's actually requested from YouTube. If there is some enforcement of chronological ordering feature that could be used for YouTube it likely wouldn't be transferable to other sites which will get support in TubeSync in the future either. Of course, if devs of youtube-dl who admittedly do have a far superior knowledge of the internals of YouTube APIs/front ends and their own codebase Than I do find a way to actually make this work properly I would implement it. In the foreseeable short term however you can expect TubeSync to index entire channels and videos every index and compare the upload dates to function properly. I'll still see if changing the log severity is sensible though to stop annoying users who attempt to index very large channels a lot.

@meeb meeb added the enhancement New feature or request label Feb 21, 2021
@DeftNerd
Copy link

DeftNerd commented Feb 23, 2021

I hate to suggest a major refactor, but I wanted to give some ideas that might help with this problem.

Using youtube-dl to generate an index of all the videos and store them in a database to slowly download them does make sense, but when looking for new videos, tubesync seems to be configured to redownload the entire index of videos again to look for updates.

A more efficient method to look for new videos would be to use the integrated YouTube RSS feeds. They're always ordered by "published" date

https://www.youtube.com/feeds/videos.xml?channel_id=someidhere

Adding an RSS/XML parser to the system might be a slight hassle, but it would significantly reduce the risk of youtube getting mad at excessive page indexing.

@meeb
Copy link
Owner

meeb commented Feb 24, 2021

Cheers for the suggestion!

I had noticed the RSS feeds, but compared to the current youtube-dl based method it doesn't actually reduce the number of requests made to YouTube that much. After getting a list of video IDs for a channel or playlist TubeSync still needs to make one request per video to get its metadata and these are the bulk of the requests to YouTube that seem to be triggering the rate limiting. For example adding a channel with 1000 videos in it results in about 25 requests for indexing, then 1000 requests for metadata, once indexed it's "just" 25 requests per indexing interval period which is probably fine.

Additionally, unless I'm blind, I can't see any way to get more than the most recent 14 or so videos via RSS (there's no ?page=2 or similar accepted parameter I can find?) so while that would indeed work for updating for new content easily it doesn't solve the initial index all media on a channel requirement.

Also I assume if a channel added > 14 videos between indexing it would have to fall back to the current way as well, which I guess is pretty unlikely but no doubt someone will find a channel that does this and trigger an edge case of missing content.

Using the feeds could shave off a few requests per day, but not enough to likely solve issues for anyone experiencing 429 rate limiting issues, for which I'll probably have to just add in some 60 second delay between metadata requests to pad requests out for newly added channels or similar if people keep experiencing problems.

I'll add it onto the future roadmap as a possible feature as using the feeds would be nicer to keep channels updated with new content. It won't replace anything too significant internally and it's also not that much work really, just use a different indexer once already indexed at least once. It wouldn't require any massive internal reworking.

@meeb
Copy link
Owner

meeb commented Feb 24, 2021

I'll track the RSS feature in #73 and the log level / log spam reduction options in #74 - I'll close this for now as I don't think there's anything left to add to the original issue, but feel free to comment or re-open it if you want to add more suggestions or comments.

@meeb meeb closed this as completed Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants