Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support episodes cleanup #44

Closed
IanShow15 opened this issue Nov 17, 2019 · 21 comments
Closed

Support episodes cleanup #44

IanShow15 opened this issue Nov 17, 2019 · 21 comments

Comments

@IanShow15
Copy link

@IanShow15 IanShow15 commented Nov 17, 2019

I have set one of the channels' page_size to 10, and there were more than 10 new videos on the channel after the 1st query. After the 2nd update, 10 more new episodes were downloaded. The feed display the most recent 10 new episodes, but the previously downloaded files still remains in the downloaded folder. Is this expected behavior? I feel like if the xml file doesn't include the old downloaded files, they should be removed after the feed is generated during each query.

@mxpv

This comment has been minimized.

Copy link
Owner

@mxpv mxpv commented Nov 18, 2019

Is this expected behavior?

Yep, this is how it works as of now. CLI gets page_size episodes via API and for each episode checks whether it needs to be downloaded to disk (and downloads if necessary).
Your point make sense, there should be some kind of cleanup mechanism, especially for video files.
I think we can define update_strategy field instead of page_size, with values like keep entire feed in sync, keep last 5 episodes, etc.

@mxpv mxpv added this to the 2.1 milestone Nov 18, 2019
@DeXteRrBDN

This comment has been minimized.

Copy link
Contributor

@DeXteRrBDN DeXteRrBDN commented Nov 19, 2019

update_strategy would be an interesting addition. I was looking for to add this too. I could help with this.

@mxpv mxpv mentioned this issue Nov 22, 2019
@mxpv mxpv changed the title downloaded episodes beyond page_size won't get deleted or included in the feed xml Support episodes cleanup Nov 22, 2019
@tuxpeople

This comment has been minimized.

Copy link

@tuxpeople tuxpeople commented Dec 3, 2019

Not sure if I understand correctly. For my use case, best fit would be if I have two variables. One for how many items to check on Youtube and one for how many downloaded files should be kept (and added to the RSS feed). Because I would set those two to different values.

@amcgregor

This comment has been minimized.

Copy link

@amcgregor amcgregor commented Dec 3, 2019

There's an interesting chicken-egg problem relating to identifying if an episode is "safe to remove" or not. My own process generates RSS XML feeds from all video content within a (channel, playlist) download directory, any additions always appear after each new run. Client-side (e.g. in the Podcasts app) it tracks watched state, the server side is never informed of this information — this is actually a massive reason I prefer the Podsync approach to media consumption. Google doesn't need to know what I watch and when.

Automated clean-up becomes problematic and tricky. If it's tied to the HTTP request to pull in the video file, the episode may be cleaned up after only being partially watched (e.g. pause long enough to break the streaming connection… and your video gets deleted, great job!) If it's tied entirely to episode count, unwatched episodes may be unintentionally removed by an arbitrary metric ("n episodes"). If tied to a sensible metric, e.g. time ("keep all episodes from the last 7 days"), then episodes you just don't quite get to may disappear automatically. I'm fortunate in that I have tens of terabytes of available storage at home, so currently I don't bother to delete episodes at all. (I actually have all digital media I have touched since 2001… just shy of 50 TiB at the moment. The first video: a bad VHS transfer of "blowing up the whale". ;)

A combination of metrics for identifying cullable (able to be safely removed ;) episodes such as "7 days after the last time of access + three months for untouched episodes" might be optimal.

@psyciknz

This comment has been minimized.

Copy link

@psyciknz psyciknz commented Dec 3, 2019

I'm not sure we need to over complicate this....I thought the main use of a podcasts was a single use episode. So after x days clean up old episodes....it's not life or death if you miss one.

If you're after "I want to save the entire internet at home" then don't remove.....or put podcasts in something more permanent.

Just my thought IMO - but that's slanted with how I use podcasts.

@DeXteRrBDN

This comment has been minimized.

Copy link
Contributor

@DeXteRrBDN DeXteRrBDN commented Dec 3, 2019

I agree with @tuxpeople approach. Having two variables would solve the issue for me also:

page_size = 5
keep_size = 10

5 items to download
10 items to keep as downloaded

Re garding what @amcgregor said, I would not mind losing those episodes not played (even downloaded) if they are beyond the keep_size value

@tuxpeople

This comment has been minimized.

Copy link

@tuxpeople tuxpeople commented Dec 4, 2019

@amcgregor Not sure if podsync needs to solve that problem. We're speaking about podcasts here. Not someones wedding video :-P So the files do not need to be saved until the end of the world. But that's just how I use podcasts.

Besides the statement from @psyciknz just a little addition: If you would like to have ANYTHING in your feed, you are ending up in also implementing RFC5005 (see https://podlove.org/paged-feeds/).

What we would need do discuss is which kind of variable keep_size is. It could be numbers of episodes, age of episodes, or both.

@psyciknz

This comment has been minimized.

Copy link

@psyciknz psyciknz commented Dec 4, 2019

I’d say number of episodes. Then it’s up to the user to set it as appropriate depending on the release frequency

@DeXteRrBDN

This comment has been minimized.

Copy link
Contributor

@DeXteRrBDN DeXteRrBDN commented Dec 4, 2019

As page_size is determined by number of episodes to look, keep_size should be number episodes to keep also.

Maybe an option to "define" size would be useful (items, age), but probably this is for another future task.

@amcgregor

This comment has been minimized.

Copy link

@amcgregor amcgregor commented Dec 5, 2019

So the files do not need to be saved until the end of the world.

In my case, that is explicitly the point. Actually, a step further: my archive needs to survive the end of the world. (It's geographically redundant—offsite backups—and locally redundant—extensive RAID, multiple servers, replicated MongoDB GridFS cluster atop, hosting a non-heirarchical metadata filesystem.) "All digital media I have touched since 2001" includes a complete copy of Wikipedia, most of Wikimedia (e.g. dict, books, so forth), Project Gutenberg, every StackExchange site including StackOverflow, with an additional set of ~45 days of music (24h a day, no repeats, every genre), millions of works of fiction, hundreds of thousands of works of non-fiction, … forming a forever library. A body of knowledge sufficient to learn to read, understand, and propagate that body of knowledge, with instructions sufficient to survive long enough to do so. (Agriculture on up.) Yes, this sounds ludicrous. And yes, it took three months to initially off-site backup. Thank the Elders of the Internet for Backblaze.

I am sadly content in the knowledge that my own requirements will not be met by any of the simplistic "rate limiting" approaches being agreed upon here, and that my own hackish attempt to replicate Podsync functionality already offers substantially more powerful controls over media ingest. For example, a frequent scenario I encounter: download i episodes on each run, preserve j episodes total per channel, covering k months of time at most.

Hypothetically applied: 3 months of episodes, say, 200 episodes within that time period, downloading at most 10 episodes every 6 hours, preserving the 30 most recent. Thus requiring three synchronization periods, or 18 hours, to be "fully caught up" and ready to go with all expected media available for viewing, while not waiting on complete ingest of a channel before continuing on the next, e.g. downloading all 200 episode for that three-month period all in one go and taking 12 hours to do so while processing nothing else, missing/skipping additional synchronization periods. This algorithm can offer guarantees about behavior while being flexible to time, count, and batch sizes, while reducing blocking (improving "turnaround time" on refreshes/updates).

(Edited to add: periodic regular data synchronization, ingress and egress, combined with feed generation, is my literal day job. That template engine used in my hackish recreation was invented at work for the purpose of streaming RSS feed generation. ;)

@DeXteRrBDN

This comment has been minimized.

Copy link
Contributor

@DeXteRrBDN DeXteRrBDN commented Dec 5, 2019

Well, I don't see the issue here. We can implement the new config value with the following option:

keep_size: 0 //Disable 'episode cleanup' functionality, keep forever

@kgraefe

This comment has been minimized.

Copy link

@kgraefe kgraefe commented Dec 5, 2019

amcgregor is not even using podsync, yet keeps trying to make discussions difficult.

As for the variable name, I'd name it keep_items. If we later want more flexibility we can add more variables like keep_age (e.g. "keep items newer than x days") and keep_size (e.g. "keep items until storage size exceeds x MiB").

@amcgregor

This comment has been minimized.

Copy link

@amcgregor amcgregor commented Dec 6, 2019

amcgregor is not even using podsync

I did. And I paid for the privilege of deeper archiving and higher quality. Then it ceased functioning. So I eliminated it as a dependency in the operation of my media consumption workflow by examining its mechanism of operation and replicating the essential process using a literal shell script and common, highly functional open-source tools (GNU parallel across multiple machines with a live progress indicator is beautiful), and now rely on functionality the abstraction of podsync does not provide for, despite the underlying mechanism supporting all of my needs. Discussions on exposing related functionality up from the underlying tools are all being driven towards the lowest common denominator of least functionality, with short-sighted implications. I want to use podsync. I have come to realize I might never be able to. (Without rewriting it, as I essentially have… in BASH.)

Well, I don't see the issue here.

Readily apparent, and not unique. (Pardon the snark; bit frustrating on this side, too. ;^P)

We can implement the new config value with the following option…

Giving up entirely is certainly an option. A regression of what my shell script is capable of, though, so I see no point in shooting myself in the foot. Mostly trying to encourage a tool I like the concept of (and early growth of) to be… less… poorly/inflexibly architected… so that I can consider using it once more. Being able to combine criteria (n episodes maximum, m days maximum, j episodes pulled maximum per run) really isn't that big of an ask. youtube-dl already does it through the combination of three command-line switches, and offers even more flexible selection criteria that just aren't exposed currently.

For clean-up, I'll be adding to my own script—for some channels or playlists—a find -mtime … -exec rm {} + invocation to remove episodes based on relatively long duration (creation) age and a shorter duration modification time age, having nginx touch files it streams to indicate possible watched state to the find call. (Or use the atime, if I can resolve the paradox of fetching the atime without updating the atime during cleanup…)

Lastly, to cover the "n episodes" culling criteria, ls *.json | sort -r | tail -n +30 | xargs rm ← keep only the latest 30 episodes, for example. (Yes, I'm aware that last is only cleaning the JSON metadata; this needs expanding upon, but even that would be effective in excluding the videos from the generated RSS file.) #TruePowerOfUNIX — pressing the fact that these requested features are implementable with system standard, absolutely basic, freely available GNU/Linux/BSD tools, right now, with only a few minutes of effort.

@DeXteRrBDN

This comment has been minimized.

Copy link
Contributor

@DeXteRrBDN DeXteRrBDN commented Dec 6, 2019

I've been looking at the code, and currently, Podsync is not tracking or looking to how many items has already been downloaded, so it does not know what can or not be deleted.

It looks for all the items requested, it checks if they exist, if not, it downloads the items. But it does not know which items are already downloaded that are not inside the current "page_size".

Youtube gets 50 items per query, so we could check if we have some downloaded items inside those first 50 ones. So we can build the XML with the elements inside the page_size and those ones that are already downloaded, but outside the page_size.

For those items already downloaded, outside the page_size, inside the 'keep_items`, and inside the 50 first items from Youtube, we can take the info from Youtube, but we would need to paginate Youtube to get more items if there is some downloaded items outside those first 50 ones.

To prevent checking Youtube for those items already downloaded, we should "store" the downloaded item info. We already have an XML file with that info, so we could use that file as an storage. So we could create an initial step to modify the current code to be able to modify the XML instead of creating a new one every time.

Once we get the functionality to modify the XML file instead of recreating it, we can create the new functionality just to read the current XML, and remove those items (including the stored file) using the criteria we decide (max total size, max items, max age, etc.).

So we could split the task in two steps:

  • Modify XML generation from recreate to read & modify.
  • Read XML items and delete those ones not passing user configuration.
@kgraefe

This comment has been minimized.

Copy link

@kgraefe kgraefe commented Dec 6, 2019

Oh I wasn't aware that the XML is currently not read by podsync. That means older items will exist as files but not in the XML file. Deleting those files via cron and controling the number of episodes with the page_size parameter will be good enough for me.

@billflu

This comment has been minimized.

Copy link

@billflu billflu commented Dec 7, 2019

I just created this script (it can probably be cleaned up) to compare mp4's referenced in the XML files versus what is downloaded. It then removes the extra mp4's. When videos are high quality and 30-60 minutes, they can take up quite a bit of space.

#Cleans up extra mp4 files which are downloaded by podsync, but not referenced
#These files are likely older than the current length of the feed
#
#Directory of podsync data (with trailing/)
podsyncdata='/home/pi/podsync/data/'
#Find referenced mp4 files in xml feeds
grep -Eoh '[A-Za-z0-9_-]{11}.mp4' $podsyncdata*.xml | sort -u > xml-mp4.txt
#Find downloaded mp4 files
find $podsyncdata -name '*.mp4' -exec basename {} \; | sort -u > mp4.txt
#Compare files to see which downloaded files aren't referenced
comm -23 mp4.txt xml-mp4.txt > diff-mp4.txt
#Remove the extra downloaded files
cat diff-mp4.txt | while read line; do rm $podsyncdata*/$line; done
#Clean up temporary files
rm *mp4.txt

Next up might be to create a script to clean up partial downloads. It can then wait for the next run or possibly kick off YouTube-dl itself.

@kgraefe

This comment has been minimized.

Copy link

@kgraefe kgraefe commented Dec 10, 2019

Yeah, or just do

cd /home/pi/podsync/data/
for f in */*.mp4 */*.mp3; do
    grep -q "$f" *.xml || rm "$f"
done
@billflu

This comment has been minimized.

Copy link

@billflu billflu commented Dec 10, 2019

Yeah, or just do

cd /home/pi/podsync/data/
for f in */*.mp4 */*.mp3; do
    grep -q "$f" *.xml || rm "$f"
done

Touché kgraefe! I knew there had to be a way to do it more efficiently.

@mxpv mxpv removed this from the 2.1 milestone Dec 10, 2019
@Rumik

This comment has been minimized.

Copy link

@Rumik Rumik commented Dec 14, 2019

Is the above script usable? If so, how so? Auto cleanup would be very handy! Especially because I have no idea where my episodes are being downloaded to! They're not in the data directory! lol

Thanks :)

@kgraefe

This comment has been minimized.

Copy link

@kgraefe kgraefe commented Dec 20, 2019

You have to save it as a shell script file and add a crontab entry to run it periodically via cron. (I don't know how to do that on your NAS, on a Linux box I'd run crontab -e.)

E.g. I have in my crontab something like:

  0      3       *       *       *       /bin/sh /home/pi/bin/clean-data.sh

which means "run /bin/sh /home/pi/bin/clean-data.sh every day at 3:00 A.M."

You may have to add PATH settings at the beginning of the script, as cron runs in a different (minimal) environment:

PATH=/bin:/usr/local/bin:/usr/bin

Otherwise it may not find the grep and rm executables.

@amcgregor

This comment has been minimized.

Copy link

@amcgregor amcgregor commented Jan 5, 2020

Ah, for those digging into this problem, there's a slightly more "refined" approach I'm investigating leads on, now. Instead of relying on arbitrary rules around time-based expiry / expunging, pure episode count limits, etc., since the machine acting as my server is also a macOS machine running Podcasts.app locally, why can't I pull the view state from the Podcasts app?

Turns out, you can!

/Users/$USER/Library/Group\ Containers/??????????.groups.com.apple.podcasts/Documents/MTLibrary.sqlite

That's the path to the SQLite database backing Podcasts.app. The question marks are a "blobbing pattern" indicating that your particular "group ID" may differ from mine, but this path should work [as-is] for passing to command-line programs, such as the sqlite3 REPL. Now, finding the path to the database is only the first part. The episode data is stored in the ZMTEPISODE table.

I'm choosing to identify episodes that are possible to clean up using a URL prefix match on the domain name I'm hosting the podcasts from—ZENCLOSUREURL column. Another idea that popped to mind was matching the format of the episode ID—ZGUID column—but the URL will be more reliable. This table tracks play status, play head position, saved state, and more, but we only need the path portion of the URL.

Cleanup requires finding episodes that:

  • Have been marked as played at some point in time.
  • Have not been marked as "saved", i.e. marked for intentional preservation.
  • Are not in a (short) list of "archival" playlists/channels, ones that should never be cleaned up.
SELECT substr(ZENCLOSUREURL, 25) FROM ZMTEPISODE
WHERE ZENCLOSUREURL LIKE "https://cast.example.com/%"
AND ZSAVED = 0 AND ZUNPLAYEDTAB != 1
AND ZAUTHOR NOT IN (...archival...)

(The 25-character prefix removal is correct for my domain name in use; this lets me feed this query to sqlite3 then use the output as on-disk file paths, one episode per line.)

Just an idea that's been bouncing around my head this last week, having poked around in the sqlite3 command-line tool a bit to see what information is available there. If this information is available, why not use it? :)

mxpv added a commit that referenced this issue Mar 8, 2020
@mxpv mxpv closed this Mar 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.