Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to self hosted version (aka Podsync CLI) #38

Closed
14 tasks done
mxpv opened this issue Oct 27, 2019 · 14 comments
Closed
14 tasks done

Switch to self hosted version (aka Podsync CLI) #38

mxpv opened this issue Oct 27, 2019 · 14 comments

Comments

@mxpv
Copy link
Owner

mxpv commented Oct 27, 2019

All workarounds that I did last few months are related to finding ways to bypass various limitations (API tokens has limited quotas, the number of download requests is limited). I believe that the current path forward is to switch to self hosted version, which would allow to not only use it normally, but add some neat features (like mp3 support).

This issue is to track progress:

  • Add CLI entrypoint
  • TOML configuration
  • Adapt current feed engine for YouTube
  • Adapt current feed engine for Vimeo
  • youtube-dl wrapper
  • Implement feed updater
  • Implement episode downloader
  • Web server with static file hosting
  • Mp3 encoder
  • Write Dockerfile for easy deployment
  • Write unit tests
  • Integrate with GitHub actions
  • Update documentation and write tutorials for a different cloud providers
  • Versioning, releaser script

Patreon post: https://www.patreon.com/posts/self-hosted-aka-31073377

@grafmik
Copy link

grafmik commented Oct 27, 2019

Hello Max and thanks for your work so far and for your effort to make a self-hosting version !

Just wondering, why the mp3 format? Isn't that for audio only? Do you mean mp4?

Keep it up!

@jonathanp0
Copy link

Since some Youtube channels simply consist of people talking for long periods of time in front of a camera, it's useful to be able to convert the video to an audio file and listen to it using podcast software. Podsync already has this feature.

@amcgregor
Copy link

amcgregor commented Oct 27, 2019

I found the internal architecture of podsync to be… rather excessively complicated. (A testament to the power of ingenuity, but too complicated for me to be comfortable setting up locally.) DynamoDB? Lambda? Golang! Docker. But also Node… (How many programming languages does one need at a single time? Node does not touch my machines.) I looked at all that, then just dove into the code looking for the ultimate youtube-dl invocation. Then extracted and isolated that, automated using GNU parallel. That Gist also describes the patches made to youtube-dl to avoid excessive numbers of HTTP requests. (E.g. actually sys.exit() when failing a video because it is too old, all subsequent videos will be older.)

I've resolved the issue with playlists being named after their origin channel instead of the actual name of the playlist and will continue to keep this tiny little shell script updated. Also added optional lines for rate limiting, randomized sleep periods, and SOCKS5 proxy configuration; that is, ssh -D 8088 example.com and the proxy would be --proxy "socks5://localhost:8088/". Only real remaining issue, feed thumbnails. With this setup, it's taken YouTube two weeks to begin to throw up Captcha challenges, after ingesting 5,895 episodes totalling 1.6 TiB across 141 channels / playlists. (Each run taking, on average, around 10 minutes, run via cron every few hours.)

@amcgregor
Copy link

amcgregor commented Oct 27, 2019

Ah, adding a second comment as it's an important note, my shell script there (basically a text file containing a channel or playlist URL per line…) explicitly gives you control over per-channel quality settings (see line 41; split that up with multiple formats if needed, I do) as well as extended video selection criteria, such as title exclusions (see line 74). (Run youtube-dl -h to see the many, many options available.)

@grafmik
Copy link

grafmik commented Oct 27, 2019

Hello Alice,

Thanks for this work! I had started to dive into Max code, starting with early commits. Did you know podsync started as a .NET project? :)

I can understand Max used a database because he had to store every user playlist. For a self-hosted, single-user version, generating/serving just one file may indeed actually be a simpler/better solution.
For node I feel you.
As for Docker, it could be a nice feature to add to your code, as this could determine the right environment, especially for versions of parallel and python.

Anyway, I'm grateful because you made me save some time and effort.

@amcgregor
Copy link

amcgregor commented Oct 27, 2019

especially for versions of parallel and python.

Any version will do:

brew install parallel

Python 3 is already a pretty universal standard; the given code will work with any Python 3.3 or newer, that is, virtually any Python released in the last 10 years in that series. Including the version that comes pre-installed on macOS.

Edited to add: thus, in this particular case, Docker would simplify nothing, and complicate everything. Like a Spartan soldier taking everything and giving nothing.

store every user playlist

On-disk directories are the database, in my case. My Python script and template will transform any directory containing youtube-dl .info.json files into a podcast. (Future improvement: only regenerate the index.xml if there are actually new/updated episodes, but feed generation is so minor compared to content collection, that's a low priority.)

Edited to add: ingest (pull.sh invocations of youtube-dl) are one half of the problem: actually getting the content. A problem tackled entirely separately: turning those collected media files into podcast feeds.

@grafmik
Copy link

grafmik commented Oct 27, 2019

I understand you want to keep things simple. Loved the 300 reference and can't help imagining Docker as a bare-torso warrior now.

As I already said, didn't read podsync code, but does it store every mp4 on their server?
I'm using your script right now (will also check these youtube channels of yours, just curious).
I see the content (mp4) is directly "youtube-dl-ed" right here on the machine.

I can see the dl.podsync.net/* urls link to googlevideo.com. Is there an upload wrapper somewhere that could avoid using space on the podsync self-hosted server?

@amcgregor
Copy link

amcgregor commented Oct 27, 2019

…does it store every mp4 on their server?

Yes, as part of the background "updater" process. Where that is Python code, so invokes youtube_dl directly, and my shell script is a shell script invoking the youtube-dl command itself. One layer out. ;)

«googlevideo.com links» … Is there an upload wrapper somewhere that could avoid using space on the podsync self-hosted server?

Well, where youtube-dl by command line, by default, will download the video content, if you are careful to pick a video format that comes "pre-muxed" (that is, audio and video together) you can hypothetically avoid downloading the video and pull the actual origin links from the .info.json for use in the RSS feed. Or, in Podsync's case, after a 302 redirect, likely looking up the local cache status vs. availability from YouTube of the pre-muxed version link.

That's a key difference, I think. I get 1080p episodes, as I re-mux the independent streams. Hypothetically I could choose a 4K --format. (But ye gods, the storage space, then!)

@leekillough
Copy link

Just wondering, why the mp3 format? Isn't that for audio only? Do you mean mp4?

Some of us want audio-only, since we like listening to the audio of podcasts posted to YouTube, but we don't have the time to watch the video, since we're doing other things when we listen to the audio, such as driving, or we don't care to see the podcaster's studio, when what they say is more important than their studio.

I consider it a welcome addition.

Self-hosting seems like the way to go too, eliminating single points of failure.

@davidAlittle
Copy link

Just wondering, why the mp3 format? Isn't that for audio only? Do you mean mp4?

Some of us want audio-only, since we like listening to the audio of podcasts posted to YouTube, but we don't have the time to watch the video, since we're doing other things when we listen to the audio, such as driving, or we don't care to see the podcaster's studio, when what they say is more important than their studio.

I consider it a welcome addition.

Self-hosting seems like the way to go too, eliminating single points of failure.

Second the audio only option. I don't know what APIs you're using, but I can tell you that as a Youtube Red subscriber (actually a Google Play Music subscriber, but that's the same thing now), there is a way to only stream audio, since this is a premium feature specifically offered as part of Red.

@amcgregor
Copy link

amcgregor commented Nov 6, 2019

Direct use of youtube-dl (the command-line program powering all of this media ingest) permits retrieval of just the audio. My little automation script wins again, it already can do this! ;P

Some of us want audio-only…

It really is a bit flabbergasting to be repeatedly asked for something the user already has the ability to do… and search for.

@mirth
Copy link

mirth commented Nov 7, 2019

Is it expected that docker-compose pull produces the following?

ERROR: for api  pull access denied for mxpv/podsync_api, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
ERROR: for updater  pull access denied for mxpv/updater, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
ERROR: for resolver  pull access denied for mxpv/podsync_lambda, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
ERROR: for nginx  pull access denied for mxpv/nginx, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

@mxpv
Copy link
Owner Author

mxpv commented Nov 7, 2019

CLI docker images are not yet published.

@mxpv
Copy link
Owner Author

mxpv commented Nov 16, 2019

New functionality, docs and tutorials will be added in follow up PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants