An audio archiver tool to create an audio library out of web videos. Download, clip, and normalize audio streams
Please see the CHANGELOG for a release history.
- A Rust environment to build the project
- Latest stable and above is supported. May or may not work on previous versions
- Either
yt-dlp
oryoutube-dl
ffmpeg
# To run locally
cargo build --release
cargo run --release -- --help
# To install & run from anywhere
cargo install --path .
cd /some/where/under/the/rainbow
gawr --help
This tool has multiple ways to be configured:
If a configuration variable is present in multiple of those locations, the priority is the following: Command line arguments
> Environment variables
> Configuration file
> Default values
.
Available command line arguments can be checked with the --help
argument :
An audio archiver tool to create an audio library out of web videos. Download, clip, and normalize audio streams
Usage: gawr [OPTIONS]
Options:
--config <config> The path to the TOML config file [default: .gawr.toml]
--id <id> The IDs of playlists or videos
--out <out> The path to the output directory
--cache <cache> The path to the cache file, avoiding processing multiple times the same videos
--split <split> Either keep the entire video or create clips based on timestamps in the description [possible values: full, slow]
--ext <ext> The file extension to use for the output files. Defines the file container format to use [possible values: mka, mkv, ogg, webm]
--clip_regex <clip_regex> Regular expressions to extract timestamps from description.
Must capture `time` and `title` groups (starting timestamp & clip title).
For every description line, every pattern will be tested until one matches.
A default pattern that should handle most cases is used if none is provided.
Must use the [Regex crate syntax](https://docs.rs/regex/latest/regex/#syntax)
--shuffle Randomize the order in which the videos are downloaded. Do not influence how clips are processed
--cores <cores> Assume the machine has this number of cores. Used to modify the number of worker threads spawned.
When using a value of 0 (default), auto-detect the number of cores from the system
--log <log> The logging level to use [possible values: ERROR, WARN, INFO, DEBUG, TRACE]
--bitrate <bitrate> The audio bitrate to use for output files. Must follow the `ffmpeg` bitrate format
-h, --help Print help
-V, --version Print version
Configuration variables can also be set using environment variables.
To do so, simply add the GAWR_
prefix and uppercase the variable.
For example GAWR_CLIP_REGEX=...
.
Finally, environment variables can be set using a TOML file.
By default, the one read is .gawr.toml
but this can be set using the --config
command line argument.
# Required variables (dummy template values)
id = ["<ID>", "<ID>"]
out = "<PATH>"
cache = "<CACHE>"
split = "<clips|full>"
# Optional variables (default values)
bitrate = 96
clip_regex = [
"^(?:\\d+\\. *)?(?P<time>[0-9]+(:[0-9]+)+) *.? +(?:[0-9]+(:[0-9]+)+)? *.? +(?P<title>.+)$",
"^(?:\\d+\\. *)?(?P<title>.+) *.? +(?P<time>[0-9]+(:[0-9]+)+) *.? +(?:[0-9]+(:[0-9]+)+)?$",
]
cores = 0
ext = "ogg"
log = "info"
shuffle = false
- The tool downloads audio streams, potentially clip them and apply audio normalization before saving them
- All of this is done on multiple threads to minimize the time downloading / processing
- It saves the current state in a sqlite local database to be able to restart at any time without re-doing work it has already done
This is where the tool starts and:
- parse the command line arguments
- read the environment variables and configuration file
- combine all of them and verify all conf variables are valid and the required ones have been specified
Then:
- checks if the external programs are present (
yt-dlp
oryoutube-dl
, andffmpeg
) - initializes the actors
The main processing loop is structured using an actor model as schematized below :
-
The list of playlist video IDs is downloaded, and compared to the sqlite cache database to see whether there are new video stream to download
-
Each of these video IDs is sent to the Download Actor which downloads the video audio stream and passes them to the next actor
-
The Timestamp Actor parses the video description to detect timestamps in the video. It then passes each video section (a start time, and an end time) to one of the next actors
-
The Clipper Actors clips the audio file to keep only the wanted video section, applies audio normalization and other
ffmpeg
conversions to get the output clip audio file
The actor model is useful here since it allows each actor to run on its own thread and thus to optimize the work done conurrently :
- As soon as the Download Actor has passed the audio file to the next actor, it will begin downloading the next one.
- Timestamps for different audio files can be processed at the same time
As this tool uses external programs, which communicate through the network, potentially using non-standard APIs... errors are bound to happen.
A lot has been done to handle as best as possible failures:
- Retry operations
- Detect unavailable video streams
- Process files using temporary files, to avoid trashing the output directory in case of crash/failure
- Save current state to handle unexpected crashes of the tool
At this point, the tool can be expected to work decently for personal usage, and should not require manual fiddling to put it out of a trash state. (but if that happens, feel free to create a new issue)
Feel free to create issues and/or submit pull request. I cannot guarantee how long it would take to solve/review them as it depends on how busy I am at the moment.