gawr

An audio archiver tool to create an audio library out of web videos. Download, clip, and normalize audio streams

CHANGELOG

Please see the CHANGELOG for a release history.

Documentation quick links

Pre-requisites
How to build & run
Configuration
How it works
Contributing

Pre-requisites

A Rust environment to build the project
- Latest stable and above is supported. May or may not work on previous versions
Either yt-dlp or youtube-dl
ffmpeg

How to build & run

# To run locally
cargo build --release
cargo run --release -- --help

# To install & run from anywhere
cargo install --path .
cd /some/where/under/the/rainbow
gawr --help

Configuration

This tool has multiple ways to be configured:

Command line arguments
Environment variables
Configuration file

If a configuration variable is present in multiple of those locations, the priority is the following: Command line arguments > Environment variables > Configuration file > Default values.

Command Line Arguments

Available command line arguments can be checked with the --help argument :

An audio archiver tool to create an audio library out of web videos. Download, clip, and normalize audio streams

Usage: gawr [OPTIONS]

Options:
      --config <config>          The path to the TOML config file [default: .gawr.toml]
      --id <id>                  The IDs of playlists or videos
      --out <out>                The path to the output directory
      --cache <cache>            The path to the cache file, avoiding processing multiple times the same videos
      --split <split>            Either keep the entire video or create clips based on timestamps in the description [possible values: full, slow]
      --ext <ext>                The file extension to use for the output files. Defines the file container format to use [possible values: mka, mkv, ogg, webm]
      --clip_regex <clip_regex>  Regular expressions to extract timestamps from description.
                                 Must capture `time` and `title` groups (starting timestamp & clip title).
                                 
                                 For every description line, every pattern will be tested until one matches.
                                 A default pattern that should handle most cases is used if none is provided.
                                 
                                 Must use the [Regex crate syntax](https://docs.rs/regex/latest/regex/#syntax)
                                 
      --shuffle                  Randomize the order in which the videos are downloaded. Do not influence how clips are processed
      --cores <cores>            Assume the machine has this number of cores. Used to modify the number of worker threads spawned.
                                 
                                 When using a value of 0 (default), auto-detect the number of cores from the system
                                 
      --log <log>                The logging level to use [possible values: ERROR, WARN, INFO, DEBUG, TRACE]
      --bitrate <bitrate>        The audio bitrate to use for output files. Must follow the `ffmpeg` bitrate format
  -h, --help                     Print help
  -V, --version                  Print version

Environment Variables

Configuration variables can also be set using environment variables.

To do so, simply add the GAWR_ prefix and uppercase the variable. For example GAWR_CLIP_REGEX=....

Configuration File

Finally, environment variables can be set using a TOML file. By default, the one read is .gawr.toml but this can be set using the --config command line argument.

# Required variables (dummy template values)
id = ["<ID>", "<ID>"]
out = "<PATH>"
cache = "<CACHE>"
split = "<clips|full>"

# Optional variables (default values)
bitrate = 96
clip_regex = [
    "^(?:\\d+\\. *)?(?P<time>[0-9]+(:[0-9]+)+) *.? +(?:[0-9]+(:[0-9]+)+)? *.? +(?P<title>.+)$",
    "^(?:\\d+\\. *)?(?P<title>.+) *.? +(?P<time>[0-9]+(:[0-9]+)+) *.? +(?:[0-9]+(:[0-9]+)+)?$",
]
cores = 0
ext = "ogg"
log = "info"
shuffle = false

How it works

Short version

The tool downloads audio streams, potentially clip them and apply audio normalization before saving them
All of this is done on multiple threads to minimize the time downloading / processing
It saves the current state in a sqlite local database to be able to restart at any time without re-doing work it has already done

Long version

Configuration & Initialization

This is where the tool starts and:

parse the command line arguments
read the environment variables and configuration file
combine all of them and verify all conf variables are valid and the required ones have been specified

Then:

checks if the external programs are present (yt-dlp or youtube-dl, and ffmpeg)
initializes the actors

The actors

The main processing loop is structured using an actor model as schematized below :

The list of playlist video IDs is downloaded, and compared to the sqlite cache database to see whether there are new video stream to download
Each of these video IDs is sent to the Download Actor which downloads the video audio stream and passes them to the next actor
The Timestamp Actor parses the video description to detect timestamps in the video. It then passes each video section (a start time, and an end time) to one of the next actors
The Clipper Actors clips the audio file to keep only the wanted video section, applies audio normalization and other ffmpeg conversions to get the output clip audio file

The actor model is useful here since it allows each actor to run on its own thread and thus to optimize the work done conurrently :

As soon as the Download Actor has passed the audio file to the next actor, it will begin downloading the next one.
Timestamps for different audio files can be processed at the same time

When things go bad

As this tool uses external programs, which communicate through the network, potentially using non-standard APIs... errors are bound to happen.

A lot has been done to handle as best as possible failures:

Retry operations
Detect unavailable video streams
Process files using temporary files, to avoid trashing the output directory in case of crash/failure
Save current state to handle unexpected crashes of the tool

At this point, the tool can be expected to work decently for personal usage, and should not require manual fiddling to put it out of a trash state. (but if that happens, feel free to create a new issue)

Contributing

Feel free to create issues and/or submit pull request. I cannot guarantee how long it would take to solve/review them as it depends on how busy I am at the moment.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.envrc		.envrc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

gawr

CHANGELOG

Documentation quick links

Pre-requisites

How to build & run

Configuration

Command Line Arguments

Environment Variables

Configuration File

How it works

Short version

Long version

Configuration & Initialization

The actors

When things go bad

Contributing

About

Licenses found

Releases 4

Packages

Languages

License

Licenses found

nicomem/gawr

Folders and files

Latest commit

History

Repository files navigation

gawr

CHANGELOG

Documentation quick links

Pre-requisites

How to build & run

Configuration

Command Line Arguments

Environment Variables

Configuration File

How it works

Short version

Long version

Configuration & Initialization

The actors

When things go bad

Contributing

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages