Jellyfin HA transcoding fork: Redis-backed session failover + experimental PostgreSQL provider #16415

ZoltyMat · 2026-03-14T06:51:24Z

ZoltyMat
Mar 14, 2026

I've been working on a fork of Jellyfin focused on one specific problem: making HLS transcoding survive pod restarts in a multi-replica Kubernetes deployment.

What it does

Right now, Jellyfin assumes transcode state lives in one server process. If that pod dies, active transcodes die with it. This fork adds a small HA layer so transcode ownership can survive a pod restart:

A new ITranscodeSessionStore abstraction for durable transcode session tracking
A RedisTranscodeSessionStore implementation with lease-based ownership
Atomic pod takeover using a Redis Lua script when a lease expires
Lease-aware cleanup so one pod does not delete segments another pod still needs
A NullTranscodeSessionStore fallback, so single-instance deployments behave exactly like upstream with no config changes

I also added an experimental PostgreSQL provider for shared-database deployments, since SQLite is not a good fit once multiple replicas are involved.

What the HA flow looks like

Pod A starts an HLS transcode and registers the session in Redis
Pod A renews the lease while it owns the session
If Pod A dies, the lease expires
Pod B receives the next request, atomically claims the expired lease, and resumes from the last completed segment on shared storage
The client sees a short buffer pause instead of a hard failure

How to run it

There are three practical modes:

1. Single instance

No config needed. It falls back to the no-op store automatically.

2. Local HA test

Run two Jellyfin instances against:

the same Redis
the same shared transcode directory

That is enough to test failover behavior locally.

3. Kubernetes / k3s

This is the intended deployment model. You need:

2+ Jellyfin replicas
Redis
shared RWX storage for transcode output
shared media storage
ideally PostgreSQL if you want a proper shared DB setup

The key config is:

Jellyfin:TranscodeStore:RedisConnectionString
Jellyfin:TranscodeStore:LeaseDurationSeconds

Repo and write-up:

Source: https://github.com/ZoltyMat/jellyfin-ha
Full change summary vs upstream: https://github.com/ZoltyMat/jellyfin-ha/blob/main/docs/FORK-DIFF.md
Write-up with diagrams and k8s manifests: https://blog.zolty.systems/posts/jellyfin-ha-kubernetes

What would be required to merge upstream

I do not expect this to be merged as-is without discussion. If there is interest, I think the realistic path is to split it into small pieces:

Introduce ITranscodeSessionStore, TranscodeSession, and NullTranscodeSessionStore only
Add the DI wiring with no behavior change unless configured
Add HLS session registration and lease renewal hooks
Add lease-aware cleanup in DeleteTranscodeFileTask
Add takeover logic in the HLS/session path
Discuss whether Redis should be the first supported distributed store, or whether the interface should land before any concrete implementation
Treat PostgreSQL as a separate discussion entirely

I think the HA transcode work has a better chance of review if it is separated from the PostgreSQL provider and migration tooling.

Why I'm posting it

I'm not trying to maintain a permanent hard fork. I built this to see whether Jellyfin could be made to behave well in a replicated environment without rewriting major subsystems. The answer seems to be yes, but it needs maintainers to decide whether this kind of deployment is something upstream wants to support.

If there's interest, I'm happy to break the work into smaller PRs, clean up anything that does not match project direction, and rework the design around maintainer feedback.

jaredglaser · 2026-05-30T19:52:10Z

jaredglaser
May 30, 2026

I would love this to work without the need for shared storage or shared db. Allowing a small amount of loss ( < 1 min ) by utilizing replication and a read only replica that becomes able to write during failover would be awesome. The client knows where it is in a stream anyway so except for very specific cases the failover would not be noticeable at all to users.

1 reply

jaretclifton Jun 1, 2026

I've had a lot of success with using lsyncd to replicate data from my "primary" JF node to my cold stand by "secondary" node. Lsync handles replication of changed files every 2 minutes or so, and then I have a script that executes a sqlite backup every 5 minutes. That backup is shuttled across via rsync from the primary to the secondary. I have a watcher script on the secondary that checks the VIP (all of this sits behind HAProxy) as well as the primary instance directly. If it sees 3 failures in a row, it will copy the current database to a backup location on the NVME, copy the latest DB sync'd from the primary into the proper location (and set appropriate permissions), and then finally start up the JF service.

When the primary is healthy again, the secondary shuts down. Yes this causes potential data loss for all items that were being watched on the secondary which is now missing in the db from the primary. This is only really a problem if the primary stays offline for more than a few minutes -- due to the client tracking it's location as you mentioned -- but really doesn't cause an issue for things like a crash or reboot of the primary.

Janky? Sure.
Functional? Very.
Full of holes? Highly likely.
Fun to build out? Absolutely.

No shared storage and no shared db required. Your mileage may vary. :)

ZoltyMat · 2026-06-01T22:58:21Z

ZoltyMat
Jun 1, 2026
Author

If you leave in Redis for session tracking, HA for direct play becomes very do-able — a lot of the shared storage and extra bits are just there as a kind of self challenge. Like if I were to build out a Netflix competitor, that's how I'd do it.

@jaredglaser That's a great point — tolerating a small amount of loss (<1 min) and leaning on replication + a promotable read replica is a really clean way to approach this without requiring shared storage at all. The client tracking its own position already covers most of the gap.

@jaretclifton Love this setup — lsyncd + sqlite backup over rsync with a HAProxy VIP and a watcher script is honestly pretty elegant for what it is. "Janky? Sure. Function? Very." is basically the best summary of homelab HA philosophy. The fact that it handles the common cases (crash/reboot) cleanly without needing shared storage or a distributed DB is exactly the kind of pragmatic solution that works well in practice.

2 replies

jaredglaser Jun 7, 2026

I would imagine you would want to avoid the case in @jaretclifton's scenario since that is inducing data loss. There could easily be cases where you could lose a considerable amount of data. If you already have a load balancer anyway it would be more ideal to have an api endpoint to modify the read/write state. During failover the load balancer tells the failover node to enter write mode. Then when the original node passes a health check it is flipped into read mode before being adopted back into the load balancer.

jaretclifton Jun 9, 2026

If you can find a way to get sqlite3_rsync to behave nicely for anything over 1GB, or anything that uses Plex's (yes I run Plex and Jellyfin on the same system doing the same shuttling) weird ICU collation, then please let me know so I can swap to it. In all my testing, it would choke on anything over 1GB and absolutely choke on the funky ICU collation. That's why I tolerate the potential for data loss. Postgres is the way for sure, but that's going to take forever for the dev's to get working properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jellyfin

Jellyfin HA transcoding fork: Redis-backed session failover + experimental PostgreSQL provider #16415

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Jellyfin

Jellyfin HA transcoding fork: Redis-backed session failover + experimental PostgreSQL provider #16415

Uh oh!

ZoltyMat Mar 14, 2026

What it does

What the HA flow looks like

How to run it

1. Single instance

2. Local HA test

3. Kubernetes / k3s

What would be required to merge upstream

Why I'm posting it

Replies: 2 comments · 3 replies

Uh oh!

jaredglaser May 30, 2026

Uh oh!

Uh oh!

jaretclifton Jun 1, 2026

Uh oh!

ZoltyMat Jun 1, 2026 Author

Uh oh!

jaredglaser Jun 7, 2026

Uh oh!

jaretclifton Jun 9, 2026

ZoltyMat
Mar 14, 2026

Replies: 2 comments 3 replies

jaredglaser
May 30, 2026

ZoltyMat
Jun 1, 2026
Author