Automatic subtitle synchronization tool
Did you know that hundreds of movies, especially from the 1950s and '60s, are now in public domain and available online? Great! Let's download Plan 9 from Outer Space. As a non-native English speaker, I prefer watching movies with subtitles, which can also be found online for free. However, sometimes there is a problem: the subtitles are not in sync with the movie.
But fear not. This tool can resynchronize the subtitles without any human input. A correction for both shift and playing speed can be found automatically... using "AI & machine learning"
macOS / OSX
brew install ffmpeg pip install autosubsync
Linux (Debian & Ubuntu)
Make sure you have Pip, e.g.,
sudo apt-get install python-pip.
Then install FFmpeg and this package
sudo apt-get install ffmpeg sudo pip install autosubsync
Note: If you are running Ubuntu 14 (but not 12 and 16, which are fine), you'll need to jump some more hoops to install FFmpeg.
autosubsync [input movie] [input subtitles] [output subs] # for example autosubsync plan-9-from-outer-space.avi \ plan-9-out-of-sync-subs.srt \ plan-9-subtitles-synced.srt
autosubsync --help for more details.
Automatic speed and shift correction
Typical synchronization accuracy ~0.15 seconds (see performance)
Wide video format support through ffmpeg
Supports all reasonably encoded SRT files in any language
Should work with any language in the audio (only tested with a few though)
Quality-of-fit metric for checking sync success
import autosubsync autosubsync.synchronize("movie.avi", "subs.srt", "synced.srt") # see help(autosubsync.synchronize) for more details
Training the model
- Collect a bunch of well-synchronized video and subtitle files and put them
in a file called
- Run (and see)
- populates the
- runs cross-validation
- populates the
Assumes trained model is available as
python3 autosubsync/main.py input-video-file input-subs.srt synced-subs.srt
Build and distribution
- Create virtualenv:
python3 -m venv venvs/test-python3
- Activate venv:
pip install -e .
pip install wheel
python setup.py bdist_wheel
The basic idea is to first detect speech on the audio track, that is, for each point in time, t, in the film, to estimate if speech is heard. The method described below produces this estimate as a probability of speech p(t). Another input to the program is the unsynchronized subtitle file containing the timestamps of the actual subtitle intervals.
Synchronization is done by finding a time transformation t → f(t) that makes s(f(t)), the synchronized subtitles, best match, p(t), the detected speech. Here s(t) is the (unsynchronized) subtitle indicator function whose value is 1 if any subtitles are visible at time t and 0 otherwise.
Speech detection (VAD)
Speech detection is done by first computing a spectrogram of the audio, that is, a matrix of features, where each column corresponds to a frame of duration Δt and each row a certain frequency band. Additional features are engineered by computing a rolling maximum of the spectrogram with a few different periods.
Using a collection of correctly synchronized media files, one can create a training data set, where the each feature column is associated with a correct label. This allows training a machine learning model to predict the labels, that is, detect speech, on any previously unseen audio track - as the probability of speech p(iΔt) on frame number i.
The weapon of choice in this project is logistic regression, a common baseline method in machine learning, which is simple to implement. The accuracy of speech detection achieved with this model is not very good, only around 72% (AURoC). However, the speech detection results are not the final output of this program but just an input to the synchronization parameter search. As mentioned in the performance section, the overall synchronization accuracy is quite fine even though the speech detection is not.
Synchronization parameter search
This program only searches for linear transformations of the form f(t) = a t + b, where b is shift and a is speed correction. The optimization method is brute force grid search where b is limited to a certain range and a is one of the common skew factors. The parameters minimizing the loss function are selected.
The data produced by the speech detection phase is a vector representing the speech probabilities in frames of duration Δt. The metric used for evaluating match quality is expected linear loss:
loss(f) = Σi s(fi) (1 - pi) + (1 - s(fi)) pi,
where pi = p(iΔt) is the probability of speech and s(fi) = s(f(iΔt)) = s(a iΔt + b) is the subtitle indicator resynchronized using the transformation f at frame number i.
Speed/skew detection is based on the assumption that an error in playing speed is not an arbitrary number but caused by frame rate mismatch, which constraints the possible playing speed multiplier to be ratio of two common frame rates sufficiently close to one. In particular, it must be one of the following values
- 24/23.976 = 30/29.97 = 60/59.94 = 1001/1000
or the reciprocal (1/x).
The reasoning behind this is that if the frame rate of (digital) video footage needs to be changed and the target and source frame rates are close enough, the conversion is often done by skipping any re-sampling and just changing the nominal frame rate. This effectively changes the playing speed of the video and the pitch of the audio by a small factor which is the ratio of these frame rates.
Based on somewhat limited testing, the typical shift error in auto-synchronization seems to be around 0.15 seconds (cross-validation RMSE) and generally below 0.5 seconds. In other words, it seems to work well enough in most cases but could be better. Speed correction errors did not occur.
Auto-syncing a full-length movie currently takes about 3 minutes and utilizes around 1.5 GB of RAM.
I first checked Google if someone had already tried to solve the same problem and found this great blog post whose author had implemented a solution using more or less the same approach that I had in mind. The post also included good points that I had not realized, such as using correctly synchronized subtitles as training data for speech detection.
Instead of starting from the code linked in that blog post I decided to implement my own version from scratch, since this might have been a good application for trying out RNNs, which turned out to be unnecessary, but this was a nice project nevertheless.
Other similar projects
- https://github.com/tympanix/subsync Apparently based on the blog post above, looks good
- https://github.com/smacke/subsync Newer project, uses WebRTC VAD (instead of DIY machine learning) for speech detection
- https://github.com/pulasthi7/AutoSubSync-old & https://github.com/pulasthi7/AutoSubSync (looks inactive)