Binary Files

This directory contains various compiled binary files, representing the latest version of various nzilbb.ag components.

These include:

nzilbb.ag.jar - the primary API/object model for nzilbb.ag
nzilbb.formatter.???.jar - a number of de/serialization modules that can, for example, be installed in LaBB-CAT to add support for format conversions.
???-to-???.jar - stand-alone utilities to perform specific format conversions - see below...

Standalone Format Converters

There are a number converter utilities that can be used to convert files from one format to another, including:

trs - Transcriber transcripts
eaf - ELAN files
vtt - web subtitles (Web VTT)
slt - SALT transcripts
cha - CLAN CHAT transcripts
textgrid - Praat TextGrids
pdf - PDF files
tex - LaTeX files
txt - plain text files
kaldi - input files for the Kaldi automatic speech recognition training system

to↓ from→	trs	eaf	vtt	slt	cha	textgrid	txt
trs		eaf-to-trs	vtt-to-trs	slt-to-trs	cha-to-trs	textgrid-to-trs
eaf	trs-to-eaf		vtt-to-eaf	slt-to-eaf	cha-to-eaf	textgrid-to-eaf	txt-to-eaf
vtt	trs-to-vtt	eaf-to-vtt		slt-to-vtt	cha-to-vtt	textgrid-to-vtt
slt	trs-to-slt	eaf-to-slt
cha	trs-to-cha	eaf-to-cha	vtt-to-cha
textgrid	trs-to-textgrid	eaf-to-textgrid	vtt-to-textgrid	slt-to-textgrid	cha-to-textgrid
txt	trs-to-txt
pdf	trs-to-pdf	eaf-to-pdf	vtt-to-pdf	slt-to-pdf	cha-to-pdf	textgrid-to-pdf
tex	trs-to-tex	eaf-to-tex	vtt-to-tex	slt-to-tex		textgrid-to-tex
kaldi	trs-to-kaldi	eaf-to-kaldi				textgrid-to-kaldi

To use a particular converter, you need to have Java installed on your system. Download the file, and double-click it to run.

If double-clicking doesn't work, you can run the converter from the command line, by entering:

java -jar vtt-to-textgrid.jar

By default converters display a window on to which you can drag and drop files for converting. However, they can also be run in 'batch mode', which allows you to automatically convert a list of files from the command line - e.g.

java -jar trs-to-textgrid.jar --batchmode *.trs

Some conversions have configurable output, e.g.

java -jar trs-to-txt.jar *.trs

...will include annotations and participant names in the output text files, but:

java -jar trs-to-txt.jar --textonly *.trs

...produces text files that exclude all annotations and participant names.

The --usage command-line switch prints information about command-line options.

As many formats do not support the meta-data, annotation granularity or ontology of other formats, many of these conversions necessarily entail loss of data. However, mappings are made from one format to another wherever possible.

For notes about specific correspondences or data losses, use the --help command-line switch, or use the Help|Information menu option of the conversion utility concerned.

Getting TextGrid Transcripts of YouTube Videos

The basic process of making a YouTube video corpus with TextGrid transcripts is:

Download Youtube videos with their closed-caption subtitle files.
Convert the closed-caption subtitle files to TextGrids.

1. Download Youtube videos with subtitles

There's a tool called yt-dlp which can be used for downloading videos from YouTube. It's a command-line program that can be given a URL for downloading. If the URL is a playlist, then it will download all videos in the playlist into separate files.

It has many useful command-line options:

--help - Displays information about all the other command-line options.
-f - Sets the format for the video, e.g. -f wav should download WAV files, mp4 will download MP4's, etc.
--extract-audio - Extracts the audio from the video after downloading it. This option is useful if -f wav doesn't work because WAV is not available.
--audio-format - Specifies what format to use if you're using --extract-audio
--sub-lang - Downloads subtitle/closed-caption files for the given ISO language code. e.g. --sub-lang en-GB will download British English subtitles. The subtitle files are saved in ...vtt files.

So for example the following command will download MP4 files and English closed-captions for the given playlist:

yt-dlp -f mp4 --sub-lang en https://www.youtube.com/playlist?list=PLdsZeeCVYnY3em55M2d3Iq3H-jSPLIWBF

...and the following will download WAV files instead of MP4 for the given playlist:

yt-dlp  --extract-audio --audio-format wav --sub-lang en https://www.youtube.com/playlist?list=PLdsZeeCVYnY3em55M2d3Iq3H-jSPLIWBF

Caveat: There's a potential problem with using YouTube's automatically-generated captions; they often have a whole bunch of repeated phrases as it goes through the recording, so much of the resulting transcript is doubled up. So make sure you eyeball your TextGrids at the end to ensure the results are what you expect!

2. Convert subtitles to Praat TextGrids

vtt-to-textgrid.jar is a utility for converting VTT files to TextGrids.

(If you prefer ELAN transcripts, try vtt-to-eaf.jar instead)

It's a Java program, so you must have a recent version of Java installed for it to work.

Downloading the vtt-to-textgrid.jar file and double-clicking it should work, but if not, you can run it from the command line like this:

java -jar vtt-to-textgrid.jar

When it starts, it looks like this:

You need to drag/drop vtt files into the big white space (or use the + button), and then click Convert. This will save each TextGrid in the same folder as the VTT file.

Validate Transcriber Transcripts

Some earlier versions of Transcriber sometimes output transcript files that had inconsistent turn alignments: the end time of a turn could be after the start time of the next turn.

These transcripts cause problems when processing transcripts for force-alignment, etc.

The transcriber deserialization module here, transcriber-validator.jar, is a command-line tool that can be used to fix up such corrupted transcripts. If you download transcriber-validator.jar, you can invoke it using your command shell, like this:

java -jar transcriber-validator.jar some-transcript.trs

By default, the utility checks and validates the given transcript(s), saving the results in a subdirectory called valid. This ensures that transcripts are copied rather that directly changed, and so the original transcript files are untouched.

Transcripts can be changed in-situ if required (i.e. changing the original file) by using the --replace command-line switch.

And if you name a directory instead of a .trs file, then the directory is recursively scanned for .trs files to process.

So check/fix all transcripts in a given directory, use the command:

java -jar transcriber-validator.jar --replace /path/to/directory/with/trs/files

For full information about command line options:

java -jar transcriber-validator.jar --usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Binary Files

Standalone Format Converters

Getting TextGrid Transcripts of YouTube Videos

1. Download Youtube videos with subtitles

2. Convert subtitles to Praat TextGrids

Validate Transcriber Transcripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Binary Files

Standalone Format Converters

Getting TextGrid Transcripts of YouTube Videos

1. Download Youtube videos with subtitles

2. Convert subtitles to Praat TextGrids

Validate Transcriber Transcripts