desgined for the OL2 data from the VARIAGE project
cite as (APA 7): Steiner, Linda. (2026). Variage File Checker (Version 1) [Computer software]. GitHub. https://linste-zh.github.io/VariageFileChecker
This work is licensed under a CC-BY 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
Please address your feedback or error reports to linste.zh@gmail.com
This code was written as part of the Variage Project (https://www.variage.ch/). The Variage Project received support from the VELUX Stiftung (https://veluxstiftung.ch/) and the UZH foundation (https://www.uzhfoundation.ch/en/).
The code was written by Linda Steiner, a student research assistant to Professor Simone Pfenninger at the English Department of the University of Zurich.
The script is called via terminal and takes as input the name of a file directory.
This directory is expected to contain transcription files (.txt) in a corrected version (/2_corrected_transcripts), as well as their original copies (.txt) and basic audio files (.mp3 and .wav) (/3_finished_correcting). In addition, some files may be eliminated from correction (/4_ignore).
All files are expected to follow the naming convention: "TimeframeID_ParticipantID_ExerciseName.txt".
The script browses the files in the directory and compares them against a list of expected participants, provided in the file "ids.txt". It takes the following actions:
- if a file name contains an expected participant ID, the file's existence is marked in a data frame for that participant by its type (.txt/.wav/.mp3) and by its folder (finished/corrected/ignored)
- if a file name contains an unexpected participant ID, a new row is generated with a warning in a designated column
- saves the generated data frame with the tracking information as a CSV file in the provided directory
In addition, if a file is supposed to be a corrected transcript, the code:
- checks whether a minimal length is reached - else it raises a warning in the designated warning column
- checks whether there is a minimal amount of "ehm"s and dashes in the transcript, which indicates the file has actually been corrected - else it raises a warning in the designated warning column
- fixes common mistakes in the annotation scheme of the data
- ensures the file is formatted according to expectations (i.e. initial line contains participant ID and TF, the remaining transcript is all in the following line with no line breaks)
- saves the formatted and checked output in a guaranteed UTF-8 encrypted text file in the folder (/1_corrected_transcripts_formatted)
This allows the researchers to 1. find any potentially missing or corrupted files and 2. increases the quality and reliability of the final transcript file
Input directory: "OL2/Time 08"
Expected Participants: 01, 02, 03, 04, 05, 06, 07, 08, 09
Input directory structure:
- └── Time 08
- └── 2_corrected_transcripts
- T08_01_OL2.txt
- T08_02_OL2.txt (uncorrected file uploaded)
- T08_03_OL2.txt (file contains common annotation and formatting mistakes)
- T08_04_OL2.txt (file not UTF-8 encoded)
- T08_06_OL2.txt (file empty)
- T08_70_OL2.txt
- └── 3_finished_correcting
- T08_01_OL2.txt
- T08_01_OL2.mp3
- T08_01_OL2.wav
- T08_02_OL2.mp3 (missing original txt file)
- T08_02_OL2.wav
- T08_03_OL2.txt
- T08_03_OL2.mp3 (missing wav file)
- T08_04_OL2.txt
- T08_04_OL2.mp3
- T08_04_OL2.wav
- T08_05_OL2.txt
- T08_05_OL2.mp3
- T08_05_OL2.wav
- T08_06_OL2.wav (missing original txt file and mp3 file)
- T08_07_OL2.txt
- T08_07_OL2.mp3
- T08_07_OL2.wav
- └── 4_ignore
- T08_09_OL2.txt
- T08_09_OL2.mp3
- T08_09_OL2.wav
- T08_08_OL2.txt
- T08_08_OL2.mp3
- T08_08_OL2.wav
- xxx.csv
- └── 2_corrected_transcripts
Console Command (Windows):
C:\Users\YOURPATH\Variage File Checker>
python main.py "YOURPATH\Time 08"
Input directory structure after execution:
- └── Time 08
- └── 1_corrected_transcripts_formatted
- T08_01_OL2.txt
- T08_02_OL2.txt
- T08_03_OL2.txt
- T08_04_OL2.txt (file now UTF-8 encoded)
- T08_06_OL2.txt (file no longer contains common annotation and formatting mistakes)
- T08_70_OL2.txt
- T08_08_OL2.txt
- └── 2_corrected_transcripts (unchanged)
- ...
- └── 3_finished_correcting (unchanged)
- ...
- └── 4_ignore (unchanged)
- ...
- └── 1_corrected_transcripts_formatted
Tracker file (.csv):
|p_id| txt | mp3 | wav | other | warnings |
|____|____________________|_________|_________|_______|_________________________________|
|01 |finished; corrected |finished |finished | | |
|02 |corrected |finished |finished | |few or no ehms; few or no dashes |
|03 |finished; corrected |finished | | | |
|04 |finished; corrected |finished |finished | | |
|05 |finished |finished |finished | | |
|06 |corrected | |finished | |very short |
|07 |finished |finished |finished | | |
|08 |open |open |open | | |
|09 |ignored |ignored |ignored | | |
|70 |corrected | | | |unknown ID |
|xxx | | | |open |unknown ID |
The code directly resolved the issues with participant 04 and 06. In addition, it reveals that
- participants 01 and 04: seem to be okay
- participant 02: the files does not seem to be corrected and the original file is missing instead
- participant 03: the .wav file is misisng
- participant 05: no corrected file was uploaded
- participant 06: the original file is missing, as is the mp3 file. In addition, the corrected file seems to be too short
- paritcipant 07: there is no file matching the ID in corrected, but there is an unknown ID 70, so probably a typo
- participant 08: the file is completely undone
- participant 09: is ignored, which is good
- there is another file "xxx" in the base folder
Thus, the remaining issues can be identified and resolved without having to manually check multiple folders, as well as identifying issues that are no visible in the directory itself.
While this code currently is highly specific to the Variage OL2 task, the code can be adapted to fit other project schemes as well. The Tracker.py and File.py classes are written very general, meaning that class instances can be adjusted For example:
- different columns in tracker csv (main.py -> l. 21)
- different base folders to be checked (main.py loops)
- different effects if a file is spotted in the base folder (main.py inside loops, e.g. l.47)
- different warnings (main.py inside loops, e.g. l.48-49)
- different checks for transcript validity (Transcript.py -> check_file())
- different formatting improvements (Transcript.py -> format_file())
This work is licensed under a CC-BY 4.0 International License (https://creativecommons.org/licenses/by/4.0/) and is open source. Credit has to be given, but code may be freely adapted.