File Checker Script

desgined for the OL2 data from the VARIAGE project

Cite as

cite as (APA 7): Steiner, Linda. (2026). Variage File Checker (Version 1) [Computer software]. GitHub. https://linste-zh.github.io/VariageFileChecker

This work is licensed under a CC-BY 4.0 International License (https://creativecommons.org/licenses/by/4.0/).

Feedback and Error Reports

Please address your feedback or error reports to linste.zh@gmail.com

Background

This code was written as part of the Variage Project (https://www.variage.ch/). The Variage Project received support from the VELUX Stiftung (https://veluxstiftung.ch/) and the UZH foundation (https://www.uzhfoundation.ch/en/).

The code was written by Linda Steiner, a student research assistant to Professor Simone Pfenninger at the English Department of the University of Zurich.

Purpose

Input

The script is called via terminal and takes as input the name of a file directory.

This directory is expected to contain transcription files (.txt) in a corrected version (/2_corrected_transcripts), as well as their original copies (.txt) and basic audio files (.mp3 and .wav) (/3_finished_correcting). In addition, some files may be eliminated from correction (/4_ignore).

All files are expected to follow the naming convention: "TimeframeID_ParticipantID_ExerciseName.txt".

Output

The script browses the files in the directory and compares them against a list of expected participants, provided in the file "ids.txt". It takes the following actions:

if a file name contains an expected participant ID, the file's existence is marked in a data frame for that participant by its type (.txt/.wav/.mp3) and by its folder (finished/corrected/ignored)
if a file name contains an unexpected participant ID, a new row is generated with a warning in a designated column
saves the generated data frame with the tracking information as a CSV file in the provided directory

In addition, if a file is supposed to be a corrected transcript, the code:

checks whether a minimal length is reached - else it raises a warning in the designated warning column
checks whether there is a minimal amount of "ehm"s and dashes in the transcript, which indicates the file has actually been corrected - else it raises a warning in the designated warning column
fixes common mistakes in the annotation scheme of the data
ensures the file is formatted according to expectations (i.e. initial line contains participant ID and TF, the remaining transcript is all in the following line with no line breaks)
saves the formatted and checked output in a guaranteed UTF-8 encrypted text file in the folder (/1_corrected_transcripts_formatted)

This allows the researchers to 1. find any potentially missing or corrupted files and 2. increases the quality and reliability of the final transcript file

Example

Input

Input directory: "OL2/Time 08"
Expected Participants: 01, 02, 03, 04, 05, 06, 07, 08, 09
Input directory structure:

└── Time 08
- └── 2_corrected_transcripts
  - T08_01_OL2.txt
  - T08_02_OL2.txt (uncorrected file uploaded)
  - T08_03_OL2.txt (file contains common annotation and formatting mistakes)
  - T08_04_OL2.txt (file not UTF-8 encoded)
  - T08_06_OL2.txt (file empty)
  - T08_70_OL2.txt
- └── 3_finished_correcting
  - T08_01_OL2.txt
  - T08_01_OL2.mp3
  - T08_01_OL2.wav
  - T08_02_OL2.mp3 (missing original txt file)
  - T08_02_OL2.wav
  - T08_03_OL2.txt
  - T08_03_OL2.mp3 (missing wav file)
  - T08_04_OL2.txt
  - T08_04_OL2.mp3
  - T08_04_OL2.wav
  - T08_05_OL2.txt
  - T08_05_OL2.mp3
  - T08_05_OL2.wav
  - T08_06_OL2.wav (missing original txt file and mp3 file)
  - T08_07_OL2.txt
  - T08_07_OL2.mp3
  - T08_07_OL2.wav
- └── 4_ignore
  - T08_09_OL2.txt
  - T08_09_OL2.mp3
  - T08_09_OL2.wav
- T08_08_OL2.txt
- T08_08_OL2.mp3
- T08_08_OL2.wav
- xxx.csv

Console Command (Windows):

C:\Users\YOURPATH\Variage File Checker> 
  python main.py "YOURPATH\Time 08"

Output

Input directory structure after execution:

└── Time 08
- └── 1_corrected_transcripts_formatted
  - T08_01_OL2.txt
  - T08_02_OL2.txt
  - T08_03_OL2.txt
  - T08_04_OL2.txt (file now UTF-8 encoded)
  - T08_06_OL2.txt (file no longer contains common annotation and formatting mistakes)
  - T08_70_OL2.txt
  - T08_08_OL2.txt
- └── 2_corrected_transcripts (unchanged)
  - ...
- └── 3_finished_correcting (unchanged)
  - ...
- └── 4_ignore (unchanged)
  - ...

Tracker file (.csv):


|p_id| txt                | mp3     | wav     | other | warnings                        |  
|____|____________________|_________|_________|_______|_________________________________|  
|01  |finished; corrected |finished |finished |       |                                 |    
|02  |corrected           |finished |finished |       |few or no ehms; few or no dashes |  
|03  |finished; corrected |finished |         |       |                                 |  
|04  |finished; corrected |finished |finished |       |                                 |  
|05  |finished            |finished |finished |       |                                 |  
|06  |corrected           |         |finished |       |very short                       |  
|07  |finished            |finished |finished |       |                                 |  
|08  |open                |open     |open     |       |                                 |  
|09  |ignored             |ignored  |ignored  |       |                                 |  
|70  |corrected           |         |         |       |unknown ID                       |  
|xxx |                    |         |         |open   |unknown ID                       |

The code directly resolved the issues with participant 04 and 06. In addition, it reveals that

participants 01 and 04: seem to be okay
participant 02: the files does not seem to be corrected and the original file is missing instead
participant 03: the .wav file is misisng
participant 05: no corrected file was uploaded
participant 06: the original file is missing, as is the mp3 file. In addition, the corrected file seems to be too short
paritcipant 07: there is no file matching the ID in corrected, but there is an unknown ID 70, so probably a typo
participant 08: the file is completely undone
participant 09: is ignored, which is good
there is another file "xxx" in the base folder

Thus, the remaining issues can be identified and resolved without having to manually check multiple folders, as well as identifying issues that are no visible in the directory itself.

Adapting for your own use

While this code currently is highly specific to the Variage OL2 task, the code can be adapted to fit other project schemes as well. The Tracker.py and File.py classes are written very general, meaning that class instances can be adjusted For example:

different columns in tracker csv (main.py -> l. 21)
different base folders to be checked (main.py loops)
different effects if a file is spotted in the base folder (main.py inside loops, e.g. l.47)
different warnings (main.py inside loops, e.g. l.48-49)
different checks for transcript validity (Transcript.py -> check_file())
different formatting improvements (Transcript.py -> format_file())

This work is licensed under a CC-BY 4.0 International License (https://creativecommons.org/licenses/by/4.0/) and is open source. Credit has to be given, but code may be freely adapted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Checker Script

Cite as

Feedback and Error Reports

Background

Purpose

Input

Output

Example

Input

Output

Adapting for your own use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
File.py		File.py
README.md		README.md
Tracker.py		Tracker.py
Transcript.py		Transcript.py
colors.py		colors.py
ids.txt		ids.txt
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

File Checker Script

Cite as

Feedback and Error Reports

Background

Purpose

Input

Output

Example

Input

Output

Adapting for your own use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages