Skip to content

linste-zh/VariageFileChecker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File Checker Script

desgined for the OL2 data from the VARIAGE project


Cite as

cite as (APA 7): Steiner, Linda. (2026). Variage File Checker (Version 1) [Computer software]. GitHub. https://linste-zh.github.io/VariageFileChecker

This work is licensed under a CC-BY 4.0 International License (https://creativecommons.org/licenses/by/4.0/).

Feedback and Error Reports

Please address your feedback or error reports to linste.zh@gmail.com


Background

This code was written as part of the Variage Project (https://www.variage.ch/). The Variage Project received support from the VELUX Stiftung (https://veluxstiftung.ch/) and the UZH foundation (https://www.uzhfoundation.ch/en/).

The code was written by Linda Steiner, a student research assistant to Professor Simone Pfenninger at the English Department of the University of Zurich.


Purpose

Input

The script is called via terminal and takes as input the name of a file directory.

This directory is expected to contain transcription files (.txt) in a corrected version (/2_corrected_transcripts), as well as their original copies (.txt) and basic audio files (.mp3 and .wav) (/3_finished_correcting). In addition, some files may be eliminated from correction (/4_ignore).

All files are expected to follow the naming convention: "TimeframeID_ParticipantID_ExerciseName.txt".

Output

The script browses the files in the directory and compares them against a list of expected participants, provided in the file "ids.txt". It takes the following actions:

  • if a file name contains an expected participant ID, the file's existence is marked in a data frame for that participant by its type (.txt/.wav/.mp3) and by its folder (finished/corrected/ignored)
  • if a file name contains an unexpected participant ID, a new row is generated with a warning in a designated column
  • saves the generated data frame with the tracking information as a CSV file in the provided directory

In addition, if a file is supposed to be a corrected transcript, the code:

  • checks whether a minimal length is reached - else it raises a warning in the designated warning column
  • checks whether there is a minimal amount of "ehm"s and dashes in the transcript, which indicates the file has actually been corrected - else it raises a warning in the designated warning column
  • fixes common mistakes in the annotation scheme of the data
  • ensures the file is formatted according to expectations (i.e. initial line contains participant ID and TF, the remaining transcript is all in the following line with no line breaks)
  • saves the formatted and checked output in a guaranteed UTF-8 encrypted text file in the folder (/1_corrected_transcripts_formatted)

This allows the researchers to 1. find any potentially missing or corrupted files and 2. increases the quality and reliability of the final transcript file


Example

Input

Input directory: "OL2/Time 08"
Expected Participants: 01, 02, 03, 04, 05, 06, 07, 08, 09
Input directory structure:

  • └── Time 08
    • └── 2_corrected_transcripts
      • T08_01_OL2.txt
      • T08_02_OL2.txt (uncorrected file uploaded)
      • T08_03_OL2.txt (file contains common annotation and formatting mistakes)
      • T08_04_OL2.txt (file not UTF-8 encoded)
      • T08_06_OL2.txt (file empty)
      • T08_70_OL2.txt
    • └── 3_finished_correcting
      • T08_01_OL2.txt
      • T08_01_OL2.mp3
      • T08_01_OL2.wav
      • T08_02_OL2.mp3 (missing original txt file)
      • T08_02_OL2.wav
      • T08_03_OL2.txt
      • T08_03_OL2.mp3 (missing wav file)
      • T08_04_OL2.txt
      • T08_04_OL2.mp3
      • T08_04_OL2.wav
      • T08_05_OL2.txt
      • T08_05_OL2.mp3
      • T08_05_OL2.wav
      • T08_06_OL2.wav (missing original txt file and mp3 file)
      • T08_07_OL2.txt
      • T08_07_OL2.mp3
      • T08_07_OL2.wav
    • └── 4_ignore
      • T08_09_OL2.txt
      • T08_09_OL2.mp3
      • T08_09_OL2.wav
    • T08_08_OL2.txt
    • T08_08_OL2.mp3
    • T08_08_OL2.wav
    • xxx.csv

Console Command (Windows):

C:\Users\YOURPATH\Variage File Checker> 
  python main.py "YOURPATH\Time 08"

Output

Input directory structure after execution:

  • └── Time 08
    • └── 1_corrected_transcripts_formatted
      • T08_01_OL2.txt
      • T08_02_OL2.txt
      • T08_03_OL2.txt
      • T08_04_OL2.txt (file now UTF-8 encoded)
      • T08_06_OL2.txt (file no longer contains common annotation and formatting mistakes)
      • T08_70_OL2.txt
      • T08_08_OL2.txt
    • └── 2_corrected_transcripts (unchanged)
      • ...
    • └── 3_finished_correcting (unchanged)
      • ...
    • └── 4_ignore (unchanged)
      • ...

Tracker file (.csv):


|p_id| txt                | mp3     | wav     | other | warnings                        |  
|____|____________________|_________|_________|_______|_________________________________|  
|01  |finished; corrected |finished |finished |       |                                 |    
|02  |corrected           |finished |finished |       |few or no ehms; few or no dashes |  
|03  |finished; corrected |finished |         |       |                                 |  
|04  |finished; corrected |finished |finished |       |                                 |  
|05  |finished            |finished |finished |       |                                 |  
|06  |corrected           |         |finished |       |very short                       |  
|07  |finished            |finished |finished |       |                                 |  
|08  |open                |open     |open     |       |                                 |  
|09  |ignored             |ignored  |ignored  |       |                                 |  
|70  |corrected           |         |         |       |unknown ID                       |  
|xxx |                    |         |         |open   |unknown ID                       |  

The code directly resolved the issues with participant 04 and 06. In addition, it reveals that

  • participants 01 and 04: seem to be okay
  • participant 02: the files does not seem to be corrected and the original file is missing instead
  • participant 03: the .wav file is misisng
  • participant 05: no corrected file was uploaded
  • participant 06: the original file is missing, as is the mp3 file. In addition, the corrected file seems to be too short
  • paritcipant 07: there is no file matching the ID in corrected, but there is an unknown ID 70, so probably a typo
  • participant 08: the file is completely undone
  • participant 09: is ignored, which is good
  • there is another file "xxx" in the base folder

Thus, the remaining issues can be identified and resolved without having to manually check multiple folders, as well as identifying issues that are no visible in the directory itself.


Adapting for your own use

While this code currently is highly specific to the Variage OL2 task, the code can be adapted to fit other project schemes as well. The Tracker.py and File.py classes are written very general, meaning that class instances can be adjusted For example:

  • different columns in tracker csv (main.py -> l. 21)
  • different base folders to be checked (main.py loops)
  • different effects if a file is spotted in the base folder (main.py inside loops, e.g. l.47)
  • different warnings (main.py inside loops, e.g. l.48-49)
  • different checks for transcript validity (Transcript.py -> check_file())
  • different formatting improvements (Transcript.py -> format_file())

This work is licensed under a CC-BY 4.0 International License (https://creativecommons.org/licenses/by/4.0/) and is open source. Credit has to be given, but code may be freely adapted.

About

This code was written as part of the Variage Project (https://www.variage.ch/). It is used to verify the existence and validity of file structures in a multi-level directory.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages