-
Notifications
You must be signed in to change notification settings - Fork 1
Dataset
The original dataset provided by the LOD-Team from ZLS comprises two files :
- a zipped archive
240301_LOD_Beispiller_audio.zip
with 3.908 female and 39.836 male audio records - an Excel file
240301_LOD_Beispiller-mat-Audio.xlsx
with 43.745 rows and 3 columns
The audio files are samples with 96 KHz or 44,1 Khz encoded in MPEG-4 audio (extension .ma4) format.
The Excel file is structured as follows :
Text | Voice | ID |
---|---|---|
De Stiermer stoung eendeiteg am Abseits. | F | 0000e52f851b4306b0c4243fcfa5d9ef |
An der Vakanz hunn ech d'Klappluede frësch ugestrach. | F | 00010b2fe6c842112233440c752c049f |
Kanns de mer Desinfektiounsmëttel, Seef a Shampoing aus der Drogerie matbréngen? | M | 1ef230b475719e727c18be902b652513 |
An de pittoreske Gaasse vun der Alstad hale sech vill Touristen op. | M | 1ef2a7910316b4fe6964ce3548ec16c1 |
De Bauer ass amgaang, Mëscht ze spreeden. | M | 1ef3320cb679b41a61df29e6ad3b4aa8 |
Du muss de Fanger weisen, wann s de eppes wëlls soen! | M | 1ef579ad0f3f86cbad3c8d96a9b0d432 |
The third column (ID) contains the name of the related audio file.
To verify the consistency of the dataset, I did a quick check with a python script written by ChatGPT-4o.
To avoid memory overflows, the audio files must be sampled with maximal 22.05 KHz and saved in WAVE format. To reduce the training duration, it's important to limit the character-set of the TTS-model. This can be done by converting the capital letters to lowercase letters and to specify in the training script only the characters used in the dataset. Capital letters are not required for TTS. The structure of the transcription-ID list must be in csv format with the pipe as delimiter. Here is an extract of the required metadata.csv
file :
0000e52f851b4306b0c4243fcfa5d9ef|de stiermer stoung eendeiteg am abseits.
00010b2fe6c842112233440c752c049f|an der vakanz hunn ech d'klappluede frësch ugestrach.
0006485c685acec081569083717a6b0a|meng bomi huet fréier alt emol en zigarillo gefëmmt.
000abe4693b1f5ea390ce3d0384d8ec6|nomade wandere mat hirem véi dohin, wou et eppes ze friesse fënnt.
000acd1203fe149631ce6ec64e35a414|de staat encouragéiert jonk leit, hir eegen entreprise ze grënnen.
000d72418e2cd06998d1fa8b34c4b47a|wäsch d'wonn uerdentlech aus, ier s de se desinfizéiers!
Pay attention that the text is encoded in UTF-8 format.
There exist several ways to change the format m4a of the audio files to the wav format. In the past I used a Python script on my Linux-Ubuntu desktop to execute the conversion process. This time on Windows 11 it's easier to use a standard Windows tool, for example the free audio converter fre:ac.
The conversion process takes a while.
To change the sample rate of the audio files from 96 KHz or 44,1 Khz to the required 22,05 KHz, I used the python resampling script from Coqui-TTS :
python TTS/bin/resample.py --input_dir E:/TTS-for-LOD/TTS/dataset/wavsin/ --output_dir E:/TTS-for-LOD/TTS/dataset/wavs/ --output_sr 22050
I zipped the resulting resampled audio files in an archive wavs.zip
and uploaded the data into the dataset folder in my HuggingFace account, because the size of files into a Github repository cannot exceed 25 MB and this archive has 4.61 GB.
To convert the transcription file in the required format you need only standard Excel skills. Delete the rows with female voices, delete the voice column, change the text in lowercase, select the pipe delimiter in the menu files -> options -> advanced
, check that the encoding of the text is UTF-8 format and export the Excel sheet as metadata.csv
. I uploaded the file into the same HuggingFace folder as the audio archive.
During the first training run, I discovered that the loss was not decreasing as expected. I was afraid that the data set contained some irregularities. An automatic check of the quality of the audio files didn't reveal any abnormality. I created a python script to do a manual hearing of the audio files. The program play-wavs.py
is included in the script folder of the present repository. I discovered that in the first 500 male audio files, 32 were female voices. I estimated that in the male dataset about 2.560 files are female. I tried several scripts to identify the files with female voices, based on pitch or other speech parameters, without success. Finally I found a pretrained AI model on HuggingFace to do the job. I assembled my findings in the Wiki-Page Voice Gender Classifier.
Nevertheless, I continued the training of the male voice during 5 days up to the threshold of 50,000 steps. I will clean the dataset with a python script using the gender classifier AI model before executing a final training to obtain the expected quality. Finally, there were less female samples as estimated, about 800 in total. For the final training with the male voice, I used only 32.000 samples to optimize the steps during the process. 1% of 32.000 = 320 evaluation samples; all numbers are multiples of the batch-sizes 32 and 16.