Dataset

1. Original Dataset

The original dataset provided by the LOD-Team from ZLS comprises two files :

a zipped archive 240301_LOD_Beispiller_audio.zip with 3.908 female and 39.836 male audio records
an Excel file 240301_LOD_Beispiller-mat-Audio.xlsx with 43.745 rows and 3 columns

The audio files are samples with 96 KHz or 44,1 Khz encoded in MPEG-4 audio (extension .ma4) format.
The Excel file is structured as follows :

Text	Voice	ID
De Stiermer stoung eendeiteg am Abseits.	F	0000e52f851b4306b0c4243fcfa5d9ef
An der Vakanz hunn ech d'Klappluede frësch ugestrach.	F	00010b2fe6c842112233440c752c049f
Kanns de mer Desinfektiounsmëttel, Seef a Shampoing aus der Drogerie matbréngen?	M	1ef230b475719e727c18be902b652513
An de pittoreske Gaasse vun der Alstad hale sech vill Touristen op.	M	1ef2a7910316b4fe6964ce3548ec16c1
De Bauer ass amgaang, Mëscht ze spreeden.	M	1ef3320cb679b41a61df29e6ad3b4aa8
Du muss de Fanger weisen, wann s de eppes wëlls soen!	M	1ef579ad0f3f86cbad3c8d96a9b0d432

The third column (ID) contains the name of the related audio file.

2. Dataset Check

To verify the consistency of the dataset, I did a quick check with a python script written by ChatGPT-4o.

3. Required Dataset Structure

To avoid memory overflows, the audio files must be sampled with maximal 22.05 KHz and saved in WAVE format. To reduce the training duration, it's important to limit the character-set of the TTS-model. This can be done by converting the capital letters to lowercase letters and to specify in the training script only the characters used in the dataset. Capital letters are not required for TTS. The structure of the transcription-ID list must be in csv format with the pipe as delimiter. Here is an extract of the required metadata.csv file :

0000e52f851b4306b0c4243fcfa5d9ef|de stiermer stoung eendeiteg am abseits.
00010b2fe6c842112233440c752c049f|an der vakanz hunn ech d'klappluede frësch ugestrach.
0006485c685acec081569083717a6b0a|meng bomi huet fréier alt emol en zigarillo gefëmmt.
000abe4693b1f5ea390ce3d0384d8ec6|nomade wandere mat hirem véi dohin, wou et eppes ze friesse fënnt.
000acd1203fe149631ce6ec64e35a414|de staat encouragéiert jonk leit, hir eegen entreprise ze grënnen.
000d72418e2cd06998d1fa8b34c4b47a|wäsch d'wonn uerdentlech aus, ier s de se desinfizéiers!

Pay attention that the text is encoded in UTF-8 format.

4. Dataset Format Conversion

There exist several ways to change the format m4a of the audio files to the wav format. In the past I used a Python script on my Linux-Ubuntu desktop to execute the conversion process. This time on Windows 11 it's easier to use a standard Windows tool, for example the free audio converter fre:ac.

fre:ac

The conversion process takes a while.

5. Dataset Resampling

To change the sample rate of the audio files from 96 KHz or 44,1 Khz to the required 22,05 KHz, I used the python resampling script from Coqui-TTS :
python TTS/bin/resample.py --input_dir E:/TTS-for-LOD/TTS/dataset/wavsin/ --output_dir E:/TTS-for-LOD/TTS/dataset/wavs/ --output_sr 22050
I zipped the resulting resampled audio files in an archive wavs.zip and uploaded the data into the dataset folder in my HuggingFace account, because the size of files into a Github repository cannot exceed 25 MB and this archive has 4.61 GB.

6. Dataset Text File

To convert the transcription file in the required format you need only standard Excel skills. Delete the rows with female voices, delete the voice column, change the text in lowercase, select the pipe delimiter in the menu files -> options -> advanced, check that the encoding of the text is UTF-8 format and export the Excel sheet as metadata.csv. I uploaded the file into the same HuggingFace folder as the audio archive.

7. Cleaning Dataset

During the first training run, I discovered that the loss was not decreasing as expected. I was afraid that the data set contained some irregularities. An automatic check of the quality of the audio files didn't reveal any abnormality. I created a python script to do a manual hearing of the audio files. The program play-wavs.py is included in the script folder of the present repository. I discovered that in the first 500 male audio files, 32 were female voices. I estimated that in the male dataset about 2.560 files are female. I tried several scripts to identify the files with female voices, based on pitch or other speech parameters, without success. Finally I found a pretrained AI model on HuggingFace to do the job. I assembled my findings in the Wiki-Page Voice Gender Classifier.

Nevertheless, I continued the training of the male voice during 5 days up to the threshold of 50,000 steps. I will clean the dataset with a python script using the gender classifier AI model before executing a final training to obtain the expected quality. Finally, there were less female samples as estimated, about 800 in total. For the final training with the male voice, I used only 32.000 samples to optimize the steps during the process. 1% of 32.000 = 320 evaluation samples; all numbers are multiples of the batch-sizes 32 and 16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly