# Armenian TTS Data Preparation

This notebook prepares a high-quality dataset for Armenian Text-to-Speech (TTS) training.  
I select 35 speakers from the Common Voice corpus, ensuring diversity and audio quality, and exclude speakers with poor or problematic recordings.  
The workflow includes:
- Filtering and combining data from multiple splits (train, test, dev, validated)
- Removing duplicates
- Filtering audio clips
- Converting audio to the required format
- Splitting the data for training, validation, and testing
- Saving the final filelists for TTS model training

## 1. Select Speakers

I select 35 speakers based on EDA, ensuring both diversity and high audio quality.

In [1]:

chosen_set ={'28de0109a7cb8b421e1b1f99d7ca53c08fab2ba0d7e742273a2699601393e4e8d4e7253e3fc58b16043531aee8bb3b466f21ca5c5694a1f58aa84515c765a072',
 '2a14d06725843c8bf80df47addecb1fbadfe1851c7c3a32f316b9d5a4b920bbebbb2c59c2490d649c4fbc7ca76e700358d7a16e24dcad85a64cc2e6796a6a716',
 '2d6e456848c8c3c1dbdad8b53c79d789366af4c311f047e8f0b4173059b2b040ffa9c0d198004f489fbf191593e08c4566dbbb095b082f76ef7e1f69e7ef9b25',
 '2f6cbba7040916d191a33342165afc83423b9c9224190d53b12ca73285c16c401a947c9e525fa10d71528b92db51c8d952c68067c7773e37903ba2ca2bd226b2',
 '3023da6d296b419df5b1e00539ab90eb66c846e41a91987321a8cd7d10e01c30341dd92cff4c309ee6baaf3a547fdb66e427741d14a6f58bcb0e36b9dc0c5e1a',
 '31e3ce7f8be2d3b5b4527f0fd1ae2c03f89f05b92a4893ede411b038747bffefebaf10398403248cf4cb621d0ca494d2696fe092807607bcd03a6fdc922ec160',
 '3369eef6dfd43e57d316a6b25dde31dc9fc6588910ca540927c2fe041b6ba06ff1b6f0627070b9ae88a8db87100c0238201ac33d2db65a7ce9d79cb9101ffd21',
 '49b9cfac58dab705b163fca60263e022ff12be4a4a006759564b595e3ffe4368c37489faaeee57bcb7e5595e5943b36bad620d15bfba222aedce648a8b5b9179',
 '53f179499b5337b22dadf05b4d4b8d46cbb3f73e2324b0267365bf4e960f3e0f5a994cef69173e2f85c301ca29f9eb3c7d14cf1edb3f776a5e0213261d5065fc',
 '60b90d3554762fae5d65d7e2b6735f1acce7de3a170c5c02883e25e97304b29a7b72866885666741c880d8a5182e25fcb830c554662ca3553708afded923e090',
 '68e8fc86f9c83e2105577567d214643d186d1d0f06e7e59866699bfcb626f1ae1589f4f9be24ddf663bc130a4c5bb728531f369f145a28f87a251b35cd3793ee',
 '6ac14a247076239cdd4599e6a633cd28645f1b88253a5806a2b4768eb9671201d16ce6331e068d7e0998ac9e86913d35f4fd71bf493e1a670012bf757c66c3ff',
 '6b64ea824661a69f4b0ce5920f2286fd7747926a1f5444f2aad3f1822a106047f7750767b2e75a3261be361443749c94b097ca6a0ddf584ab0889a58d1b58471',
 '6da653d9fd2b62dfea646ac21941b9987b8f2641877b9ece4723bf7f7274c105bfd42f0e58ff755b97120e9aba1318e1d3b36eddd44d17b01a8ecd163e8984b8',
 '6f39091aac3f2771e69372b91f8c288d728806f9834be929d2b6d525c7fa537ce6b5e6cb6dee243586dff807b3969a37b36339eab6285d07f0898762f71be6f3',
 '846e31aa456f2fdca60c0e0492d90cbf7de522e8ff8a9b683cffbd9d3d6ecc0c08efac1040a77dbc1e03633fe086a3aebb2848da1742f3ff576e7fff07d15f2e',
 '84e97d814192ea9cb2ba0185bfd5333eb0d90ea920bb57a2a2dd1a94b0c5a30d4f89556affa150c2755add1e1303847cb75ec4fc2b60fa2486823361c224c763',
 '8b76c84631d777c2e17c6ad3875c28931c94d97d07fdf4bd3bae0ece675f3bbba47ac60a672feabdde5ffda9ffcf0eceba8cc1a9382c5c89849512c260597622',
 '8dd1755a8244e81bfcb3b97830ad56d2d1b2fab1fa676b516a588907958be0fc70720883b3f2770b1670859cf765501ac1492cb003c4166b0f70a5365890782e',
 '8dfd84ecd2e66431841c2d1181ed2e0ef865d54927aad33741421304478d7edb80965657dd227db55805ea3436399fd6b0efc62ceba35de6968a924d7f992838',
 '8fdd42269ec62b3504b9885ffad74176a0c4d4bb2c6cf8509ac33ac122e89a235d43e7a74755ec89ef273a9755091ddf67053f4b122903027cdbd87f911fcfdb',
 '9a08bd28a3fbc1998279690cceaa718620987fcb4b161b4600e7b66bf522fe6fb2da05d60ecad33bda3251544ee2137fb1e1f16f8d360fa9fa64777e652cefd9',
 '9ede4609148d8eaf0e1574bb56deccd03b540fecb86eb35165ffcd32704d996085035c2d4f471697b57933885c2a1072eec94cd52c7e3a7afbc59656b396ee7b',
 'a1e727f1a1016c7ded8128da3f03dd17503239b4d40a4eba9c4a95511653fe26723800494c5c315871b036f5a7b6f2b9bffd8438f193c5633d45845339ae6442',
 'a515c4d990fc6466d908eee3a2ecc7d2a188fff1a6fbb01ab1d161a049fee502b1f5ef874706376c922d06d280e51fa184e03cecdad3ce75745f72c9c220d357',
 'acd63a7f7a57557f1329a44ab570a846f2030b5b8bba3295156ad4919da391dc8106c31653604e2fac4932bc5e1908d274c8a044f2377190a640792480807e03',
 'b5f49352aa3361a1ed8000a4bd63fad53788280870a2d2cd743093a3d96013f73f4caddf5c036afaa68038c7d4e99a567aee464c7af133c413a303227d298248',
 'c92dc8928c8bc36adc25411dc06fb4e43ea0335b6eee135401fade138db0593cb86a90a1522efb4f5ced36c08fa1a7979dc4829e31d2d5b227b17f573674955d',
 'ce6d647ad6cdf1820c8c5d4f9b9dd3b23f7b3f0aee76e55d332e530965500c94552c77d86465d117a8c61a19df2be2e52ef28b030e1ad8fbe5485f95956c8600',
 'd5d3cefd2399e4bd6f46b838c33cb44b558d4a3cf1c05ef1513cca95a9d521b15210687145fd5f0282b3edca7a3b7f4ca3b4cb9b6f033dde68b30f1e81010796',
 'dfa161d397a54e431dfb86a397193d83af5743a59132448ac3a8024e95dec30fcaf83d9ad6055988a6c88be168c6887dc54deccbbc28b334864f55ded6474160',
 'e71defa9da6ea8395e67ec5d20ab8934ce7e2a2826cf62321eb0b9aa5487a9a7c57cb355e4396f0261f13ec2b2fb17e9bf835556732166562ebe98513987d2ae',
 'ea9104f968bd916962722e501880464fb7eb6759fd877fa99404fc5d19bd3a7a0b80cb6f760d4aa97b10f287e82ff2ef97caf0d5768013d559f147d880c992ab',
 'ed04f2351d36c1612157c3ac700ab6e00bab5bf46e455c6bf4a496cdc90191c266486ffa4154ef1bd4a0f46f18024ed09175d919bd4d04009c8445323497e197',
 'fcbc29219f1ffdc4c413b71745a0085d4730258e56bb73e7ea08186ca44b0d32edb1d2be9c0806f92f35c6d0d7b4d13711259fc3ad49246fc4ee85e3fe2b0f9f'}

As discussed during the EDA, we selected these 35 speakers to ensure a normalized and diverse set of speaker data, while also excluding those whose speech quality was poor or whose recordings were subpar. Since this is a TTS (Text-to-Speech) project and not ASR (Automatic Speech Recognition), it is especially important to have high-quality data.
I am going to combine train, test, dev, and validated tsv together

In [2]:
import pandas as pd
tsv_files = [
    "/home/ubuntu/vits2_pytorch_hy/cv-corpus-17.0-2024-03-15/hy-AM/validated.tsv",
    "/home/ubuntu/vits2_pytorch_hy/cv-corpus-17.0-2024-03-15/hy-AM/train.tsv",
    "/home/ubuntu/vits2_pytorch_hy/cv-corpus-17.0-2024-03-15/hy-AM/test.tsv",
    "/home/ubuntu/vits2_pytorch_hy/cv-corpus-17.0-2024-03-15/hy-AM/dev.tsv"
]
dfs = []
for file in tsv_files:
    df = pd.read_csv(file, sep='\t')
    df = df[df['client_id'].isin(chosen_set)]
    dfs.append(df)
combined_df = pd.concat(dfs, ignore_index=True)
combined_df.to_csv("combined_filtered.tsv", sep='\t', index=False)

In [3]:
combined_df.duplicated(subset=['path']).sum() 

6296

In [4]:
combined_df.drop_duplicates(subset=['path'], inplace=True)# remove duplicates
combined_df.to_csv("combined_filtered.tsv", sep='\t', index=False)

In [5]:
import pandas as pd

combined_df = pd.read_csv("/home/ubuntu/vits2_pytorch_hy/combined_filtered.tsv", sep='\t')

clip_durations = pd.read_csv("/home/ubuntu/vits2_pytorch_hy/cv-corpus-17.0-2024-03-15/hy-AM/clip_durations.tsv", sep='\t')

filtered_clip_durations = clip_durations[clip_durations['clip'].isin(combined_df['path'])]

filtered_clip_durations.to_csv("/home/ubuntu/vits2_pytorch_hy/filtered_clip_durations.tsv", sep='\t', index=False)

In [None]:
import pandas as pd



longest_clip = filtered_clip_durations.loc[filtered_clip_durations['duration[ms]'].idxmax()]

print(f"Longest clip: {longest_clip['clip']}")
print(f"Duration: {longest_clip['duration[ms]']} ms")


Longest clip: common_voice_hy-AM_39648180.mp3
Duration: 10440 ms


The following code is filtering only those audios that appear in our filtered_tsv

In [6]:
import pandas as pd
import os
import shutil

# Paths
clips_dir = "/home/ubuntu/vits2_pytorch_hy/cv-corpus-17.0-2024-03-15/hy-AM/clips"
filtered_tsv = "/home/ubuntu/vits2_pytorch_hy/filtered_clip_durations.tsv"
output_dir = "/home/ubuntu/vits2_pytorch_hy/filtered_clips"

# Read the filtered clip list
df = pd.read_csv(filtered_tsv, sep='\t')
clip_names = set(df['clip'])

# Make output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Copy only the needed files
for clip in clip_names:
    src = os.path.join(clips_dir, clip)
    dst = os.path.join(output_dir, clip)
    if os.path.exists(src):
        shutil.copy2(src, dst)
    else:
        print(f"Warning: {src} does not exist!")

print("Done copying filtered clips.")

Done copying filtered clips.


Run  this code in our clips directory to convert sampling rate to 22050
and to wav 
```# sudo apt-get update
# sudo apt-get install ffmpeg
# for f in *.mp3; do
#   ffmpeg -i "$f" -ar 22050 -ac 1 "${f%.mp3}.wav"
# done```

Generating speaking ids from 0-34 (This is required from model)

In [7]:
client_id_map = {cid: idx for idx, cid in enumerate(combined_df['client_id'].unique())}

# Replace client_id in the DataFrame
series = combined_df['client_id'].map(client_id_map)

final_df = combined_df.copy()
final_df['client_id'] = series

final_df['client_id'].value_counts()

client_id
34    536
33    484
32    470
31    439
30    265
29    261
28    259
27    230
26    215
25    196
24    190
23    181
22    180
21    175
20    174
19    170
18    158
17    153
16    143
15    134
14    114
13    113
12    113
11    110
10     98
9      97
8      94
7      88
6      84
5      81
4      78
3      72
2      67
1      67
0      63
Name: count, dtype: int64

In [31]:
df = final_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Calculate split indices
n = len(df)
n_train = int(0.9 * n)
n_val = int(0.05 * n)
n_test = n - n_train - n_val  # ensures all rows are used

# Split
train_df = df.iloc[:n_train]
val_df = df.iloc[n_train:n_train+n_val]
test_df = df.iloc[n_train+n_val:]

# Save


In [32]:
combined_df['client_id'].nunique()

35

In [33]:
import os

def update_path_column(df, clips_dir='filtered_clips'):
    # Extract the filename (without any directory) and convert to .wav
    df.loc[:, 'path'] = df['path'].apply(lambda x: os.path.splitext(os.path.basename(x))[0] + '.wav')
    # Prepend the new directory
    df.loc[:, 'path'] = df['path'].apply(lambda x: os.path.join(clips_dir, x))
    return df

# Apply to all splits
train_df = update_path_column(train_df)
val_df = update_path_column(val_df)
test_df = update_path_column(test_df)

In [34]:
train_df['client_id'].nunique()

35

In [35]:
train_df.to_csv('train.tsv', sep='\t', index=False)
val_df.to_csv('val.tsv', sep='\t', index=False)
test_df.to_csv('test.tsv', sep='\t', index=False)

In [36]:
def save_filelist(df, out_path):
    with open(out_path, 'w', encoding='utf-8') as f:
        for _, row in df.iterrows():
            # Ensure all fields are string and handle missing values
            path = str(row['path'])
            client_id = str(row['client_id'])
            sentence = str(row['sentence'])
            f.write(f"{path}|{client_id}|{sentence}\n")

# Save your splits
save_filelist(train_df, 'common_voice_train.txt')
save_filelist(val_df, 'common_voice_val.txt')
save_filelist(test_df, 'common_voice_test.txt')

Those files are already ready for pre_processing.py please refer to README.md