## 1 Data Preprocessing

This notebook preprocesses the data, further 4 hours of training data and 4 hours of validated data are subsampled from the original dataset for the Sursilvan idiom. This subsample is going to be used to train the first prototype.

In [22]:
import os
import pandas as pd
import shutil
import soundfile as sf
from tqdm import tqdm
import random
import re
from bs4 import BeautifulSoup

In [18]:
DATA_ROOT = "romansh-data"
FOLDER_NAMES = [folder for folder in os.listdir(DATA_ROOT) if not folder.endswith(".tgz")]
IDIOM_FOLDER = "rmsursilv-cc-2021-05-28"
TARGET_HOURS = {
    "train": 4.0,
    "validated": 4.0
}
RANDOM_SEED = 42
OUTPUT_FOLDER = os.path.join(DATA_ROOT, "sursilvan-small")
SPLITS = ["train.tsv", "validated.tsv", "test.tsv"]

BASE_PATH = os.path.join(DATA_ROOT, IDIOM_FOLDER)
CLIPS_PATH = os.path.join(BASE_PATH, "clips")

As you can see from the examples, the sentences contain html tags that we need to remove.

In [19]:
def print_example():
  example_df = pd.read_csv(os.path.join(DATA_ROOT, FOLDER_NAMES[0], "train.tsv"), sep='\t')
  for i in range(min(3, len(example_df))):
    print(f"    {i+1}: {example_df['sentence'].iloc[i]}...")

print_example()

    1: <p>Hei tgau ansemen, igl mies nom √® Elin e gl'√® puspe eneda ouras per la vossa emissiun Minisguard. Ossa matte eneda avant tgi vusoters dastgessas eir glindesde per en'emda an la Svizra franzosa a scola? Scu fiss chegl per vusoters? Tot per franzos, novs conscolars e scolasts. Mattagn betg gist uscheia simpel all'antschatta. Ma exact chegl √≤ ena famiglia an la nossa seria igls Svizzers fatg chest'emda, numnadamaintg √® la famiglia Bernimoulin da Carouge sper Genevra sto per en'emda c√≤ tar nous an la rumantscheia, numnadamaintg a Sevgein.<br></p>...
    2: <p>Schi vusoters levas saveir scu tgi Nicki √≤ passanto schiglio anc sia emda a Sevgein, alloura savez vurdar igls Cuntrasts digls 17 da november. Ed ossa nignsa tar en'otra famiglia, tar la famiglia digl Helveticus. Er chest'emda ans rachinta la famiglia digl Helveticus en'episoda or dall'istorgia dalla Svizra. E chest'eda vogl per ena battaglia agl Tessin e scu tg'igls svizzers √®n sa dustos, e chegl sainza armas.<br></p>

So we clean the html from the sentences in all files ending in .tsv from the `romansh-data` folder. Then we can verify that it worked with the cleaned examples from above.

In [20]:
def clean_html(text):
    if pd.isna(text) or not isinstance(text, str):
        return ""
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def preprocess_file(tsv_path):
    df = pd.read_csv(tsv_path, sep='\t')
    if 'sentence' not in df.columns:
        print(f"  Warning: No 'sentence' column found in {tsv_path}")
        return
    df['sentence'] = df['sentence'].apply(clean_html)
    df.to_csv(tsv_path, sep='\t', index=False) 
    return len(df)

tsv_file_paths = [os.path.join(DATA_ROOT, folder, f) for folder in FOLDER_NAMES for f in os.listdir(os.path.join(DATA_ROOT, folder)) if f.endswith('.tsv')]
for path in tsv_file_paths:
    preprocess_file(path)

print_example()

    1: Hei tgau ansemen, igl mies nom √® Elin e gl'√® puspe eneda ouras per la vossa emissiun Minisguard. Ossa matte eneda avant tgi vusoters dastgessas eir glindesde per en'emda an la Svizra franzosa a scola? Scu fiss chegl per vusoters? Tot per franzos, novs conscolars e scolasts. Mattagn betg gist uscheia simpel all'antschatta. Ma exact chegl √≤ ena famiglia an la nossa seria igls Svizzers fatg chest'emda, numnadamaintg √® la famiglia Bernimoulin da Carouge sper Genevra sto per en'emda c√≤ tar nous an la rumantscheia, numnadamaintg a Sevgein....
    2: Schi vusoters levas saveir scu tgi Nicki √≤ passanto schiglio anc sia emda a Sevgein, alloura savez vurdar igls Cuntrasts digls 17 da november. Ed ossa nignsa tar en'otra famiglia, tar la famiglia digl Helveticus. Er chest'emda ans rachinta la famiglia digl Helveticus en'episoda or dall'istorgia dalla Svizra. E chest'eda vogl per ena battaglia agl Tessin e scu tg'igls svizzers √®n sa dustos, e chegl sainza armas....
    3: Hai, chegl 

Then we subsample 4 hours each of training and validated data as well as the entire test set from the sursilvan idiom. This will be used in development to train leaner prototypes of the final model.

In [21]:
def get_audio_duration(path):
  """Return duration of a wav file in seconds."""
  try:
    with sf.SoundFile(path) as f:
      return len(f) / f.samplerate
  except Exception as e:
    print(f"‚ö†Ô∏è Could not read {path}: {e}")
    return 0.0


def subsample_split(df, split_name, target_hours):
  """Return a subsampled DataFrame totaling ~target_hours."""
  df = df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

  selected_rows = []
  total_seconds = 0.0

  for _, row in tqdm(df.iterrows(), total=len(df), desc=f"Subsampling {split_name}"):
    audio_path = os.path.join(CLIPS_PATH, row["path"])
    duration = get_audio_duration(audio_path)
    if duration == 0:
      continue
    if total_seconds + duration > target_hours * 3600:
      break
    selected_rows.append(row)
    total_seconds += duration

  sub_df = pd.DataFrame(selected_rows)
  print(f"‚úÖ {split_name}: {len(sub_df)} utterances, {total_seconds/3600:.2f} hours")
  return sub_df


def copy_required_clips(df_list, output_clips_path):
  """Copy only audio files referenced in given list of DataFrames."""
  all_paths = set()
  for df in df_list:
    all_paths.update(df["path"].tolist())

  os.makedirs(output_clips_path, exist_ok=True)

  for rel_path in tqdm(all_paths, desc="Copying clips"):
    src_path = os.path.join(CLIPS_PATH, rel_path)
    dst_path = os.path.join(output_clips_path, rel_path)
    os.makedirs(os.path.dirname(dst_path), exist_ok=True)
    shutil.copy2(src_path, dst_path)

random.seed(RANDOM_SEED)
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

dfs_to_copy = []

for split_name, hours in TARGET_HOURS.items():
  tsv_path = os.path.join(BASE_PATH, f"{split_name}.tsv")
  if not os.path.isfile(tsv_path):
    print(f"‚ùå Missing {split_name}.tsv")
    continue

  df = pd.read_csv(tsv_path, sep="\t")
  sub_df = subsample_split(df, split_name, hours)
  output_tsv = os.path.join(OUTPUT_FOLDER, f"{split_name}.tsv")
  sub_df.to_csv(output_tsv, sep="\t", index=False)
  dfs_to_copy.append(sub_df)

test_tsv = os.path.join(BASE_PATH, "test.tsv")
if os.path.isfile(test_tsv):
  df_test = pd.read_csv(test_tsv, sep="\t")
  output_test_tsv = os.path.join(OUTPUT_FOLDER, "test.tsv")
  df_test.to_csv(output_test_tsv, sep="\t", index=False)
  dfs_to_copy.append(df_test)
  print(f"‚úÖ test set: {len(df_test)} utterances")

output_clips_path = os.path.join(OUTPUT_FOLDER, "clips")
copy_required_clips(dfs_to_copy, output_clips_path)

print(f"\nüéâ Mini Sursilvan folder ready at '{OUTPUT_FOLDER}'")
print("Contains:")
print(f" - {len(os.listdir(output_clips_path))} audio files (referenced in TSVs)")
print(" - train, validated, test TSVs")

Subsampling train:  11%|‚ñà         | 754/6888 [00:00<00:01, 5569.60it/s]




‚úÖ train: 754 utterances, 4.00 hours


Subsampling validated:  11%|‚ñà         | 750/6982 [00:00<00:00, 6436.01it/s]


‚úÖ validated: 750 utterances, 4.00 hours
‚úÖ test set: 94 utterances


Copying clips: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1504/1504 [00:03<00:00, 463.63it/s]


üéâ Mini Sursilvan folder ready at 'romansh-data/sursilvan-small'
Contains:
 - 1504 audio files (referenced in TSVs)
 - train, validated, test TSVs



