### DATA PREPROCESSING

Download Santali speech dataset from [IndicVoices Website](https://indicvoices.ai4bharat.org/)

###### INDICVOICE DATASET
This 1639 houred dataset from 16237 speakers covers 145 Indian districts and 22 languages. Despite significant progress in English ASR, advancements in mid- and low-resource languages remain limited due to the lack of diverse, high-quality labeled data. The study targets data collection for 22 Indian languages, which represent 1.2 billion speakers across 742 districts, considering their linguistic, cultural, and demographic diversity. A repository of 2.5K questions, 46.6K prompts, and 1.1K–4.1K role-play scenarios were chosen in 21 domains and 28 topics to build this versatile dataset. A dedicated quality control team and a multi-level transcription team were employed to ensure strict adherence to guidelines.

For detailed data collection methodology and transcription guidelines, kindly refer their [publication](https://arxiv.org/pdf/2403.01926)

IndicVoices provided the test and the validation dataset, each of which contains several audio files in the `.wav` format and their corresponding metadata in the `.json` file. The JSON file contained information such as **duration**, **scenario**, and other details about the person, including their **job** and **qualification**. The JSON file also contained several dialogues, both in the verbatim as well as the normalized form. These two fields included the text, speaker_id, start and end timestamps.

We will use the normalized data for our speech recognition task. Let's first extract audio chunks from this source.

- Up/Down sampling
- Reducing audio channels to one
- Audio files chunking
- SNR filtering
- Manifest creation

Some points to keep in mind about the audio files for wav2vec fine-tuning:
- **Audio Format:** WAV, PCM 16-bit, mono (single channel).  
- **Sampling Rate:** 16,000 Hz.  
- **Duration:** Each audio file should be between 5 and 30 seconds long.  
- **Content Guidelines:** Silence must be removed, and each file should feature only one speaker.  


The folder structure should in a format like this.
```
datasets
   ├── santali
   │   ├── test
   │   │   ├── audio
   |   |   |    |──── 00001.wav
   |   |   |    |──── 00002.wav
   │   │   └── transcript.txt
   │   ├── train
   │   │   ├── audio
   │   │   └── transcript.txt
   │   └── valid
   │       ├── audio
   │       └── transcript.txt
   └── hindi
       ├── test
       │   ├── audio
       │   └── transcript.txt
       ├── train
       │   ├── audio
       │   └── transcript.txt
       └── valid
           ├── audio
           └── transcript.txt
```


Create Dataset

#### Things to consider while writing transcript

- The entire text should be transformed into uppercase.
- Any numerical digits in the text should be converted into their corresponding word form.
- All special characters, including punctuation marks, should be removed from the text.
- Words should be separated by single spaces, with no extra spaces between them.

In [None]:
def clean(text):
  #Remove all the non-Santali characters
  vocab = " ᱚᱛᱜᱝᱞᱟᱠᱡᱢᱣᱤᱥᱦᱧᱨᱩᱪᱫᱬᱭᱮᱯᱰᱱᱲᱳᱴᱵᱶᱷ᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙"
  filtered_text = ''.join(char for char in text if char in vocab)

  #convert numbers to words

  # remove extra spaces
  return ' '.join(filtered_text.split())

In [None]:
!pip install pydub

In [None]:
import os
import json
from pydub import AudioSegment

def process_audio_files(input_folder, output_folder, transcription_file):
    # Create the output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Open the transcription file for writing
    with open(transcription_file, 'w', encoding='utf-8') as transcription:
        file_count = 1  # Counter for output filenames

        # Walk through directories and subdirectories
        for root, _, files in os.walk(input_folder):
            for file in files:
                if file.endswith(".json"):
                    # Get the corresponding .wav file
                    base_name = os.path.splitext(file)[0]
                    wav_file = os.path.join(root, base_name + ".wav")
                    json_file = os.path.join(root, file)

                    # Check if the .wav file exists
                    if not os.path.exists(wav_file):
                        print(f"Warning: Audio file {wav_file} not found for {json_file}.")
                        continue

                    # Load the JSON metadata
                    with open(json_file, 'r', encoding='utf-8') as json_fp:
                        metadata = json.load(json_fp)

                    # Load the audio file
                    audio = AudioSegment.from_wav(wav_file)

                    # Process each segment in the "normalized" field
                    for segment in metadata.get("normalized", []):
                        start = int(segment["start"] * 1000)  # Convert to milliseconds
                        end = int(segment["end"] * 1000)  # Convert to milliseconds
                        text = segment["text"]

                        # Clean/normalize text
                        text = clean(text)

                        # Split the audio segment
                        audio_segment = audio[start:end]

                        # Generate the new filename
                        new_filename = f"{file_count:05d}.wav"
                        new_filepath = os.path.join(output_folder, new_filename)

                        # Export the audio segment
                        audio_segment.export(new_filepath, format="wav")

                        # Write to the transcription file
                        transcription.write(f"{new_filename}\t{text}\n")

                        # Increment the file counter
                        file_count += 1

if __name__ == "__main__":
    input_folder = "path_to_input_folder"  # Replace with the path to your input folder
    output_folder = "path_to_output_folder"  # Replace with the path to your output folder
    transcription_file = os.path.join(output_folder, "transcription.txt")

    process_audio_files(input_folder, output_folder, transcription_file)

###### Creating manifest file
The manifest file serves as a structured index or catalog of the dataset, providing the paths to audio files and their corresponding labels or transcripts.

In [None]:
#Code credits - https://github.com/AI4Bharat/IndicWav2Vec/tree/main

import soundfile as sf
import glob
import os,tqdm

p2root = /path/to/root/folder/  #/content/datasets/santali in the example above

manifest = p2root+"/manifest/"

if not os.path.exists(manifest):
    os.makedirs(manifest)

charset = set()
for folder in tqdm.tqdm(os.listdir(p2root)):
    if 'manifest' == folder:
        continue
    wavs = glob.glob(p2root+'/'+folder+'/**/*.wav',recursive=True)
    samples = [len(sf.read(w)[0]) for w in wavs]
    #print(wavs)
    root = os.path.abspath(os.path.split(wavs[0])[0])
    wavs = [os.path.split(x)[-1] for x in wavs]

    wav2trans = dict()

    with open(p2root+'/'+folder+'/transcription.txt','r') as transcrip:
        lines = transcrip.read().strip().split('\n')
    for line in lines:
        if '\t' in line:
            file, trans = line.split("\t")
        else:
            splitted_line = line.split(" ")
            file, trans = splitted_line[0], " ".join(splitted_line[1:])
        wav2trans[file] = trans
        charset.update(trans.replace(" ","|"))


    with open(manifest+folder+".tsv",'w') as tsv, \
        open(manifest+folder+".wrd","w") as wrd, \
        open(manifest+folder+".ltr",'w') as ltr:
        print(root,file=tsv)
        for n,d in zip(wavs,samples):
            print(n,d,sep='\t',file=tsv)
            print(wav2trans[n[:-4]],file=wrd)
            print(" ".join(list(wav2trans[n[:-4]].replace(" ", "|"))) + " |", file=ltr)


with open(manifest+"dict.ltr.txt",'w') as dct:
    for e,c in enumerate(charset):
        print(c,e,file=dct)

##### Normalize the dataset


In [None]:
%%bash

path="/content/datasets"  # Input directory path
ext=".mp3"  # Input file extension (e.g., mp3)

# Iterate through all files with the given extension
for f in $(find "$path" -type f -name "*$ext"); do
  # Get the file path without the original extension
  output_file="${f%$ext}.wav"

  # Convert to .wav with 16 kHz and single channel
  ffmpeg -loglevel warning -hide_banner -stats -i "$f" -ar 16000 -ac 1 "$output_file" && rm "$f" &

done