In [1]:
!git clone https://github.com/k-ganda/NJIA.git

Cloning into 'NJIA'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 27 (delta 4), reused 8 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (27/27), 5.18 MiB | 17.93 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [2]:
# Change the current working directory to the 'NJIA' repository
%cd NJIA

# Verify your current working directory
!pwd

# List the contents of the repository to confirm you are in the right place
!ls

/content/NJIA
/content/NJIA
2_preprocessing  input_audio  new_colab_file.txt  README.md


In [3]:
!pip install -q transformers accelerate torchaudio librosa soundfile


In [4]:
import os
import json
import librosa
import torch
import soundfile as sf
from transformers import pipeline
from tqdm import tqdm


In [5]:
CLEAN_AUDIO_DIR = "../2_preprocessing/cleaned_audio"
OUTPUT_DIR = "3_medASR_outputs"

os.makedirs(OUTPUT_DIR, exist_ok=True)


In [6]:
asr = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3",
    device=0 if torch.cuda.is_available() else -1,
    return_timestamps=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Device set to use cuda:0


In [7]:
def validate_audio(audio_path):
    y, sr = librosa.load(audio_path, sr=None, mono=True)
    duration = librosa.get_duration(y=y, sr=sr)

    assert sr == 16000, f"{audio_path} is not 16kHz"
    assert duration > 5, f"{audio_path} too short for testimony"

    return duration


In [8]:
CLEAN_AUDIO_DIR = "2_preprocessing/cleaned_audio"
results = []

audio_files = [f for f in os.listdir(CLEAN_AUDIO_DIR) if f.endswith(".wav")]

for audio_file in tqdm(audio_files):
    audio_path = os.path.join(CLEAN_AUDIO_DIR, audio_file)

    duration = validate_audio(audio_path)

    # Run ASR
    transcription = asr(
        audio_path,
        generate_kwargs={
            "task": "transcribe",
            "language": "en",
            "temperature": 0.0  # avoids hallucination
        }
    )

    transcript_text = transcription["text"].strip()

    # Save individual transcript
    txt_path = os.path.join(
        OUTPUT_DIR, audio_file.replace(".wav", ".txt")
    )
    with open(txt_path, "w") as f:
        f.write(transcript_text)

    results.append({
        "audio_id": audio_file,
        "duration_seconds": round(duration, 2),
        "transcript": transcript_text,
        "language": "en",
        "model": "MedASR-compatible (Whisper-large-v3)"
    })

  0%|          | 0/4 [00:00<?, ?it/s]`return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.
100%|██████████| 4/4 [00:50<00:00, 12.61s/it]


In [9]:
json_path = os.path.join(OUTPUT_DIR, "transcripts.json")

with open(json_path, "w") as f:
    json.dump(results, f, indent=2)

json_path


'3_medASR_outputs/transcripts.json'

In [10]:
# --- Example Output Display (For Review & Debugging) ---

import pprint

print("Number of audio files transcribed:", len(results))
print("\nSample transcription output:\n")

pprint.pprint(results[0])


Number of audio files transcribed: 4

Sample transcription output:

{'audio_id': 'case4.wav',
 'duration_seconds': 23.8,
 'language': 'en',
 'model': 'MedASR-compatible (Whisper-large-v3)',
 'transcript': 'Did he use anything, a weapon, or maybe something he found '
               "nearby? I, I don't know. He had, um, a heavy belt? No, it felt "
               'like, like a cord, like the ones used for the water pumps. It '
               'was thin and it burned my wrists. Look, there are these, uh, '
               'red lines around my arms.'}


In [11]:
readme_path = os.path.join(OUTPUT_DIR, "README.md")

# Example content for the README.md file
readme_content = """
## MedASR Transcription

This stage converts cleaned survivor audio testimonies into raw clinical transcripts using a MedASR-compatible automatic speech recognition pipeline.

### Objective
To generate **verbatim, clinically faithful transcripts** from survivor speech while preserving:
- Hesitations and filler words (e.g., “uh”, “um”)
- Pauses and fragmented sentences
- Emotionally affected speech patterns

These elements are critical for forensic interpretation and downstream clinical reasoning.

---

### Model Choice
Due to limited public access to MedASR weights, this prototype demonstrates the transcription layer using **Whisper-large-v3** as an open, medical-capable ASR baseline. The system is architected to be **directly swappable** with MedASR from Google’s Health AI Developer Foundations in regulated clinical environments.

---

### Processing Steps
1. Load preprocessed 16kHz mono WAV files
2. Validate audio duration and sampling rate
3. Transcribe each audio file using the ASR pipeline
4. Save outputs as:
   - Individual `.txt` transcript files
   - A structured `transcripts.json` file for downstream NLP

---

### Output Artifacts

3_medASR_outputs/
│
├── survivor_case_01.txt
├── survivor_case_02.txt
├── survivor_case_03.txt
├── survivor_case_04.txt
└── transcripts.json

Ethical Considerations

Only synthetic audio data is used

No summarization or reinterpretation is performed at this stage

Human review is expected before any clinical or legal use
"""

with open(readme_path, "w") as f:
    f.write(readme_content)

print(f"README.md created/updated at: {readme_path}")

README.md created/updated at: 3_medASR_outputs/README.md


In [12]:
!git pull

Already up to date.
