# 0.0.0 WhisperX/Pyannote Transcription+Diarization Pipeline 

This Jupyter notebook is designed to test and evaluate a new Transcription and Diarization Pipeline with the following objectives:
1. Achieving word-level transcription accuracy to ensure detailed and precise text representation of the audio input.
2. Assessing diarization confidence levels to accurately attribute spoken segments to different speakers and measure the reliability of speaker identification.
3. Enhancing the alignment of transcriptions to be closer to natural sentence segments, thereby improving the readability and usability of the transcribed data.

The notebook leverages advanced transcription and diarization capabilities provided by the Whisper, WhisperX, and pyannote libraries. By using GPU acceleration, it processes audio data efficiently, performing alignment and diarization to produce structured outputs that are saved in CSV format for further analysis. The resources and installation instructions are included to facilitate the setup and execution of the pipeline.

Resources:
https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281 

# 0.1 Setup
WhisperX documentation found here: https://github.com/m-bain/whisperX
================================================
1. Install Git
2. Install FFMPEG and add to PATH
3. Install Anaconda 

================================================   
4. Create Conda environment
```sh
conda create -n whisperxtranscription-env python=3.10
conda activate whisperxtranscription-env
```
5. Install PyTorch https://pytorch.org/get-started/locally/ 
```sh
pip install numpy==1.26.3 torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
```
6. Install WhisperX repository and additional packages
```sh
pip install whisperx==3.2.0

pip install speechbrain ipykernel ipywidgets charset-normalizer pandas nltk plotly matplotlib webvtt-py pypi-json srt python-dotenv tqdm

```

7. Create .env file at the same level as this notebook file with the following line
```sh
HF_TOKEN="REPLACEWITHHUGGINGFACETOKENHERE"
```
=================================================
8. For GPU usage :
Install Visual Studio Community https://visualstudio.microsoft.com/downloads/
Install NVIDIA CUDA Toolkit 12.1 https://developer.nvidia.com/cuda-12-1-0-download-archive 

Check PyTorch and CUDA installation
```sh
import torch
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
```

=================================================
Fix Numpy
```sh
pip uninstall numpy -y
pip install numpy==1.26.3
```

Fix PyTorch
```sh
pip uninstall torch torchvision torchaudio -y
```

In [10]:
import torch
x = torch.rand(5, 3)
print(x)

tensor([[0.1514, 0.7911, 0.6091],
        [0.8781, 0.7379, 0.6201],
        [0.5455, 0.7667, 0.7895],
        [0.7106, 0.6113, 0.1373],
        [0.6705, 0.0542, 0.8046]])


# 0.2 Check once to see if CUDA GPU is available and PyTorch is working properly

In [9]:
# Check if CUDA GPU is available to PyTorch
import torch                                                # PyTorch
#torch.cuda.set_device(0)                                    # Set the main GPU as device to use if present
print(torch.__version__)
torch.cuda.is_available(),torch.cuda.get_device_name()      # Check if GPU is available and get the name of the GPU

2.3.0


AssertionError: Torch not compiled with CUDA enabled

In [4]:
%pip show whisperx

Name: whisperx
Version: 3.7.2
Summary: Time-Accurate Automatic Speech Recognition using Whisper.
Home-page: 
Author: Max Bain
Author-email: 
License: BSD-2-Clause
Location: /opt/miniconda3/envs/whisperxtranscription-env/lib/python3.10/site-packages
Requires: av, ctranslate2, faster-whisper, nltk, numpy, pandas, pyannote-audio, torch, torchaudio, transformers
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy
print(numpy.__version__)

1.26.3


# 1.0 Setup - Start here by adjusting variables
1. choose batch size, compute type, whisper model, and file extension to transcribe

In [None]:
import os
from tkinter import Tk, filedialog
import pandas as pd
import warnings
import torch
import whisperx
import gc
import datetime
import json
import webvtt
import logging
import dotenv

# Suppress specific deprecation warnings from torchaudio and speechbrain
warnings.filterwarnings("ignore", message=".*set_audio_backend has been deprecated.*")
warnings.filterwarnings("ignore", message=".*get_audio_backend has been deprecated.*")
warnings.filterwarnings("ignore", message=".*Module 'speechbrain.pretrained' was deprecated.*")
warnings.filterwarnings("ignore", message=".*AudioMetaData.*moved to.*")
warnings.filterwarnings("ignore", category=UserWarning)
logging.getLogger("speechbrain.utils.quirks").setLevel(logging.WARNING)



# Configuration
torch.cuda.set_device(0)  # Set GPU
device = "cuda" if torch.cuda.is_available() else "cpu" # Set device to GPU if available, otherwise CPU
language = "en"  # Set the language code en=English, es=Spanish, etc.
task = "transcribe"  # Set the task to "transcribe" or "translate" 
batch_size = 16 # Set the batch size for processing
compute_type = "float16" # Set the compute type to "float16" for faster processing
hf_token = os.getenv('HF_TOKEN') 
whisperx_model = "large-v3" # Set the WhisperX model to use
extensions = ['.ogg', '.wav', '.mp3'] # Supported audio file extensions


AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

# 2.0 Run - after adjusting variables first

Just push run here. You shouldn't need to change anything here unless you want to output less or more file types. These are mostly functions which are then called at the end of the cell.

1. You should get a popup asking to choose the folder where the files are found (It will also search subfolders).

2. You should then get a popup asking for where the transcription files should be placed (It will replicate the folder structure in which they were found)

3. You will also see a popup asking if you want to anonymize with a pseudonyms.csv file, and if so where it is located.

4. You should then see an output similar to the following (just ignore the warnings):

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.

5. When complete you will see where each were written and the folders where they were written to.


In [None]:
from tkinter import Tk, filedialog, messagebox

# Functions
def find_audio_files(base_dir, extensions):
    audio_files = []
    for root, _, files in os.walk(base_dir):
        for file in files:
            if any(file.lower().endswith(ext.lower()) for ext in extensions):
                audio_files.append(os.path.join(root, file))
    return audio_files

def anonymize_text(text, pseudonym_dict):
    for real_name, pseudonym in pseudonym_dict.items():
        text = text.replace(real_name, pseudonym)
    return text

def format_vtt_timestamp(seconds):
    hours, remainder = divmod(seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = int((seconds % 1) * 1000)
    return f"{int(hours):02}:{int(minutes):02}:{int(seconds):02}.{milliseconds:03}"

def save_transcripts(segments, output_dir, relative_path, pseudonym_dict=None):
    if pseudonym_dict:
        for segment in segments:
            segment['text'] = anonymize_text(segment['text'], pseudonym_dict)
    for i, segment in enumerate(segments):
        segment['sentence_number'] = i + 1
    df = pd.DataFrame(segments)
    df['text'] = df['text'].apply(lambda x: x.lstrip())
    cols = ['sentence_number'] + [col for col in df.columns if col != 'sentence_number']
    df = df[cols]

    os.makedirs(output_dir, exist_ok=True)
    base_filename = os.path.splitext(os.path.basename(relative_path))[0]
    csv_path = os.path.join(output_dir, f"{base_filename}_transcription.csv")
    df.to_csv(csv_path, index=False, encoding='utf-8-sig')

    with open(os.path.join(output_dir, f"{base_filename}_transcription.txt"), 'w', encoding='utf-8') as f:
        for segment in segments:
            f.write(f"{segment['text'].strip()}\n")

    json_path = os.path.join(output_dir, f"{base_filename}_transcription.json")
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(segments, f, ensure_ascii=False, indent=4)

    vtt = webvtt.WebVTT()
    for segment in segments:
        caption = webvtt.Caption()
        caption.start = format_vtt_timestamp(segment['start'])
        caption.end = format_vtt_timestamp(segment['end'])
        caption.lines = [f"{segment['sentence_number']}: {segment['text'].strip()}"]
        vtt.captions.append(caption)
    vtt.save(os.path.join(output_dir, f"{base_filename}_transcription.vtt"))

def process_audio_file(audio_file, base_output_dir, relative_path, pseudonym_dict=None):
    try:
        print(f"Processing {audio_file}...")
        audio = whisperx.load_audio(audio_file)
        model = whisperx.load_model(whisperx_model, device, compute_type=compute_type)
        result = model.transcribe(audio, batch_size=batch_size, language=language, task=task)
        del model; gc.collect(); torch.cuda.empty_cache()

        model_a, metadata = whisperx.load_align_model(language_code=language, device=device)
        result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
        del model_a; gc.collect(); torch.cuda.empty_cache()

        # Correct way to load diarization model in recent whisperx
        diarize_model = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)
        diarize_segments = diarize_model(audio)
        result = whisperx.assign_word_speakers(diarize_segments, result)

        output_dir = os.path.join(base_output_dir, os.path.dirname(relative_path))
        save_transcripts(result["segments"], output_dir, relative_path, pseudonym_dict)
    except Exception as e:
        import traceback
        print(f"Error processing {audio_file}:\n{traceback.format_exc()}")

def main():
    # Initialize Tkinter
    root = Tk()
    root.withdraw()  # Hide the main window
    
    # Bring the root window to the front
    root.attributes('-topmost', True)

    # Popup for input folder
    input_folder = filedialog.askdirectory(title="Select Folder Containing Audio/Video Files")
    if not input_folder:
        print("No folder selected. Exiting.")
        return

    # Popup for output folder
    output_folder = filedialog.askdirectory(title="Select Folder to Save Transcriptions")
    if not output_folder:
        print("No output folder selected. Exiting.")
        return

    # Ask if a pseudonyms.csv file will be used
    use_pseudonyms = messagebox.askyesno("Pseudonyms", "Will you use a pseudonyms.csv file for to anonymize the transcripts?")
    pseudonym_dict = None

    if use_pseudonyms:
        pseudonyms_file = filedialog.askopenfilename(
            title="Select Pseudonyms CSV File",
            filetypes=[("CSV files", "*.csv")]
        )
        if not pseudonyms_file:
            print("No pseudonyms file selected. Continuing without pseudonymization.")
        else:
            # Load the pseudonyms file
            pseudonyms_df = pd.read_csv(pseudonyms_file)
            pseudonym_dict = dict(zip(pseudonyms_df['name'], pseudonyms_df['pseudonym']))
            print(f"Pseudonyms loaded from {pseudonyms_file}.")

    # Find and process audio files
    audio_files = find_audio_files(input_folder, extensions)
    print(f"Found {len(audio_files)} files to process.")

    for audio_file in audio_files:
        relative_path = os.path.relpath(audio_file, input_folder)
        process_audio_file(audio_file, output_folder, relative_path, pseudonym_dict)
        print(f"Processed {audio_file}")

    print("All files processed.")

if __name__ == "__main__":
    main()
