# 0.0.0 WhisperX/Pyannote Transcription+Diarization Pipeline 

This Jupyter notebook is designed to test and evaluate a new Transcription and Diarization Pipeline with the following objectives:
1. Achieving word-level transcription accuracy to ensure detailed and precise text representation of the audio input.
2. Assessing diarization confidence levels to accurately attribute spoken segments to different speakers and measure the reliability of speaker identification.
3. Enhancing the alignment of transcriptions to be closer to natural sentence segments, thereby improving the readability and usability of the transcribed data.

The notebook leverages advanced transcription and diarization capabilities provided by the Whisper, WhisperX, and pyannote libraries. By using GPU acceleration, it processes audio data efficiently, performing alignment and diarization to produce structured outputs that are saved in CSV format for further analysis. The resources and installation instructions are included to facilitate the setup and execution of the pipeline.

Resources:
https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281 

# 0.1 Setup
WhisperX documentation found here: https://github.com/m-bain/whisperX

1. Create Python environment
conda create -n whisperx-env python=3.9
conda activate whisperx-env

2. Install PyTorch https://pytorch.org/get-started/locally/ 
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

3. Install WhisperX repository
pip install git+https://github.com/m-bain/whisperx.git

4. Additional useful packages to install
pip install charset-normalizer
pip install pandas
pip install nltk
pip install numpy
pip install plotly
pip install matplotlib
pip install jupyter ipywidgets
pip install webvtt-py
pip install pypi-json
pip install srt

5. Create .env file at the same level as this notebook file with the following line
HF_TOKEN="REPLACEWITHHUGGINGFACETOKENHERE"

# 0.2 Check once to see if CUDA GPU is available and PyTorch is working properly

In [6]:
# Check if CUDA GPU is available to PyTorch
import torch                                                # PyTorch
torch.cuda.set_device(0)                                    # Set the main GPU as device to use if present
print(torch.__version__)
torch.cuda.is_available(),torch.cuda.get_device_name()      # Check if GPU is available and get the name of the GPU

2.3.0+cu121


(True, 'NVIDIA GeForce RTX 4060 Laptop GPU')

# 2.0 Start here by adjusting variables

In [1]:
# 1. Set the device and other configuration variables
# Import necessary libraries
import os                                                   # OS
from dotenv import load_dotenv
import pandas as pd                                         # Pandas
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="torchaudio._backend") # Ignore torchaudio warnings
warnings.filterwarnings("ignore", category=UserWarning,message=".*set_audio_backend has been deprecated.*") # Ignore torchaudio warnings

import torch                                                # PyTorch
import whisperx                                             # Import the whisperx library    
import gc                                                   # for garbage collection   
import datetime                                             # for timing the process
from whisperx.utils import get_writer                       # Import the get_writer function from the whisperx library to write the transcripts to a file
import json                                                 # Import the json library to convert the JSON string to a JSON object
import webvtt                                               # Import the webvtt library to convert the VTT file to a JSON string
import srt                                                  # Import the srt library to convert the SRT file to a JSON string
import warnings

# Load the environment variables
HF_TOKEN = os.getenv('HF_TOKEN')                            # You can just replace this with your Hugging Face API token if you don't want to use the .env file

# Check if GPU is available
torch.cuda.set_device(0)                                    # Change to 0, 1, 2, 3, 4, 5, 6, 7 depending on which GPU you want to use

device = "cuda" if torch.cuda.is_available() else "cpu"     # Set the device

batch_size = 16                                             # change to 4 if low on GPU memory (may reduce accuracy) highest is 32
compute_type = "float16"                                    # change to "int8" if low on GPU memory (may reduce accuracy) highest is "float32" others are "float16" and "int8"
hf_token = HF_TOKEN                                         # Replace your Hugging Face API token in the .env file
whisperx_model = "small.en"                                 # change to "large-v2" for a larger model, others  are "small.en", "medium.en", or "large.en"

# Paths
base_dir = 'Data/rawAudioFiles'                             # Replace with the path to your main folder containing subfolders with audio files
output_base_dir = 'Data/rawTranscriptFiles'                 # Replace with the path to the folder where you want to save the transcripts
file_type1 = '.ogg'                                         # Change to 'mp3' if your audio files are in mp3 format
file_type2 = '.m4a'                                         # Change to 'WAV' if your audio files are in WAV format
file_type3 = '.mp3'                                         # Change to 'wav' if your audio files are in wav format (case dependent)

# Define the file extensions to look for
extensions = [file_type1, file_type2, file_type3]

# Load pseudonyms CSV for anonymizing the transcripts
pseudonyms_df = pd.read_csv('data/pseudonyms.csv')                              # Load the pseudonyms CSV file. in the format name,pseudonym as column headers
pseudonym_dict = dict(zip(pseudonyms_df['name'], pseudonyms_df['pseudonym']))   # Create a pseudonym dictionary from the CSV file, only stored in ram


torch.cuda.is_available(),torch.cuda.get_device_name()      # Check if GPU is available and show the name of the card, this is just here to help with a last second debug

(True, 'NVIDIA GeForce RTX 4060 Laptop GPU')

# 3.0 Run after adjusting variables first

Just push run here. You shouldn't need to change anything here unless you want to output less or more file types. These are mostly functions which are then called at the end of the cell.
You should see an output similar to the following:

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
Data has been written to Data/Trancripts_Outputs in CSV, TXT, JSON, and VTT formats
1 Data/RawAudioFiles_Inputs\Monologue.ogg has been processed and saved in Data/Trancripts_Outputs
1 audio files have been processed and saved in Data/Trancripts_Outputs


In [7]:
# 2. Functions for transcribing and diarization of audio files
# A. Function to find audio files in subfolders
def find_audio_files(base_dir, extensions): 
    audio_files = []
    for root, _, files in os.walk(base_dir):
        for audio_file in files:
            if any(audio_file.endswith(ext) for ext in extensions):
                full_audio_path = os.path.join(root, audio_file)
                audio_files.append(full_audio_path)
                print(f"Found audio file: {full_audio_path}")
    return audio_files

# B. Function to get file modification date
def get_file_modification_date(file_path):
    timestamp = os.path.getmtime(file_path)
    date = datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
    return date

# Function to anonymize text
def anonymize_text(text, pseudonym_dict):
    for real_name, pseudonym in pseudonym_dict.items():
        text = text.replace(real_name, pseudonym)
    return text

# Function to convert segments to different formats and save
def save_transcripts(segments, output_dir, filename):
    # Anonymize segments
    for segment in segments:
        segment['text'] = anonymize_text(segment['text'], pseudonym_dict)
    
    # Add sentence numbers
    for i, segment in enumerate(segments):
        segment['sentence_number'] = i + 1
    
    # Convert segments to DataFrame and reorder columns
    df = pd.DataFrame(segments)

    # Clean leading spaces from the 'text' column
    df['text'] = df['text'].apply(lambda x: x.lstrip())

    # Reorder columns, ensuring 'sentence_number' is first
    cols = df.columns.tolist()
    cols = ['sentence_number'] + [col for col in cols if col != 'sentence_number']
    df = df[cols]
    
    # Save as CSV
    csv_output_file = os.path.join(output_dir, f'{filename}_transcription.csv')
    df.to_csv(csv_output_file, index=False)
    
    # Save as TXT
    txt_output_file = os.path.join(output_dir, f'{filename}_transcription.txt')
    with open(txt_output_file, 'w', encoding='utf-8') as f:
        for segment in segments:
            # Strip leading spaces from the text
            clean_text = segment['text'].rstrip().lstrip()
            f.write(f"{clean_text}\n")

    # Save as JSON 
    json_output_file = os.path.join(output_dir, f'{filename}_transcription.json')
    # Reorder segments and clean the text field
    segments_reordered = [{k: segment[k].lstrip() if k == 'text' else segment[k] for k in cols} for segment in segments]
    with open(json_output_file, 'w', encoding='utf-8') as f:
        json.dump(segments_reordered, f, ensure_ascii=False, indent=4)
  
    # Save as VTT
    vtt_output_file = os.path.join(output_dir, f'{filename}_transcription.vtt')
    vtt = webvtt.WebVTT()
    for segment in segments:
        vtt_segment = webvtt.Caption()
        vtt_segment.start = str(datetime.timedelta(seconds=segment['start']))
        vtt_segment.end = str(datetime.timedelta(seconds=segment['end']))
        # Clean leading spaces from the text and format it with the sentence number
        clean_text = segment['text'].lstrip().rstrip()
        vtt_segment.lines = [f"{segment['sentence_number']}: {clean_text}"]
        vtt.captions.append(vtt_segment)
    vtt.save(vtt_output_file)


    print(f"Data has been written to {output_dir} for {filename} in the following formats: CSV, TXT, JSON, and VTT")

# C. Function to process each audio file
def process_audio_file(audio_file, output_dir):
    try:
        print(f"Processing file: {audio_file}")
        # Load audio
        audio = whisperx.load_audio(audio_file)
        
        # Load and transcribe using WhisperX model
        model = whisperx.load_model(whisperx_model, device, compute_type=compute_type)
        result = model.transcribe(audio, batch_size=batch_size)
        #print(result["segments"])                          # you can uncomment this line to see the transcription results

        # Clean up model from GPU if needed
        del model
        gc.collect()
        torch.cuda.empty_cache()

        # Align WhisperX output
        model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
        result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
        #print(result["segments"])                          # you can uncomment this line to see the alignment results

        # Clean up alignment model from GPU if needed
        del model_a
        gc.collect()
        torch.cuda.empty_cache()

        # Diarization with WhisperX
        diarize_model = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)
        diarize_segments = diarize_model(audio)
        result = whisperx.assign_word_speakers(diarize_segments, result)
        # print(diarize_segments)                           # you can uncomment this line to see the diarization results
        # print(result["segments"])                         # you can uncomment this line to see the diarization results

        
        # Save transcripts in multiple formats
        os.makedirs(output_dir, exist_ok=True)
        filename = os.path.splitext(os.path.basename(audio_file))[0]
        save_transcripts(result["segments"], output_dir, filename)

    except Exception as e:
        print(f"An error occurred while processing {audio_file}: {e}")

    finally:
        # Ensure that all models are cleaned from memory
        del diarize_model                                   # Clean up diarize_model
        del result                                          # Clean up result
        gc.collect()                                        # Garbage collection
        torch.cuda.empty_cache()                            # Empty cache

# D. Main function to execute the tasks
def main(base_dir, output_base_dir, extensions):
    audio_files = find_audio_files(base_dir, extensions)
    print(f"Found {len(audio_files)} audio files.")
    counter = 1
    output_dir = output_base_dir  # Initialize output_dir to the base output directory at the start

    for audio_file in audio_files:
        process_audio_file(audio_file, output_dir)
        print(f"{counter} {audio_file} has been processed and saved")
        counter += 1

    if counter > 1:
        print(f"{counter - 1} audio files have been processed and saved in {output_dir}")
    else:
        print("No audio files were processed.")

# E. Execute the main function
main(base_dir, output_base_dir, extensions)

Found audio file: Data/rawAudioFiles\Monologue1.ogg
Found audio file: Data/rawAudioFiles\Monologue2.ogg
Found 2 audio files.
Processing file: Data/rawAudioFiles\Monologue1.ogg


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\mrhal\.cache\torch\whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
Data has been written to Data/rawTranscripts for Monologue1 in the following formats: CSV, TXT, JSON, and VTT
1 Data/rawAudioFiles\Monologue1.ogg has been processed and saved
Processing file: Data/rawAudioFiles\Monologue2.ogg


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\mrhal\.cache\torch\whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
Data has been written to Data/rawTranscripts for Monologue2 in the following formats: CSV, TXT, JSON, and VTT
2 Data/rawAudioFiles\Monologue2.ogg has been processed and saved
2 audio files have been processed and saved in Data/rawTranscripts
